Sunday, May 20, 2012

Another look at over-representation analysis interpretation

Interpreting a list of differentially regulated genes can take many forms. One of the most widely used method is looking for enrichment of functional group of genes compared to a random sampling of gene from the same universe, namely an over-representation analysis (ORA).

The point I want to explore today is what is the best way to interpret the results of an ORA?
The list of GO categories one obtain often tells a complex message and leave us with a confuse feeling that we are cherry picking the categories that fit our hypothesis the best.

Let's have a look at an example. First, I extract a gene list from a publicly available experiment in Gene Expression Omnibus. I use GEOquery for that and obtain a list of 274 genes up- and down-regulated (code at the end).

From this gene list we can perform a GO ORA fairly easily using the GOstats package. I combined all the steps necessary in two functions (GO_over.r and write.GOhyper.r) that you can found on my GitHub repo. I usually download the functions directly from my R session using this function:
https://github.com/bobthecat/codebox/blob/master/source_https.r (copy and paste it in your R session or save it to a file call source_https.r)
source('source_https.r')
## You can now source R scripts from GitHub. The RAW URL is needed.
source_https('https://raw.github.com/bobthecat/codebox/master/GO_over.r')
## Define the universe
library(mouse4302.db)
uniqueId <- unique(as.vector(unlist(as.list(mouse4302ENTREZID))))
entrezUniverse <- uniqueId[!is.na(uniqueId)]
length(entrezUniverse)
[1] 20877
## ORA with conditional hypergeometric test
mfhyper <- GO_over(entrezUniverse, glist, annot='mouse4302.db')
## Information on the Directed Acyclic Graph (DAG)
goDag(mfhyper)
A graphNEL graph with directed edges
Number of Nodes = 2723
Number of Edges = 5643
## How many gene were mapped in the end?
geneMappedCount(mfhyper)
[1] 257
## Write out the results
source_https('https://raw.github.com/bobthecat/codebox/master/write.GOhyper.r')
mrnaGO <- write.GOhyper(mfhyper, filename="BP_mRNA_significant.xls")
dim(mrnaGO <- mrnaGO[mrnaGO$adjPvalue <= 0.05,])
[1] 59 8
head(mrnaGO)
GOBPID Pvalue adjPvalue OddsRatio ExpCount Count Size Term
1 GO:0007275 7.195117e-09 4.245119e-07 2.238934 47.7311790 86 3138 multicellular organismal development
2 GO:0031116 9.123574e-07 2.691454e-05 41.247520 0.1977391 5 13 positive regulation of microtubule polymerization
3 GO:0007155 6.134458e-06 1.206443e-04 2.792433 11.0429688 28 726 cell adhesion
4 GO:0048699 9.122060e-06 1.345504e-04 2.630388 12.5640388 30 826 generation of neurons
5 GO:0048468 1.143166e-05 1.348936e-04 2.326166 18.1463660 38 1193 cell development
6 GO:0031399 2.234055e-05 2.196821e-04 2.586734 11.8491359 28 779 regulation of protein modification process
view raw GO_over.r hosted with ❤ by GitHub
Here we are presented with a table of 59 GO categories that are all significant after multiple hypothesis testing correction. Cell adhesion, generation of neurons, cellular response to interferon-beta...

How to interpret this list?
One way to do that is to display the Directed Acyclic Graph (DAG) of the over-represented GO categories in the list. But in my opinion it is difficult to get a big picture of such representation. We know that the GO categories (and to a lower extend pathways) share common genes. My hypothesis is that visualizing the relationship between GO categories based on the amount of gene shared will likely help to interpret the results. So what I do, in addition, is to visualize the amount of gene shared between GO categories by plotting the results of the ORA using a heatmap (code below the plot).
Rows and columns are GO categories. The color of each square represents the percentage of gene shared between any two categories. Here we see that our gene list (274 genes) seems to preferentially contain genes from three ensembles of GO categories that are in yellow along the diagonal. Based on this observation we can interpret that the main events going on in these cells seems to be linked to regulation of metabolism, cytoskeleton re-organization and neurons development. Which make sense when you consider that we compared iPS cells to neurospheres cells.

I welcome comments about this approach (in fact this the purpose of this post). I would like to argue that such representation of a GO ORA is complementary to displaying a flat text table and plotting the DAG. Did anybody already used this approach to interpret GO ORA? Or has a better solution?
I acknowledge that it is not the perfect solution. For example, if a category does not share many genes with others it does not mean it is not worth investigating. It might even be the key to understanding the biological experiment but there are a lot of those categories... which one to pick? Plus, I think a GO ORA does not aim at fined grain analysis but at a global overview of the events.

Here is the code to produce the heatmap:
library(org.Mm.eg.db); library(gplots); library(bioDist)
# get gene in GO cat
gGOcat <- list()
for(i in 1:dim(mrnaGO)[1]){
gGOcat[[mrnaGO$GOBPID[i]]] <- as.vector(unlist(mget(mrnaGO$GOBPID[i], org.Mm.egGO2ALLEGS)))
}
# filter each GO cat by the genes found to be DE
f <- function(vecGO, vecDE){
return(vecGO[which(vecGO %in% vecDE)])
}
gGOcat <- lapply(gGOcat, FUN=function(x){f(x, glist)})
mat <- matrix(nrow=length(gGOcat), ncol=length(gGOcat))
for(i in 1:length(gGOcat)) {
for(j in 1:length(gGOcat)) {
# intersect / union
mat[i,j] <- (length(intersect(gGOcat[[i]], gGOcat[[j]])) / length(union(gGOcat[[i]], gGOcat[[j]])))
}
}
rownames(mat) <- names(gGOcat)
colnames(mat) <- names(gGOcat)
goNames <- as.vector(unlist(mget(names(gGOcat), GOTERM)))
goNames <- as.vector(unlist(lapply(goNames, FUN=function(x){x@Term})))
goNames <- paste(goNames, names(gGOcat), sep=" - ")
colnames(mat) <- goNames
rownames(mat) <- goNames
# color scale
c1 <- rainbow(32,v=seq(0.5,1,length=32),s=seq(1,0.3,length=32),start=4/6,end=4.0001/6);
pie(rep(1,32),col=c1); # show off these colors
c2 <- rainbow(32,v=seq(0.5,1,length=32),s=seq(1,0.3,length=32),start=1/6,end=1.0001/6);
pie(rep(1,32),col=c2); # show off these colors
c3 <- c(c1,rev(c2)); # rev reverses the list
rm(c1,c2)
pdf(file='GO_heatmap.pdf', height=13, width=13)
heatmap.2(mat, col=c3, margins = c(20, 20), dist=function(x){cor.dist(x)}, trace='none', density.info='none', cexRow=0.5, cexCol=0.5)
dev.off()
# keep it clean
rm(mat, goNames, gGOcat, f)
view raw GO_heatmap.r hosted with ❤ by GitHub

I put here last the code to generate the list of genes significantly differentially expressed used in this post. Briefly, the experiment consist of three cell types. (i) Neurosphere cells, which are adult stem cells found in the brain, (ii) induced pluripotent stem (iPS) cells that are derived from neurosphere cells and (iii) neurosphere cells derived from the iPS cells, making it a full loop in term cell types [reference].
library(GEOquery); library(RankProd); library(mouse4302.db)
## Download the data from GEO
gse12499 <- getGEO('GSE12499',GSEMatrix=TRUE)
e <- exprs(gse12499[[1]])
dim(e)
[1] 45101 10
## Rank Product
eRP <- RP(e[,c(1:3, 7:10)], cl=c(0,0,0,1,1,1,1), rand = 123)
table <- topGene(eRP, cutoff=0.001, method="pfp", logged=TRUE,
logbase=2, gene.names=rownames(e))
up <- unique(as.vector(unlist(mget(rownames(table$Table1), envir=mouse4302ENTREZID))))
down <- unique(as.vector(unlist(mget(rownames(table$Table2), envir=mouse4302ENTREZID))))
glist <- c(up, down)
glist <- glist[!is.na(glist)]
length(glist)
[1] 274
view raw get_the_data.r hosted with ❤ by GitHub

4 comments:

  1. Hey David,

    I suggest you check out Enrichment Map (http://www.baderlab.org/Software/EnrichmentMap) Although originally developed for visualization of GSEA results, it works with DAVID enrichment results as well. It creates a network of terms and connects them based on the amount of genes shared between them. It's pretty to look at as well.

    ReplyDelete
  2. Hi David,

    Another plug for Enrichment Map. I don't use the plugin directly, but the principle is easy to implement, and is related to your idea of using the degree of gene overlap between the annotations to organize them. It is pretty trivial to create a network of terms, and use the RCytoscape package to visualize the network interactively in Cytoscape. I actually used this method in my own Bioconductor package categoryCompare (http://bioconductor.org/packages/2.10/bioc/vignettes/categoryCompare/inst/doc/categoryCompare_vignette.pdf). Although I use it in this way to examine comparative ORA, I also tend to use it a lot for examining the results of single experiments, and find it very useful to organize the results of ORA.

    ReplyDelete
  3. Hi David,

    Thanks for a good post.
    I tried to run your script on my own data, however I get an error when I try to create the goNames object (line 25 in your heatmap script). I'm sure I missed something, but I haven't been able to figure out what "GOTERM' refers to in your mget() command. Could you maybe point me in the right direction?

    Many thanks

    ReplyDelete
    Replies
    1. GOTERM is an annotation object part of the GO.db package. This package is loaded automatically through a call for GOstats if you use the GO_over function.
      just load the GOstats library and it should be fine.

      Delete