The point I want to explore today is what is the best way to interpret the results of an ORA?
The list of GO categories one obtain often tells a complex message and leave us with a confuse feeling that we are cherry picking the categories that fit our hypothesis the best.
Let's have a look at an example. First, I extract a gene list from a publicly available experiment in Gene Expression Omnibus. I use GEOquery for that and obtain a list of 274 genes up- and down-regulated (code at the end).
From this gene list we can perform a GO ORA fairly easily using the GOstats package. I combined all the steps necessary in two functions (GO_over.r and write.GOhyper.r) that you can found on my GitHub repo. I usually download the functions directly from my R session using this function:
https://github.com/bobthecat/codebox/blob/master/source_https.r (copy and paste it in your R session or save it to a file call source_https.r)Here we are presented with a table of 59 GO categories that are all significant after multiple hypothesis testing correction. Cell adhesion, generation of neurons, cellular response to interferon-beta...
How to interpret this list?
One way to do that is to display the Directed Acyclic Graph (DAG) of the over-represented GO categories in the list. But in my opinion it is difficult to get a big picture of such representation. We know that the GO categories (and to a lower extend pathways) share common genes. My hypothesis is that visualizing the relationship between GO categories based on the amount of gene shared will likely help to interpret the results. So what I do, in addition, is to visualize the amount of gene shared between GO categories by plotting the results of the ORA using a heatmap (code below the plot).
I welcome comments about this approach (in fact this the purpose of this post). I would like to argue that such representation of a GO ORA is complementary to displaying a flat text table and plotting the DAG. Did anybody already used this approach to interpret GO ORA? Or has a better solution?
I acknowledge that it is not the perfect solution. For example, if a category does not share many genes with others it does not mean it is not worth investigating. It might even be the key to understanding the biological experiment but there are a lot of those categories... which one to pick? Plus, I think a GO ORA does not aim at fined grain analysis but at a global overview of the events.
Here is the code to produce the heatmap:
I put here last the code to generate the list of genes significantly differentially expressed used in this post. Briefly, the experiment consist of three cell types. (i) Neurosphere cells, which are adult stem cells found in the brain, (ii) induced pluripotent stem (iPS) cells that are derived from neurosphere cells and (iii) neurosphere cells derived from the iPS cells, making it a full loop in term cell types [reference].