Sunday, May 20, 2012

Another look at over-representation analysis interpretation

Interpreting a list of differentially regulated genes can take many forms. One of the most widely used method is looking for enrichment of functional group of genes compared to a random sampling of gene from the same universe, namely an over-representation analysis (ORA).

The point I want to explore today is what is the best way to interpret the results of an ORA?
The list of GO categories one obtain often tells a complex message and leave us with a confuse feeling that we are cherry picking the categories that fit our hypothesis the best.

Let's have a look at an example. First, I extract a gene list from a publicly available experiment in Gene Expression Omnibus. I use GEOquery for that and obtain a list of 274 genes up- and down-regulated (code at the end).

From this gene list we can perform a GO ORA fairly easily using the GOstats package. I combined all the steps necessary in two functions (GO_over.r and write.GOhyper.r) that you can found on my GitHub repo. I usually download the functions directly from my R session using this function: (copy and paste it in your R session or save it to a file call source_https.r)
Here we are presented with a table of 59 GO categories that are all significant after multiple hypothesis testing correction. Cell adhesion, generation of neurons, cellular response to interferon-beta...

How to interpret this list?
One way to do that is to display the Directed Acyclic Graph (DAG) of the over-represented GO categories in the list. But in my opinion it is difficult to get a big picture of such representation. We know that the GO categories (and to a lower extend pathways) share common genes. My hypothesis is that visualizing the relationship between GO categories based on the amount of gene shared will likely help to interpret the results. So what I do, in addition, is to visualize the amount of gene shared between GO categories by plotting the results of the ORA using a heatmap (code below the plot).
Rows and columns are GO categories. The color of each square represents the percentage of gene shared between any two categories. Here we see that our gene list (274 genes) seems to preferentially contain genes from three ensembles of GO categories that are in yellow along the diagonal. Based on this observation we can interpret that the main events going on in these cells seems to be linked to regulation of metabolism, cytoskeleton re-organization and neurons development. Which make sense when you consider that we compared iPS cells to neurospheres cells.

I welcome comments about this approach (in fact this the purpose of this post). I would like to argue that such representation of a GO ORA is complementary to displaying a flat text table and plotting the DAG. Did anybody already used this approach to interpret GO ORA? Or has a better solution?
I acknowledge that it is not the perfect solution. For example, if a category does not share many genes with others it does not mean it is not worth investigating. It might even be the key to understanding the biological experiment but there are a lot of those categories... which one to pick? Plus, I think a GO ORA does not aim at fined grain analysis but at a global overview of the events.

Here is the code to produce the heatmap:

Thursday, May 17, 2012

Installing EUtils perl module

In my recent post on Using R to graph a subject trend in PubMed I used the EUtils Perl module. There are detailed general instructions on how to install Perl module here for all major OS. What I did on my Mac is that.
I downloaded the archive from and ran those commands.
tar -xzf TGen-EUtils-0.13.tar.gz
cd TGen-EUtils-0.13
## instruction are in the INSTALL file
perl Makefile.PL
make test
sudo make install
And that's it folks.

Tuesday, May 15, 2012

Using R to graph a subject trend in PubMed

The traditional way to show that your topic is worth studying in front of an audience is to show the state of the field based on a literature review. This is especially true if your subject is obscure except to a handful of scientists in the world.
I was confronted with this problem more than once and the last time I decided to plot the state-of-the-field using a few scripts.
I wrote three scripts for that: pubmed_trend.r that take your PubMed query and send it to the NCBI using the Eutils tools (Perl script). Then I plot the results. The details of the scripts are below but here is how you create your trend.
In this example, we plot the trend for the number of publications per year for papers annotated with MeSH terms for "sex characteristics" and "pain" and compare this search to the number of publication/year for "sex characteristics" and "Analgesics". We will run this search between 1970 and 2011. And here is the plot.
What we see here is that the number of publications per year talking about sex difference and pain or analgesics is growing but the number of publication per year is still small and more research is needed.
...and you are good to go, your talk is launched

Here are the details of the scripts and functions. The pubmed_trend.r takes a PubMed query string as you would type it in the search box through the web interface (space have to be replaced by '+').

Tuesday, May 8, 2012

Hello world

The Brain Chronicle blog is an attempt to share with the R and scientific community at large some methods, recipes and other thoughts that emerge from my day-to-day work as a computer biologist researcher.