R Chronicle: 2012

Friday, December 21, 2012

Computing an empirical pFDR in R

The positive false discovery rate (pFDR) has become a classical procedure to test for false positive. It is one of my favourite because it rely on a re-sampling approach.

I base my implementation on John Storey PNAS paper and the technical report he published with Rob Tibshirani while at Stanford [1-2] (I find the technical report much more didactic than the PNAS paper).

I will not describe here why when considering multiple tests simultaneously you need to control for multiple hypothesis testing (here is a link for that: Wikipedia). However, I will re-state in the terms of Storey et al.[1] the definition of pFDR:
The pFDR is the expected quantity defined by the # of false positive / # of significant features conditional there is at least one significant results. which can be written as:


pFDR = E[F/S|S>0]

That said, the probability of having at least one significant value is almost certain when we have lots of features (genes). Meaning that the above equation can be re-written as
pFDR ~= FDR ~= E[F]/E[S].
What the pFDR is measuring is the probability that the feature considered is false positive at the p-value level of this feature. For example gene A has a p-value of 0.07 and a pFDR of 0.0001. If we consider gene A to be significant the chance that it is a false positive is very low. Storey wrote it like that:

"[...] the pFDR can be written as Pr(feature i is truly null | feature i is significant)[...]"

In short this is a p-value for your p-value.
Now let's compute the pFDR for a concrete example. Here, we will use a genome-wide gene expression study comparing two groups of patients (gene as row, patients as column). But the scenario can be applied to any type of data (SNPs, proteins, patient values...) as long as you have class labels. That said, it is a tiny bit idiotic to implement an FDR test for gene expression data as there are several R packages providing this functionality already but the aim here is to know how it works.

Practical thoughts to keep in mind regarding the FDR:
1. Because it relies on a random sampling procedure, results between runs will never exactly look the same. But increasing the number of random shuffling will generate more and more similar results.
2. It might be obvious to some but it is worth noting that if you do not have groups or an order in your columns (=samples) you will not be able to shuffle the labels and thus compute the FDR.

The work:
Starting from a normalized gene expression matrix generated like that:

	library(GEOquery)
	## Download the data from GEO
	GDS3716 <- getGEO('GDS3716')
	# transform the GDS to and expressionSet
	eset <- GDS2eSet(GDS3716,do.log2=TRUE)
	phenoData <- pData(eset)
	# keep only the ER+ and ER-
	samples <- phenoData$sample[grep("ER", phenoData$specimen)]
	# subsetting the expressionSet
	eset <- eset[,samples]
	# transforming to matrix
	e <- exprs(eset)
	# For the sake of the example here we keep only the first 2000 probes.
	e <- e[1:2000,samples]

view raw getting_some_data.r hosted with ❤ by GitHub

From this small expression matrix of 2000 genes by 18 samples with two groups we will compute a p-value using a two-sided t-test.

	## EXTRACTING CLASS LABELS
	classLabel <- sub("^ER(.*) breast cancer", "\\1", grep("ER", phenoData$specimen, value=T))
	classLabel
	[1] "-" "-" "-" "-" "-" "-" "-" "-" "-" "+" "+" "+" "+" "+" "+" "+" "+" "+"

	## COMPUTING P-VALUE DISTRIBUTION
	minus = which(classLabel=="-")
	plus = which(classLabel=="+")
	p <- apply(e, 1, function(x){t.test(as.numeric(x[minus]), as.numeric(x[plus]))$p.value})

view raw empirical_FDR.r hosted with ❤ by GitHub

We obtained a vector of p-values that we will use to obtain the q-values. To do that, we need to evaluate the number of false positive (pFDR ~= E[F]/E[S]).
We will achieve that by re-computing the p-value using shuffled class labels to see if just by random chance we can obtain lower p-values than with the original class label. The shuffling will re-label some ER+ samples as ER- and vice-versa.
In a way, we aim at estimating the robustness of the p-value we obtained.

Religious restrictions index: how do countries compare?

The Guardian DataBlog published yesterday an interesting article exploring graphically the religious intolerance across the world. The data are coming from a report published by Pew Research Center's Forum on Religion and Public Life. I like the philosophy DataBlog a lot, providing the raw data for everyone to look at.
However, I felt that the visualization could be improved. First the data are longitudinal and no temporal representation is provided. So I downloaded the Google Spreadsheet and worked it in R with googleVis. googleVis is the R API to the Google graphic library.
The data are composed of two data type:

The Government Restriction Index (GSI) [measures government laws, policies and actions that restrict religious beliefs or practices]
The Social Hostilities Index (SHI) [measures acts of religious hostility by private individuals,organizations and social groups]

The R code is the following:

	library(xlsx)
	library(googleVis)
	# I downloaded the Excel file, cleaned the headers and worked a bit
	# the column title.
	da <- read.xlsx("~/Downloads/religion.xlsx", sheetName=1)
	rownames(da) <- da$COUNTRY.
	da <- da[,-1]
	religion <- data.frame(country=rep(rownames(da), 3),
	year=c(rep(2007, dim(da)[1]), rep(2009, dim(da)[1]), rep(2010, dim(da)[1])),
	GRI=c(da$GRI_2007, da$GRI_2009, da$GRI_2010),
	SHI=c(da$SHI_2007, da$SHI_2009, da$SHI_2010)
	)

	M <- gvisMotionChart(religion, idvar="country", timevar="year")
	plot(M)

view raw religion_index.r hosted with ❤ by GitHub

I like it better to explore those data. Select a country of interest and follow it.

Tuesday, July 31, 2012

Twitter analysis of air pollution in Beijing

One of the air pollution detection machine in Beijing (at the American Embassy) is connected to Twitter and tweet about the air quality in real time. By default the machine in Beijing output the 24hr summary PM2.5 air pollution information. What is PM2.5 is define here

Next will be to compare the pollution level between different cities such as LA and Beijing. But it turns out the air quality data for California are not so easy to get programmatically.

Here is the code I used to produce this analysis:

Rcpp vs. R implementation of cosine similarity

While speeding up some code the other day working on a project with a colleague I ended up trying Rcpp for the first time. I re-implemented the cosine distance function using RcppArmadillo relatively easily using bits and pieces of code I found scattered around the web. But the speed increase was not as much as I expected comparing the Rcpp code to pure R.

	require(inline)
	require(RcppArmadillo)

	## extract cosine similarity between columns
	cosine <- function(x) {
	y <- t(x) %*% x
	res <- 1 - y / (sqrt(diag(y)) %*% t(sqrt(diag(y))))
	return(res)
	}

	cosineRcpp <- cxxfunction(
	signature(Xs = "matrix"),
	plugin = c("RcppArmadillo"),
	body='
	Rcpp::NumericMatrix Xr(Xs); // creates Rcpp matrix from SEXP
	int n = Xr.nrow(), k = Xr.ncol();
	arma::mat X(Xr.begin(), n, k, false); // reuses memory and avoids extra copy
	arma::mat Y = arma::trans(X) * X; // matrix product
	arma::mat res = (1 - Y / (arma::sqrt(arma::diagvec(Y)) * arma::trans(arma::sqrt(arma::diagvec(Y)))));
	return Rcpp::wrap(res);
	')

	mat <- matrix(rnorm(100000), ncol=1000)

	x <- cosine(mat)
	y <- cosineRcpp(mat)
	identical(x, y)
	[1] TRUE

view raw Rcpp_cosine.r hosted with ❤ by GitHub

And here is the speed comparison...

A new approach to discover pain related genes

Our latest paper in PLoS Computational Biology is out.
The project spanned over 2 years starting at the end of my first year of postdoctoral training until now. It has been a truly collaborative endeavor across institutions but also across sub-disciplines using text-mining, leveraging public genomic data across diseases and genotyping a human twin cohort subjected to experimental pain. A big thank to all my collaborators.
http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002538

Briefly, we successfully demonstrated that ranking diseases by pain level using a literature co-citation approach and then extracting the gene whose expression change is associated with this ranking lead to interesting new pain gene candidate.
The beauty of the approach is that it can be apply to other concept than pain. For example, we show in the paper that we can significantly prioritize genes involve in inflammation in a similar fashion.

Sunday, June 3, 2012

Obtaining a protein-protein interaction network for a gene list in R

Building a network of interaction between a bunch of genes can help a great deal in understanding the relationships between the seemingly disparate elements from your list. It can seems challenging at first to build such network but it's less complicated than it looks. Here is an approach I use.

Resources to obtain interactions information are numerous. Logically we think to go for the central repository if it exists. Unfortunately, for protein-protein interaction (PPI) there are severals (IntAct, BioGRID, HPRD, STRING...).
Using the API developed for these repo would require time and we usually don't have it. Fortunately, the gene web page from NCBI Entrez gene compile interactions from BioGRID and HPRD which seems like a reasonable and robust compromise. And on the other we can use the XML package to parse the web page.

First, we need a gene list, here I refer you to an earlier post where we extract a list 274 significantly differentially regulated genes.
Using the following little function you can scrap the interaction table from the NCBI web page.
[update: corrected bug where some genes returned an error]

	get.ppiNCBI <- function(g.n) {
	require(XML)
	ppi <- data.frame()
	for(i in 1:length(g.n)){
	o <- htmlParse(paste("http://www.ncbi.nlm.nih.gov/gene/", g.n[i], sep=''))
	# check if interaction table exists
	exist <- length(getNodeSet(o, "//table//th[@id='inter-prod']"))>0
	if(exist){
	p <- getNodeSet(o, "//table")
	## need to know which table is the good one
	for(j in 1:length(p)){
	int <- readHTMLTable(p[[j]])
	if(colnames(int)[2]=="Interactant"){break}
	}
	ppi <- rbind(ppi, data.frame(egID=g.n[i], intSymbol=int$`Other Gene`))
	}
	# play nice! and avoid being kicked out from NCBI servers
	Sys.sleep(1)
	}
	if(dim(ppi)[1]>0){
	ppi <- unique(ppi)
	print(paste(dim(ppi)[1], "interactions found"))
	return(ppi)
	} else{
	print("No interaction found")
	}
	}

view raw get.ppiNCBI.r hosted with ❤ by GitHub

Here is a quick example with the first 20 genes from my list. You obtain your edge list in the form of a data.frame.

	ppi <- get.ppiNCBI(head(glist, 20))
	[1] "7 interactions found"
	## Annotate the gene list with Mus musculus metadata
	library(org.Mm.eg.db)
	ppi$egSymbol <- mget(ppi$egID, envir=org.Mm.egSYMBOL, ifnotfound=NA)
	ppi$intID <- mget(ppi$intSymbol, envir=org.Mm.egSYMBOL2EG, ifnotfound=NA)
	ppi <- ppi[,c(3,2,1,4)]
	ppi
	egSymbol intSymbol egID intID
	1 Ifi202b Pou5f1 26388 18999
	2 Hes5 Jak2 15208 16452
	3 Eya1 Polr2a 14048 20020
	4 Eya1 Rbck1 14048 24105
	5 Eya1 Sharpin 14048 106025
	6 Cdk6 TGFBR1 12571 NA
	7 Bcl11a Sirt1 14025 93759

view raw ppi.r hosted with ❤ by GitHub

The NCBI2R package provides a similar function but there is a bug in GetInteractions().

You can write this dataframe to a text file and import it in Cytoscape directly but you can also display and work your network directly in R using the igraph package.

	library("igraph")
	gg <- graph.data.frame(ppi)

	plot(gg,
	layout = layout.fruchterman.reingold,
	vertex.label = V(gg)$name,
	vertex.label.color= "black",
	edge.arrow.size=0,
	edge.curved=FALSE
	)

	## interactive display using tk
	tkplot(gg,
	layout = layout.fruchterman.reingold,
	vertex.label = V(gg)$name,
	vertex.label.color= "black",
	edge.arrow.size=0,
	edge.curved=FALSE
	)

view raw ppiNetwork.r hosted with ❤ by GitHub

The network is simple and not fully connected but consider we obtained interaction for 5 genes out of 20 here only.

Sunday, May 20, 2012

Another look at over-representation analysis interpretation

Interpreting a list of differentially regulated genes can take many forms. One of the most widely used method is looking for enrichment of functional group of genes compared to a random sampling of gene from the same universe, namely an over-representation analysis (ORA).

The point I want to explore today is what is the best way to interpret the results of an ORA?
The list of GO categories one obtain often tells a complex message and leave us with a confuse feeling that we are cherry picking the categories that fit our hypothesis the best.

Let's have a look at an example. First, I extract a gene list from a publicly available experiment in Gene Expression Omnibus. I use GEOquery for that and obtain a list of 274 genes up- and down-regulated (code at the end).

From this gene list we can perform a GO ORA fairly easily using the GOstats package. I combined all the steps necessary in two functions (GO_over.r and write.GOhyper.r) that you can found on my GitHub repo. I usually download the functions directly from my R session using this function:

https://github.com/bobthecat/codebox/blob/master/source_https.r (copy and paste it in your R session or save it to a file call source_https.r)

	source('source_https.r')
	## You can now source R scripts from GitHub. The RAW URL is needed.
	source_https('https://raw.github.com/bobthecat/codebox/master/GO_over.r')

	## Define the universe
	library(mouse4302.db)
	uniqueId <- unique(as.vector(unlist(as.list(mouse4302ENTREZID))))
	entrezUniverse <- uniqueId[!is.na(uniqueId)]
	length(entrezUniverse)
	[1] 20877

	## ORA with conditional hypergeometric test
	mfhyper <- GO_over(entrezUniverse, glist, annot='mouse4302.db')

	## Information on the Directed Acyclic Graph (DAG)
	goDag(mfhyper)
	A graphNEL graph with directed edges
	Number of Nodes = 2723
	Number of Edges = 5643

	## How many gene were mapped in the end?
	geneMappedCount(mfhyper)
	[1] 257

	## Write out the results
	source_https('https://raw.github.com/bobthecat/codebox/master/write.GOhyper.r')

	mrnaGO <- write.GOhyper(mfhyper, filename="BP_mRNA_significant.xls")
	dim(mrnaGO <- mrnaGO[mrnaGO$adjPvalue <= 0.05,])
	[1] 59 8

	head(mrnaGO)
	GOBPID Pvalue adjPvalue OddsRatio ExpCount Count Size Term
	1 GO:0007275 7.195117e-09 4.245119e-07 2.238934 47.7311790 86 3138 multicellular organismal development
	2 GO:0031116 9.123574e-07 2.691454e-05 41.247520 0.1977391 5 13 positive regulation of microtubule polymerization
	3 GO:0007155 6.134458e-06 1.206443e-04 2.792433 11.0429688 28 726 cell adhesion
	4 GO:0048699 9.122060e-06 1.345504e-04 2.630388 12.5640388 30 826 generation of neurons
	5 GO:0048468 1.143166e-05 1.348936e-04 2.326166 18.1463660 38 1193 cell development
	6 GO:0031399 2.234055e-05 2.196821e-04 2.586734 11.8491359 28 779 regulation of protein modification process

view raw GO_over.r hosted with ❤ by GitHub

Here we are presented with a table of 59 GO categories that are all significant after multiple hypothesis testing correction. Cell adhesion, generation of neurons, cellular response to interferon-beta...

How to interpret this list?
One way to do that is to display the Directed Acyclic Graph (DAG) of the over-represented GO categories in the list. But in my opinion it is difficult to get a big picture of such representation. We know that the GO categories (and to a lower extend pathways) share common genes. My hypothesis is that visualizing the relationship between GO categories based on the amount of gene shared will likely help to interpret the results. So what I do, in addition, is to visualize the amount of gene shared between GO categories by plotting the results of the ORA using a heatmap (code below the plot).

Rows and columns are GO categories. The color of each square represents the percentage of gene shared between any two categories. Here we see that our gene list (274 genes) seems to preferentially contain genes from three ensembles of GO categories that are in yellow along the diagonal. Based on this observation we can interpret that the main events going on in these cells seems to be linked to regulation of metabolism, cytoskeleton re-organization and neurons development. Which make sense when you consider that we compared iPS cells to neurospheres cells.

I welcome comments about this approach (in fact this the purpose of this post). I would like to argue that such representation of a GO ORA is complementary to displaying a flat text table and plotting the DAG. Did anybody already used this approach to interpret GO ORA? Or has a better solution?
I acknowledge that it is not the perfect solution. For example, if a category does not share many genes with others it does not mean it is not worth investigating. It might even be the key to understanding the biological experiment but there are a lot of those categories... which one to pick? Plus, I think a GO ORA does not aim at fined grain analysis but at a global overview of the events.

Here is the code to produce the heatmap:

Installing EUtils perl module

In my recent post on Using R to graph a subject trend in PubMed I used the EUtils Perl module. There are detailed general instructions on how to install Perl module here for all major OS. What I did on my Mac is that.
I downloaded the archive from http://bioinformatics.tgen.org/brunit/downloads/tgen-eutils/ and ran those commands.

tar -xzf TGen-EUtils-0.13.tar.gz
cd TGen-EUtils-0.13
## instruction are in the INSTALL file
perl Makefile.PL
make
make test
sudo make install

And that's it folks.

Tuesday, May 15, 2012

Using R to graph a subject trend in PubMed

The traditional way to show that your topic is worth studying in front of an audience is to show the state of the field based on a literature review. This is especially true if your subject is obscure except to a handful of scientists in the world.
I was confronted with this problem more than once and the last time I decided to plot the state-of-the-field using a few scripts.
I wrote three scripts for that: pubmed_trend.r that take your PubMed query and send it to the NCBI using the Eutils tools (Perl script). Then I plot the results. The details of the scripts are below but here is how you create your trend.
In this example, we plot the trend for the number of publications per year for papers annotated with MeSH terms for "sex characteristics" and "pain" and compare this search to the number of publication/year for "sex characteristics" and "Analgesics". We will run this search between 1970 and 2011.

	source('pubmed_trend.r')
	sex.pub <- pubmed_trend(search.str = 'Sex+Characteristics[mh] AND Pain[mh]', year.span=1970:2011)
	analgesic.pub <- pubmed_trend(search.str = 'Sex+Characteristics[mh] AND Analgesics[mh]', year.span=1970:2011)

	source('plot_bar.r')
	library("RColorBrewer")

	pdf(file='sex_pain.pdf', height=8, width=8)
	par(las=1)
	colorfunction = colorRampPalette(brewer.pal(9, "Reds"))
	mycolors = colorfunction(length(sex.pub))
	plot_bar(x=sex.pub, linecol="#525252", cols=mycolors, addArg=FALSE)

	colorfunction = colorRampPalette(brewer.pal(9, "Blues"))
	mycolors = colorfunction(length(analgesic.pub))
	plot_bar(x=analgesic.pub, linecol='black', cols=mycolors, addArg=TRUE)
	title('Number of publication per year')
	legend('topleft',
	legend=c('Sex and Pain', 'Sex and Analgesics'),
	fill=c("red", "blue"),
	bty="n",
	cex=1.1
	)
	dev.off()

view raw pubmed_pain.r hosted with ❤ by GitHub

And here is the plot.

What we see here is that the number of publications per year talking about sex difference and pain or analgesics is growing but the number of publication per year is still small and more research is needed.
...and you are good to go, your talk is launched

Here are the details of the scripts and functions. The pubmed_trend.r takes a PubMed query string as you would type it in the search box through the web interface (space have to be replaced by '+').

Hello world

The Brain Chronicle blog is an attempt to share with the R and scientific community at large some methods, recipes and other thoughts that emerge from my day-to-day work as a computer biologist researcher.

R Chronicle