Saturday, June 9, 2012

Rcpp vs. R implementation of cosine similarity

While speeding up some code the other day working on a project with a colleague I ended up trying Rcpp for the first time. I re-implemented the cosine distance function using RcppArmadillo relatively easily using bits and pieces of code I found scattered around the web. But the speed increase was not as much as I expected comparing the Rcpp code to pure R.
require(inline)
require(RcppArmadillo)
## extract cosine similarity between columns
cosine <- function(x) {
y <- t(x) %*% x
res <- 1 - y / (sqrt(diag(y)) %*% t(sqrt(diag(y))))
return(res)
}
cosineRcpp <- cxxfunction(
signature(Xs = "matrix"),
plugin = c("RcppArmadillo"),
body='
Rcpp::NumericMatrix Xr(Xs); // creates Rcpp matrix from SEXP
int n = Xr.nrow(), k = Xr.ncol();
arma::mat X(Xr.begin(), n, k, false); // reuses memory and avoids extra copy
arma::mat Y = arma::trans(X) * X; // matrix product
arma::mat res = (1 - Y / (arma::sqrt(arma::diagvec(Y)) * arma::trans(arma::sqrt(arma::diagvec(Y)))));
return Rcpp::wrap(res);
')
mat <- matrix(rnorm(100000), ncol=1000)
x <- cosine(mat)
y <- cosineRcpp(mat)
identical(x, y)
[1] TRUE
view raw Rcpp_cosine.r hosted with ❤ by GitHub
And here is the speed comparison...

Friday, June 8, 2012

A new approach to discover pain related genes

Our latest paper in PLoS Computational Biology is out.
The project spanned over 2 years starting at the end of my first year of postdoctoral training until now. It has been a truly collaborative endeavor across institutions but also across sub-disciplines using text-mining, leveraging public genomic data across diseases and genotyping a human twin cohort subjected to experimental pain. A big thank to all my collaborators.
http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002538


Briefly, we successfully demonstrated that ranking diseases by pain level using a literature co-citation approach and then extracting the gene whose expression change is associated with this ranking lead to interesting new pain gene candidate.
The beauty of the approach is that it can be apply to other concept than pain. For example, we show in the paper that we can significantly prioritize genes involve in inflammation in a similar fashion.

Sunday, June 3, 2012

Obtaining a protein-protein interaction network for a gene list in R

Building a network of interaction between a bunch of genes can help a great deal in understanding the relationships between the seemingly disparate elements from your list. It can seems challenging at first to build such network but it's less complicated than it looks. Here is an approach I use.

Resources to obtain interactions information are numerous. Logically we think to go for the central repository if it exists. Unfortunately, for protein-protein interaction (PPI) there are severals (IntAct, BioGRID, HPRD, STRING...).
Using the API developed for these repo would require time and we usually don't have it. Fortunately, the gene web page from NCBI Entrez gene compile interactions from BioGRID and HPRD which seems like a reasonable and robust compromise. And on the other we can use the XML package to parse the web page.

First, we need a gene list, here I refer you to an earlier post where we extract a list 274 significantly differentially regulated genes.
Using the following little function you can scrap the interaction table from the NCBI web page.
[update: corrected bug where some genes returned an error]
get.ppiNCBI <- function(g.n) {
require(XML)
ppi <- data.frame()
for(i in 1:length(g.n)){
o <- htmlParse(paste("http://www.ncbi.nlm.nih.gov/gene/", g.n[i], sep=''))
# check if interaction table exists
exist <- length(getNodeSet(o, "//table//th[@id='inter-prod']"))>0
if(exist){
p <- getNodeSet(o, "//table")
## need to know which table is the good one
for(j in 1:length(p)){
int <- readHTMLTable(p[[j]])
if(colnames(int)[2]=="Interactant"){break}
}
ppi <- rbind(ppi, data.frame(egID=g.n[i], intSymbol=int$`Other Gene`))
}
# play nice! and avoid being kicked out from NCBI servers
Sys.sleep(1)
}
if(dim(ppi)[1]>0){
ppi <- unique(ppi)
print(paste(dim(ppi)[1], "interactions found"))
return(ppi)
} else{
print("No interaction found")
}
}
view raw get.ppiNCBI.r hosted with ❤ by GitHub
Here is a quick example with the first 20 genes from my list. You obtain your edge list in the form of a data.frame.
ppi <- get.ppiNCBI(head(glist, 20))
[1] "7 interactions found"
## Annotate the gene list with Mus musculus metadata
library(org.Mm.eg.db)
ppi$egSymbol <- mget(ppi$egID, envir=org.Mm.egSYMBOL, ifnotfound=NA)
ppi$intID <- mget(ppi$intSymbol, envir=org.Mm.egSYMBOL2EG, ifnotfound=NA)
ppi <- ppi[,c(3,2,1,4)]
ppi
egSymbol intSymbol egID intID
1 Ifi202b Pou5f1 26388 18999
2 Hes5 Jak2 15208 16452
3 Eya1 Polr2a 14048 20020
4 Eya1 Rbck1 14048 24105
5 Eya1 Sharpin 14048 106025
6 Cdk6 TGFBR1 12571 NA
7 Bcl11a Sirt1 14025 93759
view raw ppi.r hosted with ❤ by GitHub
The NCBI2R package provides a similar function but there is a bug in GetInteractions().

You can write this dataframe to a text file and import it in Cytoscape directly but you can also display and work your network directly in R using the igraph package.
library("igraph")
gg <- graph.data.frame(ppi)
plot(gg,
layout = layout.fruchterman.reingold,
vertex.label = V(gg)$name,
vertex.label.color= "black",
edge.arrow.size=0,
edge.curved=FALSE
)
## interactive display using tk
tkplot(gg,
layout = layout.fruchterman.reingold,
vertex.label = V(gg)$name,
vertex.label.color= "black",
edge.arrow.size=0,
edge.curved=FALSE
)
view raw ppiNetwork.r hosted with ❤ by GitHub


The network is simple and not fully connected but consider we obtained interaction for 5 genes out of 20 here only.