Resources to obtain interactions information are numerous. Logically we think to go for the central repository if it exists. Unfortunately, for protein-protein interaction (PPI) there are severals (IntAct, BioGRID, HPRD, STRING...).
Using the API developed for these repo would require time and we usually don't have it. Fortunately, the gene web page from NCBI Entrez gene compile interactions from BioGRID and HPRD which seems like a reasonable and robust compromise. And on the other we can use the XML package to parse the web page.
First, we need a gene list, here I refer you to an earlier post where we extract a list 274 significantly differentially regulated genes.
Using the following little function you can scrap the interaction table from the NCBI web page.
[update: corrected bug where some genes returned an error]
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
get.ppiNCBI <- function(g.n) { | |
require(XML) | |
ppi <- data.frame() | |
for(i in 1:length(g.n)){ | |
o <- htmlParse(paste("http://www.ncbi.nlm.nih.gov/gene/", g.n[i], sep='')) | |
# check if interaction table exists | |
exist <- length(getNodeSet(o, "//table//th[@id='inter-prod']"))>0 | |
if(exist){ | |
p <- getNodeSet(o, "//table") | |
## need to know which table is the good one | |
for(j in 1:length(p)){ | |
int <- readHTMLTable(p[[j]]) | |
if(colnames(int)[2]=="Interactant"){break} | |
} | |
ppi <- rbind(ppi, data.frame(egID=g.n[i], intSymbol=int$`Other Gene`)) | |
} | |
# play nice! and avoid being kicked out from NCBI servers | |
Sys.sleep(1) | |
} | |
if(dim(ppi)[1]>0){ | |
ppi <- unique(ppi) | |
print(paste(dim(ppi)[1], "interactions found")) | |
return(ppi) | |
} else{ | |
print("No interaction found") | |
} | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ppi <- get.ppiNCBI(head(glist, 20)) | |
[1] "7 interactions found" | |
## Annotate the gene list with Mus musculus metadata | |
library(org.Mm.eg.db) | |
ppi$egSymbol <- mget(ppi$egID, envir=org.Mm.egSYMBOL, ifnotfound=NA) | |
ppi$intID <- mget(ppi$intSymbol, envir=org.Mm.egSYMBOL2EG, ifnotfound=NA) | |
ppi <- ppi[,c(3,2,1,4)] | |
ppi | |
egSymbol intSymbol egID intID | |
1 Ifi202b Pou5f1 26388 18999 | |
2 Hes5 Jak2 15208 16452 | |
3 Eya1 Polr2a 14048 20020 | |
4 Eya1 Rbck1 14048 24105 | |
5 Eya1 Sharpin 14048 106025 | |
6 Cdk6 TGFBR1 12571 NA | |
7 Bcl11a Sirt1 14025 93759 |
You can write this dataframe to a text file and import it in Cytoscape directly but you can also display and work your network directly in R using the igraph package.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
library("igraph") | |
gg <- graph.data.frame(ppi) | |
plot(gg, | |
layout = layout.fruchterman.reingold, | |
vertex.label = V(gg)$name, | |
vertex.label.color= "black", | |
edge.arrow.size=0, | |
edge.curved=FALSE | |
) | |
## interactive display using tk | |
tkplot(gg, | |
layout = layout.fruchterman.reingold, | |
vertex.label = V(gg)$name, | |
vertex.label.color= "black", | |
edge.arrow.size=0, | |
edge.curved=FALSE | |
) |

The network is simple and not fully connected but consider we obtained interaction for 5 genes out of 20 here only.
Great walk-through! It would be great to have a generic method to test if a group of proteins are more closely connected in terms of a PPI than allowed by random chance. There has to be work done on this already, but I'm not aware of an automated way to simply input a list of say 5 genes, and get a p-value for how closely related they are in terms of a PPI.
ReplyDeleteIf I undertand you well you want to know what are the odd that your genes are connected compared to a random set of genes. Biological network follow a power law so many genes will have few connection and a few will be highly connected [I just read "linked" from Barabasi]. There should definitively be some work done on that for sure.
DeleteBut from the top of my head you would need to compute the empirical distribution for your universe (e.g. you microarray) and sample from it 5 random genes and observe how many time the random network end up more connected than yours. Basically an FDR.
The advantage is that empirical distribution can be pre-computed and would not change except when interaction repo are updated.