R Chronicle: Using R to graph a subject trend in PubMed

Tuesday, May 15, 2012

Using R to graph a subject trend in PubMed

The traditional way to show that your topic is worth studying in front of an audience is to show the state of the field based on a literature review. This is especially true if your subject is obscure except to a handful of scientists in the world.
I was confronted with this problem more than once and the last time I decided to plot the state-of-the-field using a few scripts.
I wrote three scripts for that: pubmed_trend.r that take your PubMed query and send it to the NCBI using the Eutils tools (Perl script). Then I plot the results. The details of the scripts are below but here is how you create your trend.
In this example, we plot the trend for the number of publications per year for papers annotated with MeSH terms for "sex characteristics" and "pain" and compare this search to the number of publication/year for "sex characteristics" and "Analgesics". We will run this search between 1970 and 2011.

	source('pubmed_trend.r')
	sex.pub <- pubmed_trend(search.str = 'Sex+Characteristics[mh] AND Pain[mh]', year.span=1970:2011)
	analgesic.pub <- pubmed_trend(search.str = 'Sex+Characteristics[mh] AND Analgesics[mh]', year.span=1970:2011)

	source('plot_bar.r')
	library("RColorBrewer")

	pdf(file='sex_pain.pdf', height=8, width=8)
	par(las=1)
	colorfunction = colorRampPalette(brewer.pal(9, "Reds"))
	mycolors = colorfunction(length(sex.pub))
	plot_bar(x=sex.pub, linecol="#525252", cols=mycolors, addArg=FALSE)

	colorfunction = colorRampPalette(brewer.pal(9, "Blues"))
	mycolors = colorfunction(length(analgesic.pub))
	plot_bar(x=analgesic.pub, linecol='black', cols=mycolors, addArg=TRUE)
	title('Number of publication per year')
	legend('topleft',
	legend=c('Sex and Pain', 'Sex and Analgesics'),
	fill=c("red", "blue"),
	bty="n",
	cex=1.1
	)
	dev.off()

view raw pubmed_pain.r hosted with ❤ by GitHub

And here is the plot.

What we see here is that the number of publications per year talking about sex difference and pain or analgesics is growing but the number of publication per year is still small and more research is needed.
...and you are good to go, your talk is launched

Here are the details of the scripts and functions. The pubmed_trend.r takes a PubMed query string as you would type it in the search box through the web interface (space have to be replaced by '+').

	pubmed_trend <- function(search.str = 'Sex+Characteristics[mh] AND Pain[mh]', year.span=1970:2011) {
	require(XML)
	require(RCurl)

	results <- NULL
	tmpf <- "./tempfile.xml"
	## clean before
	system(paste("rm", tmpf))

	for(i in year.span){
	queryString <- paste(search.str, ' AND ', i, '[dp]', sep="")
	print(paste('queryString:', queryString))
	sysString <- paste('./pubmed_trend.pl "', queryString,'"', sep="")
	system(sysString)

	xml <- xmlTreeParse(tmpf, useInternalNodes=TRUE)
	pubTerm <- as.numeric(xmlValue(getNodeSet(xml, "//Count")[[1]]))
	print(paste("#______num pub for",i,":",pubTerm))
	rm(xml)
	results <- append(results, pubTerm)
	## avoid being kicked out!
	Sys.sleep(1)
	}
	names(results) <- year.span
	## clean after
	system(paste("rm", tmpf))

	return(results)
	}

view raw pubmed_trend.r hosted with ❤ by GitHub

The Perl script is straight forward and return an XML file that is parsed by the XML library in R.
[Update] I rely here on TGen EUtils Perl module instruction how to install it can be found here

	#! /usr/bin/perl -w
	#
	# pubmed_trend.pl
	#
	# Created by David Ruau on 2011-02-17.
	# Department of Pediatrics/Div. System Medicine Stanford University.
	#
	##################### USAGE #########################
	#
	# Query PubMed with Eutils tools
	#
	#####################################################
	use Bio::TGen::EUtils;

	use strict;

	my $queryString = $ARGV[0];

	## query info
	my $eu = Bio::TGen::EUtils->new( 'tool' => 'pubmed_trend.pl',
	'email' => 'REPLACE_ME@gmail.com' );

	## EFetch
	my $query = $eu->esearch( db => 'pubmed',
	term => $queryString,
	usehistory => 'n' );

	$query->write_raw( file => 'tempfile.xml' );

	if (-z 'tempfile.xml') {
	# one more time
	my $query = $eu->esearch( db => 'pubmed',
	term => $queryString,
	usehistory => 'n' );

	$query->write_raw( file => 'tempfile.xml' );
	if (-z 'tempfile.xml') {
	open (FILE, '>', 'tempfile.xml') or die 'Could not open file, $!';

	print FILE "<begin>hello world</begin>";
	close (FILE);
	}
	}

view raw pubmed_trend.pl hosted with ❤ by GitHub

And here is the plot function using barplot.

	plot_bar <- function(x=sex.pub, linecol="royalblue", cols, addArg=TRUE) {
	bp <- barplot(x, col=cols, add=addArg)
	fit <- stats::lowess(x, f=1/3)
	lines(x=bp, fit$y, col=linecol, lwd=3)
	}

view raw pubmed_barplot hosted with ❤ by GitHub

13 comments:

TomMay 16, 2012 at 12:07 AM
Thanks! A very useful and widely applicable routine. I'll have to use this for my next talk!
ReplyDelete
Replies
Max GordonMay 16, 2012 at 4:51 AM
Wow, amazing post, thanks!
ReplyDelete
Replies
Gareth PalidworMay 16, 2012 at 10:44 AM
Nice. A while ago I implemented something similar as a web app, though not in R; it's Perl with a Lucene index rather than an NCBI web query so it's quite fast.
http://www.ogic.ca/mltrends/
ReplyDelete
Replies
SeanMay 16, 2012 at 12:51 PM
I couldn't get this to work at all, would you mind describing a bit more where I'm supposed to put which files? I'm not an R neophyte, but I don't know much about shell commands, perl or the xml packages so it's difficult to trouble shoot what's happening here.
ReplyDelete
Replies
Yannick PouliotMay 16, 2012 at 9:15 PM
Sean, as concerns e-utilities specifically, an easy intro is to use NCBI's cool ebot: http://www.ncbi.nlm.nih.gov/Class/PowerTools/eutils/ebot/ebot.cgi. It'll generate the code necessary to do simple tasks involving e-utilities. I use ebot just to save me having to code/remember how to use e-utilities. I can't imagine how to simplify it further...
ReplyDelete
Replies
produnisMay 23, 2012 at 10:05 AM
# In Ubuntu, install XML and Curl support.
# Otherwise, installation of R-Packages will fail

sudo apt-get install libxml2-dev curl libcurl4-openssl-dev

# the perl script needs the TGEN-EUtils
# download package TGen-EUtils-0.xxx.tar.gz from http://bioinformatics.tgen.org/brunit/downloads/tgen-eutils/
# extract it to some place you like, e.g. ~/Downloads/
# in terminal go to the extracted folder
# type in:

perl Makefile.PL
make
make test
sudo make install

# Start R
# install packages 'XML', 'RCurl', 'RColorBrewer'

install.packages(c('XML', 'RCurl', 'RColorBrewer'),dependencies=T)

# now the script should work
# works at least on Ubuntu 12.04 LTS precise 64bit
# GREAT WORK, BTW!!!
# have fun
ReplyDelete
Replies