Tuesday, May 15, 2012

Using R to graph a subject trend in PubMed

The traditional way to show that your topic is worth studying in front of an audience is to show the state of the field based on a literature review. This is especially true if your subject is obscure except to a handful of scientists in the world.
I was confronted with this problem more than once and the last time I decided to plot the state-of-the-field using a few scripts.
I wrote three scripts for that: pubmed_trend.r that take your PubMed query and send it to the NCBI using the Eutils tools (Perl script). Then I plot the results. The details of the scripts are below but here is how you create your trend.
In this example, we plot the trend for the number of publications per year for papers annotated with MeSH terms for "sex characteristics" and "pain" and compare this search to the number of publication/year for "sex characteristics" and "Analgesics". We will run this search between 1970 and 2011. And here is the plot.
What we see here is that the number of publications per year talking about sex difference and pain or analgesics is growing but the number of publication per year is still small and more research is needed.
...and you are good to go, your talk is launched

Here are the details of the scripts and functions. The pubmed_trend.r takes a PubMed query string as you would type it in the search box through the web interface (space have to be replaced by '+').

The Perl script is straight forward and return an XML file that is parsed by the XML library in R.  
[Update] I rely here on TGen EUtils Perl module instruction how to install it can be found here
And here is the plot function using barplot.


  1. Thanks! A very useful and widely applicable routine. I'll have to use this for my next talk!

  2. Nice. A while ago I implemented something similar as a web app, though not in R; it's Perl with a Lucene index rather than an NCBI web query so it's quite fast.

    1. That's pretty cool. I did not know your website. I see that you use a GET protocol so I queried MLTrends from R directly like that:
      > x <- read.table("http://www.ogic.ca/mltrends/?search_type=titles;norm_type=publications;graph_scale=linear;query=pain;Graph%21=Graph%21&DOWNLOAD=1", sep="\t", header=T)
      > dim(x)
      [1] 62 2
      > plot(x$year, x$pain, type='l')

      Building a little function with the 3 options (Search in, normalization and, scale) you propose is straight forward (pseudo code):
      mltrends <- function(searchTerm="pain", searchIn=c(), norm=c(), scale=c()){
      URLquery <- paste(query, together)
      x <- read.table(URLquery)


    2. Can we use MeSH terms through your interface?

    3. Currently we're only indexing title and abstract by date (also authors but those aren't accessible via the web interface).

  3. I couldn't get this to work at all, would you mind describing a bit more where I'm supposed to put which files? I'm not an R neophyte, but I don't know much about shell commands, perl or the xml packages so it's difficult to trouble shoot what's happening here.

    1. Sean, save the r scripts in pubmed_trend.r and plot_bar.r as in the example. Save the perl script in pubmed_trend.pl and make it executable. The example above assume that you put your scripts in the folder where you run R. You need to have perl installed and the required R packages.

    2. > sex.pub <- pubmed_trend(search.str = 'Sex+Characteristics[mh] AND Pain[mh]', year.span=1970:2011)
      rm: ./tempfile.xml: No such file or directory
      [1] "queryString: Sex+Characteristics[mh] AND Pain[mh] AND 1970[dp]"
      /bin/sh: ./pubmed_trend.pl: Permission denied

      If I go to the shell and try to execute the perl file I get:

      $ perl ~/pubmed_trend.pl
      Can't locate Bio/TGen/EUtils.pm in @INC (@INC contains: /Library/Perl/5.12/darwin-thread-multi-2level /Library/Perl/5.12 /Network/Library/Perl/5.12/darwin-thread-multi-2level /Network/Library/Perl/5.12 /Library/Perl/Updates/5.12.3 /System/Library/Perl/5.12/darwin-thread-multi-2level /System/Library/Perl/5.12 /System/Library/Perl/Extras/5.12/darwin-thread-multi-2level /System/Library/Perl/Extras/5.12 .) at /Users/swilts/pubmed_trend.pl line 13.
      BEGIN failed--compilation aborted at /Users/swilts/pubmed_trend.pl line 13.

      So, I guess this also depends on downloading and installing Bio/TGen/EUtils.pm ?

    3. I updated the post and added instructions how to install the EUtils module here: http://brainchronicle.blogspot.com/2012/05/installing-eutils-perl-module.html

    4. OK, and a quick google shows the other half of my problem was not knowing that I needed to do this in the terminal:

      sudo chmod 755 pubmed_trend.pl

      Looks like it's working now, cheers!

  4. Sean, as concerns e-utilities specifically, an easy intro is to use NCBI's cool ebot: http://www.ncbi.nlm.nih.gov/Class/PowerTools/eutils/ebot/ebot.cgi. It'll generate the code necessary to do simple tasks involving e-utilities. I use ebot just to save me having to code/remember how to use e-utilities. I can't imagine how to simplify it further...

  5. # In Ubuntu, install XML and Curl support.
    # Otherwise, installation of R-Packages will fail

    sudo apt-get install libxml2-dev curl libcurl4-openssl-dev

    # the perl script needs the TGEN-EUtils
    # download package TGen-EUtils-0.xxx.tar.gz from http://bioinformatics.tgen.org/brunit/downloads/tgen-eutils/
    # extract it to some place you like, e.g. ~/Downloads/
    # in terminal go to the extracted folder
    # type in:

    perl Makefile.PL
    make test
    sudo make install

    # Start R
    # install packages 'XML', 'RCurl', 'RColorBrewer'

    install.packages(c('XML', 'RCurl', 'RColorBrewer'),dependencies=T)

    # now the script should work
    # works at least on Ubuntu 12.04 LTS precise 64bit
    # GREAT WORK, BTW!!!
    # have fun