R Chronicle

Tuesday, March 19, 2013

How not to reveal your MySQL DB login/password when sharing code on GitHub or BitBucket?

Solution: use your ~/.my/cnf
Inside your ~/.my.cnf file define the connection parameters to your databases. For example, here I define two groups called local and toto:

[local]
user = root
password = ultra_secret
host = localhost

[toto]
user = capitaine_flam
password = galaxy
host = milky.way.net

This allow me to connect to the databases defined in the group from R using the following command:

	library(RMySQL)
	con <- dbConnect(MySQL(), group='toto', dbname="user_profile")
	m <- dbGetQuery(con, "SELECT DISTINCT user_id FROM demographics WHERE gender='0'")
	dbDisconnect(con)

view raw connect.r hosted with ❤ by GitHub

This trick also apply when you are writing report using knitR to create Sweave or markdown reports. Your password is stored in clear in your .my.cnf file. So as long as your file system is not compromised you are fairly safe.

Sunday, February 24, 2013

Large correlation in parallel

A little improvement to the bigcor function proposed on Rmazing to compute huge correlation matrix in R, I made the function work in parallel using all the CPU cores available on the machine. The code is here.

Here is a benchmark of the 2 functions on my machine with 8 cores:

	R <- c(2000, 5000, 10000, 20000, 40000)
	## I hit the limit at ~50000 the ff function refuse to create the matrix.
	# Error in if (length < 0 \|\| length > .Machine$integer.max) stop("length must be between 1 and .Machine$integer.max") :
	# missing value where TRUE/FALSE needed
	# http://www.bytemining.com/2010/05/hitting-the-big-data-ceiling-in-r/
	normal <- numeric(length=length(R))
	for(i in 1:length(R)){
	split <- ifelse(R[i]<=20000, 10, 20)
	MAT <- matrix(rnorm(R[i] * 10), nrow = 10)
	normal[i] <- system.time(res <- bigcor(MAT, nblocks = split, verbose=FALSE))[3]
	}

	parallel <- numeric(length=length(R))
	for(i in 1:length(R)){
	split <- ifelse(R[i]<=20000, 10, 20)
	MAT <- matrix(rnorm(R[i] * 10), nrow = 10)
	parallel[i] <- system.time(res <- bigcorPar(MAT, nblocks = split, verbose=FALSE))[3]
	}

	d <- data.frame(time=c(normal, parallel), type=rep(c("normal", "parallel"), each=length(R)), size=rep(R, 2))

	library(ggplot2)
	pdf("bigcor_benchmark.pdf", height=7, width=7)
	qplot(size, time, data=d, group=type, colour=type, geom=c("point","path"),
	xlab="Matrix size", ylab="Time in sec.",
	main="Speed comparison bigcor / bigcorPar")
	dev.off()

view raw benchmark.r hosted with ❤ by GitHub

Monday, January 14, 2013

Air quality analysis from Beijing twitter feed.

As air pollution in Beijing reach new high [NYT article]. I re-ran the analysis I put online a few months ago.

"Crazy bad" is a good description when it reach those levels. But I am sure there are other place like Mexico city, LA etc... that also look as dramatic as those number for Beijing. The fact that the machine is tweeting make the analysis so easy. I hope it keep tweeting and that other place in the world do the same.

From the NYT article:

The existence of the embassy’s machine and the @BeijingAir Twitter feed have been a diplomatic sore point for Chinese officials. In July 2009, a Chinese Foreign Ministry official, Wang Shu’ai, told American diplomats to halt the Twitter feed, saying that the data “is not only confusing but also insulting,” according to a State Department cable obtained by WikiLeaks. Mr. Wang said the embassy’s data could lead to “social consequences.”

Friday, December 21, 2012

Computing an empirical pFDR in R

The positive false discovery rate (pFDR) has become a classical procedure to test for false positive. It is one of my favourite because it rely on a re-sampling approach.

I base my implementation on John Storey PNAS paper and the technical report he published with Rob Tibshirani while at Stanford [1-2] (I find the technical report much more didactic than the PNAS paper).

I will not describe here why when considering multiple tests simultaneously you need to control for multiple hypothesis testing (here is a link for that: Wikipedia). However, I will re-state in the terms of Storey et al.[1] the definition of pFDR:
The pFDR is the expected quantity defined by the # of false positive / # of significant features conditional there is at least one significant results. which can be written as:


pFDR = E[F/S|S>0]

That said, the probability of having at least one significant value is almost certain when we have lots of features (genes). Meaning that the above equation can be re-written as
pFDR ~= FDR ~= E[F]/E[S].
What the pFDR is measuring is the probability that the feature considered is false positive at the p-value level of this feature. For example gene A has a p-value of 0.07 and a pFDR of 0.0001. If we consider gene A to be significant the chance that it is a false positive is very low. Storey wrote it like that:

"[...] the pFDR can be written as Pr(feature i is truly null | feature i is significant)[...]"

In short this is a p-value for your p-value.
Now let's compute the pFDR for a concrete example. Here, we will use a genome-wide gene expression study comparing two groups of patients (gene as row, patients as column). But the scenario can be applied to any type of data (SNPs, proteins, patient values...) as long as you have class labels. That said, it is a tiny bit idiotic to implement an FDR test for gene expression data as there are several R packages providing this functionality already but the aim here is to know how it works.

Practical thoughts to keep in mind regarding the FDR:
1. Because it relies on a random sampling procedure, results between runs will never exactly look the same. But increasing the number of random shuffling will generate more and more similar results.
2. It might be obvious to some but it is worth noting that if you do not have groups or an order in your columns (=samples) you will not be able to shuffle the labels and thus compute the FDR.

The work:
Starting from a normalized gene expression matrix generated like that:

	library(GEOquery)
	## Download the data from GEO
	GDS3716 <- getGEO('GDS3716')
	# transform the GDS to and expressionSet
	eset <- GDS2eSet(GDS3716,do.log2=TRUE)
	phenoData <- pData(eset)
	# keep only the ER+ and ER-
	samples <- phenoData$sample[grep("ER", phenoData$specimen)]
	# subsetting the expressionSet
	eset <- eset[,samples]
	# transforming to matrix
	e <- exprs(eset)
	# For the sake of the example here we keep only the first 2000 probes.
	e <- e[1:2000,samples]

view raw getting_some_data.r hosted with ❤ by GitHub

From this small expression matrix of 2000 genes by 18 samples with two groups we will compute a p-value using a two-sided t-test.

	## EXTRACTING CLASS LABELS
	classLabel <- sub("^ER(.*) breast cancer", "\\1", grep("ER", phenoData$specimen, value=T))
	classLabel
	[1] "-" "-" "-" "-" "-" "-" "-" "-" "-" "+" "+" "+" "+" "+" "+" "+" "+" "+"

	## COMPUTING P-VALUE DISTRIBUTION
	minus = which(classLabel=="-")
	plus = which(classLabel=="+")
	p <- apply(e, 1, function(x){t.test(as.numeric(x[minus]), as.numeric(x[plus]))$p.value})

view raw empirical_FDR.r hosted with ❤ by GitHub

We obtained a vector of p-values that we will use to obtain the q-values. To do that, we need to evaluate the number of false positive (pFDR ~= E[F]/E[S]).
We will achieve that by re-computing the p-value using shuffled class labels to see if just by random chance we can obtain lower p-values than with the original class label. The shuffling will re-label some ER+ samples as ER- and vice-versa.
In a way, we aim at estimating the robustness of the p-value we obtained.

Religious restrictions index: how do countries compare?

The Guardian DataBlog published yesterday an interesting article exploring graphically the religious intolerance across the world. The data are coming from a report published by Pew Research Center's Forum on Religion and Public Life. I like the philosophy DataBlog a lot, providing the raw data for everyone to look at.
However, I felt that the visualization could be improved. First the data are longitudinal and no temporal representation is provided. So I downloaded the Google Spreadsheet and worked it in R with googleVis. googleVis is the R API to the Google graphic library.
The data are composed of two data type:

The Government Restriction Index (GSI) [measures government laws, policies and actions that restrict religious beliefs or practices]
The Social Hostilities Index (SHI) [measures acts of religious hostility by private individuals,organizations and social groups]

The R code is the following:

	library(xlsx)
	library(googleVis)
	# I downloaded the Excel file, cleaned the headers and worked a bit
	# the column title.
	da <- read.xlsx("~/Downloads/religion.xlsx", sheetName=1)
	rownames(da) <- da$COUNTRY.
	da <- da[,-1]
	religion <- data.frame(country=rep(rownames(da), 3),
	year=c(rep(2007, dim(da)[1]), rep(2009, dim(da)[1]), rep(2010, dim(da)[1])),
	GRI=c(da$GRI_2007, da$GRI_2009, da$GRI_2010),
	SHI=c(da$SHI_2007, da$SHI_2009, da$SHI_2010)
	)

	M <- gvisMotionChart(religion, idvar="country", timevar="year")
	plot(M)

view raw religion_index.r hosted with ❤ by GitHub

I like it better to explore those data. Select a country of interest and follow it.

Tuesday, July 31, 2012

Twitter analysis of air pollution in Beijing

One of the air pollution detection machine in Beijing (at the American Embassy) is connected to Twitter and tweet about the air quality in real time. By default the machine in Beijing output the 24hr summary PM2.5 air pollution information. What is PM2.5 is define here

Next will be to compare the pollution level between different cities such as LA and Beijing. But it turns out the air quality data for California are not so easy to get programmatically.

Here is the code I used to produce this analysis:

Rcpp vs. R implementation of cosine similarity

While speeding up some code the other day working on a project with a colleague I ended up trying Rcpp for the first time. I re-implemented the cosine distance function using RcppArmadillo relatively easily using bits and pieces of code I found scattered around the web. But the speed increase was not as much as I expected comparing the Rcpp code to pure R.

	require(inline)
	require(RcppArmadillo)

	## extract cosine similarity between columns
	cosine <- function(x) {
	y <- t(x) %*% x
	res <- 1 - y / (sqrt(diag(y)) %*% t(sqrt(diag(y))))
	return(res)
	}

	cosineRcpp <- cxxfunction(
	signature(Xs = "matrix"),
	plugin = c("RcppArmadillo"),
	body='
	Rcpp::NumericMatrix Xr(Xs); // creates Rcpp matrix from SEXP
	int n = Xr.nrow(), k = Xr.ncol();
	arma::mat X(Xr.begin(), n, k, false); // reuses memory and avoids extra copy
	arma::mat Y = arma::trans(X) * X; // matrix product
	arma::mat res = (1 - Y / (arma::sqrt(arma::diagvec(Y)) * arma::trans(arma::sqrt(arma::diagvec(Y)))));
	return Rcpp::wrap(res);
	')

	mat <- matrix(rnorm(100000), ncol=1000)

	x <- cosine(mat)
	y <- cosineRcpp(mat)
	identical(x, y)
	[1] TRUE

view raw Rcpp_cosine.r hosted with ❤ by GitHub

And here is the speed comparison...