Friday, December 21, 2012

Computing an empirical pFDR in R

The positive false discovery rate (pFDR) has become a classical procedure to test for false positive. It is one of my favourite because it rely on a re-sampling approach.

I base my implementation on John Storey PNAS paper and the technical report he published with Rob Tibshirani while at Stanford [1-2] (I find the technical report much more didactic than the PNAS paper).

I will not describe here why when considering multiple tests simultaneously you need to control for multiple hypothesis testing (here is a link for that: Wikipedia). However, I will re-state in the terms of Storey et al.[1] the definition of pFDR:
The pFDR is the expected quantity defined by the # of false positive / # of significant features conditional there is at least one significant results. which can be written as:
pFDR = E[F/S|S>0]
That said, the probability of having at least one significant value is almost certain when we have lots of features (genes). Meaning that the above equation can be re-written as
pFDR ~= FDR ~= E[F]/E[S].
What the pFDR is measuring is the probability that the feature considered is false positive at the p-value level of this feature. For example gene A has a p-value of 0.07 and a pFDR of 0.0001. If we consider gene A to be significant the chance that it is a false positive is very low. Storey wrote it like that:
"[...] the pFDR can be written as Pr(feature i is truly null | feature i is significant)[...]"
In short this is a p-value for your p-value.
Now let's compute the pFDR for a concrete example. Here, we will use a genome-wide gene expression study comparing two groups of patients (gene as row, patients as column). But the scenario can be applied to any type of data (SNPs, proteins, patient values...) as long as you have class labels. That said, it is a tiny bit idiotic to implement an FDR test for gene expression data as there are several R packages providing this functionality already but the aim here is to know how it works.

Practical thoughts to keep in mind regarding the FDR:
1. Because it relies on a random sampling procedure, results between runs will never exactly look the same. But increasing the number of random shuffling will generate more and more similar results.
2. It might be obvious to some but it is worth noting that if you do not have groups or an order in your columns (=samples) you will not be able to shuffle the labels and thus compute the FDR.

The work:
Starting from a normalized gene expression matrix generated like that:
From this small expression matrix of 2000 genes by 18 samples with two groups we will compute a p-value using a two-sided t-test.
We obtained a vector of p-values that we will use to obtain the q-values. To do that, we need to evaluate the number of false positive (pFDR ~= E[F]/E[S]).
We will achieve that by re-computing the p-value using shuffled class labels to see if just by random chance we can obtain lower p-values than with the original class label. The shuffling will re-label some ER+ samples as ER- and vice-versa.
In a way, we aim at estimating the robustness of the p-value we obtained.