I base my implementation on John Storey PNAS paper and the technical report he published with Rob Tibshirani while at Stanford [1-2] (I find the technical report much more didactic than the PNAS paper).
I will not describe here why when considering multiple tests simultaneously you need to control for multiple hypothesis testing (here is a link for that: Wikipedia). However, I will re-state in the terms of Storey et al. the definition of pFDR:
The pFDR is the expected quantity defined by the # of false positive / # of significant features conditional there is at least one significant results. which can be written as:
pFDR = E[F/S|S>0]
That said, the probability of having at least one significant value is almost certain when we have lots of features (genes). Meaning that the above equation can be re-written as
pFDR ~= FDR ~= E[F]/E[S].
What the pFDR is measuring is the probability that the feature considered is false positive at the p-value level of this feature. For example gene A has a p-value of 0.07 and a pFDR of 0.0001. If we consider gene A to be significant the chance that it is a false positive is very low. Storey wrote it like that:
"[...] the pFDR can be written asIn short this is a p-value for your p-value.
Pr(feature i is truly null | feature i is significant)[...]"
Now let's compute the pFDR for a concrete example. Here, we will use a genome-wide gene expression study comparing two groups of patients (gene as row, patients as column). But the scenario can be applied to any type of data (SNPs, proteins, patient values...) as long as you have class labels. That said, it is a tiny bit idiotic to implement an FDR test for gene expression data as there are several R packages providing this functionality already but the aim here is to know how it works.
Practical thoughts to keep in mind regarding the FDR:
1. Because it relies on a random sampling procedure, results between runs will never exactly look the same. But increasing the number of random shuffling will generate more and more similar results.
2. It might be obvious to some but it is worth noting that if you do not have groups or an order in your columns (=samples) you will not be able to shuffle the labels and thus compute the FDR.
Starting from a normalized gene expression matrix generated like that:
From this small expression matrix of 2000 genes by 18 samples with two groups we will compute a p-value using a two-sided t-test.
We obtained a vector of p-values that we will use to obtain the q-values. To do that, we need to evaluate the number of false positive (pFDR ~= E[F]/E[S]).
We will achieve that by re-computing the p-value using shuffled class labels to see if just by random chance we can obtain lower p-values than with the original class label. The shuffling will re-label some ER+ samples as ER- and vice-versa.
In a way, we aim at estimating the robustness of the p-value we obtained.
What we can see here is that using the empirical FDR approach we obtain 184 significant results at 0.05 level. Using the more stringent sequential method such as the Benjamini-Hochberg approach we find no significant result. An important point to keep in mind is that because the method rely on re-sampling the class label if you do not have a sufficient number of samples the q-values will loose reliability. And in those cases procedures like Benjamini-Hochberg, Bonferroni etc.. are wiser choice. I personally judge the size of the data set by asking me the question are those samples representative of the entire population? In the example we have 18 human breast cancer samples and I would say it is a bit border line.
1. Storey JD, Tibshirani R. 2003. Statistical significance for genomewide studies. Proc Natl Acad Sci USA 100: 9440–9445.
2. Storey J, Tibshirani R. 2001. Estimating the positive false discovery rate under dependence, with applications to DNA microarrays. Department of Statistics, Stanford. [PDF]