Introduction

Earlier we looked at estimating contrasts. One issue I left in the air was how to deal with multiple comparisons. Specifically, if we fit a one-way ANOVA model to data with several levels of the categorical variable, we can reject the null hypothesis that all of the means are the same, but that does not tell us which means are different. Before we looked at the problem when we have a pre-specified idea about what the differences could be, where we can set up nd test contrasts. But the more usual situation is where we don’t have any strong hypotheses about sensible contrasts, and want to explore teh differences we see, to try to conclude which might be real.

The approach that is often taken os to compare all pairwise combinations of levels (control vs treatment 1, control vs treatment 2, treatment 1 vs treatment 2 etc.). But this creates a lot of tests, and that brings other problems.

If we have \(m\) tests of \(H_0\), at level \(\alpha\), what is the probability that at least one is significant when \(H_0\) is true for all tests?

Answer The probability of one test not being significant is \(1-\alpha\), so the probability for all \(m\) tests is \((1-\alpha)^m\).

Calculate this probability for \(\alpha=0.05\) and \(m=1, 5, 10, 20\).

(there is a point to this, related to birthdays)

Answer

For \(m=1\), \((1-\alpha)^1=0.95\)

For \(m=5\), \((1-\alpha)^1=0.77\)

For \(m=10\), \((1-\alpha)^1=0.60\)

For \(m=20\), \((1-\alpha)^1=0.36\)

Notice how the probabilities go down: even with 5 test there is a 23% chance of getting at least one significant test.

We can plot this for \(m\) up to 50:

So with lots of test we get a high probability of getting false positives. This means if we just use these tests and don’t adjust anything, we are very likely to conclude that there are effects, when it is just noise.

This is related to the birthday problem.

By chance, then, we would expect that some tests will be statistically significant. This can get extreme with lots of tests: in genetics analyses regularly look at thousands of tests. This means that by chance we will get significant results, and if we are unlucky, most of our significant tests may be Type I errors (i.e. false positives). We need to correct this, to try to reduce the proportion of false results. There are a few ways to do this.

One approach is based on the family-wise error rate. The idea is that if we do \(k\) tests, we want to control the probability of a single false positive result, by setting that to \(\alpha_{FWER}\), and using that to determine the \(\aloha\) for each individual test.

A more recent alternative is the false discovery rate. This approach controls the proportion of positive tests that are false positives (i.e. false discoveries). So it lets some test return “true” positives, and controls the proportion that are not true positives.

Example: Potatoes and using multiple tests

As an example we will use the same data we have used before, on potato yields. But here we will look at differences between potato varieties.

This is the data. the crosses are the variety means.

The original data had 3 replicates per treatment per variety, hebce there are 9 observations for each variety (we know that there is no treatment effect from previous weeks). Some varieties, such as the Duke of York, appear to have a lower yield.

Ajax Arran comrade British queen Duke of York Epicure Great Scot Iron duke K of K Kerrs pink Nithsdale Tinwald perfection Up-to-date
Mean 3.34 2.26 3.14 1.78 2.16 3.4 3.25 3.96 3.31 2.82 3.04 3.84

We can check this with a one-way ANOVA:

Df Sum Sq Mean Sq F value Pr(>F)
Variety 11 43.63839 3.967126 13.57224 0
Residuals 96 28.06051 0.292297 NA NA

The varieties differ, but there are 11 of them, so comparing them is difficult. The standard approach in this case is to compare all of the means, which is \(n(n-1)/2\) comparisons, or 55 comparisons here. If we compare at 5%, we would expect 2-3 significant tests by chance, even if there were truly no differences.

We can compare each pair of means with a simple t-test. We can see the first few here:

Arran comrade British queen Duke of York Epicure Great Scot
Ajax 1.99 0.38 2.89 2.18 -0.12
Arran comrade -1.62 0.90 0.19 -2.11
British queen 2.51 1.81 -0.49
Duke of York -0.71 -3.01
Epicure -2.30
Great Scot

For each of these tests we can calculate the p-values:

Arran comrade British queen Duke of York Epicure Great Scot
Ajax 0.049 0.7077 0.0048 0.0315 0.9070
Arran comrade 0.1091 0.3725 0.8504 0.0374
British queen 0.0136 0.0740 0.6230
Duke of York 0.4813 0.0034
Epicure 0.0236
Great Scot

The standard t-test at \(\alpha=0.05\) would have a critical value of 1.98, and 19 of the comparisons pass that threshold. We would expect 2-3 to pass it by chance. So how many are “really” significant?

These are the comparisons for which the p-value is less than 0.05:

Variety Variety t statistic P value
Duke of York K of K -4.034276 0.0001100
Duke of York Up-to-date -3.816429 0.0002397
Epicure K of K -3.327302 0.0012440
Arran comrade K of K -3.138227 0.0022579
Epicure Up-to-date -3.109455 0.0024672
Duke of York Great Scot -3.006697 0.0033709
Arran comrade Up-to-date -2.920380 0.0043570
Ajax Duke of York 2.889553 0.0047693
Duke of York Kerrs pink -2.829953 0.0056697
Duke of York Iron duke -2.718975 0.0077722
British queen Duke of York 2.513459 0.0136204
Duke of York Tinwald perfection -2.336715 0.0215328
Epicure Great Scot -2.299722 0.0236304
Ajax Epicure 2.182578 0.0315060
Epicure Kerrs pink -2.122979 0.0363279
Arran comrade Great Scot -2.110648 0.0374017
K of K Nithsdale 2.098317 0.0385028
Epicure Iron duke -2.012000 0.0470222
Ajax Arran comrade 1.993504 0.0490441

The extreme varieties - Duke of York and K of K - appear quite frequently. Some of comparisons are presumable by chance, so we want to try to identify these.

Family-Wise Error Rate

One approach to correcting multiple testing is based on the family-wise error rate. The idea is that if we do \(m\) tests, we want to have a probability of \(\alpha_{FWER}\) of a single false positive. Note that this will be conservative: if we expect some of the tests to be positive, then we expect to see several positive tests, so the probability of a positive test should be \(>\alpha_{FWER}\).

There are a few ways to implement this idea, depending on the assumptions one wants to make. Here we will look at a couple of simple ways.

Šidák Correction

A conceptually simple idea is the Šidák Correction.

If we have \(m\) independent tests, with the probability \(\alpha\) that every test is negative? From this, what corrected value of \(\alpha\) would you use so that the probability that any test was significant was \(\alpha_{FWER}=0.05\)?

As you may guess, this is the Šidák Correction.

Answer

First, the probability of no test being significant is \(\alpha_{FWER} = (1-\alpha)^m\) (because they are independent). Then,

\[ \begin{align} \alpha_{FWER} &= 1-(1-\alpha)^m \\ (1-\alpha_{FWER})^{1/m} &= 1-\alpha \\ \alpha &= 1-(1-\alpha_{FWER})^{1/m} \end{align} \]

The Šidák Correction is clearly conservative: the value of \(\alpha\) is only correct if all null hypotheses are true, and if they are independent.

Example

With the Šidák Correction with 55 tests, the critical p-value is \(1-(1-0.05)^{1/55}=0.001\) Of the 55 tests, 2 of the tests are significant: the comparison of the Duke of York with K of K and Up-to-Date.

Bonferroni Correction

A more popular alternative to the Šidák Correction, and for many years the most popular multiple comparison method was the Bonferroni Correction. This is not quite as conservative as the Šidák Correction. It is based on Boole’s Theorem.

What, you are wondering, is Boole’s Theorem. It is this:

\[ Pr(\bigcup_{i=1}^\infty A_i ) \le \sum_{i=1}^\infty Pr(A_i) \]

Where \(A_1, A2, \dots A_\infty\) are a countable set of events. We only need the finite version of this, so we want to prove

\[ Pr(\bigcup_{i=1}^n A_i ) \le \sum_{i=1}^n Pr(A_i) \]

There are a few ways to do it (draw a Venn diagram if you want the intuitive explanation). Here we will use induction.

For \(n=1\) we know \(Pr(A_1) \le Pr(A_1)\). Of course, we know a bit more than that, but this is all we need.

Show that \(Pr(A_1) \cup Pr(A_2) \le Pr(A_1) + Pr(A_2)\), and thus \(Pr(\bigcup_{i=1}^n A_i ) \le \sum_{i=1}^n Pr(A_i)\)

(this is obvious if you know the right equation)

Answer

The frist part is because \(Pr(A_1) \cup Pr(A_2) \le Pr(A_1) + Pr(A_2) - Pr(A_1 \cap A_2)\). then by induction we can use unions of sets and add one more element:

Then, more generally,

\[ \begin{align} Pr(\bigcup_{i=1}^{n+1} A_i ) &= \Pr(\bigcup_{i=1}^{n} A_i ) + Pr(A_{n+1}) - Pr(\bigcup_{i=1}^{n+1} A_i \cap A_{n+1}) \\ &\le \sum_{i=1}^n Pr(A_i) + Pr(A_{n+1}) - Pr(\bigcup_{i=1}^{n+1} A_i \cap A_{n+1}) \\ &\le \sum_{i=1}^{n+1} Pr(A_i) \end{align} \]

Now we have this, we can use it to develop a simple correction.

Use Boole’s Theorem to develop a correction for multiple tests.

(this is obvious if you know the right equation)

Answer

For each test \(\alpha_B\) should be the same, so Boole’s theorem becomes

\[ Pr(\bigcup_{i=1}^m A_i ) \le \sum_{i=1}^n \alpha_B = m\alpha_B \]

Thus, if we want a set \(\alpha\), \(\alpha \le m\alpha_B\) so we can use \(\alpha_B = \alpha/m\).

This has the advantage over the Šidák Correction that is is easier to calculate, and does not assume independence. It is also less conservative.

Example

With the Bonferroni Correction with 55 tests, the critical p-value is \(0.05/55=0.0009\) Of the 55 tests, 2 of the tests are significant. These are the same two tests as before (not surprisingly).

Tukey’s Range Test

An alternative to correcting the p-values is to look at the distribution of estimates, and compare them to what would be expected under the null hypothesis. This is §12.3 of Larsen and Marx.

We assume we are comparing the means of \(k\) groups, each of size \(n_j=r\), which are assumed to be normally distributed with a constant variance that is estimated as \(s^2\), with \(\nu\) degrees of freedom. Our null hypothesis is \(H_0: \mu_1=\mu_2, \dots =\mu_k\). So this is just the standard one-way ANOVA design we had in Week 9. We can use ANOVA to test \(H_0\), but that does not tell us which levels of the treatment are different. Tukey suggested looking at the distribution of the maximum difference in means.

We calculate our test statistic as

\[ q = \frac{\bar{y}_{max} - \bar{y}_{min}}{s/\sqrt{r}} \]

And this follows a studentised range distribution, \(f_R(q | k,\nu)\) with the following pdf:

\[ f_R(q | k,\nu) = \frac{\sqrt{2\pi}k(k-1)\nu^{\nu/2} }{\Gamma(\nu/2) 2^{\nu/2-1}} \int_0^\infty {s^\nu \phi( \sqrt{\nu} s) \left[ \int_{-\infty}^\infty {\phi( z+q s)\phi( z) [\Phi( z+q s)-\Phi(z)]^{k-2}} dz \right]} ds \]

We will not prove this, or look at it in more detail (\(\phi(z)\) is the pdf for a standard normal distribution, and \(\Phi(z)\) is the corresponding cdf), as it is rather horrifying.

In practice, we can find critical values with either tables or computer, and then test this using the same procedure as a normal t-test (although the critical value is rather different).

Example: Potatoes and using multiple tests

The smallest mean is 1.78 for Duke of York, and the largest is 3.96 for K of K. We have 12 means and \(r=9\) and our standard error is \(s/=\) 0.54 (I have not shown you the calculations). Thus

\[ q = \frac{\bar{y}_{max} - \bar{y}_{min}}{s/\sqrt{r}} = \frac{ 3.96-1.78}{0.54/\sqrt{9}} = 12.11 \]

The critical value for a Studentized Range Distribution with 96 degrees of freedom and 12 means is 4.74. So the largest difference is larger, and we can reject \(H_0\). We can go further, and

Of the 55 tests, 27 of the differences are larger than this. This is rather a lot, and is because several varieties are different. In essence there are several varieties that differ.

False Discovery Rate

The FWER approach has been criticised for being too conservative: there is a \(\alpha_{FWER}\) probability of getting a single test coming out as significant. But if we are donig these tests then we a priori expect there to be at least one significant test, so strictly the justification for the FWER correction is wrong. But we would need something better, and the False Discovery Rate (FDR) is the main alternative

The idea behind the false discovery rate is to control the number of false positive tests. FDR is \(Q=E(V/R)\) where \(R\) is the number of rejected null hypotheses and \(V\) is the (unknown) number of false positives, i.e. it is the proportion of rejected hypotheses that should not be rejected. So if the null hypotheses are all true, FDR is 1, and if some null hypotheses are false, FDR is <1.

How do we use this when we don’t know how many false positives we have? One solution is the Benjamini–Hochberg procedure. This works by ordering the p-values so that \(P_{(1)} \le P_{(2)} \le \ldots \le P_{(m)}\), and finding the largest value of \(P_{(k)}\) such that

\[ P_{(k)} \le \frac{k}{m} \alpha \]

We can visualise this by plotting \(P_{(k)}\) against \(k\), and adding the line \(P_{(k)} = \frac{\alpha}{m}k\):

Then we start at the left and reject hypotheses until we reach a value of \(P_{(k)}\) above the line:

We will not look at a formal proof of this (see the original paper if you want the details).

Example: Potatoes and FDR

For the potatoes we can carry out the procedure:

Accoring to Benjamini–Hochberg we have 9 significant comparisons. These are the 9 top rows of the table of p-values above.