Introduction: Why?

So far we have mainly looked at data where the random error was normally distributed. But not all data is like that. Data can be weird - skewed, long-tailed, short tailed, multimodal etc. There are several ways to deal with this: sometimes it is not a big problem, but alternatives include transforming the data or assuming a different distribution. However, the data may not still cooperate (or the alternatives may be too messy). So one approach has been to not deal with the distribution of the data, but instead to work with the relative order of the data, i.e. with the ranks. This area is called non-parametric statistics.

If we are dealing with ranks, we cannot test hypotheses about the mean of the distribution, because we are discarding that information. Instead, we work with the median and quantiles: half of the data are above the median, so we can use this idea to develop tests.

There are advantages and disadvantages to non-parametric methods. They relax the assumption that the data are from a particular distribution, but the cost of this is that they are not as powerful, i.e. if \(H_0\) is false, non-parametric tests need more data to show it (§6.4 discusses power in more detail if you want a reminder).

If we are going to work with comparing ranks, we have to work out their distribution under the null hypothesis. This is often not trivial, but there are three approaches:

  • derive the exact distribution
  • use an approximation: with a large enough sample size a lot of test statistics are approximately normally distributed
  • simulate the distribution

The third approach is probably the one most people would take now, but it’s the least interesting mathematically. So we will look at the other two approaches.

Sign Test

Probably the simplest non-parametric test is the sign test. The idea is simple: we draw \(Y\) from a continuous distribution and we want to test whether \(\tilde{y}>\mu\), where \(\tilde{y}\) is the median (rather than the mean).

Our null hypothesis is than that the number of data points above \(\mu\) should follow a binomial distribution, i.e. if our sample size in \(N\), \(k\) values are above \(\mu\), so \(k\sim Binom(N, \frac{1}{2})\).

From this, we can first work out the exact test (for small samples in the pre-computer days), and then an approximate large-sample test.

Based on the binomial distribution, (a) what test would we use to test the null hypothesis that \(\tilde{y}>\mu\), where \(\tilde{y}\) is the median of \(Y\) (i.e. a one-tailed test)? And then (b) how would you test \(\tilde{y} \ne \mu\)?

Answer

We assume \(k\sim Binom(N, \frac{1}{2})\).

  1. For a one-tailed test we can calculate \(Pr(n \le k | N)\), i.e. 

\[ S = \sum_{i=0}^{k} \frac{N!}{i! (n-i)!} \frac{1}{2}^i \frac{1}{2}^{N-i} = \frac{1}{2^N}\sum_{i=0}^{k} \frac{N!}{i! (n-i)!}. \]

We reject \(H_0\) if \(S < \alpha\) (e.g. $).

  1. For a two-tailed test if \(k< \mu\) we calculate \(Pr(n \le k | N) + Pr(n \ge N-k | N)\), i.e. 

\[ \begin{align} S &= \sum_{i=0}^{k} \frac{N!}{i! (n-i)!} \left(\frac{1}{2}\right)^i \left(\frac{1}{2}\right)^{N-i} + \sum_{i=N-k}^{N} \frac{N!}{i! (n-i)!} \left(\frac{1}{2}\right)^i \left(\frac{1}{2}\right)^{N-i}\\ &= \frac{N!}{2^N}\left(\sum_{i=0}^{k} \frac{1}{i! (n-i)!} + \sum_{i=N-k}^{N} \frac{1}{i! (n-i)!} \right). \end{align} \]

Without access to a computer, this can be a pain to calculate in \(N\) is large. And even with a computer, it can be helpful to quickly calculate an approximation (e.g. to get a general idea of if the null hypothesis is reasonable). We can do that because when \(N\) is large, the distribution of \(k\) is approximately normal.

So, lets develop an approximate sign test for large \(N\)!

If \(N\) is large, what (approximate) test could we use to test the null hypothesis that \(\tilde{y}\ne\mu\), where \(\tilde{y}\) is the median of \(Y\)?

Hint You will need to work out what the mean and variance of \(k\) are under the numm hypothesis.
Answer

If \(k\sim Binom(N, \frac{1}{2})\), the mean and variance are \(\frac{1}{2}\) and \(N\frac{1}{2}(1-\frac{1}{2})=N/4\). So, approximately, \(k \sim N\left(\frac{1}{2}, \sqrt{N/4} \right)\).

If \(z_\alpha\) is the usual critical value of a normal distribution (e.g. \(z_0.025=-1.96\)), we can test \(H_0\) by testing if

\[ \frac{1}{2} - z_{\alpha/2} \sqrt{N/4}< k < \frac{1}{2} + {1-z_\alpha/2} \sqrt{N/4} \]

And we reject \(H_0\) if this is not true.

We can also use a sign test for paired data.

If we have paired data, \(X_i\), \(Y_i\), \(i=1, \dots, N\), and we want to test if \(\tilde{\mu}_X \ne \tilde{\mu}_Y\), how dould we adapt the sign test to do this?

Hint This is large about getting something that can be plugged into what we have above.
Answer

If \(\tilde{\mu}_X = \tilde{\mu}_Y\), \(Pr(X_i > Y_i)=\frac{1}{2}\), so we can count how many times \(X_i > Y_i\), and plug that into the test above.

Example: Pipes

We can look at a problem from Chapter 7, specifically problem 7.4.21, which was an exercise. The data are the maximum pit depths in pipes, i.e. how much they have corroded. The question is whether the depths are less than 0.0042 inches. These are the data, with whether the data point is above 0.0043:

Pipes 0.0039 0.0041 0.0038 0.0044 0.004 0.0036 0.0034 0.0036 0.0046 0.0036
0 0 0 1 0 0 0 0 1 0

So 2 pipes are above 0.0042 inches. The exact test is then

\[ \begin{align} S &= \frac{N!}{2^N}\sum_{i=0}^{2} \frac{1}{i! (N-i)!} \\ &= \frac{10!}{2^{10}} \left( \frac{1}{0! (10)!} + \frac{1}{1! 9!} + \frac{1}{2! 8!}\right)\\ &= \frac{3628800}{1024} \left( \frac{1}{3628800} + \frac{1}{ 362880} + \frac{1}{80640}\right) \\ &= 0.054. \end{align} \]

Which suggests the difference is not significant.

We can also try the normal approximation. Under the null hypothesis \(k \sim N(1/2, 10/4)\), so the test statistic is \(\frac{k - 10\frac{1}{2}}{\sqrt{10/4}} =\frac{2 - 5}{1.58} = -1.89\). The p-value for this is 0.029, which would suggest it was significant at 5%.

Thus we get different conclusions depending on how we do the test. The approximate test is, well, approximate, and the sample size of 10 is probably too small to really trust it.

But also note that although the p-values are either side of the critical value, they are not that different: 0.054 and 0.029. The values are both so close to the critical value that they could easily have gone the other way.

Ranks

Rank-based tests are based on the idea that under the null hypothesis the ranks are arbitrary. But we can chose ranks in different ways. Here we will look at data from continuous distributions: if the distributions are discrete then we can get ties, which can be handled by modifying the tests slightly.

Wilcoxon Test: one-sampled not-t test

Assume we have data \(y_1, y_2, \dots, y_n\) drawn from some continuous and symmetric distribution with median \(\mu\). We want to test \(H_0: \mu = \mu_0\), i.e. if the median of a distribution equals a set value (not the mean as the book says!).

We start by ranking the deviation of the data from \(\mu_0\), i.e. the rank, \(r_i\), of \(|y_i-\mu_0|\). If the distribution is symmetric about \(\mu_0\), the sign of \(y_i-\mu_0\) should not matter. So we define the sign, \(z_i\), as

\[ z_i = \left\{ \begin{array} 00 & \textrm{if } y_i - \mu_0 < 0 \\ 1 & \textrm{if } y_i - \mu_0 > 0 \\ \end{array} \right. \]

So, for each data point we have \(y_i\), \(|y_i-\mu_0|\), \(r_i\) and \(z_i\). The Wilcoxon signed rank statistic is the sum of the ranks of the data for which \(y_i - \mu_0 > 0\), i.e.

\[ w = \sum_{i=1}^n r_i z_i \]

Now we want to know the distribution of this statistic under \(H_0\). This is Theorem 14.3.1, which states that if we have data \(y_i, i=1,\dots,/n\), drawn from continuous and symmetrical distributions with pdfs \(f_i(y)\). If they have the same mean, \(\mu\), then if \(H_0: \mu=\mu_0\), the pdf of the signed rank statistic, \(p_W(w)\), is

\[ p_W(w) = Pr(W=w) = \left( \frac{1}{2^n}\right)c(w) \]

where \(c(w)\) is the coefficient of \(e^{wt}\) in the expansion of \(\prod_{i=1}^n(1+e^{it})\).

This is done via moment generating functions, which are described in §3.12 of Larsen & Marx. If you can’t remember them, re-read that section, or read the footnote at the bottom of the page1.

Our strategy is:

  1. Write out the pdf of \(w\), and get stuck working out what one term is.
  2. Erite down the mgf of \(w\), including the term we are stuck with
  3. Re-write \(w\) in a different way, as \(U\), so that the mgf is easier to derive
  4. Derive the mgf of \(U\). This gives us a target to aim for
  5. Compare the two mgfs, see where they are the same, and from that derive this term we were stuck with
  6. Rejoice

Step 1

We have \(w = \sum_{i=1}^n r_i z_i\) and \(z_i\) is either 0 or 1. There are therefore \(2^n\) ways of constructing the ranks. Under \(H_0\) these are equally likely, so has probability of \(2^{-n}\) of occurring. For some of these, \(c(w)\), the rank will be \(r_i\), so the pdf is

\[ p_W(w) = \frac{c(w)}{2^n} \]

Step 2

Now we need to write down the mgf of \(p_W(w)\)work out what \(c(w)\) is. We will approach this indirectly, by deriving the mgf for a distribution that is the same as \(W\). This is (almost) too trivial to be an exercise: it is

\[ M_W(w) = E(e^{Wt}) = \sum_{w=1}^{n(n+1)/2}e^{wt}p_W(w) = \sum_{w=0}^{n(n+1)/2}e^{wt}\frac{c(w)}{2^n} \]

Why does the summation go from 0 to \(n(n+1)/2\)?

Answer

Because \(W\) is defined on the ranks, i.e. from 0 to \(\sum_{i=1}^n i = n(n+1)/2\). 0 is when all of the data are \(<\mu_0\).

I asked this because (a) it took me a bit of time to work it out, and (b) it helps emphasise this, which will help later. Also, Larsen & Marx get the start of the summation wrong. Possibly deliberately to see who is paying attention.

Step 3

\(W\) is the sum of the ranks of the data, so we can also write it as \(U=\sum_{i=1}^n U_i\) where

\[ U_i = \left\{ \begin{array} 00 & \textrm{with probability } \frac{1}{2} \\ i & \textrm{with probability } \frac{1}{2} \\ \end{array} \right. \]

To see that these are the same, think that we order the data according to \(r_i\), the rank of the absolute values. Under the null hypothesis each one is either 0 or \(r_i\) with probability \(\frac{1}{2}\), and these are independent.

Step 4

Show that the mgf for \(U=\sum_{i=1}^n U_i\) is \(\frac{1}{2^n}\prod_{i=1}^n (1 + e^{it})\)

Hint Work out the mgf of \(U_i\) first. Then, of course, \(U=\sum U_i\).
Answer

The m.g.f. of \(U_i\) is \(E(e^{U_i(t)})\), which is

\[ E(e^{U_i(t)}) = E(\frac{1}{2}e^{0t} + \frac{1}{2}e^{it}) = \frac{1}{2}(1 + e^{it}) \]

The moment generating function for \(U = \sum_{i=1}^nU_i\) is then

\[ M_U(t) = \prod_{i=1}^n M_{U_i}(t) = \prod_{i=1}^n \frac{1}{2}(1 + e^{it}) = \frac{1}{2^n}\prod_{i=1}^n (1 + e^{it}) \]

For \(n=4\) we would get the following

\[ \begin{align} M_U(t) &= \frac{1}{2^4}\prod_{i=1}^4 (1 + e^{it}) \\ &= \frac{1}{2^4}(1 + e^{t})(1 + e^{2t})(1 + e^{3t})(1 + e^{4t}) \\ &= \frac{1}{16}\left(1 + e^t + e^{2t} + 2e^{3t} + 2e^{4t} + 2e^{5t} + 2e^{6t} + 2e^{7t} + e^{8t} + e^{9t} + e^{10t} \right) \end{align} \]

Note that the terms go from \(e^0t\) to \(e^{\frac{n(n+1)}{2} t}\).

Step 5

The mgfs are

\[ M_U(t) =\frac{1}{2^n}\prod_{i=1}^n (1 + e^{it}) \]

and

\[ M_W(w) = \frac{1}{2^n}\sum_{w=0}^{n(n+1)/2}e^{wt}c(w) \]

And we know if we expand \(\prod_{i=1}^n (1 + e^{it})\) we get a sum in terms of \(e^{at}\), where \(a=1,...,n(n-1)/2\). So set \(a=w\), and the two mgfs are the same if the coefficient of \(e^{at}\) is \(c(w)\). And this proves Theorem 13.3.1

Step 6

Using the Wilcoxon Test

Now we know what the null distribution is (or at least what the equation is).

How would you test the hypothesis \(H_0: \mu=\mu_0\) using a Wilcoxon Signed Rank Test?

Hint See Week 1 for setting up a test.
Answer

\(H_0: \mu=\mu_0\) and \(H_1: \mu \ne\mu_0\). Our test statstic is \(w = \sum_{i=1}^n r_i z_i\). We can create critical regions, \(\{w_1^*,w_2^*\}\) of the sample space of \(W\) by defining \(Pr(W \le w_1^*) = 0.025\), and \(Pr(W \ge w_2^*) = 0.025\), where \(Pr(W \le w_1^*) = \sum_{w=1}^{w_1^*} p_W(w)\) and \(Pr(W \ge w_2^*) = \sum_{w=w_1^*}^{n(n+1)/2} p_W(w)\).

These critical values are in Table A.6, but can also be calculated by hand or by computer.

Example: Pipes

Back to the pipes. The question is still whether the depths are less than 0.0042 inches. Now we need to calculate the ranks as well:

Pipes 0.0039 0.0041 0.0038 0.0044 0.004 0.0036 0.0034 0.0036 0.0046 0.0036
Rank 6 8 5 9 7 2 1 3 10 4
Above 0 0 0 1 0 0 0 0 1 0

Note that there are ties in the data, but I have still given them unique ranks (here is does not matter)

Use a Wilcoxon signed rank test to test if the average pit depths are less than 0.0042 inches

Answer

From above we see that the sum of the ranks for the data above 0.0042 is \(w=19\). This is a one-tailed test, so we reject the null if \(W<w_2^*\), where \(w_2^*\) is found from \(Pr(W \ge w_2^*) = \sum_{w=w_1^*}^{n(n+1)/2} p_W(w)\).

If we look at the Table A6 we see that for \(n=10\) the least extreme value listed is for \(16\le w \le 39\), for which \(Pr(W \ge 39) = Pr(W \le 16) = 0.138\). So this is certainly not significant at 5%.

Large Sample Test

Looking these statistics up in tables is a pain (even worse, someone had to calculate the values in the tables). But, as so often in statistics, asymptotics comes to our rescue. It turns out that when \(n\) is large, the distribution of \(w\) is approximately normal. So we can just calculate the mean and variance, and assume normality.

Under \(H_0\), what is \(E(W)\), the expected value of the Wilcoxon signed rank statistic?

You will need a couple of results about \(\sum_{i=1}^n i\), and \(\sum_{i=1}^n i^2\).

Hint

Use the form \[ U_i = \left\{ \begin{array} 00 & \textrm{with probability } \frac{1}{2} \\ i & \textrm{with probability } \frac{1}{2} \\ \end{array} \right. \]

Answer Theorem 14.3.2.

Under \(H_0\), what is the variance of the Wilcoxon signed rank statistic?

Note: See equation 3.12.6.

Answer Also Theorem 14.3.2.

Example: Pipes

For the pipes data above, test whether the mean is different from 0.0042 using the normal approximation to the Wilcoxon Signed Rank test. Do the test at 5%.

Answer

\(n=10\), so

\[ E(W) = \frac{n(n+1)}{4} = \frac{10\times 11}{4} = 22.5 \]

\[ Var(W) = \frac{n(n+1)/2n+1}{24} = \frac{10\times 11 \times 21}{24} = 96.25 \]

So under the null hypothesis \(W \sim N(22.5, \sqrt{96.25})\). the observed value is 19, so we want \(\Phi(\frac{19 - 22.5}{\sqrt{96.25}}) = \Phi(\frac{19 - 22.5}{\sqrt{96.25}}) = -0.36\). This is a long way from being significant (the critical value for a 1-tailed test at 5% would be -1.645),

There are a couple of extensions to the Wilcoxon signed rank test.

Paired Data

Assume we have paired data, \(x_i\) and \(_i\), \(i=1,\dots,n\) (e.g. “before” and “after” measurements on individual \(i\)) and we want to ask if \(\mu_X=\mu_Y\) without assuming normality. We can do that by converting the problem into a Wilcoxon Signed Rank Test.

Derive a rank-based test for \(\mu_X=\mu_Y\) for paired data by converting the problem to a Wilcoxon Signed Rank Test.

Answer

For paired data the test of whether \(\mu_X=\mu_Y\) is the same as asking if \(E(x_i-y_i)=0\). So we can calculate \(D_i = x_i-y_i, \forall i\) and test this against a value of 0. The rest follows a standard Wilcoxon Signed Rank Test. See p653-4 of Larsen & Marx.

Other Tests

Hdere are a few other non-parametric tests.

Two-sampled data test

If we have two samples, \(x_1, x_2, \dots, x_n\) and \(y_1, y_2, \dots, y_m\) from distributions \(f_X(x)\) and \(f_Y(y)\) and we want to test if \(\mu_X=\mu_Y\). We will assume that \(n\) and \(m\) are large enough that a normal approximation will work.

The strategy is similar to the Wilcoxon Signed Rank Test. We rank all of the data, so we get \(r_i, i=1,\dots,n+m\). Similar to before we define a variable \(z_i\) where \(z_i=1\) if the \(i^{th}\) observation is from \(X\). Then our test statisic is the sum of the ranks of observations from \(X\):

\[ w' = \sum_{i=1}^{n+m}r_i z_i \]

The large sample results are in Theorem 14.3.4: essentially the mean and variance can be calculated, and a normal approximation used.

Kruskal-Wallis: non-parametric one way ANOVA

The Kruskal-Wallis test is, as the title says, the non-parametric one way ANOVA. The idea is that we have \(k\) groups (i.e. \(k\) levels of a factor). Under the null hypothesis \(\mu_1 = \mu_2 = \dots = \mu_k\), we would expect the distribution of ranks to be the same for each group. Then we can use the sum of the ranks, \(R_{.j}\) to create our test statistic:

\[ B = \frac{12}{n(n+1)} \sum_{j=1}^k \frac{R_{.j}^2}{n_j} - 3(n+1) \]

And asymptotically, this follows a \(\chi^2_{k-1}\) distribution.

See §14.4 for more.

Friedman Test

The Friedman test is like the two-way ANOVA. If we have our rows and columns, we can sat that the rows are blocks, and columns are the treatments that we are interested in. Under our null hypothesis that the treatments are the same, the ranks should be random in each block. So we rank our data *within** each block and calculate the following statistic:

\[ g = \frac{12}{bk(k+1)} \sum_{j=1}^k {r_{.j}^2} - 3b(k+1) \]

And if \(H_0\) is true, \(g \sim \chi^2_{k-1}\). This is described in §14.5.

We can use the potato yield data from earlier in the course. There were 3 treatments, and each was applied to one variety of potato (which we will use a block). These are the data for five of the varieties:

Variety B Cl S
1 Ajax 3.09 3.24 3.69
4 Arran comrade 2.25 2.07 2.46
7 British queen 2.94 3.19 3.28
10 Duke of York 2.02 1.77 1.54
13 Epicure 2.16 2.23 2.10

For each variety (=block) we can calculate the ranks:

Variety B Cl S
1 Ajax 1 2 3
4 Arran comrade 2 1 3
7 British queen 1 2 3
10 Duke of York 3 2 1
13 Epicure 2 3 1

We calculate \(\sum_{j=1}^k r_{.j}\) by summing over the treatments and then squaring the sums. These are the values for the first 5 varieties::

##        B  Cl   S
## Sum    9  10  11
## SumSq 81 100 121

and for the full data they are

##         B  Cl   S
## Sum    21  26  25
## SumSq 441 676 625

To calculate \(g\) we need \(b=12\) varieties/blocks, \(k=3\) treatments, and $_{j=1}3r_{.j}2 = $ 302. Then

\[ \begin{align} g &= \frac{12}{bk(k+1)} \sum_{j=1}^k r_{.j}^2 - 3b(k+1) \\ &= \frac{12}{12\times 3(3+1)} (441+676+625) - 3\times 12 (3+1) \\ &= \frac{1}{12} (1742) - 144 = 1.167 \end{align} \]

We compare this to a \(\chi^2_2\) distribution. The critical value is \(\chi^2_{0.95,2}=\) 5.99, so the test is not significant.

Paramtric and Non-parametric tests

See §14.7 for a discussion. Basically, non-parametric tests have less power, so this has to be traded off against the bias from the data not obeying the assumptions of a parametric test.


  1. Briefly, a moment generating function is defined for discrete variables as \(M_W(t) = E(e^{tW}) = \sum_k e^{tk} p_W(k)\), where \(p_W(k) = Pr(w=k)\).

    This can be used to generate moments because (Theorem 3.12.1) \(M_W^{(r)}(0) = E(W^r)\)

    In other words, if we differentiate \(r\) times we get the \(r^{th}\) moment. This is useful here not because we want the \(r^{th}\) moment but because if two densities have the same moment generating function, they also have the same density function, i.e. they are the same probability distribution. This is what we want to do here.

    We will also need the result (Theorem 3.12.3b) that if \(W_1, W_2, \dots, W_n\) are independent random variables and we are interested in \(W=\sum_{i=1}^n W_i\), the mgf of \(W\), \(M_W(t)\), is \(M_W(t) = \prod_{i=1}^n M_{W_i}(t)\).↩︎