Chapter 10: Goodness of Fit Tests

Hints and reminders are in bold

Questions appear in blue.

Some hints and answers are hidden away: click here Hello! I might be the answer. Or a hint.

Learning outcomes

  • Understand multinomial (get them to work out Poisson link?)

  • be able to do GoF/chi^2 tests with known parameters

  • be able to do GoF/chi^2 tests with unknown parameters: understand the difference

  • understand contingency tables

  • should understand LRTs

The Multinomial Distribution

The multinomial distribution is a generalisation of the binomial distribution. We can think that in a binomial distribution there are two classes: “heads” and “tails”, or “success” and “failure”. But we can have more than two classes: “heads”, “tails” and “edge”, or “success”, “failure”, and “utter failure”. Or, more seriously, we can have categories like which (of \(t\)) political parties will a person vote for.

So, suppose our sample space is the integers \(i=1, \dots, t\). We sample these with probabilities \(p_1, p_2, \dots p_t\). Then if we take \(n\) samples, the distribution of the vector \((X_1, X_2, \dots, X_i)\) is

\[ Pr_{X_1, X_2, \dots, X_i}(k_1, k_2, \dots, k_t) = Pr(X_1=k_1, X_2=k_2, \dots, X_t=k_t) \\ = \frac{n!}{k_1!k_2! \dots k_t!} \prod_{j=1}^{k}{p_j^{k_j}} \] Prove this!

Do this in two steps:

  1. derive the probability of any sequence
  2. as the order of the sequence does not matter, work out how many ways a sequence can lead to the counts \((X_1, X_2, \dots, X_i)\).
Hint for second part Theorem 2.6.2 might help
Answer This is theorem 10.2.1, and the proof is there.

Relationship to Binomial

Theorem 10.2.2 says that if the vector \(\mathbf{X} = (X_1, X_2, \dots, X_i)\) follows a multinomial distribution then each \(X_i\) follows a binomial distribution. But we can go beyond that.

Corollary for 10.2.2 if the vector \(\mathbf{X} = (X_1, X_2, \dots, X_i)\) follows a multinomial distribution then any partition of \(\mathbf{X}\) into two groups follows a binomial distribution.

Prove this!

Hint Use the way Theorem 10.2.2 is proved.
Answer For any partition \(P\), i.e. the set of \(i=1,...,t \in P\), the probability that any individual observation is “in” the partition is \(q_P=\sum_{j \in P} p_j\). So if \(Y_p = \sum_{j \in P}X_j\), \(Y_p\) is a series of Bernoulli trials, so follows a Binomial distribution.

Relationship to Poisson

This is not covered in L&M, but is one of my favourite results: it means we can treat problems with multinomial and Poisson distributions interchangeably, so we can use whichever makes life easier.

Suppose we observe \(n\) objects (e.g. species of fish in the river Seine), which can be in one of \(t\) classes, with probability \(p_j\) of being in class \(j\). Clearly the number in each class follows a multinomial distribution.

Now suppose instead we spend time \(T\) looking for these objects (e.g. spending a day fishing on the Seine), and the rate of observing class \(j\) is \(\eta_j\), and all observations are independent. Then clearly the number of observations of class \(j\), \(r_j\), follows a Poisson distribution with mean \(\lambda_j = \eta_j T\) (if this is not clear, re-read §4.2, in particular the derivation of eqn. 4.2.2). i.e. 

\[ Pr(X_j=r_j) = \frac{(\lambda_j)^{r_j} e^{-\lambda_j}}{r_j!} \]

Now, if we condition on the total number observed, \(\sum_{j=1}^t{r_j} = n\), what is the joint probability across all \(t\) classes, i.e. \(Pr(\mathbf{X} = \mathbf{r}|\sum_{j=1}^t{r_j} = n)\)? It turns out that this is a multinomial distribution too.

Guess what? We’re going to prove it. The first step is to write down the conditional probability

\[ Pr(\mathbf{X} = \mathbf{r}|\sum_{j=1}^t{r_j} = n) = \frac{Pr(\mathbf{X}=\mathbf{r} \cap \sum_{j=1}^t{r_j} = n)}{Pr(\sum_{j=1}^t{r_j} = n)} = \frac{Pr(\mathbf{X}=\mathbf{r})}{Pr(\sum_{j=1}^t{r_j} = n)} \]

Why can we make the simplification above?

Hint This is a logic problem, about what \(\mathbf{X}=\mathbf{r}\) implies.
Answer Because \(\sum_{j=1}^t{r_j} = n\) is determined by \(\mathbf{r}\), i.e. \(Pr( \sum_{j=1}^t{r_j} = n | \mathbf{X}=\mathbf{r})=1\)

What is the expression for \(Pr(\mathbf{X}=\mathbf{r})\)?

Answer \[ Pr(\mathbf{X}=\mathbf{r}) = \prod_{j=1}^t{\frac{\lambda_j^{r_j} e^{-\lambda_j}}{r_j!}} \]

And now the denominator:

Write down the probability density function for \(Pr(\sum_{j=1}^t{r_j} = n)\)

(yes, you can derive it, using moment generating functions, or you can just use the result hidden away in the book)

Hint See Example 3.12.10
Answer

It’s a Poisson distribution:

\[ Pr(\sum_{j=1}^t{r_j} = n) = \frac{e^{-\sum_{j=1}^t{\lambda_j}} (\sum_{j=1}^t{\lambda_j})^n}{n!} \]

So we need to put these together.

What is the distribution of \(\mathbf{X} = \mathbf{r}|\sum_{j=1}^t{r_j} = n\)?

Note that you can write \(p_j = \lambda_j/{\sum_j \lambda_j}\).

Hint There is one cunning trick, based on this: \(a^{b} = a^m \times a^n\) if \(b = m+n\)
Answer

If you want to listen to me explain this (sorry about the handwriting), click here.

\[ \begin{align} Pr(\mathbf{X} = \mathbf{r}|\sum_{j=1}^t{r_j} = n) &= \frac{Pr(\mathbf{X}=\mathbf{r})}{Pr(\sum_{j=1}^t{r_j} = n)} \\ &= \frac{\prod_{j=1}^t{\frac{\lambda_j^{r_j} e^{-\lambda_j}}{r_j!}}} {\frac{e^{-\sum_{j=1}^t{\lambda_j}} (\sum_{j=1}^t{\lambda_j})^n}{n!}} \\ &= \frac{n!}{\prod_{j=1}^t{r_j!}} \frac{\prod\lambda_j^{r_j} e^{-\lambda_j}}{e^{-\sum{\lambda_j}} (\sum{\lambda_j})^n} \\ &= \frac{n!}{\prod_{j=1}^t{r_j!}} \frac{e^{-\sum\lambda_j}\prod\lambda_j^{r_j} }{e^{-\sum{\lambda_j}} (\sum{\lambda_j})^n} = \frac{n!}{\prod_{j=1}^t{r_j!}} \frac{\prod\lambda_j^{r_j} }{ (\sum{\lambda_j})^n} \end{align} \] Now the cunning trick. We write \((\sum{\lambda_j})^n =\prod(\sum{\lambda_j})^{r_j}\) and get

\[ \begin{align} Pr(\mathbf{X} = \mathbf{r}|\sum_{j=1}^t{r_j} = n) &= \frac{n!}{\prod_{j=1}^t{r_j!}} \frac{\prod\lambda_j^{r_j} }{ (\sum{\lambda_j})^n} \\ &= \frac{n!}{\prod_{j=1}^t{r_j!}} \frac{\prod\lambda_j^{r_j} }{ \prod(\sum{\lambda_j})^{r_j}} \\ &= \frac{n!}{\prod_{j=1}^t{r_j!}} \prod \frac{\lambda_j^{r_j} }{ (\sum{\lambda_j})^{r_j}} \\ &= \frac{n!}{\prod_{j=1}^t{r_j!}} \prod \left( \frac{\lambda_j}{ \sum{\lambda_j}} \right)^{r_j} \\ &= \frac{n!}{\prod_{j=1}^t{r_j!}} \prod p_j^{r_j} \end{align} \] where \(p_j = \lambda_j/\sum{\lambda_j}\). This is the multinomial, as promised,

Why is this interesting? Because it suggests that we can consider different data in the same way: we do not need to worry about fixing the number of trials when we are interested in the proportions of each class in the data. It also means that if we have multinomial data with a fixed \(n\) we can treat it as Poisson distributed data. This means that when we analyse the sort of data we are looking at this week, we do not have to worry about whether the total are fixed. In more complicated multinomial models the same result holds, so we can still model the data s being Poisson distributed, although we can have to make sure the structure of the data is respected.

Goodness of Fit tests

We want to use the multinomial distribution to do goodness of fit tests. These occur when we have a model for count data, i.e. when we know what expected proportions of counts we should have, and we want to check to see if that is what we actually have. So we can use a likelihood ratio test.

So, if our data is the vector of counts, \(\mathbf{r}\), and under \(H_0\) the expected proportion is \(\mathbf{\hat{p}}\) then

\[ Pr(\mathbf{r}|H_0) = \frac{n!}{\prod_{j=1}^t{r_j!}} \prod p_j^{r_j} \] Under \(H_1\) we have to estimate \(\mathbf{p}\). The maximum likelihood estimate is easy if we use the Poisson formulation, where \(p_j = \lambda_j/\sum \lambda_j\), because we know that the MLE for \(\lambda\) is the mean of the observations (§…), i.e. \(r_i\), so \(\hat{p}_j=r_i/\sum r_i = r_i/n\).

Exercise

Derive the likelihood ratio \(\lambda = \frac{Pr(\mathbf{r}|H_0)}{Pr(\mathbf{r}|H_1)}\) in therms of the data, \(r_j\), and the probabilities under \(H_0\), \(p_j\).

Answer \[ \begin{align} \lambda &= \frac{Pr(\mathbf{r}|H_0) }{Pr(\mathbf{r}|H_1)} \\ &= \frac{\frac{n!}{\prod_{j=1}^t{r_j!}} \prod p_j^{r_j}}{\frac{n!}{\prod_{j=1}^t{r_j!}} \prod \frac{r_j}{n}^{r_j}} \\ &= \prod_{j=1}^t \left(\frac{np_j}{r_j} \right)^{r_j} \end{align} \]

From the likelihood ratio, show that $ _j O_j $ is an alternative statistic to use, where \(O\) = “observed”, and \(E\) = “expected”

Answer First, note that \(r_j = O_j\), and \(np_j = E_j\), so \[ \begin{align} \lambda &= \prod_{j=1}^t \left(\frac{np_j}{r_j} \right)^{r_j} = \prod_{j=1}^t \left(\frac{E_j}{O_j} \right)^{O_j} \\ \log(\lambda) &= \sum_{j=1}^t O_j \log \left(\frac{E_j}{O_j} \right) \\ -\log(\lambda) &= \sum_{j=1}^t O_j \log \left(\frac{O_j}{E_j} \right) \\ \end{align} \] The final line just flips the ratio in the log.

This statistic can be used as a test statistic, because \(G=-2log(\lambda)\) approximately follows a \(\chi^2\) distribution, and is called G-test. The result that it follows a \(\chi^2\) distribution is actually much more general: it holds for any non-pathological likelihood ratio \(\lambda\). It also forms the basis of the goodness of fit statistics we will use. We use \(O\) and \(E\) because they are easier for non-mathematicians to remember.

Taylor Series Approximation

Now we have a likelihood ratio test, we can approximate it by one that is easier to work with if you don’t have a calculator. We use a Taylor series approximation around \(E_j\) to get an approximation for \(O_j \log \left(\frac{O_j}{E_j} \right)\), and can then sum up the terms.

To remind you, a Taylor series expansion of a function, \(f(x)\) around a point \(x^{*}\)

\[ f(x) = \sum_{p=0}^\infty{\frac{f^{n}(x^{*})}{n!}(x-x^{*})^n} = f(x^{*}) + f'(x^{*})(x-x^{*}) + \frac{f''(x^{*})}{2}(x-x^{*})^2 + \dots \]

We only need the terms up to \((x-x^{*})^2\) for this approximation, and we will expand \(f(O_j) = O_j \log \left(\frac{O_j}{E_j} \right)\) around \(E_j\).

Derive the \(\chi^2\) statistic, \(\chi^2 = \sum_j (O_j-E_j)^2/E_j\), as an approximation of \(G=2\sum_{j=1}^t O_j \log \left(\frac{O_j}{E_j} \right)\). Then make a wild guess of what distribution it approximately follows.

Answer

We can approximate each term in the sum, and add them up at the end. Formally this is a Taylor series in multiple dimensions, but as each dimension is independent, there are no cross-dimension terms.

First, let’s differentiate!!!!

\[ \begin{align} f(O_j) &= O_j \log \left(\frac{O_j}{E_j} \right) \\ \frac{df(O_j)}{dO_j} &= 1 \log \left(\frac{O_j}{E_j} \right) + O_j /O_j = \log \left(\frac{O_j}{E_j} \right) + 1 \\ \frac{d^2f(O_j)}{dO_j^2} &= \frac{1}{O_j} \end{align} \]

And plug these in, and evaluate at \(O_j=E_j\). We will look at each term in \(\lambda=\sum g_j\) separately and sum them up afterwards:

\[ \begin{align} \log{g_j} &\approx E_j \log\left(\frac{E_j}{E_j}\right) + \left[\log\left(\frac{E_j}{E_j}+1 \right) \right](O_j-E_j) + \frac{1}{2}\frac{1}{O_j}(O_j-E_j)^2 \\ &= 0 + (O_j-E_j) + \frac{1}{2}\frac{(O_j-E_j)^2}{O_j} \end{align} \]

Now we can sum this up:

\[ \begin{align} \log{\lambda} &\approx \sum_j(O_j-E_j) + \sum_j\frac{1}{2}\frac{(O_j-E_j)^2}{O_j} \end{align} \] Now note that \(\sum_j O_j=n\), and \(\sum_j E_j=n\), so we end up with

\[ \begin{align} 2\log{\lambda} &\approx \chi^2 = \sum_j\frac{(O_j-E_j)^2}{O_j} \end{align} \]

With a name like this, you will not be surprised to learn that the \(\chi^2\) statistic follows a \(\chi^2\) distribution. The degrees of freedom is someting we will explain as we go on to use this statistic to do goodness of fit tests.

Goodness of Fit test with known parameters

First we will look at the case when we know the parameters, i.e. we already have a model for \(p_j\). For example, in genetics we know that some of traits are controlled by a single gene, and in species like humans who have two copies of each gene, one can be dominant over the other (so genotypes AA, Aa, and aA are all like AA, and aa is different). This means that if we cross two individuals who are Aa, 75% of their offspring should look like an A, and 25 look like an a. So we can use this to test whether a gene shows dominant behaviour. And we can look at multiple genes to get more complicated ratios1

So we can plug this into the \(\chi^2\) statistic. If our data is a vector of counts, \(k_j\), with expected counts \(np_j\), then \(\chi^2 = \sum_j\frac{(k_j-np_j)^2}{k_j}\) follows a \(\chi^2_{t-1}\) distribution. We will not prove this: Theorem 10.3.1 gives a proof for \(t=2\) and sketches the approach for \(t>2\).

As the \(\chi^2\) statistic is quadratic in the difference \(k_j-np_j\), a larger difference (on average) will mean a greater deviation from the null hypothesis, so we only want to use a one-tailed test. Thus our decision rule is (Theorem 10.3.1b): we reject \(H_0\) (i.e. that the data are drawn from the known distribution) if \(\sum_j\frac{(k_j-np_j)^2}{k_j} \ge \chi^2_{1-\alpha, t-1}\).

Note that this is an approximation. The main way the approximation becomes worse is if \(np_j\) becomes small. The common advice is that the approximation is good when all of the counts are above 5, although this is a bit conservative.

Suppose we observe \(c\) replicate data sets, for each one we have observed counts \(k_{sj}\) (\(s=1,...,c\), with \(\sum_{j=1}^t k_{sj}=n_s\)) and expected proportions \(p_{sj}\) (so \(\sum_{j=1}^t p_{sj}=1\)). What is the (approximate) distribution of the \(\chi^2\) goodness of fit statistic for all of the data, i.e. \(\sum_{s=1}^c\sum_{j=1}^t\frac{(k_{sj}-n_s p_{sj})^2}{k_{sj}}\)? Specifically, what are the degrees of freedom?

Answer

We know that for any one data set \(\sum_{j=1}^t\frac{(k_{sj}-n_s p_{sj})^2}{k_{sj}} \sim \chi^2_{t-1}\). And the sum of independent \(\chi^2\) distributions is also a \(\chi^2\) distribution: if \(X_i \sim \chi^2_{d_i}\), \(\sum_i X_i \sim \chi^2_{\sum d_i}\).

For the whole data we have \(c\) \(\chi^2\) statistics that each follows a \(\chi^2_{t-1}\) distribution. So the sum over all of them must follow a \(\chi^2_{c(t-1)}\) distribution. So we have \(ct-c\) degrees of freedom, where there are \(ct\) counts.

This is more than just an interesting exercise. It relates to how we consider degrees of freedom.For each category in \(t\) we get one piece of data. But we also have a constraint: \(\sum_j k_j=n\). This “costs” us a degree of freedom: \(n\) is fixed, so if we know \(\sum_{j=1}^{t-1} k_j\) we know \(k_t\). Hence we have \(t-1\) degrees of freedom for each data set. With \(c\) data sets we have \(c\) constraints, so we lose \(c\) degrees of freedom.

Goodness of Fit test with unknown parameters

Often we will not know all of the parameters of a model. For example, with a binomial distribution we may not know the value of \(p\). Here’s an example, from Saxony between 1876 and 1885. These are the number of boys in families of 8 children:

library(vcdExtra)
## Loading required package: vcd
## Loading required package: grid
## Loading required package: gnm
data("Geissler")
Geissler8 <- Geissler[Geissler$size==8, ]

knitr::kable(t(c(Geissler8[,"Freq"], sum(Geissler8[,"Freq"]))),
             col.names = c(0:8, "Total"))
0 1 2 3 4 5 6 7 8 Total
161 1152 3951 7603 10263 8498 4948 1655 264 38495

If the sex ratio is constant, then this data should obviously come from a Binomial distribution. But we cannot assume that \(p=0.5\) (Laplace was “morally certain” that it was not 0.5 in Paris).

So we have to estimate the sex ratio. It is not too difficult, but to save you the trouble, the estimate is $=$0.52.

Now we need to calculate the proportion of families with 8 children who have 0, 1, 2 etc boys.

n <- sum(Geissler8$Freq)
phat <- sum(Geissler8$boys*Geissler8$Freq)/(8*sum(Geissler8$Freq))

Geissler8$prob <- dbinom(Geissler8$boys, 8, phat)
Geissler8$Expected <- Geissler8$prob*n
Geissler8$diff <- Geissler8$Freq-Geissler8$Expected
Geissler8$chi2 <- ((Geissler8$Freq-Geissler8$Expected)^2)/Geissler8$Expected

knitr::kable(Geissler8[,c("boys", "Freq", "prob", "Expected", "diff", "chi2")], 
             digits = 1, row.names = FALSE,
             col.names = c("Number of Boys", "Frequency", 
                           "Probability", "Expected Frequency", "diff",
                           "chi^2"))
Number of Boys Frequency Probability Expected Frequency diff chi^2
0 161 0.0 116.7 44.3 16.9
1 1152 0.0 993.4 158.6 25.3
2 3951 0.1 3701.2 249.8 16.9
3 7603 0.2 7879.8 -276.8 9.7
4 10263 0.3 10485.0 -222.0 4.7
5 8498 0.2 8928.9 -430.9 20.8
6 4948 0.1 4752.4 195.6 8.1
7 1655 0.0 1445.4 209.6 30.4
8 264 0.0 192.3 71.7 26.7

So the \(\chi^2\) statistic is 159.41. But how many degrees of freedom does it have?

It turns out (= we will not see the proof) that we lose one degree of freedom for every estimated parameter. So if we estimate \(s\) parameters we have \(t-s-1\) degrees of freedom. Here \(t=9\), and \(s=1\), so the statistic follows a \(\chi^2_7\) distribution, and we reject the null hypothesis if \(\sum_j\frac{(k_j-n \hat{p}_j)^2}{k_j} \ge \chi^2_{1-\alpha, t-s-1}\). Here the critical value \(\chi^2_{0.95, 7}\) is 14.1, so we reject \(H_0\) and say that the distribution of births is not binomial.

Degrees of Freedom

We will be meeting degrees of freedom throughout the rest of this course. We can understand them as a measure of information: for every data point we gain one degree of freedom. But if we have a constraint (e.g. the data must sum to an \(N\)) we lose one. And we also lose one for every parameter we estimate. So the total degrees if freedom is \(t-s-c\).

Contingency Tables

Finally for this week, we can look at where goodness of fit tests can be useful: testing independence between variables. L&M give the example of two film critics rating films on a three-point scale, and then asked if their rating are independent. So for any film there is a Row Score, \(R \in {1,\dots, s}\), and a column score \(C \in {1,\dots, t}\). If they are independent then \(Pr(R=r, C=c) = Pr(R=r) Pr(C=c)\). The expected number of films in category \((r,c)\) would thus be \(E_{r,c} = n Pr(R=r) Pr(C=c)\).

Show (well, argue) that the estimated frequency in cell \(\{r,c\}\) is \(\frac{\sum_{i=1}^{s} n_{ic} \sum_{j=1}^{t} n_{rj}}{n}\)

Hint: independence means that we can estimate row- and column- probabilities separately, and plug them in afterwards

Answer

We know (see above) that the maximum likelihood estimate for a multinomial cell probability is the frequency divided by the total count. So, for the \(r^{th}\) row the estimate for the probability of being in column \(c\) is

\[ \hat{p}_c|r = \frac{n_{ic}}{\sum_{k=1}^{s} n_{kc}} \] And these are the same for each row, by independence, so

\[ \hat{p}_c = \frac{\sum_{i=1}^{s} n_{ic}}{\sum_{i=1}^{s} \sum_{j=1}^{t} n_{ij}} =\frac{\sum_{i=1}^{s} n_{ic}}{n} \]

And of course the same argument goes for the rows:

\[ \hat{p}_r = \frac{\sum_{j=1}^{t} n_{rj}}{n} \]

So if we plug these in to estimate \(n_{rc}\) we have \[ \hat{n}_{rc} = n \hat{p}_r\hat{p}_c = n \frac{\sum_{i=1}^{s} n_{ic}}{n} \frac{\sum_{j=1}^{t} n_{rj}}{n} = \frac{\sum_{i=1}^{s} n_{ic} \sum_{j=1}^{t} n_{rj}}{n} \]

Now we have these probabilities, we can calculate our \(\chi^2\) statistic:

\[ \chi^2 = \sum_{i=1}^{s} \sum_{j=1}^{t} \frac{({n_{ij} - n \hat{p}_r\hat{p}_c})^2}{n \hat{p}_r\hat{p}_c} \]

Of course we now have to work out how many degrees of freedom we have. There are \(st\) data points, but we estimate \(s-1\) and\(t-1\) parameters (because the probabilities must sum to 1, if we know \(s-1\) probabilities we can calculate the last by subtraction. We can also see this as estimating \(s\) probabilities, but having the constraint that they sum to 1). We also have the constraint that \(\sum \sum n_{ij}=n\). So we have \(st - (s-1) - (t-1)-1 = st - s- t+1 = (s-1)(t-1)\) parameters.


  1. For my PhD I worked on cereal mildews, which are simpler because they only have one copy of each gene. But I was in a cereal department, where they had to deal with wheat, which has 3 genomes, so there are either 2, 4, or 6 copies of a gene.↩︎