NOTE: Chapter 8 will be done later.

Hints and reminders are in bold

Questions appear in blue.

Some hints and answers are hidden away: click here Hello! I might be the answer. Or a hint.

References to §x.y are to section §x.y of the Larsen and Marx textbook. e.g. $4.3 is the section titled The Normal Distribution.

This is Chapter 9 of L&M: I will present things in a slightly different order, but I suggest you do also read the book.

Last week we looked at inference about a single sample. This week we will look at two samples (and later lots of samples, because that is how to count: 1, 2, >2).

With two samples the main questions are about how the two samples differ: usually whether the means are difference, but sometimes the variances are what are important. We will see that last week’s module on analyses of a single sample has covered a lot of material that will be useful here.

Large- scale tests of proportion

We will start with §9.4, because once we have done a bit of easy work we get a slightly simpler problem.

Suppose we have two treatments, \(X\) and \(Y\). We take \(n\) samples from treatment \(X\), of which \(x\) are successes (however we define success). Similarly, we take \(m\) samples from treatment \(Y\), of which \(x\) are successes. All samples are independent, so \(x\) and \(y\) follow (independent) binomial distributions. We want to ask if the true probability of success is the same, i.e. if \(p_X = p_Y\). We will assume that \(n\) and \(m\) are large, so we can use the central limit theorem to make things Normal. A classic example of this would be opinion polls, e.g. comparing samples taken at different times to see if opinions have changed.

L&M walk through the LRT, but we can go straight to the asymptotic result.

Under the null hypothesis we have \(p_X = p_Y = p_e\). So \(p_e = \frac{x+y}{n+m}\). Approximately, \(x \sim N(np_e, np_e(1-p_e))\), and \(y \sim N(mp_e, mp_e(1-p_e))\)

Under the alternative hypothesis, i.e. that \(p_X \ne p_Y\), we have \(p_X = \frac{x}{n}\) and \(p_y = \frac{y}{m}\), as well as \(x \sim N(np_X, np_X(1-p_X))\), and \(y \sim N(mp_Y, mp_Y(1-p_Y))\).

We can use \(p_X - p_Y\) to develop a test statistic, and develop a test for \(H_0: p_X - p_Y = 0\).

What is the distribution of \(p_X - p_Y\) under \(H_0\)? In particular what is the mean and variance?

Answer

\(p_X - p_Y\) follows a normal distribution, because it is a linear combination of normal random variables. Under \(H_0\) \(p_X = p_Y\), so \(E(p_X - p_Y)=0\). The variance is also easy to calculate, from Theorem 3.9.5: \(Var(aX + bY) = a^2Var(X) + b^2 Var(y) + 2ab Cov(X,Y)\). We have \(a=1\), \(b=-1\) and \(Cov(X,Y)=0\), so, because \(Var(p_X)=p_e(1-p:e)/N\) we have

\[ \begin{align} Var(p_X - p_Y) &= Var(p_X) + Var(p_Y) \\ &= p_e(1-p_e)/n + p_e(1-p_e)/m \\ &= p_e(1-p_e)\left(\frac{1}{n} + \frac{1}{m}\right) \\ &= \frac{nm}{n+m}p_e(1-p_e) \end{align} \]

Based on the first question, what would you use as a test statistic, and how would you do the test?

Answer

Now we know that \(p_X - p_Y \sim N(0, \frac{nm}{n+m}p_e(1-p_e))\) we also know we can calculate

\[ z = \frac{p_X - p_Y}{\sqrt{\frac{nm}{n+m}p_e(1-p_e)}} = \frac{\frac{x}{n} - \frac{y}{m}}{\sqrt{\frac{nm(x+y)(n-x+y-m)}{(n+m)^3}}} \] so that \(z \sim N(0,1)\), and we can use this as a test statistic. Specifically, if we are testing \(H_0: p_X = p_Y\) against \(H_1: p_X \ne p_Y\), we reject at the 5% level if \(|z|>1.95\).

This is all Theorem 9.4.1.

We can also calculate a 95% confidence interval. I will leave that as a problem for you to figure out.

The purpose of starting with this is that it is a simple example of a comparison of two samples: the only comparison is of the means, and these are normally distributed when \(n\) and \(m\) are large enough (30 is the usual value we use as a benchmark, but the further \(p_e\) is from 0.5, the larger \(n\) and \(m\) have to be).

Estimating the Difference in the Means of Normally Distributed Samples

Now we will look at estimating the difference in the means of the samples when they are normally distributed, but when we don’t know the variance. We will look at the case where the variances are the same, as this is easier (and is usually a good assumption anyway).

We assume that we take two samples, \(X_i\), \(i=1, \dots, n\), and \(Y_j\), \(j=1, \dots, m\), where each sample is drawn from an independent normal random variable: \(X_i \sim N(\mu_X, \sigma^2)\) and \(Y_j \sim N(\mu_Y, \sigma^2)\). Then all we need are these three statistics:

\[ \begin{align} \bar{X} &= \sum_{i=1}^n{X_i} \\ \bar{Y} &= \sum_{j=1}^m{Y_i} \\ S_P^2 &= \frac{(n-1)S_X^2 + (m-1)S_Y^2}{n+m-2} = \frac{\sum_{i=1}^n(X_i - \bar{X})^2 + \sum_{j=1}^n(Y_j - \bar{Y})^2}{n+m-2} \end{align} \]

In other words, we need the sample means and the pooled sample variance (we are using \(P\) to denote pooled here: pooled = combined).

We are interested in the difference between the means, so \(\mu_X - \mu_Y\) is probably the best statistic to consider. The a sensible estimator of this is \(\bar{X} - \bar{Y}\), which is unbiased (because \(E(\bar{X} - \bar{Y})=\mu_X-\mu_Y\). This is all because \(E(Y_i)=\mu_Y\), and \(E(aY_i+b)=a\mu_Y+b\)). We can look at the distribution of this, but scaled by the standard deviation of the statistic (= the standard error):

\[ t = \frac{(\bar{X} - \bar{Y}) - (\mu_X-\mu_Y)}{S_P \sqrt{\frac{1}{n}+\frac{1}{m}}} \] To work out the distribution of \(t\) we could take a wild guess based on its name. Or we could divide top & bottom by \(\sigma\) to get this:

\[ t = \frac{\frac{(\bar{X} - \bar{Y}) - (\mu_X-\mu_Y)}{\sigma \sqrt{\frac{1}{n}+\frac{1}{m}}}}{S_P/\sigma} = \frac{ \frac{(\bar{X} - \bar{Y}) - (\mu_X-\mu_Y)} {\sigma \sqrt{\frac{1}{n}+\frac{1}{m}}} }{\sqrt{\frac{1}{n+m-2}{\left[ \sum_{i=1}^{n}{\frac{(X_i-\bar{X})^2}{\sigma^2}} + \sum_{j=1}^{m}{\frac{(Y_j-\bar{Y})^2}{\sigma^2}} \right]}}} \]

Which all looks very horrible. But you are going to sort out. First, note that \(\mu_X\), \(\mu_Y\) and \(\sigma^2\) are all (unknown) constants, so only \(\bar{X}\), \(\bar{Y}\) and \(\sigma^2_P\) are random variables (because they are functions of the data).

What is the distribution of the numerator, \(\frac{(\bar{X} - \bar{Y}) - (\mu_X-\mu_Y)}{\sigma \sqrt{\frac{1}{n}+\frac{1}{m}}}\)?

Answer

The numerator is normally distributed, because \(\bar{X}\) and \(\bar{Y}\) are both normally distributed: \(\bar{X} \sim N(\mu_X, \sigma^2/n)\), and \(\bar{Y} \sim N(\mu_Y, \sigma^2/m)\), so

\[ \bar{X} - \bar{Y} \sim N(\mu_X - \mu_Y, \sigma^2/n + \sigma^2/m) \] so \(\frac{(\bar{X} - \bar{Y}) - (\mu_X-\mu_Y)}{\sigma \sqrt{\frac{1}{n}+\frac{1}{m}}} \sim N(0,1)\).

Now we can look at the denominator.

What is the distribution of \(\sum_{i=1}^{n}{\frac{(X_i-\bar{X})^2}{\sigma^2}}\)?

Answer This follows a \(\chi^2_{n-1}\) distribution, from Theorem 7.3.2

So what is the distribution of \(\sum_{i=1}^{n}{\frac{(X_i-\bar{X})^2}{\sigma^2}} + \sum_{i=j}^{m}{\frac{(Y_j-\bar{Y})^2}{\sigma^2}}\)?

Answer This is the sum of a couple of \(\chi^2\) distributions, so follows a \(\chi^2_{n+m-2}\) distribution.

Now we have all that, we will note that someone once proved that the numerator and denominator are independent. So…

So what is the distribution of \(t\)?

Answer

It must follow a t-distribution.

This is all Theorem 9.2.1.

Now we know this, we can derive a confidence interval for \(\bar{X}-\bar{Y}\).

What is the expression for the 95% Confidence Interval?

This will involve quantiles of the t distribution, of course.

Answer

The confidence interval is \[ \bar{X}-\bar{Y} \pm t_{\alpha/2, n-1} s_P\sqrt{\frac{1}{n}+\frac{1}{m}} \]

This is Theorem 9.5.1.

Test of whether \(\mu_X=\mu_Y\)

We can also test whether we have enough data to conclude that \(\mu_X \ne \mu_Y\)1. We can do that with a likelihood ratio test. The derivation of this is similar to the likelihood ratio test for the mean from last week, and is in §9.A.1.

Under \(H_0\) we obviously have \(\mu_X = \mu_Y\), i.e. the means are equal. Under \(H_1\) we have \(\mu_X \ne \mu_Y\). So the likelihood under \(H_1\) is

\[ \begin{align} L_1 &= \prod_{i=1}^{n} f_X(x_i) \prod_{j=1}^{m} f_Y(y_j) \\ &= \left(\frac{1}{\sigma\sqrt{2 \pi}} \right)^{n+m} \exp{ \left( -\frac{1}{2\sigma^2} \left[\sum_{i=1}^{n}(x_i - \mu_X)^2 + \sum_{j=1}^{m}(y_j - \mu_Y)^2 \right] \right) } \end{align} \]

For \(H_0\) the likelihood is the same, but with \(\mu_X = \mu_Y = \mu_0\).

To get the maximum likelihood estimates, you need to solve \(\frac{\partial L_0}{\partial \mu_0}=0\), and \(\frac{\partial L_0}{\partial \sigma^2_0}=0\). These give us

\[ \hat{\mu}_0 = \frac{\sum_{i=1}^{n}x_i + \sum_{j=1}^{m}y_j}{n+m} \]

and

\[ \sigma^2_0 = \frac{\sum_{i=1}^{n}(x_i-\hat{\mu}_0)^2 + \sum_{j=1}^{m}(y_j-\hat{\mu}_0)^2}{n+m} \]

Derive the maximum likelihood estimates under \(H_0\). The results are below.

Answer (not available yet!) I should record a video about this.

Similarly for \(H_1\) we get

\[ \hat{\mu}_X = \bar{x}, \hat{\mu}_Y = \bar{y} \]

and

\[ \sigma^2_1 = \frac{\sum_{i=1}^{n}(x_i-\bar{x})^2 + \sum_{j=1}^{m}(y_j-\bar{y})^2}{n+m} \]

Now you can plug these into the likelihoods, and after a bit of re-arrangement get

\[ \lambda = \frac{L_0}{L_1} = \left(\frac{\sigma_1^2}{\sigma_0^2} \right)^{\frac{n+m}{2}} \]

I was thinking of not going through this, but this contains a couple of important points.

First, the likelihood ratio for the means is a function of the estimates of the variances. This is an idea that crops up repeatedly. The intuition is that when we are modelling data, we are trying to explain the structure in the data. So here there is a possible structure of two groups with different means. But there is random variation on top of the structure. If the model is “good”, then it should explain more of the variation than just randomness. So we look at how much unexplained variation there is under \(H_0\) and \(H_1\). If \(\mu_X = \mu_Y\) then the value of \(\bar{x} -\bar{y}\) can purely be explained by random variation, i.e. \(\sigma_P^2\).

Secondly, the statistic used is the ratio of variances. Under \(H_0\) the expected value of the ratio should be 1. We will see later in the course that this idea can be extended to more complex models (but using the explained variance for \(H_1\)).

Back to the likelihood ratio test…

We can manipulate the likelihood ratio (this is the second page of §9.A.1) to get

\[ \lambda^{2/(n+m)} = \frac{n+m-2}{n+m-2 + t^2} \]

So the only random variable in the ratio is \(t^2\). We know that \(t\) has a t distribution with \(n+m-2\) degrees of freedom.

Exercise: derive the hypothesis test, i.e. write down the regions of the sample space where \(H_0\) would be rejected.

Answer

We would reject \(H_0\) when \(\lambda^{2/(n+m)}\) is too small (i.e. below some critical value), which is equivalent to \(t^2\) being too large. i.e. we reject \(H_0\) if \(t^2 \ge t_{crit}^2\), or if \(|t| \ge |t_{crit}|\), where \(Pr(-t_{crit} <T <t_{crit}|H_0) = 1 - \alpha\). But we know (from above) that \(T\) follows a t distribution with \(n+m-2\) degrees of freedom, so \(-t_{crit} = t_{\alpha/2, n+m-2}\), and \(+t_{crit} = t_{1-\alpha/2, n+m-2}\).

The Behrens-Fisher problem: testing \(\mu_X=\mu_Y\) when \(\sigma^2_X \ne \sigma^2_Y\)

This is a problem with no exact solution (nowadays we would probably simulate the problem away). Instead approximate solutions have been suggested.

The natural test statistic to try to use is

\[ W = \frac{\bar{X} - \bar{Y} - (\mu_X-\mu_Y)}{\sqrt{\frac{S^2_X}{n} + \frac{S^2_Y}{m}}} \]

The good news is that this is approximately distributed as a t distribution. The bad news is the degrees of freedom:

\[ \frac{\left(\frac{\sigma_X^2}{n} + \frac{\sigma_Y^2}{m} \right)^2}{\frac{\sigma_X^4}{n^2(n-1)} + \frac{\sigma_Y^4}{m^2(m-1)}} \]

This was derived by Welch and published in 1938 (alongside a paper about The London Skull). The derivation is outline on p458 of L&M. The idea, essentially, is to ask what \(\chi^2\) distribution has the same mean and variance as the sampling distribution of \(\left(\frac{S^2_X}{n} + \frac{S^2_Y}{m} \right)/\left(\frac{\sigma^2_X}{n} + \frac{\sigma^2_Y}{m} \right)\). The hope is that this approximation is close enough to the truth.

Inference about Variances

Occasionally we want to make inferences about the variances (usually we will be looking at a more complex problem than just two samples, but a lot of the intuition carries over).

We are still thinking about the problem where \(x_i \sim N(\mu_X, \sigma^2_X), i=1,\dots,n\) and \(y_j \sim N(\mu_Y, \sigma^2_Y), j=1,\dots,m\). But now we want to focus on \(\sigma^2_X\) and \(\sigma^2_Y\). It is natural to look at their ratio, \(\sigma^2_X/\sigma^2_Y\).

First, the theory. We know that

\[ \sum_{i=1}^{n}{\frac{(X_i-\bar{X})^2}{\sigma_X^2}} \sim \chi^2_{n-1} \]

and similarly for \(Y_j\). We want the ratio of \(\frac{\sigma_X^2}{\sigma_Y^2}\), which both involve \(\chi^2\) distributions, so we can predict an \(F\) distribution in the near future. Last week’s notes will be worth checking, to remind yourself about the \(F\) distribution.

What is the distribution of \(\frac{\sigma_X^2}{\sigma_Y^2}\)?

Answer

We know that if \(U \sim \chi^2_n\) and \(V \sim \chi^2_m\), then \(\frac{V/m}{U/n}\) follows an F distribution.

So

\[ \frac{\sum_{i=1}^{n}{\frac{(X_i-\bar{X})^2}{(n-1)\sigma_X^2}}} {\sum_{j=1}^{m}{\frac{(Y_j-\bar{Y})^2}{(m-1)\sigma_Y^2}}} \sim F_{n-1,m-1} \] Of course, \(\sum_{i=1}^{n}{\frac{(X_i-\bar{X})^2}{(n-1)}}= s_X^2\), so

\[ \frac{s_X^2/\sigma_X^2}{s_Y^2/\sigma_Y^2} \sim F_{n-1,m-1} \] So, with a slight abuse of notation,

\[ \frac{\sigma_Y^2}{\sigma_X^2} \sim \frac{s^2_Y}{s_X^2} F_{n-1,m-1} \]

Estimates and Confidence Intervals

We can estimate the variance ratio with \(\frac{s^2_Y}{s_X^2}\).

But what is the \(\alpha\)% confidence interval for \(\frac{\sigma_X^2}{\sigma_Y^2}\)?

Answer

We will need the critical values of the F distribution, \(F_{\alpha/2, a, b}\) and \(F_{1-\alpha/2, a, b}\). Note that because the F distribution is asymmetric, we need to get both: we cannot just use one and transform. Given this, we know that the interval is defined by

\[ Pr(F_{\alpha/2, n-1, m-1} \le \frac{s_X^2/\sigma_X^2}{s_Y^2/\sigma_Y^2} \le F_{1-\alpha/2, n-1, m-1}) \]

So

\[ Pr(F_{\alpha/2, n-1, m-1} \frac{s_Y^2}{s_X^2}\le \frac{\sigma_Y^2}{\sigma_X^2} \le F_{1-\alpha/2, n-1, m-1}\frac{s_Y^2}{s_X^2}) = \alpha \]

But this is the confidence interval for \(\sigma_Y^2/\sigma_X^2\)! Now, we could try a cunning manipulation, or we could simply reverse the roles of \(X\) and \(Y\). Note that we have to reverse the degrees of freedom too, so the interval becomes \(F_{\alpha/2, m-1, n-1} \frac{s_X^2}{s_Y^2}, F_{1-\alpha/2, m-1, n-1}\frac{s_YX^2}{s_Y^2})\).

Test of the difference in variances

Now we can also test the differences. We do not need to go through the details, just use the relationship between confidence intervals and hypothesis tests that you might have already noticed.

Write down the hypothesis test for \(\sigma_X^2 = \sigma_Y^2\)

Start by writing down \(H_0\) and \(H_1\)
Answer

\(H_0: \sigma_X^2 = \sigma_Y^2\)

\(H_1: \sigma_X^2 \ne \sigma_Y^2\)

For our test statistic we can use \(s^2_X/s^2_Y\): this should be 1 if \(H_0\) is true. So from above, we can reject \(_0\) if either \(s^2_X/s^2_Y \ge F_{1-\alpha, m,n}\) or \(s^2_X/s^2_Y \le F_{\alpha, m,n}\).

This is theorem 9.3.1.

How to handle lack of normality. Or why do we care about this module?

This module has assumed that the data follow normal distributions. But, one may argue, nothing is normal. This is correct, but statisticians are happy to be only slightly wrong. Back in §7.4 the book discussed tests when normality is wrong, and their point was that the effect is to change the actual Type I so that it is not 5%. If it is only slightly wrong (6% or 3%, say), the differences will be small.

How much should we worry about this? The simple answer is “it depends”. But t-tests are usually robust, i.e. a violation of the assumptions is unlikely to make a big difference. This is partly due to central limit theorem: everything non-pathological becomes normally distributed if the sample size is large enough, and the t-statistic is a function of means.

Of course, this does not always work. And at least one book has been written about when it is a problem and what to do about it. But diagnosing these problems is not part of this course: it belongs to the Dark Arts of statistics, which some statisticians try to hide from mathematicians (they don’t want to scare them off).


  1. This is not the most serious comment. Sorry, I am not a fan of hypothesis tests.↩︎