Chapter 7: The Normal Distribution

This week’s module covers Chapter 7 of L&M. The module is arranged in a different order, because I find this easier to follow. If you prefer the book’s way of doing things, use that.

Learning outcomes

be able to

  • derive the t distribution from the Normal and Gamma distributions
  • derive the F distribution as ratio of gammas
  • explain the relationship between gamma and \(\chi^2\) distributions
  • make inferences about \(\mu\) when data comes from a normal distribution
  • make inferences about \(\sigma^2\) when data comes from a normal distribution

Hints and reminders are in bold

Questions appear in blue.

Some hints and answers are hidden away: click here Hello! I might be the answer. Or a hint.

References to §x.y are to section §x.y of the Larsen and Marx textbook. e.g. $4.3 is the section titled The Normal Distribution.

Distributions

We are going to start by looking at inference for data that is normally distributed. There are three reasons for this: (1) it is used a lot in practice, (2) more complex models build on this theory, and (3) even models that aren’t built on this tend to look very similar (thanks to the asymptotics, in essence that isn’t pathological looks normal with a large enough sample size).

If we want to make inferences about the normal distribution, we need to know the distribution of the parameters, and of the likelihood. Fortunately there are only three distributions we need for what we want, and these straightforward (even if not all of the derivations are).

The Normal Distribution

We know the normal distribution from §4.3

\[ f(x|\mu, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2} \frac{(x - \mu)^2}{\sigma^2}} \]

It’s symmetric, and the log of the pdf is a quadratic equation. A lot more could be written about it, and some of that will be in this course.

We know (from Corollaries §4.3.1 & §4.3.2, which follows from Theorem 4.3.3) that the sum of independent normally distributed random variables is also normally distributed. So the mean of a normal distribution is also normally distributed (if you don’t believe me, prove it!). But what normal distribution? We need to know the mean and variance of the distribution

  • If we have \(X_i \sim N(\mu, \sigma^2)\), \(i=1,\dots, n\), what is the expectation of \(\frac{1}{n} \sum{X_i}\)?
Answer \[ E(\frac{1}{n}\sum_{i=1}^n{X_i}) = \frac{1}{n} \sum_{i=1}^n{E(X_i)} = \frac{1}{n} n\mu = \mu \]
  • What is the variance of \(\frac{1}{n} \sum{X_i}\)?
Answer \[ Var(\frac{1}{n}\sum_{i=1}^n{X_i}) = \left(\frac{1}{n}\right)^2 \sum_{i=1}^n{Var(X_i)} = \frac{1}{n^2} n \sigma^2 = \frac{\sigma^2}{n} \] Also note that this means that the standard deviation is \(\frac{\sigma}{\sqrt{n}}\).

Gamma Distribution

This is covered in §4.6. The gamma distribution has the pdf

\[ f_Y(y|\lambda, r) = \frac{\lambda^r}{\Gamma(r)}y^{r-1}e^{-\lambda y}, ~~~ y,\lambda, r > 0 \]

If we hide the normalising constants we see that

\[ f_Y(y|\lambda, r) \propto y^{r-1}e^{-\lambda y} \]

The Gamma crops up all over the place: - as the distribution of times to the \(r^{th}\) event - as the sum of squared normal distributions - as an approximation to the likelihood of a Poisson distribution (as we will see later)

The shape of the gamma depends on \(r\), so we often call it the shape parameter:

The \(\chi^2\) Distribution

The \(\chi^2\) distribution is a special case of the Gamma distribution. it has one parameter, the Degrees of Freedom: a \(\chi^2\) distribution with \(m\) degrees of freedom is a Gamma distribution with \(r=\frac{m}{2}\) and \(\lambda=\frac{1}{2}\), and we denote it \(\chi^2_m\).

Exercise: prove that the sums of squares of standard normal distributions follows a \(\chi^2\) distribution.

The theorem:

Let \(U = \sum_{j=1}^m Z_j^2\), where \(Z_j \sim N(0,1)\) are independent standard normal variables. Then \(U\) follows a gamma distribution with \(r = \frac{m}{2}\) and \(\lambda = \frac{1}{2}\).i.e.

\[ f_U(u) = \frac{1}{2^{\frac{m}{2}} \Gamma(\frac{m}{2})} u^{\frac{m}{2}-1} e^{-\frac{u}{2}} \]

The strategy: work this out for \(m=1\), and then the sum \(\sum_{j=1}^m Z_j^2\)$ is straightforward.

  • What is the pdf for \(m=1\)?

Hint: work out the cdf for \(Z^2\) (i.e. \(Pr(Z^2 \le u)\)), by writing it as the integral of the pdf of \(Z\). Then differentiate it

That hint wasn’t enough
  • \(Pr(Z^2 \le u) = Pr(Z \le \sqrt{u})\)
  • \(Pr(Z \le \sqrt{u})\) is symmetric around 0, so you only need \(\sqrt{u}\) once in the integral.
I have the integral, but I want some help differentiating it

(I had to look this up)

If you have a function \(G(x) = \int_a^b g(x) dx\) then \(\frac{dG(x)}{db} = g(b)\) (obviously), but we have \(\sqrt(b)\), so we need to use the chain rule.
  • What is the pdf for \(U\) when \(m>1\)?

This is easy if you use the relevant result about the Gamma distribution (§4.6).

Help, I don’t know the result to use Have a look at Theorem 4.6.4
I’ve done it, was I right?

See Theorem 7.3.1 in §7.3. Or watch this video (my apologies for the sound quality and the pen tapping)

Now you have proved that, it is worth pausing to think about you have done. We know that the sum of normal random variables is normally distributed, and now we know that the sums of squares of standard normal random variables follows a \(\chi^2\) distribution, with the degrees of freedom equal to the number of variables in the sum. If we have a normal distribution with a known variance, we can standardise it and get a \(\chi^2\) distribution for the square. All of which means we will be meeting the \(\chi^2\) distribution quite a bit in our future.

The F Distribution

The F distribution is the ratio of independent \(\chi^2\) distributions divided by their degrees of freedom. It also crops up a lot with normal distributions, because (as we have just seen) the sum of squares of a standard normal distributions follow a \(\chi^2\) distribution, and ratios of these are common when working with the normal distribution, e.g. in likelihood ratio tests.

The pdf for an F distribution is

\[ f_{m,n}(w) = \frac{ \Gamma(\frac{m+n}{2}) m^{m/2} n^{n/2} w^{m/2-1}}{\Gamma(\frac{m}{2}) \Gamma(\frac{n}{2}) (n + mw)^{(m+n)/2}} \]

Stripping away the normalising constants this is

\[ f_{m,n}(w) \propto \frac{ w^{m/2-1}}{ (n + mw)^{(m+n)/2}} \]

Exercise: F as ratio of \(\chi^2\) distributions

Prove that if \(U \sim \chi^2_n\) and \(V \sim \chi^2_m\), then \(\frac{V/m}{U/n}\) follows an F distribution. We will do this in steps.

Step 1: Find the pdf for V/U.

You will need to know that

  • \(f_U(u) = \frac{1}{2^{\frac{m}{2}} \Gamma(\frac{m}{2})} u^{\frac{m}{2}-1} e^{-\frac{u}{2}}\)
  • \(f_{V/U}(w) = \int_0^\infty{|u| f_U(u) f_V(uw) du}\)
  • you can solve the integral by inspection (i.e. recognise the integrand - the thing being integrated - and from that deduce the solution)
I’m inspecting the integrand and don’t recognise it What have you seen recently that has the form \(x^a e^b\) (for various \(x, a, b\))?
Sorry, I’m still not seeing it It’s a Gamma distribution. You now need to work out what \(r\) and \(\lambda\) are.
Solution (check this before you get to step 2!)

\[ f_{V/U}(w) = \frac{\Gamma(\frac{n+m}{2})}{\Gamma(\frac{n}{2})\Gamma(\frac{m}{2})} \frac{w^{(m/2)-1}}{(1+w)^{\frac{n+m}{2}}} \]

Step 2: Find the pdf for \(\frac{V/M}{U/n}\).

This needs Theorem 3.8.2: if \(Y = aX + b\) then \(f_Y(y)=\frac{1}{|a|}f_X \left(\frac{y-b}{a} \right)\).

Step 2 mainly involves juggling a lot of constants to get the right pdf.

Solution (check this before you get to step 2!) See theorem §7.3.3

The t distribution

The t-distribution has more to do with beer than tea. It is the ratio of a normal to a chi-squared distribution. The practical reason for it being important is that if we sample some data, \(Y_i\), from a normal distribution, \(\frac{\bar{Y} - \mu}{S/\sqrt{n}}\) follows a t distribution.

We need to derive this. It takes a few steps, but we already have some of them.

First, the general result is that if \(Z \sim N(0,1)\) and \(U \sim \chi^2_n\) (i.e. \(Z\) follows a standard normal distribution, and \(U\) follows a \(\chi^2\) distribution with \(n\) degrees of freedom), and \(Z\) and \(U\) are independent, then \(T_n =\frac{Z}{\sqrt{U/n}}\) follows a t-distribution, i.e. 

\[ f_{T_n}(t) = \frac{\Gamma(\frac{n+1}{2})}{\sqrt{n \pi} \Gamma(\frac{n}{2}) } \left(1 + \frac{t^2}{n} \right)^{-(n+1)/2} ~~~~~~ -\infty <t<\infty \]

This is symmetric around 0 (L&M have a Lemma, hopefully you can see another proof of this).

Exercise: Derive the t distribution

The proof goes via cumulative density functions, and the fact that \(T^2_n = \frac{Z^2}{U/n}\), so has an \(F_{1, n}\) distribution.

  1. Show that \(F_{T_n}(t) = \frac{1}{2} + \frac{1}{2}F_{T^2_n}(t^2)\)
Hint

The density is symmetrical, so assume \(t>0\) and start by integrating from 0.

Also, note that the result has \(F_{T^2_n} = Pr(0 \le T_n^2 \le t^2)\). When you take \(\sqrt{T_n^2}\), how do the limits change?

Also - I have no idea if this is now too easy or too difficult.
Solution This is the first half of Theorem 7.3.4.
  1. Differentiate \(F_{T_n}(t)\) to get the required result
Hints
  1. Look a the exercise to prove that the sums of squares of standard normal distributions follows a \(\chi^2\) distribution.
  2. \(\Gamma(\frac{1}{2})=\pi\)
Solution This is the second half of Theorem 7.3.4.

One way of looking at the difference between the normal and t distributions is that the t distribution is a mixture of normal distributions, with different variances, so some wider than others. What this does for the shape of the distribution is to make the peak larger (where there are normal distributions with low variance), and give it thicker tails (where there are normal distributions with high variance). Her ewe plot a normal distribution and a t distribution with 3 degrees of freedom (also see Fig. 7.3.2):

Inferences About the Normal Distribution

Now we have these distributions, and you have some idea how they fit together, we can use them to develop inference for a normal distribution.

We will assume that we have collected some data, \(Y_i\) (\(i=1,\dots,n\)), and we will assume it follows a normal distribution with an unknown mean \(\mu\) and variance \(\sigma^2\). We write this as \(Y_i \sim N(\mu, \sigma^2)\). Because we do not know \(\mu\) or \(\sigma^2\) we will have to estimate them.

There are a few questions we can ask about this distribution:

  • what is the mean, and how certain are we about this?
  • does the mean equal a specific value?
  • what is the variance?
  • does the variance equal a specific value?

The first two questions are most common, and once we understand these, we can build on them to look at how the mean changes between data points. First, we have a couple of results to establish.

We will be using the mean and variance of the data, i.e. \(\bar{Y}=\frac{1}{n}\sum_{i=1}^{n}{Y_i}\) and \(S^2 = \frac{1}{n-1}\sum_{i=1}^{n}{(Y_i - \bar{Y})^2} = \bar{Y}=\frac{1}{n-1}\sum_{i=1}^{n}{Y_i^2 - n\bar{Y}^2}\).

We will also need the following result:

\[\frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\]

which is probably not surprising, but L&M hide it in Appendix 7.A.1. Notice that the degrees of freedom here is \(n-1\). The loss of one degree of freedom comes from having to estimate the mean, \(\bar{Y}\) (in Appendix 7.A.1 this is the job of the final row of \(A\)).

Proof that \(T_{n-1} = \frac{\bar{Y} - \mu}{S/\sqrt{n}}\) follows a \(t_{n-1}\) distribution

This is theorem 7.3.5. We have done all of the hard work, so let’s do the easy work.

  • what is the distribution of \(\bar{Y} - \mu\)? (as a function of \(\mu\) and \(\sigma^2\))
Answer We know that sums of random normal variables are normally distributed, and the mean of \(\bar{Y}\)? is \(\mu\). The variance of \(\bar{Y}\) is \(\sigma^2/n\) (see the section above on the normal distribution), so \(\bar{Y} - \mu \sim N(0, \sigma^2/n)\)
  • We know the distribution of \(\frac{(n-1)S^2}{\sigma^2}\)

  • It is also known that \(\bar{Y}\) and \(S^2\) are independent, and

if \(Z \sim N(0,1)\) and \(U \sim \chi^2_n\), and \(Z\) and \(U\) are independent, then \(T_n =\frac{Z}{\sqrt{U/n}}\) follows a t-distribution.

  • what is the distribution of \(t = \frac{\bar{Y} - \mu}{S/\sqrt{n}}\)?
Answer

\(\bar{Y} - \mu \sim N(0, \sigma^2/n)\), so \((\bar{Y} - \mu)/(\sigma/\sqrt{n})\) must be a standard normal.

We know that \(\frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{n-1}\)

and \(T_n =\frac{Z}{\sqrt{U/m}}\) follows a t-distribution (where \(Z\) and \(U\) are independent and \(m\) is the degrees of freedom of \(U\)).

Set \(Z = (\bar{Y} - \mu)/(\sigma/\sqrt{n})\), and \(U = \frac{(n-1)S^2}{\sigma^2}\) so \(m = n-1\). Then

\[ t = \frac{(\bar{Y} - \mu)}{\sigma/\sqrt{n}} \bigg/ \sqrt{\frac{(n-1)S^2}{(n-1)\sigma^2}} = \frac{\sqrt{n}(\bar{Y} - \mu)}{\sigma} \bigg/ \frac{S}{\sigma} = \frac{(\bar{Y} - \mu)}{S/\sqrt{n}} \]

and must follow a t distribution.

It is worth pausing to see what we have. If we are thinking about random variables, then we know that if \(X \sim N(\mu, \sigma^2)\), \((Y-\mu)/\sigma\) follows a standard normal distribution. Here we have the sample equivalent: if we can assume \(Y_i, (i=1,\dots, n)\) follows a normal distribution, then \(\frac{\bar{Y} - \mu}{S/\sqrt{n}}\) follows a t-distribution with \(n-1\) degrees of freedom. So it is the sample equivalent. The difference comes, in essence, from having to estimate the sample variance.

Now we have lots of theory, let’s use it! The t statistic will be pivotal in inferences for the mean.

Maximum Likelihood Estimator for \(\mu\)

The likelihood for \(n\) observations \(Y_i\) from a normal distribution is

\[ f(Y_i|\mu, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2} \frac{(Y_i - \mu)^2}{\sigma^2}} = \left(\frac{1}{\sqrt{2 \pi \sigma^2}}\right)^n e^{-\frac{1}{2} \frac{\sum_{i=1}^n(Y_i - \mu)^2}{\sigma^2}} \]

Exercise: derive the maximum likelihood estimator for \(\mu\)

Hint: use the log likelihood

Hint
  1. Expand the brackets in the log likelihood before differentiating
  2. \(\sum_{i=1}^n{Y_i} = n\bar{Y}\)
Answer

The log likelihood is

\[ \begin{align} l &= -n\log(\sqrt{2 \pi \sigma^2}) - \frac{1}{2 \sigma^2} \sum_{i=1}^{n}(Y_i - \mu)^2 \\ &= C - \frac{1}{2 \sigma^2} \left(\sum_{i=1}^{n}Y_i^2 - 2\sum_{i=1}^{n}Y_i\mu + \sum_{i=1}^{n}\mu^2 \right) \\ &= C - \frac{1}{2 \sigma^2} \left(\sum_{i=1}^{n}Y_i^2 - 2n \bar{Y}\mu + n\mu^2 \right) \end{align} \]

Now differentiate w.r.t \(\mu\) and set to 0:

\[ \frac{dl}{d\mu} = -\frac{1}{2 \sigma^2} \left(-2n \bar{Y} + 2 n\mu \right) = 0 \] And then rearrange to get \(\mu = \bar{Y}\):

\[ \begin{align} -\frac{2n}{2 \sigma^2} \left(- \bar{Y} + \mu \right) &= 0 \\ -\bar{Y} + \mu &= 0 \\ \mu &= \bar{Y} \end{align} \]

Confidence Interval for \(\mu\)

We know that \(\frac{\bar{Y} - \mu}{S/\sqrt{n}}\) follows a t distribution, so we can construct a confidence interval for \(\mu\) using this.

Remember that if we had a normal distribution, \(X_i \sim N(\mu, \sigma^2)\), we would calculate the confidence interval as \(\mu \pm 1.96 \sigma\), where 1.96 is the 97.5%ile of the normal distribution, so we have 2.5% of the probability outside the confidence interval. Because the normal distribution is symmetrical, this is (a) easy, and (b) the best interval1.

For the t distribution we can do something similar, but instead of 1.96 we need to know (or can look up) the critical values for a t distribution, i.e. \(t_{\alpha/2, n-1}\) for an \(\alpha\)% confidence interval. Basically, we construct the interval in the same way, but with different critical values derived from a different distribution.

Exercise: derive the confidence interval for \(\mu\)

Solution This is Theorem 7.4.1

Likelihood Ratio Test for \(\mu = \mu_0\)

It turns out that this is just the same as asking if \(\mu_0\) is in the confidence interval. But let’s derive it properly.

Formally, the test is of \(H_0: \mu = \mu_0\) against \(H_1: \mu \ne \mu_0\). We need the maximum likelihood estimates of \(\mu\) and \(\sigma^2\) under \(H_0\) and \(H_1\). We have some, but I will save you the effort of deriving the m.l.e.’s of \(\sigma^2\) (they are not too difficult, but nobody uses the m.l.e.s in practice2). I will denote the parameters under the hypothesis with their subscripts, e.g. \(\sigma^2_1\) for \(\sigma^2\) under \(H_1\).

Under \(H_0\) we obviously have \(\mu_0 = \mu_0\). But then we have \(\sigma^2_0 = \frac{1}{n}\sum_{i=1}^{n}(Y_i - \mu_0)^2\).

Under \(H_1\) we have \(\mu_1 = \bar{Y}\) (see above). For \(\sigma^2_1\) we get \(\sigma^2_1 = \frac{1}{n}\sum_{i=1}^{n}(Y_i - \bar{Y})^2\).

Exercise: Show that the likelihood ratio, \(\lambda = L_{H_0}/L_{H_1}\) is*

\[ \lambda = \frac{L_{H_0}}{L_{H_1}} = \left[\frac{\sum_{i=1}^{n}{(Y_i - \bar{Y})^2}} {\sum_{i=1}^{n}{(Y_i - \mu_0)^2}} \right]^{n/2} \]

Step 1: write down the likelihoods under \(H_0\), and simplify it

Do this by writing out the likelihoods under \(H_0\) and \(H_1\) separately, plugging in the estimators for \(\mu\) and \(\sigma^2\). Simplify the likelihoods before writing down the ratio.

Help!
  1. Once you have done the calculations for \(H_0\), it is almost the same for \(H_1\).
  2. There is some very convenient cancelling in the exponents, when you write down the likelihood under \(H_0\) (or \(H_1\))
Answer This is the first part of Theorem 7.A.2.1

Step 2: show that \[ \lambda = \left[1 + \frac{t^2}{n-1} \right]^{-n/2} \] where \(t = \frac{\bar{Y}-\mu_0}{s/\sqrt{n}}\).

Hint: notice that the power changes sign! So write the denominator out so you get the numerator plus another term. Then you can invert it, and the 1 appears.

Help!

Write the denominator as \[ \sum_{i=1}^n(Y_i - \mu_0)^2 = \sum_{i=1}^n[(Y_i-\bar{Y}) + (\bar{Y}-\mu_0)]^2 \] and re-arrange.

Answer This is the second part of Theorem 7.A.2.1

Note that we get qualitatively the right monotonic behaviour: as \(t^2\) increases, \(\lambda\) decreases. The only random variable in \(\lambda\) is \(t\), so the critical value of \(\lambda\) is the same as the critical value of \(t^2\) (and hence \(t\)).

Step 3: show that the critical value of \(\lambda\) to reject \(H_0\) is the same as the critical value of the \(t\)-statistic.

Answer This is the final part of Theorem 7.A.2.1

Inference about the Variance, \(\sigma^2\)

Inference about the variance, \(\sigma^2\), is less common, but does occasionally occur. And more complex modelling of variances happens in a few fields, e.g. in quantitative genetics (a field basically invented by Fisher, but only properly developed later in Edinburgh and Birmingham. These two schools of work later had to re-write their textbooks so everyone knew what they were talking about).

We will need a couple of facts:

  1. the unbiased estimator of the variance is

\[ S^2 = \frac{1}{n-1}\sum_{i=1}^n(Y_i - \bar{Y})^2 \]

note that this is not actually the maximum likelihood estimator, which is biased. Also note that the square root of this this is not an unbiased estimate of the standard deviation, \(S\).

  1. The ratio

\[ \frac{(n-1)S^2}{\sigma^2} = \frac{1}{\sigma^2}\sum_{i=1}^n(Y_1-\bar{Y})^2 \]

follows a \(\chi_{n-1}^2\) distribution. This makes hypothesis testing and calculation of confidence intervals straightforward.

Confidence Interval for \(\sigma^2\)

We can derive a confidence interval for \(\sigma^2\) from \((n-1)S^2/\sigma^2\), because we know if follows a \(\chi^2\) distribution, so

\[ Pr \left[\chi^2_{1-\alpha/2, n-1} \le \frac{(n-1)S^2}{\sigma^2} \le \chi^2_{\alpha/2, n-1} \right] = 1-\alpha \]

This is like a confidence interval for a normal (or t!) distribution: the statistic has a 95% probability of being within the limits. But here the limits are set by the \(\chi^2\) distribution, i.e. by \(\chi^2_{1-\alpha/2, n-1}\) and \(\chi^2_{\alpha/2, n-1}\).

Exercise: derive the confidence interval for the variance

Answer This is theorem 7.5.1.

These values are like the 1.96 we use for a 95% confidence interval for the normal distribution, but now they are different for the upper and lower limits. We have to get them from statistical tables (see §7.5), or nowadays from software, e.g. using R (which is the statistical package I used to create this document) I would use pchisq(c(0.025, 0.975), n-1). Python, Julia and other modern programming languages will have similar functions.

Also note that the interval is symmetric, i.e. it excludes the same amount of probability (\(\alpha/2\)) at each end of the distribution. This is convenient, but means it is not the shortest possible interval.

Tests when \(X\) is not normal

This is discussed around Figure 7.4.5 (from p 400). The bottom line is that a t-test is usually robust, i.e. it usually works well even if the data are not normally distributed. A larger sample size helps, as does the actual distribution not being too far from a normal distribution.

In more complex data analysis this is an important issue: the problem is not just whether the data meet the assumptions of the model (because they never do), rather it’s how and how badly they deviate from the assumptions, and what are the effects of the deviations. But as this is largely a problem of applied statistics, rather than mathematical statistics, we will give it less attention than perhaps we should.


  1. it is best in that is it (a) the shortest possible confidence interval, (b) all points in the interval have a higher density than those outside, and (c) has equal amounts of probability mass outside each end of the interval.↩︎

  2. the difference from the usual estimate of a variance is that the m.l.e. uses \(\frac{1}{n}\), not \(\frac{1}{n-1}\). But we all use \(\frac{1}{n-1}\) because it is slightly less biased in finite samples.↩︎