Learning outcomes

This week is largely about revising what you have already learned, and putting it all into what is probably a slightly different context. There is more text here than I expect there to be in the future modules.

We are using the 6th edition of Larsen & Marx’s book An Introduction to Mathematical Statistics and its Applications, so citations will be to that edition. We will refer to sections with §x.y, so (for example) §3.10 is section 10 in Chapter 3, about Order Statistics.

Cover of Larsen & Marx textbook

The modules in this course will have text to read, and some videos. There will also be exercises and questions to answer, along with the answers. These are intended to help you learn the material (rather than just passively receive it). They can be tackled on your own or in groups. The purpose of these questions is to help you learn the material, not to test if you know it. There are separate exercises to test how well you are learning the material: they will help you find out how much of the week’s material you have understood, and what you need help with. It is the job of the instructors to helkp you, of course.

Some of the questions have hints. These are there so you can try to answer the question on your own, but if you get stuck, have a look. You can also ask the instructors if the hints don’t help. The answers are also supplied, so you can see if you are right.

Questions appear in blue.

Hints and reminders are in bold

Some hints and answers are hidden away: click here Hello! I might be the answer. Or a hint. Hopefully it is obvious which.

Introduction

This module gets us up to speed with the inferential methods we will use. In other words, if we are to make inferences about a problem (say, whether the gender of the name given to a hurricane affects how much damage it causes), we need a general set of methods.

This module will outline the approach based on likelihood. Some of this will be revision, but there will also be some new material. By the end of the module, you should be able to

  • explain the procedure for deriving a maximum likelihood estimate
  • derive some (simple) maximum likelihood estimates
  • conduct (simple) likelihood ratio tests
  • explain the frequentist philosophy behind maximum likelihood estimation

Likelihood

The methods we use are all based on the likelihood. So it is good to remind ourselves what this is, and why. The mathematics are covered in §5.2

Data Generating Model

The approach is based on the idea that we can model our data, to say that it came about according to some process, and there is some randomness in the data. So, for example, if we think about the classical (and rather boring) coin flip, we can think that the data was generated by the flip, so each datum1 can be Heads or Tails. Our data are a sequence of Hs and Ts, which we can write as 0s and 1s. We will say that 1 corresponds to Heads, so if we have \(N\) trials, \(k\) will be Heads and \(N-k\) will be Tails.

Our model is that each datum has a probability \(p\) of being a 1, and the different data are independent. This means that if we have \(N\) tosses, the number of Heads, \(k\), will follow a binomial distribution (see §3.2 of L&M for a reminder). Thus,

\[ Pr(k|N, p) = \frac{N!}{k! (N-k)!} p^k (1-p)^{N-k} \] If we have the mythical fair coin, we have \(p=0.5\), and we can calculate this probability.

An Example not using a coin

Photo of an inflatable Earth. Not full size

We can also use binomial sampling to find out how much of the earth is land. We can randomly sample points on the earth and seeing if each point is land or sea. I have simulated this in a previous course using an inflatable globe. When I did this, I sampled 12 times before I got bored, and got Land 2 times.

What is the probability of getting this data if the earth is 50% land?

Hint \[ \frac{12!}{2! 10!} = 66 \]
Answer \[ \frac{12!}{2! 10!} 0.5^2 (1-0.5)^{10} = \frac{66}{2^{12}} = 66\times0.25\times0.0009765625 = 0.016 \]

According to Wikipedia, 29% of the earth is land, if this is true then what is the probability of the data?

Answer \[ \binom{12!}{2! 10!} 0.29^2 (1-0.29)^{10} = 66\times0.0842\times0.03255244 = 0.18 \]

Which proportion seems more likely, 0.5 or 0.29, and why?

Answer

0.29, because the probability of getting the data is about 10 times higher with that probability (it is 11.2 times higher).

Well done, you have just done your first likelihood ratio test. (a) without realising it, and (b) without any formal assessment. A 11.2 times higher likelihood might still be possible by chance.

Estimating Parameters

But what if we don’t know \(p\)? Then we want to estimate it. The way to do this using likeihood is to maximise the likelihood. i.e. treat it as a function of \(p\) and maximise that. Some notation is useful here. If, in general, we have some data, \(Y\), and some parameters, \(\theta\), we write the likelihood as

\[ P(Y|\theta) = L(\theta | Y) \]

Note that the conditioning reverses. The reason is that \(L(\theta | Y)\) is a function of \(\theta\), the parameters, so it’s easier to understand conditioned on the data (which, of course, we know) even if it is actually a probability of \(Y\).

In practice we usually use the log-likelihood:

\[ \ln(P(Y|\theta)) = \ln L(\theta | Y) = l(\theta | Y) \]

it turns out that this (with a natural log) is much more useful, for all sorts of reasons that we will come across throughout the course.

Here is the likelihood as a function of \(p\), for some data where \(N=13\) and \(k=7\). We call this a likelihood curve. Note that it is not a probability distribution.

How do we maximise this? First, a couple of tricks:

  • \(\max(L(\theta | X)) = \max(\log(L(\theta | X)))\)
  • we can ignore any constants.

So we want to maximise \(L(p|N, k) = \frac{N!}{k! (N-k)!} p^k (1-p)^{N-k}\).

Show that the maximum likelihood estimate for \(p\) is \(\hat{p} = \frac{k}{N}\)

Notice that I have given the estimate a hat: \(\hat{p}\). this is to show that it isn the maximum likelihood estimate of \(p\) (as opposed to any other value).

I don’t know where to start, please give me a hint

You need to find the maximum of the likelihood curve. This curve is well behaved, so you just need to find where the slope is 0, and solve for \(p\). Also note the hints above: use the log-likelihood and you can ignore any constants.

Of course, you should really check that this is a maximum, but we will skip that step.

Answer

Our likelihood is \[ L(p|N, k) = \frac{N!}{k! (N-k)!} p^k (1-p)^{N-k} \]

So the log-likelihood is

\[ l(p|N, k) = \log \left( \frac{N!}{k! (N-k)!} \right) + k \log(p) + (N-k)\log(1-p) \] This is a function of \(p\), which is what we want to estimate. We differentiate with respect to \(p\). The first term is a constant, so disappears. The second term is easy once we remember that \(\frac{d \log(y)}{dy} = \frac{1}{y}\). The third term uses integration by parts (with \(y=1-p\)):

\[ \frac{d \log(1-p)}{dp} = \frac{d \log(y)}{dy} \frac{d y}{dp} = \frac{1}{y} (-1) = -\frac{1}{1-p} \] And putting these together we get

\[ \frac{d l(p | n)}{dp} = \frac{k}{p} - \frac{N-k}{1-p} \]

Now we just set this to 0 and re-arrange to solve for \(p\).

\[ \begin{split} 0 &= \frac{k}{p} - \frac{N-k}{1-p} \\ \frac{N-k}{1-p} &= \frac{k}{p} \\ \frac{p}{1-p} &= \frac{k}{N-k} \end{split} \] We will pause here to note that \(\frac{p}{1-p}\) is the ratio of the probability of successes to failures, and \(\frac{k}{N-k}\) is the ratio of actual successes to failures. \(\frac{p}{1-p}\) is the odds of a success (which is used in gambling a lot). The log of this is called the log odds, and more complicated modelling of probabilities is often done by modelling the log odds.

OK, back to getting the estimate of \(p\). Let’s divide the top and bottom of the left hand side by \(p\) to get

\[ \begin{split} \frac{1}{1/p-1} &= \frac{k}{N-k} \\ \frac{N-k}{k} + 1 &= 1/p \\ \frac{N-k + k}{k} &= 1/p \\ \frac{N}{k} &= 1/p \\ \hat{p} &= \frac{k}{N} \end{split} \] So, to nobody’s surprise, the estimate for the probability of success is the proportion of successes.

What is the maximum likelihood estimate fo the proportion of the earth that is land, based on our sample of 2 Lands and 10 Seas?

Answer (no, sorry, you don’t get a hint) \[ \frac{2}{12} = \frac{1}{6} = 0.167 \]

Interval Estimation

This was covered in §5.3. Briefly, if we have an estimate, say \(\hat{\theta}\), we also want to give a range of likely values, We call that a confidence interval. Typically this is a 95% confidence interval. The precise definition of this is covered below, but roughly there is a 95% probability that the estimate would be in the interval if the data were resampled and the maximum likelihood estimate was correct.

We can get an almost exact confidence interval for the binomial problem above. We can imagine re-sampling the data lots of times, and each time getting an estimate, \(\hat{p}\). These can only take values \(0/N, 1/N, \dots , N/N\). We know that the numerator will follow a binomial distribution, so we can calculate the cumulative density function. For example, if we have \(N=9\) trials and \(k=6\), we have \(\hat{p} = 6/9\) and there are 10 possible values:

p 0 0.111 0.222 0.333 0.444 0.556 0.667 0.778 0.889 1
cdf 0 0.001 0.008 0.042 0.145 0.350 0.623 0.857 0.974 1

For a 95% confidence interval we typically take the central 95%, so leave out 2.5% of the probability at each side. From the cumulative density function, we can simply read off where the 2.5% and 97.5% points are. But there is no precise 2.5% point, so we be conservative and make the interval wider, so it becomes (0.222, 0.222). This is actually a 99.2% interval, which is a bit too wide.

If \(N\) is large, we can approximate the distribution of the estimator with a normal distribution, and calculate a confidence interval that way. This is because if the estimator, \(\theta \sim N(\mu_\theta, \sigma^2_\theta)\), where \(\mu_\theta\) and \(\sigma_\theta\) are the estimate and its standard error respectively, then a 95% confidence interval is \(\mu_\theta \pm 1.96\sigma_\theta\).

For the binomial distribution we can get the approximate confidence interval in the following steps (yes, I have split the steps into small ones), by assuming normality:

What is the estimator?
Answer \(k/N\)
What is the random variable in the estimator?
Answer \(k\)
If the random variable is approximately normally distributed, what are its mean and variance. Hence, what is the means and variance of the estimator?
Answer \[k \sim N(k, Np(1-p))\] Hence \[\hat{p} \sim N(k/N, p(1-p)/N)\] (because \(Var(k X) = k^2 Var(X)\))
4. What is the 95% confidence interval for the binomial distribution?
Answer \[\hat{p} \pm 1.96\sqrt{p(1-p)/N}\]
Exercise: For the Earth data, with 2 lands and 10 Seas, what is the approximate confidence interval for the proportion of land?
Answer

We have \[\hat{p} \pm 1.96\sqrt{p(1-p)/N}\]

with \(\hat{p} = 1/6\), and \(N=12\), so this is \[ 1/6 \pm 1.96\sqrt{(1/6)(1-1/6)/12} = 0.167 \pm 1.96\sqrt{p(1-p)/N} = (-0.04, 0.38) \]

You are right: this makes no sense, because a negative proportion is impossible.

Exercise: Suppose we had collected more data, with 20 lands and 100 Seas, what would the approximate confidence interval for the proportion of land be?

Answer

We have

\[\hat{p} \pm 1.96\sqrt{p(1-p)/N}\]

with \(\hat{p} = 1/6\), and \(N=120\), so this is \[ 1/6 \pm 1.96\sqrt{(1/6)(1-1/6)/120} = 0.167 \pm 1.96\sqrt{p(1-p)/N} = (0.09, 0.23) \]

This is at least possible, and is close to the exact confidence interval, which is is (0.1, 0.22).

This is the mathematics of the approach, but why do we do it this way? And what exactly is \(\hat{p}\)? For that we need some philosophy of statistics. Don’t worry, we will get back to the maths soon enough.

Frequentism

There are two common schools of statistical thought: frequentist and Bayesian. Here we are focussing on the frequentist approach. Not necessarily because it is better or worse, but because the maths is easier (and a lot of the ideas are still useful if you decide that the Bayesian approach is The One True Way).

The frequentist approach is based on the idea that we are working with the likelihood, \(l(\theta|Y) = P(Y|\theta)\). The key insight is that it is \(Y\), the data, that is random: the parameters, \(\theta\), are fixed (if unknown). Thus any probability statements must be about the data. We cannot talk about \(P(\theta)\) because that is not a random variable.

This means that when we ask questions like “what is the most likely value of \(p\)?” we have to be asking questions about the data. We do this by acknowledging that our estimates (e.g. \(\hat{p}=k/N\)) are functions of the data.

So when we talk about uncertainty in estimates, we are talking about uncertainty in the data, and probabilities are reflections of that uncertainty. This leads us on to a formal concept of (frequentist) probabilities: they are long-run proportions, or in other words frequencies. So a frequentist probability of (say) 0.4 means that if the same situation was repeated lots of times, you would see that outcome 40% of the time.

How does this influence the interpretation of our statistical analyses? Because the only thing that is random is the data, any probability statement (e.g. \(\mu>0\)) has to be a statement about the data, or (more commonly) a function of the data.

If, for example, we want to know the proportion of the earth that is water, we can randomly sample 1000 points and ask what proportion are water. Say we get 652 points. Then the estimand is the actual proportion that is water: according to the US government this is 71%. Our estimator for this is \(k/N\), where \(k\) is the number of point that are water, and \(N\) is the number of trials. Our estimate is 652/1000 = 0.65.

So the maximum likelihood estimate is the estimate that has the highest probability of giving the data, if it was the true parameter.

There are a few important terms that tend to get thrown around a lot and refer to slightly different concepts. (the Norwegian is in brackets, using the translation here)

  • statistic (observator) - a number that is a function of the data. e.g. the proportion of times we sample land
  • estimand (estimand) - a parameter in a model we want to estimate, e.g. the actual proportion of the earth that is land
  • estimator (estimator) - a function that is used to calculate an estimate, e.g. \(r/N\)
  • estimate (estimat) - the statistic that is used to estimate an estimand.

The difference between a statistic and an estimate is that an estimate is a statistic that is tied to an estimand. In essence, it is the estimate of something (e.g. the actual proportion of the earth that is land), whereas a statistic is only a number calculated from the data.

Not all estimates are statistics: I could use 42 as an estimate. It would be a bad estimate in most cases, of course, but it would still be an estimate. And it is not a statistic because it is not calculated from the data.

Interval Estimation

But what about confidence intervals? As with any estimate, it is a statistic, i.e. a function of the data. Thus is can’t say “there is a 95% chance that the parameter is in this interval”, because that would be a statement about the parameter, which is fixed. Instead, the interval is random, so we say that if the MLE were the true value, and we sampled lots of repeated data sets, 95% of those data sets would have the MLE in their confidence interval.

Because the confidence interval is a function of the data, it can only take values which are functions of the data, which is why for the binomial distribution the intervals are not exact.

See Figure 5.3.2 and the text on the same page for another version of this explanation.

Hypothesis testing

This is covered in Chapter 6 of L&M.

Hypothesis tests are an area of statistics that generates a lot of argument, most of which is more heat than light. The bottom line is that they are over-used and are easily mis-interpreted. Despite this they are still useful, and worth knowing about.

With hypothesis tests we are comparing two hypotheses, which we call \(H_0\) and \(H_1\). \(H_0\) is the null hypothesis, roughly that nothing interesting is going on. \(H_1\) is the alternative hypothesis, that there is something interesting happening.

So, to use the classic (=boring) example of a coin toss, \(H_0\) is the hypothesis that the coin is fair, i.e. \(Pr(H)=0.5\). The alternative hypothesis is that the coin is not fair, i.e. \(Pr(H) \neq 0.5\), i.e. \(Pr(H) \in [0,1]\). Notice that strictly \(H_0\) is nested within \(H_1\). This is typical: \(H_0\) is usually \(H_1\) with one or more parameters fixed (and usually fixed to 0).

This is the process of hypothesis testing:

  1. Choose \(H_0\) and H\(1\)
  2. Choose a test statistic
  3. Work out what the distribution would be under the null hypothesis
  4. Compare the value of the statistic from the data to the distribution under the null
  5. If the value is sufficiently extreme, declare the null hypothesis to be rejected. Otherwise say it is not rejected.

In order to decide if the value is sufficiently extreme, we have to define “sufficiently extreme”. We do this using the idea of repeated sampling. If the null hypothesis was true and we sampled the same situation many times, then each sample would be random. Our test statistic is a statistic, i.e. a function of the data, so will be a draw from some distribution. If the null hypothesis was true, then the real data would just be another draw from this same distribution. If we chose a good test statistic, then if the null hypothesis is wrong, the test statistic will not look like it came from the distribution under the null hypothesis.

More formally, we can set a value of the test statistic so that the observed statistic (from the data) is “sufficiently extreme” if it is bigger than this value. We call this the critical value. We usually assume that the middle 95% of the distribution is OK, with 2.5% at either extreme being too extreme (more on this below). The figure above is for a one-tailed test, so we still make the 5% too extreme, but now it is in just one tail of the distribution.

Exercise: For the Earth data, test the hypothesis that the earth is 29% land, based on the data with 2 Lands and 10 Seas sampled.

Choose \(H_0\) and \(H_1\)
Answer

\(H_0\) would be that 29% of the earth is land, so \(p=0.29\)

\(H_1\) would be that the proportion of the earth that is land is not 29%, i.e. \(p \ne 0.29\)
Choose a test statistic
Answer A few could be used, but an obvious one is the estimate of \(p\) (i.e. \(k/N\)). But we could use \(k\) itself.
Work out what the distribution would be under the null hypothesis
Answer

We know that \(k\) follows a binomial distribution, so if we use that as the test statistic, it is easy. If we use \(k/N\) then the distribution looks the same, but we have \(k/N\) on the x-axis. i.e. if \(\pi\) is the probability of success under the null hypothesis and \(P\) is the estimate of \(p\) (i.e. the random variable) then we have:

\[ Pr(P=p | \pi, N) = \frac{N!}{(Np)!(N(1-p))!} \pi^{Np} (1 - \pi)^{N(1-p)} \] If \(N\) is large we can assume \(p\) approximately follows a normal distribution

Compare the value of the statistic from the data to the distribution under the null

(the key thing here is to write down the equation for the p-value: once you have that you will see that it needs quite a bit of boring calculation)

Answer

Our statistic from the data is 2/12 = 1/6. So we want to know how likely it is to get that value or one more extreme under the null hypothesis. If we had a one sided test we would want

\[ \sum_{i=0}^2\binom{N!}{i!(N-i)!} \pi^{i} (1 - \pi)^{N-i} \]

(or an equivalent expression in terms of \(p\)). But as we have a two-sided hypothesis, we should double this, to account for \(k\) being either larger or smaller than expected under \(H_0\).

At this point you could do the calculations, which are a pain. I used the computer to find that the p-value is \(2 \times\) (0.02 + 0.08 + 0.18) = 0.56.

If the value is sufficiently extreme, declare the null hypothesis to be rejected. Otherwise say it is not rejected.
Answer Our p-value is above 0.05, so we cannot reject the null hypothesis: the data do not give us evidence to reject the hypothesis that the proportion of the earth that is water is not 29%.

What is “sufficiently extreme”?

The test is significant if the statistic is sufficiently extreme. This needs to be defined formally. The way this is done is to refer back to probabilities: essentially if it is sufficiently unlikely to get a statistic that extreme. Typically we use a value of 5%, i.e. if there is a probability of <5% of getting a statistic as extreme as the observed one. We call this critical value \(\alpha\).

Potentially we could work out a threshold based on a balancing of costs and benefits, but that would be too complex in most cases, so instead a standard of \(\alpha = 0.05\) has been adopted. The origins of this are from R.A. Fisher, which basically thought it was about right, and then it stuck as a convention.

There have been attempts to change this, e.g. to suggest different thresholds such as 0.01. But the convention is stuck. The closest we have got to a change is the more nuanced view that the lower the p-value the stronger the evidence against the null hypothesis. So we can talk about a value of 0.0001 suggesting there is more evidence against the null.

For the purposes of this course, you do not need to worry about these issues. But be warned.

One and two sided tests

Briefly…

Usually we use two-tailed tests. i.e. we compare \(\mu=0\) with \(\mu \ne 0\). We decide that \(\mu \ne 0\) if the statistic (\(\bar{x}\), say) is far enough away from 0 in either direction, i.e. if \(|\bar{x}|>c\). So we reject \(H_0\) if \(\bar{x}\) is in either tail of the distribution, i.e. if \(\bar{x}>c\) or if \(\bar{x}< -c\) (in fact we can use different upper and lower critical values, but let’s not get too complicated yet).

But sometimes we are interested in one direction only, so we compare \(\mu=0\) with \(\mu > 0\) (or \(\mu < 0\)). Then we only look at \(\bar{x}>c\). This is a one-tailed test, because we only need to look at one tail of the distribution.

For most sorts of problem we use two-tailed tests. The time to use one tailed tests is when we have an explicit hypothesis about the direction of an effect. e.g. if we start spraying fungicide on a fungus, we might expect there to be less fungus afterwards, so we don’t need to bother testing if there is more.

One-sided tests can be controversial, because people have an unfortunate habit of wanting to see “positive” results. With a two sided test, there is a 2.5% probability of getting a significant result where \(\hat{mu}>0\) (say). But if we use a one-tailed test, the probability is doubled. So people tend to get worried that using a one-sided test is naughty, and they will be accused of being naughty2.

Someone once suggested “one and a half tailed tests”. What do you think these are, and why were they suggested?

(this is a question for discussion)

Answer

One and a half tailed tests are asymmetric. Rather than have the rejection region being 2.5% on each side of the sampling distribution of the test statistic, we have most of it, say 4.9% on one side, and (say) 0.1% on the other.

The justification for this is that we want to do a one-sided test, but know that we might be utterly wrong, so want to allow for that possibility. For example, we might be testing a drug to see if it reduces mortality. But there is the possibility that it could actually increase mortality (if it has side effects), so we want to allow for that possibility.

In practice, this idea hasn’t been used much, in part because one-tailed tests tend not to be used a lot.

Type I and Type II Errors. And why most tests are silly.

The decisions are probabilistic, so there is a probability that they will be wrong. There are two possible errors:

  • Type I errors: false positives, i.e. significance when \(H_0\) is true. If the test is well designed, this has probability \(\alpha\).
  • Type II errors: false negatives, i.e. non-significance when \(H_1\) is true. The probability of this is called the power of the test and depends on how far away the \(H_1\) is from \(H_0\). The probability of this is called \(\beta\), and is a function of the deviation from \(H_0\).

Exercise: calculate power

For the Earth data, assume we have a large data set with 120 observations (so we can assume normality), and \(H_0\) is that \(p=0.29\). If the true value of \(p\) is 0.4, what is the type II error, i.e. the probability that we accept the null hypothesis when it is wrong.

Calculate the critical region for the test that \(H_0\) is true. You can use \(p\) as the test statistic, and assume that it is normally distributed. You can assume that we are testing at the 5% level.

Hint You want the distribution of the test statistic, which means working out its mean and variance, and from that the critical limits.
Answer

We are assuming \(p\) is normally distributed, so under \(H_0\) it has a mean of 0.29, and a variance of \(p(1-p)/N = 0.29 \times (1-0.29)/120 =\) 0.0017.

So \(p|H_0 \sim N(0.29,0.041)\). The critical region is then the central 95% of this distribution, i.e. \(0.29 \pm 1.96\times0.041 = (0.21, 0.37)\). (1.96 is the upper 2.5% percentile of the standard normal distribution)

For a “true” value of 0.4, calculate the probability that a sample would fall in the critical region for the test.

Hint You want the distribution of the test statistic, which means working out its mean and variance, and from that the critical limits.
Answer

We want to know \(Pr(k<t_{crit} |p=0.4, N=120)\). We can work out that \(\hat{p} \sim N(0.4, 0.045^2)\). So we want to know the probability that a draw from this distribution is between the critical values you just calculated (which should be (0.21, 0.3)). We can get this by converting it to a standard normal, and looking the value up in a table of z-scores (or plugging them into the appropriate function in your computer). These are

\(z_{upper} = (0.37 - 0.4)/0.045 = -0.67\) and \(z_{lower} = (0.21 - 0.4)/0.045 = -4.22\)

So the probability is \(\Phi(-0.67) -\Phi(-4.22)=\) 0.25 - 0 = 0.25.

Power is most useful when calculated for several values, to get a power curve. This is a plot of a true value of the test statistic against \(1-\beta\). This is the power curve for our earth data, for the actual data and for a larger data set.

The “true” value is on the x-axis, so the further away it is from the null hypothesis, the higher the probability of rejecting the null hypothesis.

Question: Why is the red line narrower than the black one?

Answer Because the red line is for the case with more data, so the uncertainty is lower: the standard error of the estimate of \(p\) is \(\sqrt{p(1-p)/N} \propto 1/\sqrt{N}\). As \(N\) increases, the curve gets narrower

Question: for the red line (with a large N), what is the minimum value, and what value of \(p\) is it for?

Hint Think about the value of \(p\) first. Once you know that, the value of \(1-\beta\) might be clear.
Answer The minimum is at \(p=0.29\), i.e. when the null hypothesis is true. When the null hypothesis is true, if we are testing at 5% then there should be a 5% probability of rejecting the null hypothesis, so this is the value of \(1-\beta\) at this value of \(p\).

The reasons for the difference in the minimum of the black line different are rather difficult to understand (=I am not sure about them either), but are to do with the fact that this is an approximation.

The use and misuse of tests

How does this relate to the silliness of most tests? Because for most tests nobody would believe \(H_0\). In the example of a “fair” coin, how likely is it that a real coin would have a probability of heads of exactly 0.5, as opposed to (say) \(0.5 + 10^{-37}\)? For almost all tests, we know already that the null is false.

So why do we do it? The fundamental issue is simplicity: if we don’t need a hypothesis, then life is easier. \(H_1\) is usually a more complex hypothesis, so if we can show that we need it, great. If we cannot, then we stick with (simpler) \(H_0\). And now we get a second issue (also covered in §6.6 of L&M). Our decision about whether to reject \(H_0\) or not depends on whether our test statistic is likely under our null hypothesis: if it is sufficiently unlikely, then we say that it is “statistically significant”. But this is not the same as practical significance. If a coin gives us Heads with a probability of 0.7, then this is practically significant. But if it gives a Heads with probability 0.501? Or 0.5001 or…

If we look at significance tests, then the critical region gets smaller as the sample size increases. Thus, we could even argue that if we know that \(H_0\) is false, a significance is simply testing if we have enough data to know that it is false.

The good news is that significance tests are not useless. Rather, they should be treated with care. They can give us confidence about our models, and whether they are just modelling random noise. This also means they can be used to simplify models, so that they are not representing what looks like noise.

Likelihood Ratio Tests

As noted above, we can use any reasonable test statistic. If, for example, we are testing whether a mean is zero, it makes sense to use that mean as the test statistic. But we can also think about this more generally: we are comparing models \(H_0\) and \(H_1\), so we could use summaries of the model fit, i.e. the likelihood, as a test statistic. This leads us to the idea of the likelihood ratio test.

The likelihood ratio test is more than just a nice idea: it (a) gives us a general approach to choosing test statistics, and (b) will work when \(H_0\) and \(H_1\) differ in more complex ways, in particular if they differ by more tha one parameter.

Here is the basic idea, using the notation in §6.5: the data are \(Y\), the parameters are \(\omega\) for \(H_0\) and \(\Omega\) for \(H_1\).

  • We have a null hypothesis, \(H_0\), which is a model \(P(Y | \omega, H_0) = L(\omega|Y)\)
  • We have an alternative hypothesis, \(H_1\), which is a model \(P(Y | \Omega, H_1) = L(\Omega|Y)\)

The likelihood ratio is

\[ \lambda = \frac{P(Y | \hat{\omega}, H_0)}{P(Y | \hat{\Omega}, H_1)} \]

This is the ratio of probabilities. So it says that the data are \(\lambda^{-1}\) times more likely if \(H_1\) were true than if \(H_0\) was true. Notice that this is a statement about hte data (which are random).

I gave \(\omega\) and \(\Omega\) hats to be clear that these are the estimates of the parameters, i.e. we first have to calculate the maximum likelihood estimates, and then given those calculate the likelihoods under \(H_0\) and \(H_1\). In order to do the significance test we also need to know the distribution of \(\lambda\) under \(H_0\). This is something we will cover later in the course for several models.

Exercise: calculate the LRT for z-test

Assume we have \(n\) data points drawn from a normal distribution, i.e. \(x_i \sim N(\mu, \sigma^2)\), where we (mystically) know \(\sigma^2\). We want to test whether the mean equals \(\mu_0\).

Prove that the likelihood ratio statistic is \(\lambda = e^{-\frac{n}{2\sigma^2} \left(\mu_0 - \bar{x} \right)^2}\).

We do this in stages:

What is the likelihood for \(n\) data points drawn from a normal distribution with mean \(\mu\) and variance \(\sigma^2\)?

Answer \(L(X | \mu, \sigma^2) = \left( \frac{1}{\sqrt{2\sigma^2}} \right)^n e^{-\sum_{i=1}^n{\frac{( {x_i}-\mu)^2}{2\sigma^2}}}\)

What are the null and alternative hypotheses?

Answer

\(H_0\): \(\mu = \mu_0\) \(H_1\): \(\mu \ne \mu_0\). i.e. \(\mu = \mu_1\), where we need to estimate \(\mu_1\). We can use the maximum likelihood estimate, which is \(\bar{x}\).

Write down the likelihood ratio

(we will simplfy it in the next step)

Answer (before we try to simplify it)

First, write the whole ratio down. Initially I will write the means under the hypotheses as \(\mu_0\) and \(\mu_1\). Note that the terms not in the exponent are \(1/\sqrt{2 \pi \sigma^2}\) for both hypotheses, so these cancel.

\[ \lambda = \frac{P(Y | \omega, H_0)}{P(Y | \Omega, H_1)} = \frac{e^{-\sum_{i=1}^n{\frac{( {x_i}-\mu_0)^2}{2\sigma^2}}}} {e^{-\sum_{i=1}^n{\frac{( {x_i}-\mu_1)^2}{2\sigma^2}}}} = e^{-\frac{1}{2\sigma^2} \left(\sum_{i=1}^n{( {x_i}-\mu_0)^2} - \sum_{i=1}^n{( {x_i}-\mu_1)^2} \right)} \]

Now prove that this equals \(\lambda = e^{-\frac{n}{2\sigma^2} \left(\mu_0 - \bar{x} \right)^2}\).

Hints
  • \(\sigma^2\) & \(n\) are constant.
  • work with \(\log{\lambda}\): it just means you have one less thing to keep track of.
  • For \(H_1\), the mean is \(\bar{x} = \frac{1}{n}\sum_{i=1}^{n}{x_i}\) (Example 5.2.5 in L&M). This also means that \(\sum_{i=1}^{n}{x_i} = n \bar{x}\)
  • there is a lot of book-keeping, expanding quadratics and simplifying. Try to keep track of all the terms.
Answer

We can work with \(log(\lambda)\). We just expand the quadratics, work out what the summations are, and simplify a few things:

\[\begin{align} log(\lambda) &= -\frac{1}{2\sigma^2} \left(\sum_{i=1}^n{( {x_i}-\mu_0)^2} - \sum_{i=1}^n{( {x_i}-\bar{x})^2} \right) \\ &= -\frac{1}{2\sigma^2} \left( \sum_{i=1}^n{x_i^2} - \sum_{i=1}^n{2 x_i \mu_0} + \sum_{i=1}^n{\mu_0^2} -\sum_{i=1}^n{x_i^2} + \sum_{i=1}^n{2x_i \bar{x}} - \sum_{i=1}^n\bar{x}^2 \right) \\ &= -\frac{1}{2\sigma^2} \left( - 2n \bar{x} \mu_0 +n\mu_0^2 + 2 n \bar{x}^2 - n \bar{x}^2 \right) \\ &= -\frac{n}{2\sigma^2} \left(\mu_0^2 - 2\bar{x} \mu_0 + \bar{x}^2 \right) \\ &= -\frac{n}{2\sigma^2} \left(\mu_0 - \bar{x} \right)^2 \end{align}\]

To conduct an actual test we would need to know what the distribution of this is under the null hypothesis. It turns out to follow a \(\chi^2\) distribution, which will crop up repeatedly in this course. So we can leave this here for now.


  1. datum is the singular of plural. One datum, two data points. The grammatical issues get a bit more complicated, and if anyone uses data as a singular, most people won’t care.↩︎

  2. There are so many ways one can be naughty and mis-use statistics. So many that several books have been written on this topic.↩︎