Regression

This week we will start to look at modelling data. This builds on the work we have already done with the normal distribution (so expect to see more F, t, and \(\chi^2\) distributions), and will also help set up work that will follow.

We will use some real data to illustrate the problem. These data are the official fastest times for men and women for the 100m sprint (running) from 1948 to 2004, at the Olympic games:

The data are here, if you want to play with them.

The purpose of the analysis of this data was to suggest that women will eventually run more quickly than men. We will not look at this precise question, but initially only look at the womens’ times.

The problem here is sometimes called “simple linear regression”. More complicated linear regressions are also used. The method to fit a line to the data was first developed in France1, using least squares, but most people now treat it as an example of maximum likelihood estimation.

This is how Stephen Stigler2 describes least squares:

The method of least squares is the automobile of modern statisical analysis: despite its limitations, occasional accidents, and incidental pollution, this method and its numerous variations, extensions, and related conveyances carry the bulk of statistical analysis, and are known and valued by nearly all.

This week we will learn what the regression car looks like, and how it works under the hood. We will also give it a quick test drive. Next week we will learn how to drive it in anger, and how to identify some of the limitation and accidents.

Learning outcomes

be able to

  • explain the relationship between maximum likelihood and least squares
  • derive the maximum likelihood estimates for the slope and intercept in a simple regression -derive the sampling distributions of the slope and intercept in a simple linear regression.

Hints and reminders are in bold

Questions appear in blue.

Some hints and answers are hidden away: click here Hello! I might be the answer. Or a hint.

References to §x.y are to section §x.y of the Larsen and Marx textbook. e.g. $4.3 is the section titled The Normal Distribution.

The Problem

We have pairs of points, \((x_1, y_1), (x_2, y_2), \dots (x_n, y_n)\). We assume that \(x_i\) is fixed, in the sense that we know its precise value: it may be set by an experiment (e.g. if the temperature of a chemical reaction is controlled), or may be observed without error (e.g. the year of the Olympics). We call \(x_i\) the predictor, or a covariate, or a predictor variable.

We call \(y_i\) the response, and assume it is random. This is what we are interested in studying, e.g. the winning times in the 100m, and how they change over time. Or how the speed of a chemical reaction depends on temperature.

We can also look at this problem as one of trying to use \(x_i\) to explain \(y_i\): \(y_i\) is our data, which is what is random and varying, and \(x_i\) is something known which hopefully can be used to predict \(y_i\). We will assume that the expected relationship between \(X\) and \(Y\) is linear, i.e. we want a model \(\mu_i = E(y_i) = a + b x_i\). We will look at some non-linear extensions of this later: suffice it to say that there are a lot of ways of relaxing these assumptions, but most are extensions of the model here.

Historically, this problem was first solved using a method called least squares, minimising \(\sum_{i=1}^n(y_i-\mu_i)^2\). It should be clear that we can use this for any definition of \(\mu_i=f(x_i)\), although non-linear functions can get messy.

So far in this course we have been using maximum likelihood to fit models and estimate parameters. So it would be natural to do this here. Given what we have done so far in this course, it will not surprise you to learn that we use a normal distribution. This means that a lot of the ideas about inference from normal distributions can carry over to this problem (and indeed more complex problems).

The Solution

We assume the following model:

\[ \begin{align} y_i &= \beta_0 + \beta_1 x_i + \varepsilon_i = \mu_i + \varepsilon_i \\ \varepsilon_i &\sim N(0, \sigma^2) \end{align} \]

So \(\varepsilon_i\) is the discrepancy between \(y_i\) and \(\mu_i\). It might help to visualise this in a figure:

\(\varepsilon_i\) is a vector parallel to the y-axis.

If we are going to maximise this likelihood, we need to write it down first. This involves a bit of typing but is otherwise straightforward:

\[ L = \prod_{i=1}^n {\frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{1}{2} \sum_{i=1}^n \frac{(y_i - \beta_0-\beta_1 x_i)^2}{\sigma^2}}} \]

We want to find values of \(\beta_0\), \(\beta_1\) and \(\sigma^2\) that minimise \(L\).

Show that maximising the likelihood for \(\beta_0\) and \(\beta_1\) is equivalent to minimising the sum of squares, \(\sum_{i=1}^n(y_i - \beta_0-\beta_1 x_i)^2\)

Hint Use the log-likelihood
Answer

First we transform to the log likelihood:

\[ \begin{align} \log(L) &= \sum_{i=1}^n \log{\frac{1}{\sqrt{2 \pi \sigma^2}} -{\frac{1}{2} \sum_{i=1}^n \frac{(y_i - \beta_0-\beta_1 x_i)^2}{\sigma^2}}} \\ &= -\frac{n}{2}\log{(2 \pi \sigma^2)} - {\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0-\beta_1 x_i)^2} \end{align} \]

The first term does not depend on \(y_i\), \(\beta_0\), or \(\beta_1\) and the second term has a negative sum of squares, \(-\sum_{i=1}^n(y_i - \beta_0-\beta_1 x_i)^2\) (ignoring other constants). So maximising this is the same as minimising \(\sum_{i=1}^n(y_i - \beta_0-\beta_1 x_i)^2\).

Least squares and maximum likelihood are thus mathematically equivalent. The advantage of viewing this as a likelihood problem is that we can use the likelihood machinery: the estimates of the parameters \(\hat{\beta_0}\), \(\hat{\beta}_1\) and \(\hat{\sigma}^2\) are random variables, so we can talk about their sampling distributions.

First, of course, we need to find the maximum likelihood estimates.

Finding the maximum likelihood estimates for \(\beta_0\) and \(\beta_1\)

The solution for \(\beta_0\) is a function of \(\beta_1\), and vice versa, so first write them in terms of the other

Show that the maximum likelihood estimates for \(\beta_0\) in terms of \(\beta_1\) is \(\hat{\beta_0}=\bar{y} -\beta_1 \bar{x}\)

Answer for \(\beta_0\) (in terms of \(\beta_1\))

\[ \log(L) = C - {\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0-\beta_1 x_i)^2} \]

We can use the chain rule, with \(a_i = y_i - \beta_0-\beta_1 x_i\), so \(\log(L) = C - {\frac{1}{2\sigma^2} \sum_{i=1}^n a_i^2}\)

\[ \begin{align} \frac{\partial \log(L)}{\partial a_i} &= -{\frac{2}{2\sigma^2} \sum_{i=1}^n a_i} \\ \frac{\partial a_i}{\partial \beta_0} &= -1 \end{align} \] So

\[ \begin{align} \frac{\partial \log(L)}{\partial \beta_0} &= (-1)\left(-{\frac{2}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0-\beta_1 x_i)}\right)=0 \\ &=\frac{1}{\sigma^2} \left(\sum_{i=1}^n y_i - \sum_{i=1}^n\beta_0- \sum_{i=1}^n\beta_1 x_i\right) \\ &= \sum_{i=1}^n y_i - n\beta_0-\beta_1 \sum_{i=1}^nx_i \\ \beta_0 &= \frac{\sum_{i=1}^n y_i -\beta_1 \sum_{i=1}^n x_i}{n} \\ &= \bar{y} -\beta_1 \bar{x} \\ \end{align} \]

Notice what this estimate means: if \(\bar{x}=0\), we would have \(\beta_0=\bar{y}\). If \(\bar{x}\ne0\) “starts” at \(\bar{y}\) and move the intercept up or down by \(\beta_1\) {x}$. This implies that, unless \(\bar{x}=0\), \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are correlated.

Show that the maximum likelihood estimates for \(\beta_1\) in terms of \(\beta_0\) is \(\hat{\beta}_1 = \frac{\sum_{i=1}^n x_i y_i -\beta_0 \sum_{i=1}^n x_i}{\sum_{i=1}^n x_i^2}\)

Answer for \(\beta_1\) (in terms of \(\beta_0\))

This proceeds in the same way

\[ \log(L) = C - {\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0-\beta_1 x_i)^2} \]

We can use the chain rule, with (as before) \(a_i = y_i - \beta_0-\beta_1 x_i\), so \(\log(L) = C - {\frac{1}{2\sigma^2} \sum_{i=1}^n a_i^2}\)

\[ \begin{align} \frac{\partial \log(L)}{\partial a_i} &= -{\frac{2}{2\sigma^2} \sum_{i=1}^n a_i} \\ \frac{\partial a_i}{\partial \beta_1} &= -x_i \end{align} \]

So

\[ \begin{align} \frac{\partial \log(L)}{\partial \beta_0} &= \left(-{\frac{2}{2\sigma^2} \sum_{i=1}^n (-x_i)(y_i - \beta_0-\beta_1 x_i)}\right)=0 \\ &=\frac{1}{\sigma^2} \left(\sum_{i=1}^n x_i y_i - \sum_{i=1}^n x_i \beta_0- \sum_{i=1}^n\beta_1 x_i^2\right) \\ &= \sum_{i=1}^n x_i y_i - \beta_0\sum_{i=1}^n x_i -\beta_1 \sum_{i=1}^n x_i^2 \\ \beta_1 &= \frac{\sum_{i=1}^n x_i y_i -\beta_0 \sum_{i=1}^n x_i}{\sum_{i=1}^n x_i^2} \end{align} \]

Now we have a pair of linear equations, as we can see if we re-arrange them:

\[ \begin{align} \bar{y} &= \beta_0 + \beta_1 \bar{x} \\ \sum_{i=1}^n x_i y_i &= \beta_0 \sum_{i=1}^n x_i + \beta_1 \sum_{i=1}^n x_i^2 \end{align} \]

Thus by Cramer’s rule there is a unique solution. So we can plug the equation for \(\beta_0\) into the equation for \(\beta_1\) to find \(\hat{\beta}_1\)

Find \(\hat{\beta}_1\), the maximum likelihood estimate for \(\beta_1\)

Solution

There are a few ways of writing the solution, so if you didn’t get to the final version, that’s fine.

\[ \begin{align} \hat{\beta_1} &= \frac{\sum_{i=1}^n x_i y_i -\beta_0 \sum_{i=1}^n x_i}{\sum_{i=1}^n x_i^2}\\ \hat{\beta_1} \sum_{i=1}^n x_i^2 &= \sum_{i=1}^n x_i y_i - (\bar{y} - \hat{\beta_1} \bar{x}) \sum_{i=1}^n x_i\\ &= \sum_{i=1}^n x_i y_i - \bar{y}\sum_{i=1}^n x_i + \hat{\beta_1} \bar{x} \sum_{i=1}^n x_i\\ \hat{\beta_1} \left(\sum_{i=1}^n x_i^2 -\bar{x}\sum_{i=1}^n x_i \right) &= \sum_{i=1}^n x_i y_i - \bar{y}\sum_{i=1}^n x_i \\ \hat{\beta_1} &= \frac{\sum_{i=1}^n x_i y_i - \bar{y}\sum_{i=1}^n x_i}{\sum_{i=1}^n x_i^2 -\bar{x}\sum_{i=1}^n x_i} \\ &= \frac{\sum_{i=1}^n x_i y_i - \sum_{i=1}^ny \sum_{i=1}^n x_i/n}{\sum_{i=1}^n x_i^2 - \left(\sum_{i=1}^n x_i \right)^2/n} \\ &= \frac{n \sum_{i=1}^n x_i y_i - \sum_{i=1}^ny \sum_{i=1}^n x_i}{n \sum_{i=1}^n x_i^2 - \left(\sum_{i=1}^n x_i \right)^2} \\ &= \frac{n \left(\sum_{i=1}^n x_i y_i - n \bar{x}\bar{y} \right)}{n \left(\sum_{i=1}^n x_i^2 - n \bar{x}^2 \right)} = \frac{\sum_{i=1}^n x_i y_i - n \bar{x}\bar{y} }{\sum_{i=1}^n x_i^2 - n \bar{x}^2} \end{align} \]

Note that we can also write this as

\[ \hat{\beta}_1 = \frac{Cov(X,Y)}{Var(X)} \]

To get \(\hat{\beta}_0\), we can simply plug our estimate of \(\beta_1\) into our equation. This is trivial, because we just plug the value in.

Estimating \(\hat{\sigma}^2\).

Find \(\hat{\sigma}^2\), the maximum likelihood estimate for \(\sigma^2\)

You might be able to guess what the result will be (or be slightly out), but obviously we still want to derive it formally. To do this you will need the partial derivative of \(\log(L)\) w.r.t. \(\sigma^2\).

\[ \log(L) = -\frac{n}{2}\log{2 \pi} -\frac{n}{2}\log{\sigma^2} - {\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \beta_0-\beta_1 x_i)^2} \]

Hint to help you start Plug \(\beta_0=&= \bar{y} -\beta_1 \bar{x}\) into the likelihood first
Solution

First we follow the hint

\[ \begin{align} \log(L) &= -\frac{n}{2}\log{2 \pi} -\frac{n}{2}\log{\sigma^2} - {\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - (\bar{y} - \beta_1 x_i) -\beta_1 x_i)^2} \\ &= -\frac{n}{2}\log{2 \pi} -\frac{n}{2}\log{\sigma^2} - {\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \bar{y})^2} \end{align} \]

So that

\[ \begin{align} \frac{\partial \log(L)}{\partial \sigma^2} &= -\frac{n}{2}\frac{1}{\sigma^2} - \left(-\frac{1}{(\sigma^2)^2}\right) \frac{1}{2} \sum_{i=1}^n (y_i - \bar{y})^2 \\ &= -\frac{1}{2 \sigma^2} \left(n -\frac{1}{\sigma^2} \sum_{i=1}^n (y_i - \bar{y})^2 \right) \end{align} \] Set this to 0, so \(-\frac{1}{2 \sigma^2}\) disappears, and we get

\[ \begin{align} 0 &= n -\frac{1}{\sigma^2} \sum_{i=1}^n (y_i - \bar{y})^2 \\ \frac{1}{\sigma^2} \sum_{i=1}^n (y_i - \bar{y})^2 &= n \\ \sigma^2 &= \frac{\sum_{i=1}^n (y_i - \bar{y})^2}{n} \\ \end{align} \] Which is almost the usual sample variance.

This is great as far as it goes, but we want to know the sampling distributions of the estimates. We will not look at \(\hat{\sigma}^2\) (see Theorem 11.3.3 for the result).

The Sampling Distributions of \(\beta_0\) and \(\beta_1\)

The sampling distributions of \(\beta_0\) and \(\beta_1\) are not too difficult to obtain, with a bit of trickery and juggling sums of squares. The key to doing this is to note that in the data, \(x_i\) is considered fix, so only \(y_i\) is random. Thus, the sampling distributions of \(\beta_0\) and \(\beta_1\) are functions of \(y_i\), and some constants.

We will note that the denominator of \(\hat{b}_1\) can take a few forms:

\[ n \sum_{i=1}^n x_i^2 - \left(\sum_{i=1}^n x_i \right)^2 = n \sum_{i=1}^n x_i^2 - \left(n\bar{x}\right)^2 = n \left(\sum_{i=1}^n x_i^2 - n\bar{x}^2\right) = n\sum_{i=1}^n (x_i - \bar{x})^2 \]

Which you can prove for yourselves.

We will start with \(\beta_1\).

\[ \hat{\beta_1} = \frac{n \sum_{i=1}^n x_i y_i - \sum_{i=1}^n y_i \sum_{i=1}^n x_i}{n\sum_{i=1}^n (x_i - \bar{x})^2} \]

Deriving the sampling distribution of \(\beta_1\)

(1) Write \(\hat{\beta_1}\) so that it is a linear function of the \(y_i\)s

Answer \[ \begin{align} \hat{\beta_1} &= \frac{n \sum_{i=1}^n x_i y_i - \sum_{i=1}^n y_i \sum_{i=1}^n x_i}{n\sum_{i=1}^n (x_i - \bar{x})^2} \\ &= \frac{\sum_{i=1}^n y_i \left(n x_i - \sum_{i=1}^n x_i \right)}{n\sum_{i=1}^n (x_i - \bar{x})^2} \\ &= \frac{\sum_{i=1}^n y_i \left(n x_i - n \bar{x} \right)}{n\sum_{i=1}^n (x_i - \bar{x})^2} \\ &= \frac{\sum_{i=1}^n y_i \left(x_i - \bar{x} \right)}{\sum_{i=1}^n (x_i - \bar{x})^2} \end{align} \]

(2) From this deduce what distribution \(\beta_1\) follows (working out what the parameters of that distribution are will come in a moment)

Answer It’s a linear combination of the \(y_i\)s, which are normally distributed, so \(\beta_1\) much also follow a normal distribution.

(3) Show that it is unbiased, i.e. \(E(\hat{\beta_1})=\beta_1\)

Hint

\(E(y_i) = \beta_0 + \beta_1 x_i\)

Answer Theorem 11.3.2b, specifically the top of p548.

(4) Show that the sampling variance is \(Var(\hat{\beta_1})=\frac{\sigma^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\)

Hint
  1. Write the denominator as \(\sum_{i=1}^n(x_i - \bar{x})^2\)
  2. \(Var(aX) = a^2 Var(X)\)
  3. Only \(y_i\) is random.
Answer

Theorem 11.3.2c, specifically the middle of p548.

Deriving the sampling distribution of \(\beta_0\)

This follows the proofs for \(\beta_1\).

(1) Write \(\hat{\beta_0}\) so that it is a linear function of the \(y_i\)s

Answer

We will write \(\sum_{i=1}^n (x_i - \bar{x})^2 = s_x^2\)

\[ \begin{align} \hat{\beta_0} &= \bar{y} - \frac{\sum_{i=1}^n y_i(x_i - \bar{x})}{\sum_{i=1}^n (x_i - \bar{x})^2}\bar{x} \\ &= \frac{\sum_{i=1}^n y_i}{n} - \frac{\sum_{i=1}^n y_i(x_i - \bar{x})}{s_x^2} \frac{\sum_{i=1}^n y_i}{n} \\ &= \frac{s_x^2 \sum_{i=1}^n y_i - n \bar{x}\sum_{i=1}^n y_i(x_i - \bar{x})}{n s_x^2} \\ &= \frac{ \sum_{i=1}^n y_i (s_x^2 - n \bar{x} (x_i - \bar{x}))}{n s_x^2} \end{align} \]

(2) From this deduce what distribution \(\beta_0\) follows (working out what the parameters of that distribution are will come in a moment)

Answer It’s a linear combination of the \(y_i\)s, which are normally distributed, so \(\beta_0\) much also follow a normal distribution. I know, you’re shocked.

(3) Show that it is unbiased, i.e. \(E(\hat{\beta_1})=\beta_1\)

Hints
  1. \(E(y_i) = \beta_0 + \beta_1 x_i\)
  2. Note the different ways of writing the denominator above.
Answer

There might be a quicker way to this result, but here we go… \[ \begin{align} E(\hat{\beta_0}) &= \frac{ \sum_{i=1}^n E(y_i) (s_x^2 - n \bar{x} (x_i - \bar{x}))}{n s_x^2} \\ &= \frac{ \sum_{i=1}^n (\beta_0 + \beta_1 x_i) (s_x^2 - n \bar{x} (x_i - \bar{x}))}{n s_x^2} \\ &= \frac{ \sum_{i=1}^n \beta_0(s_x^2 - n \bar{x} (x_i - \bar{x})) + \beta_1 x_i (s_x^2 - n \bar{x} (x_i - \bar{x}))}{n s_x^2} \\ &= \frac{ \sum_{i=1}^n \beta_0 s_x^2 - n \bar{x} \beta_0\sum_{i=1}^n(x_i - \bar{x}) + \sum_{i=1}^n \beta_1 x_i s_x^2 - \sum_{i=1}^n \beta_1 x_i n \bar{x} (x_i - \bar{x}))}{n s_x^2} \\ &= \frac{n \beta_0 s_x^2 - 0 + \beta_1s_x^2 \sum_{i=1}^n x_i - \sum_{i=1}^n \beta_1 x_i n \bar{x} (x_i - \bar{x}))}{n s_x^2} \\ &= \frac{n \beta_0 s_x^2 + n \beta_1 s_x^2 \bar{x} - \beta_1 n \bar{x} \sum_{i=1}^n x_i (x_i - \bar{x}))}{n s_x^2} \\ \end{align} \]

Before this gets too horrible,note that \(\sum_{i=1}^n x_i (x_i - \bar{x}) = \sum_{i=1}^n x_i^2 -\sum_{i=1}^n x_i \bar{x} =\sum_{i=1}^n x_i^2 -n \bar{x}^2 = \sum_{i=1}^n (x_i-\bar{x})^2 =s^2_x\). Now back to the Horrible \[ \begin{align} E(\hat{\beta_0}) &= \frac{n \beta_0 s_x^2 + n \beta_1 s_x^2 \bar{x} - \beta_1 n \bar{x} s_x^2}{n s_x^2} \\ &= \beta_0 + \beta_1 \bar{x} - \beta_1 \\ &= \beta_0 + \beta_1 \bar{x} - \beta_1 \bar{x} \\ &= \beta_0 \end{align} \]

(3) Show that the sampling variance is \(Var(\hat{\beta_0})=\frac{\sigma^2 \sum_{i=1}^n x^2_i}{n\sum_{i=1}^{n}{(x_i - \bar{x})^2}} = \sigma^2 \left[ \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n}{(x_i - \bar{x})^2}}\right]\)

This is not too difficult if you work from \(\beta_0 = \bar{y} - \beta_1 \bar{x}\), and note that \(Var(\bar{y}) = Var(\sum y_i/n)\).

Hint
  1. \(Var(aX+bY) = a^2 Var(X) + b^2 Var(Y)\)
  2. Only \(y_i\) is random.
Answer

First, we can break the variance down into components \(Var(\hat{\beta}_0) = Var(\bar{y} - \hat{\beta}_0 \bar{x}) = Var(\bar{y}) + \bar{x}^2 Var(\hat{\beta}_0)\)

We know \(Var(\hat{\beta_1})=\frac{\sigma^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\). We can work out \(Var(\bar{y})\): note that \(E(\bar{y_i})\) is not constant over \(i\), and \(Var(y_i)=\sigma^2\): to be strict this should be \(Var(y_i|x_i)\). So we have:

\(Var(\bar{y}) = Var \left( \frac{\sum_{i=1}^ny_i}{n}\right) = \frac{1}{n^2}\sum_{i=1}^nVar(y_i) =\frac{1}{n^2}n\sigma^2=\frac{\sigma^2}{n}\)

And we can plug these in:

\[ \begin{align} Var(\hat{\beta}_0) &= Var(\bar{y}) + \bar{x}^2 Var(\hat{\beta}_1) \\ &= \frac{\sigma^2}{n} + \bar{x}^2 \frac{\sigma^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2} \\ &= \sigma^2 \left(\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\right) \\ \end{align} \]

The sampling distribution of \(\sigma^2\)

We won’t look at this in any detail (sorry!). The result is Theorem 11.3, and essentially says that \(n \frac{\hat{\sigma}^2}{\sigma^2}\) follows a \(\chi^2_{n-2}\) distribution. This might not be too surprising, all the way from week 2 we have seen that \(\frac{\hat{\sigma}^2}{\sigma^2}\) follows one gamma distribution or another. The only change here is that the degrees of freedom are \(n-2\). The intuitive reason for this is, as discussed in the module of goodness of fit, you lose one degree of freedom for every parameter estimated: here we estimate \(\beta_0\) and \(\beta_1\), hence the \(n-2\).

Womens’ Times Data

Now we can see this theory in action…

We can fit this model to the womens’ times. So \(x_i\) is the \(i^{th}\) year, and \(y_i\) is the winning time in that year.

We could calculate this by hand, but it is easier to use a computer. So using that, I find

\(\bar{x}=\) 1984

\(\bar{y}=\) 11.02

\(Cov(X, Y)=\) -6.58

\(Var(X)=\) 506.67.

So \(\beta_1=\) -6.58/506.67 = -0.013.

And \(\beta_0=\) 11.02 - (-0.01) \(\times\) 1984 = 36.78.

The full model is thus

\(y_i =\) 36.78 - 0.013 \(x_i + e_i\)

What does this mean? Obviously this is a straight line, which we can plot with the data:

The line represents the best fit to the data: if there was no error (i.e. \(e_i=0, \forall i\)) it would lie exactly on this line. So we call the line the fitted line, and the values of this at the data are the fitted values.

Now some easy exercises, about interpreting the model parameters:

Exercise What is the change in expected winning time from one Olympic Games to the next (Olympic Games only occur every 4 years)?

Exercise What, according to the model, would have been the winning time in Year 0?

Exercise Rather than fitting the model \(y_i = a + b x_i\), we could fit the model \(y_i = a^* + b^* (x_i- \bar{x})\). How would the estimates differ?

Answers
  1. The change would be \(4\beta_1 = 4\times (-0.013) =\) -0.052s
  2. In year 0 the winning time would be \(\beta_0=\)-0.013s.
  3. \(y_i = a^* + b^* (x_i - \bar{x}) = a^* - b^* \bar{x} + b^* x_i\), so \(b^*=b\), i.e. the slope does not change, but the intercept changes to \(a = a^* - b^* \bar{x}\).

The point of the second question was to point out that we can predict the winning time 2000 years ago, but hopefully you will think that the number is unlikely to be realistic: I think I might have a chance of running 100m in that time. After I trained a bit.

The third question makes the point that if you are interested in the slope, you can put the intercept anywhere. \(\bar{x}\) is often used for convenience.

How can we (simply) transform the data to simplify the calculation (in fact, avoid plugging in \(\hat{\beta}_1\) into \(\bar{y} -\beta_1 \bar{x}\) at all)?

Hint We can transform \(x_i\)
Solution

We can write \(x^c_i=x_i - \bar{x}\). Then because \(\hat{\beta}_1 = \frac{Cov(X,Y)}{Var(X)}\), subtracting the mean does not affect the slope. But the intercept is \(\hat{\beta}^c_1 =\bar{y} -\beta_1 \bar{x}^c = \bar{y}\), because \(\bar{x}=0\) by definition.

We call this centering (which his why I use a \(c\) superscript). It is often a useful trick, indeed standardising the covariate, i.e. calculating \(x_i^s =\frac{x_i-\bar{x}}{s_x}\), where \(s_x\) is the standard deviation of \(X\), is often helpful in more complex problems.

Now we have the basic theory, and have seen a simple example. Next week we will look at how to use it, and some ways to check that the model is not horrible.

Footnotes


  1. I have claimed that least squares was invented by the French so they could work out where Paris was. This is almost - but not entirely - false.↩︎

  2. From Chapter 17 of Stigler, S.M. (1999) Statistics on the Table. If you fancy dipping into the history of statistics, this is a good book. It is well researched history, and written by someone who at times is having a lot of fun making it engaging.↩︎