Recap

Last week we started looking at regression models. We assume we have data with pairs \((x_i, y_i)\), and a model \(y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\), with \(\varepsilon_i \sim N(0,\sigma^2)\).

We found that we can fit this model by maximum likelihood, getting

\[ \begin{align} \hat{\beta}_0 &=\bar{y} - \beta_1 \bar{x} \\ \hat{\beta}_1&=\frac{Cov(X,Y)}{Var(X)} = \frac{\sum_{i=1}^nx_iy_i - \bar{x} \bar{y}}{\sum_{i=1}^nx_i^2 - n \bar{x}^2}\\ \hat{\sigma}^2&=\frac{1}{n}{\sum_{i=1}^n(y_i - \hat{\beta}_0-\hat{\beta}_1 x_i)^2} \end{align} \] We found that \(\hat{\beta}_0\) and \(\hat{\beta}_1\) were both normally distributed (Theorem 11.3.2):

\[ \begin{align} \hat{\beta}_0 &\sim N \left(\bar{y} - \beta_1 \bar{x}, \sigma^2 \left[\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n(x_i - \bar{x})^2} \right] \right)\\ \hat{\beta}_ 1 &\sim N \left(\frac{\sum_{i=1}^nx_iy_i - \bar{x} \bar{y}}{\sum_{i=1}^nx_i^2 - n \bar{x}^2},\frac{\sigma^2}{\sum_{i=1}^n(x_i - \bar{x})^2}\right) \\ n \frac{s^2}{\sigma^2} &\sim \chi^2_{n-2} \end{align} \]

These are their distributions depending on the (unknown) true values of the parameters.

This module will look at how to use these results to look at data. Specifically we will see:

how to inferences about the parameters without knowing the true values
how to inferences about data (e.g. predicting new values)
how to look at if the model is good, and some ways to improve it.

First, we will look at the model is a bit more detail: this will give us some terminology, and help us develop methods to check that hte fitted model is a good one.

The parts of the regression model

From the model we can write the data as

\[ y_i = a + b x_i + e_i = \mu_i + e_i \]

We call \(\mu_i\) the fitted value: we can also write it as \(\hat{y}\), and \(e_i\) the residual (the residual is the bit left over after you’ve done everything else).

We can think that the model is made up of a systematic part (\(\mu_i\)) and a random part (\(e_i\)). Thus the \(e_i\)s should not have any structure: if they do, that should be in the systematic part of the model, so we should re-think our model.

One way of looking at least squares is that it minimising the variance of the residual (as it is \(\min\sum_{i=1}^n(y_i-\mu_i)^2\)). Conversely it means we are maximising the variance explained by \(\mu_i\). Thus, although we are usually not directly interested in \(\sigma^2\), indirectly the model fitting can be seen as focussed on making it as small as possible.

Inferences about the parameters: testing and confidence intervals

Inferences about \(\sigma^2\)

First we will look at \(\sigma^2\). We are not usually directly interested in this parameter, but \(\hat{\beta}_0\) and \(\hat{\beta}_1\) both depend on \(\sigma^2\), so we need to deal with it anyway. Plus, when we start to look at ANOVA, later in the course, we will see that the results here will be lurking.

We know that the maximum likelihood estimator is \(\hat{\sigma}^2 = \frac{\sum_{i=1}^n (y_i - \bar{y})^2}{n}\). We also know that \(n \frac{\hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-2}\) (Theorem 11.3.3).

Thus, the unbiased estimator for \(\sigma^2\) is

\[ s^2 = \frac{1}{n-2}\sum_{i=1}^n(y_i - \hat{\beta}_0 - \hat{\beta}_1 x_i)^2 \]

Show that \(s^2=\frac{n}{n-2} \hat{\sigma}^2\) is an unbiased estimator for \(\sigma^2\)

Hint

If \(X \sim \chi^2_k\), \(E(X)=k\)

Answer

See corollary 11.3.1

If you are in the weird situation that you want to make an inference about \(\sigma^2\), how do you do it? We know the estimator (see above), so what is its confidence interval?

What is the confidence interval for \(\sigma^2\)

Hint

If \(X \sim \chi^2_k\), \(E(X)=k\)

Answer

See p555:

\[ \left[ \frac{(n-2) s^2}{\chi^2_{1-\alpha/2, n-2}}, \frac{(n-2) s^2}{\chi^2_{\alpha/2, n-2}} \right] \] where \(\chi^2_{\alpha, k}\) is the \(\alpha\)% critical value of a \(\chi^2_k\) distribution.

Inferences about \(\beta_1\)

Now we will look at \(\beta_1\), the slope. This is usually the parameter that we are interested in, e.g. how much faster are women sprinters running every year? We want to know not just what the estimate is, but what is the variation around this estimate?

Because \(\hat{\beta}_ 1 \sim N \left(\frac{\sum_{i=1}^nx_iy_i - \bar{x} \bar{y}}{\sum_{i=1}^nx_i^2 - n \bar{x}^2},\frac{\sigma^2}{\sum_{i=1}^n(x_i - \bar{x})^2}\right)\), we know that

\[ Z = \frac{\hat{\beta}_1 - \beta_1}{\sqrt{\sigma^2/\sum_{i=1}^n(x_i - \bar{x})^2}} \] follows a standard normal distribution. We also know that \(n \frac{s^2}{\sigma^2} \sim \chi^2_{n-2}\), so

What is the distribution of \(Z\)? And how would you use this to test if \(\beta_1=0\)? And how would you calculate the confidence intervals?

Hint

This is a re-use of ideas from Chapter 7, which we covered in Week 2. We will be using these ideas a lot in this module, so if you’re unsure, reread them, once you’ve found the right part.

Answer

It’s a t distribution: see 11.3.4. For the test, see 11.3.5. And for the confidence intervals see Theorem 11.3.6.

At some point this will (or already has) become repetitive: one t-test is like all the others. It’s just a question of what you put into it. But fear not, in a couple of weeks we will look at ANOVA, where everything is an F test. Actually, the test here is also an F test, if you do a bit of work on it (start from Theorem 7.3.4).

Inferences about \(\beta_0\)

It is rare that we want to make inferences about one \(\beta_0\) (although comparing them is sometimes done). But just in case…

What statistic would you use to calculate confidence intervals, and test if \(\beta_0=b\)?

Hint

You can follow the same approach used for \(\beta_0\)

Answer

It’s a t etc. etc.

See p554-555.

This is mainly to drive home the point that everything’s a t distribution, unless it’s an F distribution. Whilst this might be boring, it’s useful because it makes life a lot simpler.

Inferences about Data

So far we have looked at the parameters, but we can also be interested in the data, for example if our prime purpose is prediction. There are two problems we need to consider: predicting \(Y_{new}\), and estimating \(E(Y_{new})\):

\(Y_{new}\) is some “new” data, e.g. we might want to predict the winning time for the women’s 100m for 2024 (something that can make you a lot of money as a statistician. Not betting, but working for the betting companies).
\(E(Y_{new})\) is the expected value. This is not certain, because the parameters are not certain. It can be useful if we think that most of our uncertainty, i.e. the variance \(\sigma^2\), is because of observation error, and the process is largely deterministic. Then \(E(Y_{new})\) would be a better estimate of the underlying state.

The point estimates are the same, but they have different variances.

Why do \(Y_{new}\) and \(E(Y_{new})\) have different variances?

Answer

From the model we have \(y_{new} = \beta_0 + \beta_1 x_{new} + \varepsilon_{new}\), so \(E(y_{new}) = \beta_0 + \beta_1 x_{new}\). As \(y_{new}\) has the extra term, \(\varepsilon_{new}\), it has extra variance due to this.

If we look back at the figure above, \(E(Y_{new})\) is a point that lays along the fitted line, so any variance is due to the fitted line. \(Y_{new}\) is an actual data point, so there is extra variance because the data differ from the line.

Inferring \(E(Y_{new})\)

So what is the expected value for a new observation? The obvious estimate is \(E(Y_{new}) = \hat{Y}= \hat{\beta}_0 + \hat{\beta}_1 x_{new}\). This is unbiased, from theorem 11.3.2:

\[ E(\hat{Y})= E(\hat{\beta}_0) + E(\hat{\beta}_1) x_{new} = \beta_0 + \beta_1 x_{new} \]

But what is its sampling distribution?

What distribution does \(\hat{Y}=E(Y_{new})\) follow?

Answer

The random variables in \(\hat{\beta}_0 + \hat{\beta}_1x_{new}\), are \(\hat{\beta}_0\) and \(\hat{\beta}_1\), and the sum of normal random variables is also normally distributed.

What is the variance of \(\hat{Y}=E(Y_{new})\)?

Hint

\(Var(\hat{Y}) = Var(\hat{\beta}_0 + \hat{\beta}_1 x_{new})\) also, use \(\hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{x}\), and the variances of the parameter estimates

Answer

p557… \[ \begin{align} Var(\hat{Y}) &= Var(\hat{\beta}_0 + \hat{\beta}_1 x_{new}) \\ &= Var(\bar{Y} - \hat{\beta}_1 \bar{x} + x_{new} \hat{\beta}_1) \\ &= Var(\bar{Y} + \hat{\beta}_1 (x_{new}-\bar{x})) \\ &= Var(\bar{Y}) + Var(\hat{\beta}_1)(x_{new}-\bar{x})^2 \end{align} \] Now we can plug in \(Var(\bar{Y})\) and \(Var(\hat{\beta}_1)\). \(Var(\bar{Y}) = Var(Y)/n=\sigma^2/n\), and \(Var(\hat{\beta}_1) = \frac{\sigma^2}{\sum_{i=1}^n(x_i - \bar{x})^2}\):

\[ \begin{align} Var(\hat{Y}) &= Var(\bar{Y}) + Var(\hat{\beta}_1)(x_{new}-\bar{x})^2 \\ &= \frac{\sigma^2}{n} + \frac{\sigma^2}{\sum_{i=1}^n(x_i - \bar{x})^2}(x_{new}-\bar{x})^2 \\ &= \sigma^2 \left[\frac{1}{n} + \frac{(x_{new}-\bar{x})^2}{\sum_{i=1}^n(x_i - \bar{x})^2} \right] \end{align} \]

The variance of the predictor is a function of the unknown variance, so if we want to estimate a confidence interval, we need to include an estimate of this, which will affect the sampling distribution of \(hat{Y}_{new}\).

How should we calculate an \(\alpha\)% confidence interval for \(\hat{Y}=E(Y_{new})\)?

Hint

See above, for \(\beta_1\) etc.

Answer

Theorem 11.3.7. Yeah, it’s another t distribution

Inferring \(Y_{new}\)

So what is the predictive distribution for a new observation? As you might guess, it’s going to be a t distribution. But what t distribution?

The obvious estimate is \(E(Y_{new}) = \hat{Y}= \hat{\beta}_0 + \hat{\beta}_1 x_{new}\), as above, and this is unbiased (because \(E(Y_{new})\) is unbiased, and \(E(\varepsilon_{new})=0\)). But what is its sampling distribution?

By definition of the model, \(Y_{new}\) follows a normal distribution (the model is \(y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\), with \(\varepsilon_i \sim(0, \sigma^2)\)).

What is the variance of \(Y_{new}\)?

Answer

p559…

\[ \begin{align} Var(Y) &= Var(\hat{\beta}_0 + \hat{\beta}_1 x_{new} + \varepsilon_i) \\ &= Var(\hat{Y}) +\sigma^2\\ &= \sigma^2 \left[\frac{1}{n} + \frac{(x_{new}-\bar{x})^2}{\sum_{i=1}^n(x_i - \bar{x})^2} \right] + \sigma^2 \\ &= \sigma^2 \left[1 + \frac{1}{n} + \frac{(x_{new}-\bar{x})^2}{\sum_{i=1}^n(x_i - \bar{x})^2} \right] \end{align} \]

Notice that the difference from the variance of \(\hat{Y}\) is the extra \(\sigma^2\), which comes from predicting the actual value, not just its expected value.

The random variables in \(\hat{\beta}_0 + \hat{\beta}_1x_{new} + \varepsilon_i\), are all normally distributes, so \(y_{new}\) is also normally distributed. This means that

How should we calculate an \(\alpha\)% confidence interval for \(Y_{new}\)?

Hint

See above

Answer

Theorem 11.3.8.

We can plot these for the women’s Olympic 100m times (we will look at the calculations in more detail further down):

Obviously the prediction interval for \(y_{new}\) is wider. Also notice that the intervals are wider when \(x - \bar{x}\) is larger. This comes from the \((x - \bar{x})^2 Var(\bar{\beta}_1)\) term in the variances. it can be traced back to the uncertainty in the slope. Because the line has to got through \((\bar{x}, \bar{y})\), the uncertainty at \(\bar{x}\) is just uncertainty in \(\bar{y}\). Away from that, the uncertainty in the slope also has an effect, and this is larger the further from \(\bar{x}\) you are: the error in the estimate of \(\beta_1\) is \(\hat{\beta}_1-\beta_1\), so the error in a prediction is \((\hat{\beta}_1-\beta_1)\times(x_{new}-\bar{x})\).

Inference for the Women’s 100m

We have everything we need to look at the Women’s 100m winning times. The original purpose of collecting the data was to compare the changes in men’s and women’s times, which we will look at later. For now, we will do the calculations for the women’s times.

Estimating the model parameters

Step one is to fit the model to the data, which is the same as estimating the parameters of the model. These are our estimators:

\[ \begin{align} \hat{\beta}_0 &=\bar{y} - \beta_1 \bar{x} \\ \hat{\beta}_1&=\frac{Cov(X,Y)}{Var(X)} = \frac{\sum_{i=1}^nx_iy_i - \bar{x} \bar{y}}{\sum_{i=1}^nx_i^2 - n \bar{x}^2}\\ \hat{\sigma}^2&=\frac{1}{n-2}{\sum_{i=1}^n(y_i - \hat{\beta}_0-\hat{\beta}_1 x_i)^2} \end{align} \]

And last week we got \(\bar{x}=\) 1984, \(\bar{y}=\) 11.02, \(Cov(X, Y)=\) -6.58, \(Var(X)=\) 506.67.

From these we can calculate the parameters:

\(\beta_1=\) -6.58/506.67 = -0.013
\(\beta_0=\) 11.02 - (-0.01) \(\times\) 1984 = 36.78.

It is a (little) bit more difficult to calculate the estimate of \(\sigma^2\). We can expand the expression above into terms involving the parameters and \(\sum y^2\), \(\sum x^2\), \(\sum xy\), \(\bar{x}\) and \(\bar{y}\). Or we can use a computer:

Fitted <- b0 + b1*Data$Year
SumSq <- sum((Data$WomenTimes-Fitted)^2)

To get \(\sum_{i=1}^n(y_i - \hat{\beta}_0-\hat{\beta}_1 x_i)^2=\) 0.61. So \(\hat{\sigma}^2=\) 0.61/(19-2)= 0.036.

The full model is thus

\(y_i =\) 36.78 - 0.013 \(x_i + \varepsilon_i\)

\(\varepsilon_i \sim N(0, 0.036)\)

Estimating Confidence Intervals

We will do this for \(\hat{\beta}_1\), as this is usually the most interesting.

From above, we know that \(\frac{\hat{\beta}_1 - \beta_1}{s/\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2}}\) follows a \(t_{n-2}\) distribution, so the \(\alpha\)% confidence interval is

\[ \hat{\beta}_1 \pm t_{(1-\alpha)/2,n-2}\sqrt{\frac{\hat{\sigma}^2}{\sum_{i=1}^n(x_i-\bar{x})^2}}. \]

We have \(\hat{\beta}_1\), \(\hat{\sigma}^2\), and can calculate \(\sum_{i=1}^n(x_i-\bar{x})^2 = \sum_{i=1}^nx_i^2-n\bar{x}^2 =\) 9120 (yes, I did use my computer for this). We can look up \(t_{(1-0.95)/2,19-2}=\) -2.11.

Plugging these in gives us \(-0.013 \pm 2.11\sqrt{\frac{0.036}{9120}} = -0.013 \pm\) 0.0042 = (-0.0172, -0.0088). So we can be confident that the slope is negative, and winning times getting quicker by between 9ms and 17ms every year.

Calculating Predictions

The next Olympic 100m sprints will be in 2024 in Paris, so we want to predict the winning time, and summarise our certainty about this with a confidence interval.

The point prediction¹ for the winning time is:

\[ \hat{y}_{2024} = \hat{\beta}_0 + \hat{\beta}_1 \times2024 = 36.78-0.013\times2024 = 10.50 \] For the prediction interval we want to calculate

\[ \begin{align} Var(y_{2024}) &= \hat{\sigma}^2 \left[1 + \frac{1}{n} + \frac{(x_{new}-\bar{x})^2}{\sum_{i=1}^n(x_i - \bar{x})^2} \right] \\ &=0.036 \times \left[1 + \frac{1}{19} + \frac{(2024-1984)^2}{9120} \right] \\ &=0.036 \times \left[\frac{19 + 1 + 10/3}{19} \right] \\ &= 0.044 \end{align} \]

The point in factorising the brackets is just to see that most of the variance comes from the uncertainty in the actual time, rather than in the parameters, i.e. \(Var(\bar{y}_{2024})\) is relatively small.

Anyway, from this we can calculate the 95% prediction interval:

\[ \hat{y}_{2024} \pm t_{(1-\alpha)/2,n-2} \sqrt{Var(y_{2024})} = 10.50 \pm 2.11\times\sqrt{0.044} \]

So the interval is (10.06, 10.94). And we are suggesting that, if the model is correct, there is a 95% probability that the winning time will be in this interval. Or ot be exact, if the race was run lots of times, in 95% of them the winning time would be in the interval.

Inference for the Men’s 100m

Now weyou can do the same calculations for the men’s 100m.

Estimating the model parameters

Step one is to fit the model to the data, which is the same as estimating the parameters of the model. We will need the following statistics:

\(n=\) 19
\(\bar{x}=\) 1984
\(\bar{y}=\) 10.01
\(Cov(X, Y)=\) -4.64
\(Var(X)=\) 506.67
\(\sum_{i=1}^n(x_i-\bar{x})^2 =\) 9120
\(t_{(1-\alpha)/2,n-2} =\) -2.11
\(\sum_{i=1}^n(y_i - \hat{\beta}_0-\hat{\beta}_1 x_i)^2=\) 0.24

Some of these are the same as we have before, of course.

What are \(\hat{\beta}_0\) and \(\hat{\beta}_1\) for the men’s data?

Answer

\(\beta_1= \frac{Cov(X,Y)}{Var(X)} =\) -4.64/506.67 = -0.0092
\(\beta_0=\bar{y} - \beta_1 \bar{x} =\) 10.01 - (-0.0092) \(\times\) 1984 = 28.169.

What is \(\hat{\sigma}^2\) for the men’s data?

Answer

\(\hat{\sigma}^2=\frac{1}{n-2}{\sum_{i=1}^n(y_i - \hat{\beta}_0-\hat{\beta}_1 x_i)^2} =\) 0.24/(19-2)= 0.014.

What is the 95% confidence interval for \(\hat{\beta}_1\) for the men’s data?

Answer

\[ \hat{\beta}_1 \pm t_{(1-\alpha)/2,n-2}\sqrt{\frac{\hat{\sigma}^2}{\sum_{i=1}^n(x_i-\bar{x})^2}}. \]

which is -0.0092 \(\pm\) -2.11 \(\times (\) 0.014/9120\()^{0.5}\)

Plugging these in gives us \(-0.0092 \pm 2.11\sqrt{\frac{0.014}{9120}} = -0.003 \pm\) 0.0027 = (-0.0118, -0.0065). So we can be confident that the slope is negative, and winning times getting quicker by between 6ms and 12ms every year.

How does the slope compare to the estimated slope for the women’s times?

We will look at this question more formally later, for the moment I want you to interpret the estimates: what do they mean to a sports fan who is not interested in just the numbers? You many also want to look at the plot of the data, from last week’s module.

Answer

The slope is steeper for the women’s times, so they are getting quicker at a faster rate. They are also (at the moment) slower, so the gap between men and women is getting smaller.

What is the predicted winning time for the men’s 100m at the Paris Olympics, in 2024? And what is the 95% prediction interval?

Answer

The prediction is

\[ \hat{y}_{2024} = \hat{\beta}_0 + \hat{\beta}_1 \times2024 = 28.17-0.0092\times2024 = 9.55 \] (note that this is calculated using the rounded numbers above, in particular a slope of 0.0092. With a value of 0.009 the estimate is 9.95, a result so slow even a British sprinter could run faster than that. So, if you get slightly different numbers, it is probably the rounding, plus having values of \(X\) that are a long way from 0)

For the confidence interval we need the prediction variance:

\(Var(y_{2024}) = \hat{\sigma}^2 \left[1 + \frac{1}{19} + \frac{40^2}{9120}\right]=\) 0.017.

\[ \hat{y}_{2024} \pm t_{(1-\alpha)/2,n-2} \sqrt{Var(y_{2024})} = 9.55 \pm 2.11\times\sqrt{0.017} \]

So the interval is (9.51, 9.59).

Difference between slopes

We now have two separate models, one for men’s times and one for the women’s times. The specific reason for the interest in these data was that they suggested that at some point women may be running faster than men (and, as one wag pointed out, they would eventually be running in negative time). Estimating when is a bit beyond this course - not because it is technically difficult, but because it would need quite a bit of boring computing. So we will ask the slightly easier question - are the slopes actually different?

This is actually straightforward, and inevitably leads to a t distribution. We can assume the slopes are \(\beta_1^a\) and \(\beta_1^b\) for the two data sets, and they have \(n\) and \(m\) observations respectively. The difference in slopes is \(\beta_1^a-\beta_1^b\); we know that \(\hat{\beta}_1^a - \hat{\beta}_1^b\) follows a normal distribution (as a function of the unknown \(\sigma^2_a\) and \(\sigma^2_b\)). So we just need to work out the mean and variance.

What is mean of \(\hat{\beta}_1^a - \hat{\beta}_1^b\)?

Answer

\({\beta}_1^a - {\beta}_1^b\)

Essentially, because both estimators are unbiased the linear combination of them is too.

What is variance of \(\hat{\beta}_1^a - \hat{\beta}_1^b\)?

Hint

This is a linear combination of variances
\(Var(\hat{\beta}_1) = \frac{\sigma^2}{\sum_{i=1}^n(x_i - \bar{x})^2}\)

Answer

\[ \begin{align} \sigma^2_p = Var(\hat{\beta}_1^a - \hat{\beta}_1^b) &= Var(\hat{\beta}_1^a) + Var(\hat{\beta}_1^b) \\ &= \frac{\sigma_a^2}{\sum_{i=1}^n(x^a_i - \bar{x}_a)^2} + \frac{\sigma_b^2}{\sum_{i=1}^n(x^b_i - \bar{x}_b)^2} \end{align} \] This reduces to the result in Theorem 11.3.9 if you assume the variances are equal².

Now we have the mean and variance, how would we calculate an \(\alpha\)% confidence interval?

Hint: yep, it’s a t distribution. It has \(n+m-4\) degrees of freedom (\(m+n\) data points, and we lose 4 because we have to estiamte two \(\beta_0\)s and two \(\beta_1\)s)

Answer

The \[ \hat{\beta}^a_1 -\hat{\beta}^b_1 \pm t_{(1-\alpha)/2,n+m-4}\sqrt{\sigma^2_p} \] where

\[ \sigma^2_p = \frac{\sigma_a^2}{\sum_{i=1}^n(x^a_i - \bar{x}_a)^2} + \frac{\sigma_b^2}{\sum_{i=1}^n(x^b_i - \bar{x}_b)^2} \]

As always, you just (!) need to work out the mean and variance of the distribution, and the degrees of freedom, and plug them in. For this problem, of course, you need two means and two variances.

Compare mens and womens slopes

For the men’s and women’s 100m Olympic winning times, what is the difference in the slopes, and what is the 95% confidence interval? Is it plausible that they could be improving times at the same rate?

You should have almost all of the numbers calculated above, so you just need to plug in the correct numbers. The only number you don’t have is \(t_{0.025, 2\times19-4}\) = 2.03.

Answer

First, the difference \(\hat{\beta}_w - \hat{\beta}_m = -0.013-(-0.0092)=-0.038\)

Next, the variance. The \(X\)s are the same, so this is \[ \sigma^2_p = \frac{\sigma_w^2}{\sum_{i=1}^n(x_i - \bar{x})^2} + \frac{\sigma_m^2}{\sum_{i=1}^n(x_i - \bar{x})^2} = \frac{\sigma_w^2 + \sigma_m^2}{\sum_{i=1}^n(x_i - \bar{x})^2} \]

And from above we have \(\sigma_w^2= 0.036\), \(\sigma_m^2= 0.014\), and \(\sum_{i=1}^n(x_i - \bar{x})^2=9120\). So the variance is \(\frac{0.036+0.014}{9120}=\frac{0.050}{9120}=5.5\times10^{-6}\)

And of course we have \(t_{0.025, 2\times19-4}\) = 2.03.

Putting these together, the confidence interval is \(-0.038 \pm 2.03 \times \sqrt{5.5\times10^{-6}} = -0.038 \pm 0.0048 = (-0.043, -0.033)\). Rounding might make a slight difference.

So the difference is decreasing by (on average) between 33 and 43 ms per year. This interval is significantly different from 0, suggesting that this is a real change.

Summary

We have gone through a lot of the the actual problems that statisticians are given. Although the simple linear regression is, well, simple, it is actually used quite often, and a lot of extensions follow a very similar path. Including the use of the t distribution.

One potential problem with this model is that it makes quite a few assumptions, that may of may not be reasonable. For example, it assumes that the relationships are linear. Some of these assumptions may not be met, and indeed might be so critical that they can lead to nonsense conclusions. In the next module we will look at this in more detail: some approaches to checking if the assumptions are met, nad at least one approach to correcting the model.

We use point estimate and point prediction to describe an estimate (or prediction) that is a single value, e.g. \(\hat{y}\) is a point prediction, \(\hat{beta}_0\) is a point estimate. Confidence intervals are interval estiamtes.↩︎
Seriously, I had to restrain myself from loudly swearing when I realised that they were quietly making that assumption.↩︎

ST1201/6201 Week 6

Bob O’Hara

2022-09-26

Recap

The parts of the regression model

Inferences about the parameters: testing and confidence intervals

Inferences about \(\sigma^2\)

Inferences about \(\beta_1\)

Inferences about \(\beta_0\)

Inferences about Data

Inferring \(E(Y_{new})\)

Inferring \(Y_{new}\)

Inference for the Women’s 100m

Estimating the model parameters

Estimating Confidence Intervals

Calculating Predictions

Inference for the Men’s 100m

Estimating the model parameters

Difference between slopes

Compare mens and womens slopes

Summary