Hints and reminders are in bold

Questions appear in blue.

Some hints and answers are hidden away: click here Hello! I might be the answer. Or a hint.

Recap

Last week we continued looking at regression models. We assume we have data with pairs \((x_i, y_i)\), and a model \(y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\), with \(\varepsilon_i \sim N(0,\sigma^2)\).

We found that we can fit this model by maximum likelihood, getting

\[ \begin{align} \hat{\beta}_0 &=\bar{y} - \beta_1 \bar{x} \\ \hat{\beta}_1&=\frac{Cov(X,Y)}{Var(X)} = \frac{\sum_{i=1}^nx_iy_i - n\bar{x} \bar{y}}{\sum_{i=1}^nx_i^2 - n \bar{x}^2} = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i - \bar{y})}{\sum_{i=1}^n(x_i - \bar{x})^2} \\ \hat{\sigma}^2&=\frac{1}{n}{\sum_{i=1}^n(y_i - \hat{\beta}_0-\hat{\beta}_1 x_i)^2} \\ \hat{y} &= \hat{\beta}_0 + \hat{\beta}_1 x_i \end{align} \]

We found that the estimates of \(\hat{\beta}_0\), \(\hat{\beta}_1\) and \(\hat{y}\) were all t distributed, so with a slight abuse of notation:

\[ \begin{align} \hat{\beta}_0 &\sim \frac{\beta_0 - t_{n-2}}{\sqrt{\frac{s^2 \sum{x_i^2}}{n\sum(x_i-\bar{x})^2}}}\\ \hat{\beta}_1 &\sim \frac{\beta_1 - t_{n-2}}{s/\sqrt{\sum(x_i-\bar{x})^2}} \\ \hat{y_i} &\sim \frac{(\beta_0+\beta_1 x_i)- t_{n-2}}{s \sqrt{\frac{1}{n} + \frac{ (x_i-\bar{x})^2}{\sum(x_i-\bar{x})^2}}} \\ \hat{y}_{new} &\sim \frac{(\beta_0+\beta_1 x_{new}) - t_{n-2}}{s \sqrt{1 + \frac{1}{n} + \frac{ (x_i-\bar{x})^2}{\sum(x_i-\bar{x})^2}}} \\ \end{align} \]

We can calculate these for the women’s 100m winning times data that we have been using. In reality, statistics packages have functions to do this, but here is the code in R for the calculations. It should be similar in other languages (Python, Fortran, Pascal, etc. etc.).

# Download the data from the web
Data100m <- 
  read.csv("https://www.math.ntnu.no/emner/ST1201/2022h/Week05/OlympicTimes.csv")

# Calculate some statistics 
n <- length(Data100m$Year) # number of data points
Xbar <- mean(Data100m$Year)
Ybar <- mean(Data100m$WomenTimes)
Xc <- Data100m$Year - Xbar # centre the covariate, so it has mean 0
Yc <- Data100m$WomenTimes - Ybar
XY <- Xc*Yc # (X_i-barX)(Y_i-barY)

# Estimate the parameters
beta1hat <- sum(XY)/sum(Xc^2)
beta0hat <- Ybar - beta1hat*Xbar
Yhat <- beta0hat + beta1hat*Data100m$Year # Fitted values
sigma2 <- sum((Data100m$WomenTimes - Yhat)^2)/(n-2)

How Good is our Model?

Now we have a model, we can estimate the parameters and predictions, alongside intervals to summarise our uncertainty in them. But is it actually any good?

There are two interpretations of “how good is our model?”. One is asking if it explains a lot of variation in the data: a bad model explains almost nothing, a good one can explain a lot. The other interpretation is “does it match the assumptions?”. If it does, it is a good model (in this sense). If it does not, it may still work well, or it may be a horrible disaster.

\(R^2\): How much variation does the data explain?

One important question about a model is how much of the data does it explain. If it explains 97% of the variation in the data, then it is probably a very good model. If it only explains 4% of the variation it is not so impressive.

Luckily for linear models this is straightforward to calculate and interpret. We can start by looking at the variance in the data:

\[ Var(Y) = Var(\beta_0 + \beta_1 x_i + \varepsilon_i) = \beta_1^2 Var(x_i) + Var(\varepsilon_i) = \beta_1^2 Var(x_i) + \sigma^2 \]

The amount of variation explained by the \(x_i\)s is obviously \(\beta_1^2 Var(x_i)\), so we can express that as a proportion of the total variance.

How do we calculate this for the data? We can reduce this down to sums of squares. We can write these down:

\[ \begin{align} SS_{Tot} &= \sum_{i=1}^n(y_i - \bar{y})^2 \\ SS_{TR} &= \sum_{i=1}^n(\hat{y_i}- \bar{y})^2 \\ SS_{E} &= \sum_{i=1}^n(y_i - \hat{y_i})^2 \\ SS_{X} &= \sum_{i=1}^n(x_i - \bar{x_i})^2 \end{align} \]

  • \(SS_{Tot}\) is the total sum of squares: \(Var(Y) = SS_{Tot}/(n-1)\)
  • \(SS_{TR}\) is the treatment sum of squares, it represents the variation in the predicted \(y_i\)s, because \(Var(\hat{Y})= SS_{TR}/(n-2)\).
  • \(SS_{E}\) is the error sum of squares, i.e. the unexplained variation: \(s^2 = SS_{E}/(n-2)\)
  • \(SS_{X}\) is the sum of squares of the covariate: \(Var(X) = SS_{X}/(n-1)\).

Last week you were given an exercise to prove that \(\sum_{i=1}^n(Y_i - \bar{Y})^2 = \sum_{i=1}^n(Y_i - \hat{Y})^2 + \sum_{i=1}^n(\hat{Y}_i - \bar{Y})^2\). I will assume you have done this, so you accept that this is the same as \(SS_{Tot} = SS_{TR} + SS_{E}\)

Then we can define \(R^2\) as

\[ R^2 = \frac{SS_{TR}}{SS_{Tot}} = 1 - \frac{SS_{E}}{SS_{Tot}} \]

And this is the proportion of the variance in the data explained by the fitted line (the “treatment”: the reason for this terminology is historical).

Why is this called \(R^2\)? Because it’s the square of \(r\), the correlation coefficient. As you might know (and if not, we will cover it next week) the correlation coefficient is \(Cov(X,Y)/\sqrt{Var(X)Var(Y)}\). For a sample of data we can calculate it as

\[ r = \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2 \sum_{i=1}^{n}(y_i - \bar{y})^2}} \]

Show that \(R^2=Cor(X,Y)^2\)

Hint
  • Note the expression for \(\hat{\beta}_1\) above: it can be calculated in a few ways, and this is the nicest here
  • you need to write out \(\hat{y}\) and \(\hat{\beta}_0\) in terms of \(\hat{\beta}_1\)
Answer

\[ R^2 = \frac{SS_{TR}}{SS_{Tot}} = \frac{\sum_{i=1}^n(\hat{y_i}- \bar{y})^2}{\sum_{i=1}^n(y_i - \bar{y})^2} \\ \]

Now, looking at the numerator we have

\[ \begin{align} SS_{TR} &= \sum_{i=1}^n(\hat{y_i}- \bar{y})^2 \\ &= \sum_{i=1}^n(\hat{\beta}_0 + \hat{\beta}_1 x_i - \bar{y})^2 \\ &= \sum_{i=1}^n(\bar{y} - \hat{\beta}_1 \bar{x} + \hat{\beta}_1 x_i - \bar{y})^2 \\ &= \beta_1^2 \sum_{i=1}^n(x_i - \bar{x})^2 \end{align} \]

Now, \[ \hat{\beta}_1= \frac{\sum_{i=1}^nx_iy_i - \bar{x} \bar{y}}{\sum_{i=1}^nx_i^2 - n \bar{x}^2} = \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} \]

So

\[ \begin{align} SS_{TR} &= \beta_1^2 \sum_{i=1}^n(x_i - \bar{x})^2 \\ &= \left(\frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2}\right)^2 \sum_{i=1}^n(x_i - \bar{x})^2 \\ &= \frac{\left(\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\right)^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \end{align} \]

OK, back to \(R^2\):

\[ \begin{align} R^2 = \frac{SS_{TR}}{SS_{Tot}} &= \frac{\sum_{i=1}^n(\hat{y_i}- \bar{y})^2}{\sum_{i=1}^n(y_i - \bar{y})^2} \\ &= \frac{\left(\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\right)^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \bigg/ \sum_{i=1}^n(y_i - \bar{y})^2 \\ &= \frac{\left(\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\right)^2}{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n(y_i - \bar{y})^2} \end{align} \]

And this is the square of \(r\), as defined above.

The purpose of \(R^2\) is to summarise how well the model is doing: like a lot of statistics it can be over-interpreted.

We can calculate \(R^2\) for the women’s 100m data, using the statistics we calculated above:

# Calculate sums of squares
SST <- sum(Yc^2) # = sum((Data$WomenTimes - Ybar)^2
SSTr <- sum((Yhat-Ybar)^2)
SSE <- sum((Data100m$WomenTimes - Yhat)^2)

(Rsquared <- SSTr/SST)
## [1] 0.7161751

So time explains just over 70% of the variation in the data, which is quite good for only one variable. But it still may not be a good model, becausse it might not fit the assumptions of the model, and these deviations might be important enough to make a difference1

Residuals

The strategy that is often taken to check the model is to check that the random parts (i.e. the estimates of \(\varepsilon_i\)) are actually random, and fit the distributional assumptions. There are a few ways to check this, but visual checks are often the model helpful. These checks are based on the residuals, the estimates of \(\varepsilon_i\).

We can estimate \(\varepsilon_i\) with \(e_i = y_i - \hat{y}\). If our model is good, the residuals should look like random draws from a normal distribution. If they do not look like this (and there are lots of ways they cannot).

It is worth recapping the assumptions of the model:

  1. \(E(y_i) = \beta_0 + \beta_1 x_i\)
  2. \(\varepsilon_i \sim N(0, \sigma^2)\)

Or, to break it down a bit:

  1. The model is linear in \(x\)
  2. The errors (\(\varepsilon_i\)s) are normally distributed
  3. The errors have the same variance
  4. The errors are independent

Although there are tests for these, they are generally examined graphically, because this gives the analyst much more information. Independence is often not checked: usually it will be clear if the assumptions might be violated (e.g. if observations are made in space, observations closer together might be more similar). Then we usually fit more complex model to check if this is a problem.

To see how residuals can be sued, we can look at our residuals from the model for women’s 100m times. There are a few ways to do this, so here we will plot them against year:

We can see that the first and last few residuals are positive, and the ones in the middle are negative. This suggests that the curve should not be a straight line, and some sort of curve would be better.

There are also two years (1960 and 1988) with lower values. Both of these were quicker because they were wind assisted.

Overall, this suggests that the relationship with time might not be linear: the improvements may be getting less over time (if you cannot see why this is so, work it out! You might find it helpful to draw some plots).

Residual plots like these are useful for seeing non-linearities and outliers (i.e. odd points a long way from the fitted line). But they are less good for looking at the assumption of normality.

What to do with Bad Models

Models can be bad for a lot of reasons, so there are also many possible ways to deal with them. Sometimes the solution is as simple as correcting a typo. At other times the relationship between \(X\) and \(E(Y)\) is not linear, so we need to change this relationship. But sometimes the deviations from the model may not matter: they may have little effect on the modelled relationship. Here we are heavily into the area of statistics where mathematics does not help: any judgement should mainly depend on the context of the problem. For example, Case Study 11.2.3 is a nice example where the residual plot is needed to see that one data point has a much higher value than expected, and looks like it could be because of the person collecting the data. But does the difference matter? Just from looking a Figure 11.2.5, it is impossible to see any effect. But the residual plot shows it is there. I can’t say whether it is important or not, simply because I am not a chemist, so I have no idea how accurate they would expect or need the experiments to be.

If we think a deviation from a model does matter, we should try to correct this. Often the deviation is that the curve is not a straight line (see Case Study 11.2.2 for an example). In this situation, we should try to find a better curve, i.e. a better relationship between \(X\) and \(E(Y)\). There are many different curves that could be used, and several different approaches. We can look at some of the effects of these transformations by first simulating some data. The left data is linear, the right data set has an exponential transformation: you can see that the line is curved.

We can fit a linear model to each of these, and look at the residuals:

We can see that the transformations affect the linearity, but it also affects the variance: in the data on the right the variance increases with the mean. There are a few possible solutions to these problems, and the best soilution will depend on a few factors.

Sort-of Non-linear Curves

We can transform the data, so we fit the model corresponding to

\[ f(y_i) = a + b x_i + e_i \]

using some function\(f()\). This is the same as \(y_i = f^{-1}(a + b x_i + e_i)\). I have described this as sort-of non-linear because the equation used in the model fitting is linear. But the back-transformation is non-linear.

Exponential Regression

Rather than changing linearly, data can change exponentially. For example, if we have a population of size \(N_0\) at time 0, and each individual produces offspring at a rate \(\lambda\) then at time \(t\) the number of individuals is \(N_t = N_0 e^{\lambda t}\). This is exponential growth: it is a general model for organisms (including epidemics) at low densities.

We can transform this model:

\[ \log(N_t) = \log(N_0) + \lambda t \]

So if we want to find the growth rate, \(\lambda\), we can fit this model and the slope is \(\lambda\).

The full model, with the error, is \(\log(N_t) = \log(N_0) + \lambda t + \varepsilon_i\), or

\[ N_t = N_0 e^{\lambda t} e^{\varepsilon_i} \]

so the effect of the error, \(\varepsilon_i\), is multiplicative.

We can see the effect on the two simulated data sets:

The residuals of right hand side data look good now, as might be expected because we have back-transformed the transformed data. The left hand side data set now has a curve in it, and also the variance decreases as \(x_i\) increases.

Other non-linear models

L&M describe a few similar models:

  • log-log relationship \(\log y_i = a + b \log x_i\)
  • logistic curve \(\log {\frac{L-y}{y_i}} = a + b \log x_i\) (L&M call this “logistic regression”, but that term is usually used for a slightly different model)

and list some others in Table 11.2.9.

Sort-of linear Curves

Rather than transforming \(Y\), we can also transform \(X\). For example we could use

\[ y_i = \beta_0 + \beta_1 \log(x_i) + \varepsilon_i \]

Obviously we fit this by transforming the \(x_i\)s and then applying our regression equations.

We can see that for these data this does not work: the left-hand data now has a curve. The right-hand data has the curvature fixed, but not the change in variance.

Of course, other transformations could be used.

log-log Regression

In exponential regression, the effects are multiplicative. Sometimes everything is multiplicative: \(y_i = a x_i^b \varepsilon_i\). This leads to log-transforming both sides:

\(\log(y_i) = \log(a) + \log(x_i) + \log(\varepsilon_i)\)

So we can log-transform both \(y_i\) and \(x_i\), and use linear regression.

Here the transformation messes things up. But often it does work well.

Exercises

For these residual plots, what do you think it wrong, and what could you do about them?

For all of these, a model \(y_i = \beta_0+\beta_1 x_i +\varepsilon_i\) was fitted. So if it helps, try to work out what what the data look like

Fitted model: \(y_i =\) 0.27 + 0.99 \(x_i +\varepsilon_i\)

Answer These are largely fine, but there is an outlier. This could be removed if there is a good reason for it being odd, or it might be a typo.

Fitted model: \(y_i =\) 0.51 + -0.07 \(x_i +\varepsilon_i\)

Answer Yep, this looks a mess. In fact, this is a logistic curve. So try using \(\ln{\frac{L-y}{y}}=\beta_0 + \beta_1 x_i\). You can try to estimate \(L\) by plotting the data.

Fitted model: \(y_i =\) 0.71 + 0.17 \(x_i +\varepsilon_i\)

Answer The variance decreases with the mean. It also might be a bit curved, so taking \(e^y\) could be a good idea. But this is probably the most difficult problem.

Fitted model: \(y_i =\) 3.84 + 0.81 \(x_i +\varepsilon_i\)

Answer These are fine: don’t touch them!

Summary

Residual analysis is very much part of the Dark Arts of statistics. The methods that are used have a mathematical basis, but actually understanding what residuals are doing is as much about intuition. Much of the fun of actually doing statistics comes from trying to understand what the data are trying to tell us, and poking around at residuals is one way of doing this.


  1. This can totally change the results. A few years ago some researchers claimed that hurricanes with female names caused more deaths than those with male names. Their model was more complicated, but if they had done what we are looking at here, they would have found out that their model was poor, and a simple fix lead to a much better model, but no effect of gender of the hurricane.↩︎