How good is my model?

The 95% confidence interval suggests Rexthor’s dog could also be a cat, or possibly a teapot.

Some Data: Women’s times

Here is the data on the winning times from the women’s 100m sprint at the Olympics (we will compare this data to the men’s times later in the course)

Times <- read.csv("https://www.math.ntnu.no/emner/ST2304/2019v/Week5/Times.csv")
WomenMod <- lm(WomenTimes~Year, data=Times)
plot(Times$Year, Times$WomenTimes, lwd=2,
     xlab="Year", ylab="Winning Time")
abline(WomenMod, col=2)
segments(Times$Year, fitted(WomenMod), Times$Year, Times$WomenTimes, col="darkblue", lwd=2)

Motivation - is a straight line any good?

Here are some simulated data sets. For all of them I used the the same errors, but manipulated the data in different ways. For each one, you should decide

if you think a straight line would be a good fit to the data, and
if it is not, can you do something simple to improve the fit? (for some you cannot, for some you can)

How good is my model? A Summary

The simulated data above suggest that simply fitting a straight line is not always the best idea. The data might not fit it, or there might be strange values, or other oddities in the data. When we have one X and one Y, plotting them is helpful for seeing a lot of problems, but later we will have several X’s, and these can hide problems. So we need some tools to pick apart the model and data, so we can find out how well the model is describing the data.

Another View of Regression

We can look on the model as being made up of two parts: the systematic part (which is being explained by covariates - our X’s), and the random part (which is the normally distributed error). For a simple regression the two parts look like this:

\[ \begin{aligned} y_i &= \color{red}{\mu_i} &+ \color{darkblue}{\varepsilon_i} \\ &= \color{red}{\alpha + \beta x_i} &+ \color{darkblue}{\varepsilon_i} \end{aligned} \]

Systematic part of model: a straight line
Random part of model: residual error

All of the models we will see have this general form, but both parts can be much more complicated. For a good model the systematic part should explain a lot of the variation in the data. The random part of the model should be,well, random. This means there shouldn’t be any structure in it.

The red line is the systematic part of the model, the dark green lines are the random parts: the residual variation that is left over.

How much variation does the model explain?

The total variation is

\[ \begin{aligned} \text{Var}(y_i) &= \color{red}{\text{Var} (\alpha + \beta x_i)} + \color{darkblue}{\text{Var} (\varepsilon_i)} \\ &= \color{red}{ \beta^2 \text{Var}(x_i)} + \color{darkblue}{\sigma^2} \end{aligned} \]

\(\sigma^2\) is the residual variation. We can ask how much of the total total variation is explained by the model. WWe do this by calculating hte proportion of the total variation explained by the model

\[ R^2 = \frac{\color{red}{\text{Variance Explained}} }{\text{Total Variance}} = 1 - \frac{\color{darkblue}{\text{Residual Variance}} }{\text{Total Variance}} \]

After a bit of maths, we get

\[ R^2 = 1 - \frac{\sum (y_i - \mu_i)^2}{\sum (y_i - \bar{y})^2} \]

where \(\sum (y_i - \mu_i)^2\) is the residual variance (i.e. the squared difference from expected value), and \(\sum (y_i - \bar{y})^2\) is the total variance (i.e. the squared difference from grand mean). We usually write \(R^2\) as a percentage, so we mutiply this by 100.

An obvious question is what is a good \(R^2\)? Unfortunately, there is no simple answer. In general, a value of 4% is bad, and 99% good, but the context is important. 30% is usually not great - most of teh variation is random, and 70% is usually good. But in the laboratory there should be less variation, so 70% might still not be great (especially if you have designed an experiment where there should be a huge effect).

It might be that this is the best we can do, because sometimes data are just very noisy. But it might be that we can do better. \(R^2\) can’t tell us whether there is a better model, but we have other tools to do that. It is useful, though, in summarising how well our model is explaining the data.

\(R^2\) is easy to extract from R. It is calculated automatically as part of the summary(), so we can get it from this:

R2 <- summary(WomenMod)$r.squared
R2

## [1] 0.6723703

round(100*R2, 1)

## [1] 67.2

Exercise

Exercise: calculate the \(R^2\) for the 8 plots. Do some seem better than others?

You will need to read in the data, and fit the models. Note that x is the same for all y’s except y7. Here is some code to get you started

Data <- read.csv("https://www.math.ntnu.no/emner/ST2304/2019v/Week6/SimRegression.csv")
mod1 <- lm(y1 ~ x, data=Data) # This works for y1, ..., y6, y8
mod7 <- lm(y7 ~ x7, data=Data) # for y7 we need to use x7

summary(mod1)$r.squared

## [1] 0.8708701

Regression Assumptions, and how to check them

Regression model make a number of assumptions:

straight line
errors are independent
errors have the same variance
errors are normally distributed
errors have zero mean

The final assumption is forced by the maximum likelihood estimation, so we don’t need to worry about it.

Model is systematic part + random part

\[ \begin{aligned} y_i &= \color{red}{\mu_i} &+ \color{darkblue}{\varepsilon_i} \\ &= \color{red}{\alpha + \beta x_i} &+ \color{darkblue}{\varepsilon_i} \end{aligned} \]

If we are to check these assumptions, we need some tools. The most basic is the concept of the residual. The model for the data is \(y_i = \color{red}{\alpha + \beta x_i} + \color{darkblue}{\varepsilon_i}\), and we estimate this with the fitted model:

\[ y_i = \color{red}{\hat{\alpha} + \hat{\beta} x_i} + \color{darkblue}{e_i} \]

\(\hat{\alpha}\) and \(\hat{\beta}\) are the parameter estimates, so \(\hat{\alpha} + \hat{\beta} x_i\) is the prediction for \(y_i\). \(\color{darkblue}{e_i}\) are the residuals: these are estimates of \(\color{darkblue}{\varepsilon_i}\).We sometimes standardise the residuals, to make it easier to interpret their sizes:

\[ t_i = \frac{y_i - E(y_i)}{\sqrt{Var(r_i)}} \]

Residuals should have no structure, beyond being normally distributed¹. So, we can check for structure to check if the residuals are good.

We can extract residuals from R like using the residuals()function. Although we can stare at them, it is more useful to plot them in different ways, for example agaisnt the fitted values (= \(\color{red}{\alpha + \beta x_i}\)), or against covariates that are in the model:

Women.res <- residuals(WomenMod)
round(Women.res, 2)[1:5]

##     1     2     3     4     5 
##  0.36  0.02  0.08 -0.35  0.11

Women.fit <- fitted(WomenMod)
round(Women.fit, 2)[1:5]

##     1     2     3     4     5 
## 11.54 11.48 11.42 11.35 11.29

par(mfrow=c(1,2))
plot(Women.fit, Women.res, main = "Plot against fitted values")
plot(Times$Year, Women.res, main = "Plot against predictor")

These do look similar, because we only have one predictor, and the fitted value is just a linear transformation of that. When we got on to more than one predictor, these will be different.

Residuals should not have any structure, and we can use them to can see structures such as curvature, outliers and heteroscedasticity (variance changing).

Regression Assumptions

For data sets 5 - 8, which assumption is wrong?

Plot the residuals against the fitted values for all 8 plots.

For which data do they suggest a problem?
What is the problem?
Can you think of ways to improve these models?
- no, you haven’t been given the tools yet! So you can be creative

Normal Probability Plots

Residual plots can show some deviant patterns, but they are poor as a test of normality. There are formal tests of normality, but they generally don’t work too well. Instead we can use normal probability plots. These are a particular sort of ‘Q-Q plot’ (Q-Q = quantile-quantile). The idea of a Q-Q- plot is that if we have \(N\) samples from a distribution, then the smallest should be approximately the same as the \(1/(N+1)\)th quantile, the next smallest approximately the \(2/(N+1)\)th quantile etc.

So, for example, for a uniform distribution the quantiles should be evenly spaced:

N <- 50
# NNorm <- scale(rnorm(N))
NUnif <- runif(N, 0, 1)

par(mfrow=c(2,1), mar=c(2,2,1,1))
plot(1:N, NUnif, xlab="Order drawn", ylab="Value", main="Un-sorted")
plot((1:N)/(N+1), sort(NUnif), xlab="Rank", ylab="Value", main="Sorted")
abline(0,1)

For a normal distribution the spacing is uneven, but we can calculate it

N <- 50
# NNorm <- scale(rnorm(N))
NNorm <- rnorm(N, 0, 1)

par(mfrow=c(3,1), mar=c(2,2,1,1))
plot(1:N, NNorm, xlab="Order drawn", ylab="Value", main="Un-sorted")
plot((1:N)/(N+1), sort(NNorm), xlab="Rank", ylab="Value", main="Sorted")
plot(qnorm((1:N)/(N+1)), sort(NNorm), xlab="Rank", ylab="Value", main="Quantiles")
abline(0,1)

R even has code specifically to make these plots, so everything’s easier. The line that is drawn is the expected line, which the points should fall along.

qqnorm(NNorm)
qqline(NNorm)

What you can see

You Turn…

Draw normal probability plots for the 8 data sets. Do any suggest problems?
Try to draw normal probability plots that are normal, and then have outliers, skewness and thick tails
- you will need to simulate data (e.g. with rnorm()), and then add points, or transform the data

You can simulate and change your data like this:

# simulate a normal distirbtion with 50 points, a mean of 0 and variance 1
SimNormData <- rnorm(50, 0, 1)
# Change the 13th point
SimNormDataChange <- SimNormData; SimNormDataChange[13] <- 50
# Transform the data
SimNormDataTransform <- SimNormData^2

Make you own bad data

For each question below, create the data using the code, plot the graphs, and try to interpret them!

Start off with creating an outlier using the code below.

This is easy: add a “bad” point

Remember indexing with [] (some notes in this week’s exercise)

You will need to remove the comments on the plots

# make good data 
x.badness <- 21:70
y <- rnorm(length(x.badness), x.badness, 10)
plot(x.badness,y)

# add an outlier
y.outlier <- y
y.outlier[20] <- 200
mod.outlier <- lm(y.outlier~x.badness)
#qqnorm(resid(mod.outlier)) 
#qqline(resid(mod.outlier))

Now try some skewness.

# add skew
err.skew <- rnorm(length(x.badness), 4, 1)^2
y.skew <- x.badness + err.skew
mod.skew <- lm(y.skew~x.badness)
par(mfrow=c(1,2))
#plot(x.badness, y.skew)
#qqnorm(resid(mod.skew))
#qqline(resid(mod.skew))

Finally some thick tails.

# make tails thick
err.tailed <- rnorm(length(x.badness), 4, c(1,5)) # repeat std dev 1, 5
y.tailed <- x.badness + err.tailed
mod.tailed <- lm(y.tailed~x.badness)
par(mfrow=c(1,2))
#plot(x.badness, y.tailed)
#qqnorm(resid(mod.tailed)); qqline(resid(mod.tailed))

Leverage

This is less well know, but can be a problem, and can’t always be seen with residuals. We can use data sets 6 and 7 to see the problem. First. here ar ethe data sets, with the troublesome points highlighted.

We can fit a model and plot the residuals (the y-axis is jittered: the points has a small amount added to them, so they are easier to see):

For data set 6 the outlier is obvious in the residuals. But for data set 7 the obviously weird point has a residual that’s almost 0. If we look at the regresson line, we can see that the weird point pulls the line towards it, so the residual isn’t that small. We can tak about it as an influential point, because it has a large influence on the fitted line.

We can’t use residuals to find influential points like these, instead we ask how much the fitted values for the other points change if we remove a data point. For each point we predict the other points with the full model, and call each one of these \(\hat{y}_j\), and also predict them with a model where point \(i\) has been removed (and call these \(\hat{y}_{j(-i)}\)). So a large difference, \(\hat{y}_j-\hat{y}_{j(-i)}\), would suggest a large effect. This will also depend on the total variation, so we calculate this, which we call Cook’s D:

\[ D_i = \frac{\sum_{j=1}^{n} (\hat{y}_j - \hat{y}_{j(-i)})^2}{s^2} \] where \(s^2\) is the residual variance. Large values of \(D_i\) mean a large influence, so if \(D_i > 1\), or \(D_i >4/n\) (depending on who you ask!).

Your task

Fit the model with and without the weird point - You can remove the point like this:

DataNotWeird <- SimData[SimData$x7<10,]

Look at the fitted models. How similar are they?

How similar are the parameter estimates
plot the fitted lines on the data (with abline())
Calculate Cook’s D for the different data sets, and plot them against x. Do you see any influential points?

cooks.distance(WomenMod)[1:5]

##           1           2           3           4           5 
## 0.643742510 0.001416355 0.018007524 0.243659099 0.017246109

How good is my model? A Summary

Before looking at how to improve models, it is worth summarising how to check a model.

The model can be read as a sum of the fit and the residuals. The proportion of the total variation explained by the model is summarised with \(R^2\).

The residuals should have no pattern. Here we try look at them graphically, because a couple of graphs can reveal a lot of problems:

Residual plots can show curvature, outliers and heteroscedasticity
Normal Probability Plots can show a lack of normality (e.g. skewness), and also outliers
Cook’s D can be used to identify influential points.

How can we improve the model?

If we have identified some problems, what do we do? The first thing to do it to check the data and model for silly mistakes, like typos. Once that has been done, we should start to think about what hte mis-fit means, and whether it will cause any real problems. For example, does it change our conclusions? Or if we are making predictions, we want to know if the mis-fit will change the predictions.

If we find individual data points are a potential problem (e.g. they are outliers or influential), we first need to check that they are not typos, or other errors (e.g. computers reading them in incorrectly). If they are genuine, then we can remove them & see if that makes a big difference. If it does, we have to think hard. It is possible that they really should be ‘in’ the data, so that we would be inflating the fit of the model by removing them. And if there are more than one or two points that we could remove, it might be that the problem is not that they are outliers. For example, it could be that the distribution genuinely has more large values than expected from a normal distribution (the technical term for this is that the data are leptokurtic. This term will not be on the exam).

Other problems are due to the model, as a whole, not fitting. In particular, we can sometimes see curvature or skewness in the residuals. In these cases there are a few solutions, based on transformations. One possibility, if we only have curature or if the curvature is only apparent in the plot against one covariate (but not others) is to transform the covariate, e.g. \(\sqrt{x_i}\), \(x_i^2\), \(\log(x_i)\), so the model becomes

\[ y_i = \alpha + \beta x_i^p + \varepsilon_i \] where, for example, \(p=2\) for the squared transformation. The transformation can sometimes make biological sense. An alternative approach is to add more terms, for example a quadratic term:

\[ y_i = \alpha + \beta x_i + \gamma x_i^2 + \varepsilon_i \] There will be more about this later, when we get on to multiple regression.

An alternative approach is to transform the response, e.g. \(\sqrt{x_i}\), \(x_i^2\), \(\log(x_i)\), to give a model like this:

\[ y_i^p = \alpha + \beta x_i + \varepsilon_i \] Doing this can change both the linearity and the skewness, so it can be a bit more trickly. But there are times when it is the clear thing to do. There is a general class of transformations that can be used, the Box-Cox family:

\[ y_i \rightarrow y_i^p \] e.g. if we want a square root transformation, we can use \(p = 1/2\). If we decide \(p = 0\) is appropriate, we use \(log(y_i)\). The transformation has two effects: on the curvature of the response, and on how the variance changes with the mean. We can look at the effect on curvature first:

x <- seq(10,100, length=30)
y <- rnorm(length(x), x, 5)
ySq <- y^2
ySqrt <- sqrt(y)
ylog <- log(y)

par(mfrow=c(2,4), mar=c(4.1,2.1,3,1), oma=c(0,2,0,0))
plot(x, y, main="Untransformed")
 abline(lm(y~x, data = Times), col=2)
plot(x, ySq, main="Squared Transformation")
 abline(lm(ySq~x, data = Times), col=2)
plot(x, ySqrt, main="Square Root Transformation")
 abline(lm(ySqrt~x, data = Times), col=2)
plot(x, ylog, main="log Transformation")
 abline(lm(ylog~x, data = Times), col=2)

plot(x, resid(lm(y~x, data = Times)))
plot(x, resid(lm(ySq~x, data = Times)))
plot(x, resid(lm(ySqrt~x, data = Times)))
plot(x, resid(lm(ylog~x, data = Times)))

And then how the variance changes with the mean. If the variance is constant, we say the data are homoscedastic. If the variance changes with the mean, it is heteroscedastic. These are two terms that will also not be on the exam (especially not an oral exam).

The Box-Cox transformation can change this relationship

het.x <- rep(1:101,2)
a=2
b = 1
sigma2 = het.x^1.3
eps = rnorm(length(het.x),mean=0,sd=sqrt(sigma2))
het.y=a+b*het.x + eps
mod.het <- lm(het.y ~ het.x)
mod.het2 <- lm(sqrt(het.y) ~ het.x)

## Warning in sqrt(het.y): NaNs produced

par(mfrow=c(1,3), mar=c(4.1,2.1,3,1), oma=c(0,2,0,0))
plot(het.y, het.x, main="Data")
abline(mod.het, col=2)

plot(fitted(mod.het), resid(mod.het), main="Residuals")
plot(fitted(mod.het2), resid(mod.het2), main="Residuals, sqrt transformation")

Box-Cox in R

We could do the transformation by trying several different transformations. This is a good way of getting a feel for what it’s doing, and the transformation doesn’t usually have to be precise. But R has a function to find the best Box-Cox transformation, which is found in the MASS package:

library(MASS)

## Warning: package 'MASS' was built under R version 3.5.2

x <- 1:50
y <- rnorm(50,0.1*x, 1)^2
boxcox(lm(y ~ x)) # 0.5 is true transformation

Your Turn

Unfortunately boxcox() needs positive responses, so we can’t use the data we have already been using. Instead we can create data from a Gamma distribution with rgamma(). It has two parameters (after the number of points to simulate):

Shape: controls skew: the higher, the more symmetrical
Scale: this controls the mean (mean = shape*scale)

We can create a (bad) model by keeping the shape constant but letting the scale vary with x:

x <- seq(1,10, length=50)
y1 <- rgamma(length(x), shape=5, scale=x)
y2 <- rgamma(length(x), shape=100, scale=x)
y3 <- rgamma(length(x), shape=5, scale=x^2)
par(mfrow=c(1,3))
plot(x,y2, main="Larger Shape")
plot(x,y1)
plot(x,y3, main="Scale Quadratic")

Look at the curves, and describe what it looks like
Regress each y (i.e. y1, y2, y3) against X
Check the residuals.
See if a transformation helps, e.g.

gam.mod <- lm(y1 ~ x)
library(MASS)
boxcox(gam.mod)

if a transformation is suggested, try it, and check the residuals again

A Word of Caution

Don’t overinterpret

Summary

We now know how to asses the model fit

\(R^2\) show how much variation the model explains
Residual plots and Normal Probability Plots can show curvature, outliers, and varying variance
Influential Points can be detected using Cook’s D. These may not be large outliers!
We should check outliers & other odd points - are they typos?
We can try to transform the response to get a better model

this is not quite true: because we are estimating them, the estimation induces a slight correlation. But this is almost universally ignored in practice↩

How good is our straight line?

Bob O’Hara

How good is my model?

Some Data: Women’s times

Motivation - is a straight line any good?

How good is my model? A Summary

Another View of Regression

How much variation does the model explain?

Exercise

Regression Assumptions, and how to check them

Regression Assumptions

Normal Probability Plots

What you can see

You Turn…

Make you own bad data

Leverage

Your task

How good is my model? A Summary

How can we improve the model?

Box-Cox in R

Your Turn

A Word of Caution

Summary