MultipleRegressionModule.utf8.md

title: “Multiple Regression” author: “Bob O’Hara” output: html_document: df_print: paged—

This week: Multiple Regression

We will look at

explaining our dependent variable with more than one explanatory variable
how to fit these models in R
what a design matrix is (this will be helpful later)
how to fit a polynomial model

More Monsters

In the cellar of the museum in Frankfurt we had a population of Schey. These are small creatures that lurk in the dark and eat ancient dust and stale cobwebs. Some of us wanted to know more about them, and whether they could be trained to clean the museum collections.

We caught 100 and measured the amount of dust they could eat in 5 mins, and wanted to explain that by their body size, their gape size (i.e. how large their mouths are).

The Data

Before we have looked at regression against a single covariate. Now we want to look at regression against two covariates: the extension to more than 2 is straightforward.

This is a more common situaiton in practice than regression against a single covariate. Several factors often affect a response, so we need to take this into account. Even worse, they might interact, i.e. the effect of one covariate might depend on the value of another. Sometimes we are interested in all of the effects, at other times we are only interested in some, but we are worried that there are others that might have an effect, and need to be included in the analysis (not including them can bias the results, and also make the estimates worse).

This is our model for simple regression

\[ y_i = \color{red}{\alpha + \beta x_i} + \color{blue}{\varepsilon_i} \]

The obvious extension to two covriates is to simply add on the effect of the second covariate

\[ E(y_i) = \color{red}{\alpha + \beta_1 x_{1i} + \beta_2 x_{2i}} \]

This equation is for a plane (which is just a line in 3D, of course). We can visualise the model (with a bit of difficulty). The plane is the black grid, the data are the red dots, which are either above or below the plane. The blue lines are the residuals: they project the points onto the plane, so we can see where their expected values are.

In many ways this model is almost the same as a simple regression. We have a fitted part, which is just a bit more complicated than before, and residuals, which are the same as before. And in other ways the models are the same, and we can use the same tools on them (e.g. model checking can be done the same way). Indeed, we use the same R function to fit the model (the maths is explained in more detail below):

FullMod <- lm(Dust ~ GapeSize + BodySize, data=Schey)
summary(FullMod)

The only change is in the formula. It was

Y ~ X

now it is

Y ~ X1 + X2

Your Turn

For the data:

first fit the model with each covariate individually (i.e. first explain dust eaten by gape size, then explain dust eaten by body size).
- use summary() to look at the parameter estimates and \(R^2\). Write down the regression models (i.e. plug the correct values into \(E(y_i) = \alpha + \beta_1 x_{i}\))
- What do the models suggest are the effects on dust eating, and how well do the variables individually explain the variation in the response?
fit a model with both covariates (i.e. explain dust eaten by both gape size and body size).
- again, use summary() to look at the parameter estimates and \(R^2\).Write down the regression model.
- What does this model suggest are the effects on dust eating, and how well do the variables together explain the variation in the response?
- How do these results compare to those from the single regression models?

Regression More Generally

The model above only has 2 covariates, but we can easily add more. The model will look like this

\[ \begin{aligned} y_i &= \alpha + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \dots + \beta_p x_{ip} + \varepsilon_i \\ y_i &= \alpha + \sum_{j=1}^p \beta_j x_{ij} + \varepsilon_i \end{aligned} \]

we have \(p\) covariates, labelled from \(j=1\) to \(p\)
we have \(p\) covariate effects
the j\(^{th}\) covariate values for the i\(^{th}\) individual is \(x_{ij}\)

Obviously we can add as many covariates as we want, although the model will not fit if there are more covariates than data points, and in practice we would like to have far fewer covariates, becuse each covariate makes the model a little less certain.

Writing the model like this can get messy, espacially if we want to manipulate it. But we can write it as a matrix. This is, in some ways, a detail, but the practical upshot is that we can work with the matrix formulation to find out how to fit the model, and then for any complicated model we just have to be able to write it in this matrix form, and everything else just follows.

The first step in writing this as amatrix is to turn the intercept into a covariate by using a covariate with a value of 1 for every data point. Then we write all of the covariates in a matrix, \(X\).

\[ X = \left( \begin{array}{ccc} 1 & 2.3 & 3.0 \\ 1 & 4.9 & -5.3 \\ 1 & 1.6 & -0.7 \\ \vdots & \vdots & \vdots \\ 1 & 8.4 & 1.2 \\ \end{array} \right) \]

The first column is the intercept, the second is the first covariate, and the third is the second covariate. This is called the Design Matrix. Using matrix algebra, the regression model becomes

\[ \mathbf{Y} = X \mathbf{\beta} + \mathbf{\varepsilon} \]

where \(\mathbf{Y}\), \(\mathbf{\beta}\) and \(\mathbf{\varepsilon}\) are now all vectors of length \(n\), where there are \(n\) data points. \(X\) is an \(n \times (p+1)\) matrix. We will not look at the mathematics in any detail, the point here is that the model for the effect of covariates can be written in the design matrix. It turns out that this is very flexible, if we have more covariates, or interactions betwen covariates, we can still write them in a design matrix. The model, in all its ugly glory, is

\[ \left( \begin{array}{c} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_n \\ \end{array} \right) = \left( \begin{array}{ccc} 1 & 2.3 & 3.0 \\ 1 & 4.9 & -5.3 \\ 1 & 1.6 & -0.7 \\ \vdots & \vdots & \vdots \\ 1 & 8.4 & 1.2 \\ \end{array} \right) \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots \\ \beta_p \\ \end{array} \right) + \left( \begin{array}{c} \varepsilon_1 \\ \varepsilon_2 \\ \varepsilon_3 \\ \vdots \\ \varepsilon_n \\ \end{array} \right) \]

After a bit of matrix algebra, one can find the maximum likelihood solution

\[ \mathbf{b} = (X^T X)^{-1}X^T \mathbf{Y} \] where \(\mathbf{b}\) is the MLE for \(\mathbf{\beta}\). We won’t show you the proof, and you won’t need to remember this. In practice the computer will do all the calculations for you.

Understanding a model

The plot below shows the data and the line fitted for the model with just body size. the code draws the line from the model with both body size and gape size, but we don’t see it.

plot(Schey$BodySize, Schey$Dust)
abline(a=coef(BodyMod)["(Intercept)"], b = coef(BodyMod)["BodySize"])
abline(a=coef(FullMod)["(Intercept)"], b = coef(FullMod)["BodySize"], col=2)
legend(9.025, 105.6, c("Body Size", "Body & Gape"), lty=1:2, col=1:2)

The reason the line isn’t drawn is that it’s in the wrong place. The model that was fitted was

\[ y_i = \color{red}{\hat{\alpha} + \hat{\beta_{1}} x_{i1} + \hat{\beta_{2}} x_{i2}} + \color{blue}{\varepsilon_i} \] (\(x_{i1}\) is Body Size, \(x_{i2}\) is Gape Size. The hats on Greek letters show that we are using the estimates of the parameters). But this code

abline(a = coef(BSMod)["(Intercept)"], 
       b = coef(BSMod)["BodySize"])

draws the line \(y_i = \color{red}{\hat{\alpha} + \hat{\beta_{1}} x_{i1}}\). There is no \(\hat{\beta_{2}} x_{i2}\). The fitted model is a plane (in 3D), but we are drawing a line (in 2D) by taking a slice through the plane. Unfortunately we are taking a slice in the wrong place. The code above takes a slice through the plane where \(x_{i2}=0\). But the smallest value of \(x_{i2}\) (gape size) in the data is 4.6.

We could cut the plane anywhere, by plugging in different values for gape size. For example here are some lines

GapePreds <- seq(floor(min(Schey$GapeSize)), ceiling(max(Schey$GapeSize)), by=0.25)
BetterLines <- data.frame(
  GapePreds = GapePreds, 
Better.as = coef(FullMod)["(Intercept)"] +  
  coef(FullMod)["GapeSize"]*GapePreds)

par(mar=c(4.1,4.1,1,1))
plot(Schey$BodySize, Schey$Dust)
abline(a=coef(BodyMod)["(Intercept)"], 
       b = coef(BodyMod)["BodySize"])
apply(BetterLines, 1, function(ln, b, at.ln=0) {
  abline(a=ln["Better.as"], b=b, col=2)
  text(at.ln, ln["Better.as"]+at.ln*b, ln["GapePreds"], cex=1.3, font=2)
}, b = coef(FullMod)["BodySize"], at.ln=10.61)

If we want to select a single line to draw, the mean of gape size is a good choice

Better.a <- coef(FullMod)["(Intercept)"] +  
  coef(FullMod)["GapeSize"]*mean(Schey$GapeSize)
par(mar=c(4.1,4.1,1,1))
plot(Schey$BodySize, Schey$Dust)
abline(a=coef(BodyMod)["(Intercept)"], 
       b = coef(BodyMod)["BodySize"])
abline(a=Better.a, b = coef(FullMod)["BodySize"], col=2)

Mean Centering

So far we have not been thinking about the intercept. It is a necessary part of the model, but is not so easy to interpret.

For our data, we can extend the plot so that the x-axis includes 0. This is where the intercept is (because the prediction for there is \(E(y_i) = \alpha + \beta_1 0 + \beta_2 0 = \alpha\)). The interpretation is diffiuclt, because we cannot have any Schey with a size and gape of 0, so as a prediction this is silly. That’s fine if we don’t want to interpret the intercept.

par(mar=c(4.1,4.1,1,1))
plot(Schey$BodySize, Schey$Dust, xlim=range(c(0, Schey$BodySize)), xaxs="i")
abline(a=coef(BodyMod)["(Intercept)"], 
       b = coef(BodyMod)["BodySize"])
abline(a=Better.a, b = coef(FullMod)["BodySize"], col=2)

Sometimes we can find it helpful to have an intercept that is interpretable, if only so we can do a sanity check. We can easíly move the intercept like this:

par(mfrow=c(1,2), mar=c(2,2,1,1), oma=c(2,2,0,0), bty="n", col="grey50")
plot(Schey$BodySize, Schey$Dust, col=2, xlim=c(0, max(Schey$BodySize)), yaxt="n")
axis(2, pos=0)
plot(Schey$BodySize, Schey$Dust, col=2, yaxt="n", xlim=c(0, max(Schey$BodySize)))
axis(2, pos=mean(Schey$BodySize))
mtext("Body Size", 1, outer=TRUE)
mtext("Dust", 2, outer=TRUE)

In practice this just means subtracting the mean from Body Size

Schey$BodySize.c <- Schey$BodySize - mean(Schey$BodySize)
plot(Schey$BodySize.c, Schey$Dust, col=2, 
     yaxt="n", bty="n")
axis(2, pos=0)

Your task

Schey$BodySize.c <- Schey$BodySize - mean(Schey$BodySize)
Schey$GapeSize.c <- Schey$GapeSize - mean(Schey$GapeSize)

FullMod <- lm(Dust ~ GapeSize + BodySize, 
                data=Schey)
FullMod.c <- lm(Dust ~ GapeSize.c + BodySize.c, 
                  data=Schey)

Fit the models with the un-centred and centred Body Size and Gape Size. Look at the parameters (with coef()), and discuss any differences.
Can you interpret the parameters?

Scaling and Standardisation

Body size was measure in grams, but it could also be measured in kg. If we fit the model with this, we see that the effect of body size is massive.

Schey$BodySize.kg <- Schey$BodySize/1000
mod.kg <- lm(Dust ~ GapeSize + BodySize.kg, data=Schey)

round(coef(mod.kg), 2)

## (Intercept)    GapeSize BodySize.kg 
##        9.38       10.62     4509.23

Why is the effect so massive? We want to you to think about this, and discuss the parameters, and what they mean.

How do you interpret the regression coefficients? They say something about the change in Dust when body size changes, but can you say what?
can you interpret the slopes in terms of predictions?

Because we can re-scale the parameters, and still do the regression, we can (if we want) re-scale to anything we think is sensible. FOr example, we could swap temperatures between Kelvin, Celcius and Fahrenheit depending on our whims. One way of re-scaling them that is often used is to standardise them to have a variance (and standard deviation) of 1. Here are two ways of doing this in R, the first does it “by hand”, the second uses an R function. But both do the same thing.

Schey$BodySize.s <- (Schey$BodySize - mean(Schey$BodySize))/
  sd(Schey$BodySize)
Schey$GapeSize.s <- scale(Schey$GapeSize)

Now a difference of 1 mean a difference in 1 standard deviation in the data (e.g. 1 standard deviation in body size, rather tha 1g). But internally the model is the same ¹, although the parameters have different values and thus have to be interpreted slightly differently.

FullMod.s <- lm(Dust ~ GapeSize.s + BodySize.s, 
                data=Schey)
round(coef(FullMod.s), 3)

Fit the model with the standardised coefficients

How would you interpret the standardised coefficients?
When might you prefer to use the standardised or un-standardised models?

Polynomials

We can look back to Data Set 8 last week. This was the relationship that wasn’t straight

SimData <- read.csv("https://www.math.ntnu.no/emner/ST2304/2019v/Week6/SimRegression.csv")
plot(SimData$x, SimData$y8, main="Data Set 8")

There are a few ways to deal with this. One is to use a curve rather than a straight line. There are many choices of curve, but one choice is to use polynomials. In practice most reasonable curves can be written as a Taylor series

\[ f(x) = \beta_0 + \beta_1 (x- \bar{x}) + \beta_2 (x- \bar{x})^2 + \beta_3 (x- \bar{x})^3 + \dots + \beta_\infty (x- \bar{x})^\infty \]

So in principle we could fit any curve like this, But fitting a curve with an infinite number of parameters is tricky, so instead we use an approximation with only \(p\) terms

\[ f(x) \approx \beta_0 + \beta_1 (x- \bar{x}) + \beta_2 (x- \bar{x})^2 + \beta_3 (x- \bar{x})^3 + \dots + \beta_p (x- \bar{x})^p \]

usually we would want \(p\) to be small (if it gets any bigger than about 4 we should be rethinking what we are doing). Fitting the curve is easy, we just regress \(Y\) against \(X\), \(X^2\), \(X^3\) etc. We don’t have to centre, of course, although it might make the interpetation easier.

In R we can simply treat the extra terms as additional variables

linmod <- lm(y8 ~ x, data=SimData)
quadmod <- lm(y8 ~ x + I(x^2), data=SimData)

We need to write I(x^2) rather than just x^2 for slightly obscure reasons, to do with writing more complex models².

If we want to plot a polynomial, we can’t use abline(), unfortunately. Instead we have to predict new data, and plot that.

PredData <- data.frame(x=seq(min(SimData$x), 
                             max(SimData$x), length=50))
PredData$y.quad <- predict(quadmod, newdata = PredData)

linmod <- lm(y8 ~ x, data=SimData)
cubmod <- lm(y8 ~ x, data=SimData)

PredData$y.lin <- predict(quadmod, newdata = PredData)
PredData$y.cub <- predict(cubmod, newdata = PredData)

plot(SimData$x, SimData$y8, main="Data Set 8")
lines(PredData$x, PredData$y.quad, col=2)

Your tasks are to (1) fit the linear and quadratic models to y8.

Does the quadratic model fit better?
Are the parameters different?
What happens if you add an \(x^3\) term?
Plot the curves. How different are they? Just by looking at the plots, which do you think is best, and why?

A quick bit of maths. The standardised model (for one variable) is \[y_i = \color{red}{\alpha + \beta \frac{ (x_{i}-\bar{x})}{s_{x}}} + \color{blue}{\varepsilon_i}\] where \(\bar{x}\) is the mean of \(x\) and \(s_x\) is the standard deviation of \(x\). We can expand the brackets and re-arrange to get \[y_i = \color{red}{\alpha + \beta x_{i}/s_{x} - \beta \bar{x}/s_{x}} + \color{blue}{\varepsilon_i}\] But \(\bar{x}_{.j}\) is a constant - it does not vary for different y’s, so we have the same model, but with \(\alpha^* = \alpha - \frac{\beta_j}{s_{j}} \bar{x}_{.j}\) and \(\beta_j^* = \frac{\beta_j}{s_{j}}\)↩
essentially, we want (A + b)^2 = A + A:B + B. We will explain more about this later↩