Exercise 5: Multiple regression

SOLUTION

Instructions:

Hints and reminders are italic

Questions appear in blue.

Answers are in red or as R code!.

Resources:

http://reliawiki.org/index.php/Multiple_Linear_Regression_Analysis (This has far too much information but the first couple of pages have some nice figures and some background)
Google (or any other search engine)

The challenge: Will we be taken over by aliens?

It is a time in the near future, space travel has become more common. Ships leave from earth every few months on exploration missions of the galaxy. Ship “Explorer 5” has been searching to see if any life on other planets could be useful on earth. During the mission the crew found a planet with life and collected some organisms. The alien organism 101 is a multicoloured, modular, plant-like organism (the plant-animal divide is less clear for aliens). The organism seems to grow well in air similar to earth's atmosphere and could be useful as it grows flower-like structures made of gold.

Alien picture

On their way back to earth the scientific crew of “Explorer 5” have been conducting experiments on alien 101 to try and determine the conditions under which it grows best. The scientists grew alien 101 in containers under different temperatures (ºC) and rainfall (mm) conditions. They recorded the weight of biomass (g) of alien 101 per square metre after one week.

Unfortunately, on the journey back, the ship hit a satellite and has crash landed on Australia. The crew were able to survive, but the experimental lab has been badly damaged. Alien 101 has been released into the wild on earth.

You are a local team of scientists assisting the space travel company with predicting how the organism might spread. The company want to know where to focus their containment resources to stop the organism from taking over. Even if it could be useful, they don't know how it could damage earth's wildlife.

It is your job to predict where the alien organism will spread to and recommend how to deploy resources

The data can be found at https://www.math.ntnu.no/emner/ST2304/2019v/Week7/Alien101LabData.csv

The first step is to import the data and assign it to an object. You can use the whole web link above to import the data. It is a csv file with column names (header) included.

Alien <- read.csv("https://www.math.ntnu.no/emner/ST2304/2019v/Week7/Alien101LabData.csv", 
                     header=T)

The next step is to plot the data.

hint: use pairs() to see all the data at once

pairs(Alien)

plot of chunk unnamed-chunk-2

1. Run a simple linear regression for each of i) temperature and biomass, ii) rainfall and biomass. What are the coefficient estimates, the confidence intervals, and how much variance do the models explain?

Hint1: think about which is your response and which is your explanatory variable

Hint2: for the lm() you will need to use format lm(y ~ x, data = YourData)

ModelTemp <- lm(Biomass ~ Temperature, data = Alien)
ModelRain <- lm(Biomass ~ Rainfall, data = Alien)

coef(ModelTemp)

##  (Intercept)  Temperature 
## 4998.9263681    0.2643539

coef(ModelRain)

##  (Intercept)     Rainfall 
##  5.00970e+03 -6.80879e-03

confint(ModelTemp)

##                    2.5 %       97.5 %
## (Intercept) 4997.7146160 5000.1381202
## Temperature    0.2112134    0.3174944

confint(ModelRain)

##                     2.5 %        97.5 %
## (Intercept)  5.008427e+03  5.010973e+03
## Rainfall    -8.274233e-03 -5.343348e-03

summary(ModelTemp)$r.squared

## [1] 0.4986085

summary(ModelRain)$r.squared

## [1] 0.4645216

2. Interpret the results of the separate linear regressions. What do the results suggest about the ideal conditions for alien 101?

Separately the models suggest a positive relationship between biomass and temperature (confidence intervals do not span 0) and a negative relationship between rainfall and biomass (confidence intervals also do not span 0). As the confidence intervals of each relationship do not span 0, we would be unlikely to see these effects if nothing was going on i.e. a null of no relationship was true. Including uncertainty, we still get relationships in the estimated direction. Altogether this suggests that the ideal conditions for alien 101 are hot and dry (these conditions would increase biomass).

You now have some results about how temperature and rainfall (individually) influence the growth of alien 101. But you want to generate predictions of how it could spread around Australia. You can do this by creating predictions of the amount of biomass for the actual average annual temperature and rainfall of Australia. The company have provided you with very simplifed data of the average annual temperature and total annual rainfall for different coordinates in Australia. Found at: https://www.math.ntnu.no/emner/ST2304/2019v/Week7/Australia.csv

This is plotted below.

Using this you can try to guess where the alien might spread to based on what you have found out about the influence of temperature and rainfall on alien 101's growth. Here you assume that amount of biomass is an indicator of how suitable the conditions are for alien 101.

plot of chunk unnamed-chunk-4

While guessing is ok. It would be better to predict actual numbers based on our linear models. To do this, you will need to:

Import the Australia data.
Edit the code below to work for your data and model.

# First thing for the prediction is to create the newdata object
# you will need as an argument for the predict() function.

# Here you need two, one for temperature and one for rainfall.
# Newdata needs to be a datafame so the function data.frame() is very
# useful here. My data is called Australia and my models are ModelTemp
# and ModelRain.
# Column names in newdata MUST be identical to those in the
# data used to make the model (the lab data). 

# newdata for temperature (it makes a dataframe with a column called 
# Temperature from the Temp column in the Australia data)
newdataTemp <- data.frame(Temperature = Australia$Temperature)

# newdata for rainfall
newdataRain <- data.frame(Rainfall = Australia$Rainfall)

# now predict - this should be familiar now
PredictionsTemp <- predict(ModelTemp, newdata = newdataTemp, 
                           interval="prediction")

# now predict - this should be familiar now
PredictionsRain <- predict(ModelRain, newdata = newdataRain, 
                           interval="prediction")

Now you have predictions for the biomass in g/m² for the whole of Australia. If you look at the results, the numbers are very large. This is because biomass is in grams. From looking just at the numbers it is difficult to interpret these predictions. So you ask another member of your team to plot the results back onto the map of Australia. You have already noticed that the direction of the pattern is the same for the upper and lower prediction intervals, even if the actual biomass weight has uncertainty. So you only ask for a plot of the relative biomass (i.e. it is colour coded based on high, medium, low not absolute biomass values). This creates a simplified easily interpretable graph, but we should not forget about uncertainty when interpreting results.

plot of chunk unnamed-chunk-6

3. Based on these results where would you recommend the company focusses its containment resources?

The temperature predictions suggest high biomass only in the very north of Australia. This would suggest only a small number of resources needed in the this area. The rainfall predictions suggest high biomass in the very middle of Australia. To combine you could have resources in the middle and only in the very north. The rest of the island has low predicted biomass so does not need high resources.

Your team have made some predictions based on using temperature and rainfall separately. But you know what you can put both variables into a combined multiple regression model. So you now decide to do this.

4. Write a line of R code (and run it) to run a multiple linear regression including temperature and rainfall.

ModelBoth <- lm(Biomass ~ Temperature+Rainfall, data = Alien)

coef(ModelBoth)

##   (Intercept)   Temperature      Rainfall 
##  5.002993e+03  1.734513e-01 -2.802242e-03

confint(ModelBoth)

##                     2.5 %        97.5 %
## (Intercept)  4.998771e+03  5.007215e+03
## Temperature  6.887640e-02  2.780262e-01
## Rainfall    -5.592785e-03 -1.169943e-05

summary(ModelBoth)$r.squared

## [1] 0.5183331

5. How can we write the multiple linear regression as an equation? Hint: think about how this is different to a simple regression

You can write it as matrices or as a sum of all betas multiplied by all Xs. Writing as a matrix is neater and easier. It will also be useful for other linear models.

6. Compare the estimates from the multiple regression to those from the individual regressions. Include coefficients, confidence intervals, and R squared

Remember you can also plot the results. To do this you should plot each variable against biomass separately i.e. biomass against temperature and biomass agaisnt rainfall. (Example below)

R hints:

Use par(mfrow=c(1,2)) to create a plot area with one row and 2 columns this lets you see both of your plots next to each other.
To add the model line to the plot you can use abline().
Usually for abline you just include a=intercept and b=slope but for a multiple regression model you also need to include the influence of the other variable. Therefore, the intercept actually = intercept of variable + (mean of other variable equals slope of other variable)

The estimate values have changed a bit. They are still in the same direction as before. But the size of each has decreased; temperature (from 0.26 to 0.17), rainfall (from -0.006 to -0.003) The confidence intervals still do not cross 0 for either effect and the intercept has remained almost the same as it was for temperature. The R-squared has not increased much from the individual models (0.52 for joint), it is certainly not the sum of the two. The two variables do not explain completely different variation.

# EXAMPLE:

# make two plot window
par(mfrow=c(1,2))

# make the plot of temperature and biomass
plot(Alien$Temperature, Alien$Biomass)

# save the coefficient values as an object
# contains 3 values, intercept, slope for temperature,
# slope for rainfall
coefs <- coef(ModelBoth)

# use the coefficients to plot the line for temperature
abline(a=coefs[1]+(mean(Alien$Rainfall)*coefs[3]), b=coefs[2], col=2)

# make the plot of Rainfall and biomass
plot(Alien$Rainfall, Alien$Biomass)

# save the coefficient values as an object
# contains 3 values, intercept, slope for temperature,
# slope for rainfall
coefs <- coef(ModelBoth)

# use the coefficients to plot the line for temperature
abline(a=coefs[1]+(mean(Alien$Temperature)*coefs[2]), b=coefs[3], col=2)

7. How does R find the estimates of the parameters for these linear regressions? Hint: think back to earlier weeks - its the same

Maximum likelihood, we are still using it here.

8. Perform model checking of the multiple linear regression. Include 3 plots of model checking and decide if you think the fit is ok

Hint: code available in Exercise 4

# Take rounded fitted and residual values from the multiple regression
# model
BothFitted <- round(fitted(ModelBoth),2)
BothResiduals <- round(residuals(ModelBoth),2)

# Plot the fitted versus residuals 
plot(BothFitted, BothResiduals)
# add a horizontal line at 0
# line is grey and dashed (lty=2)
abline(h=0, lty=2, col="grey")

plot of chunk unnamed-chunk-9

# Plot a normal Q-Q plot
qqnorm(BothResiduals)
qqline(BothResiduals)

plot of chunk unnamed-chunk-9

# Plot Cook's distance
plot(ModelBoth, which=4)

plot of chunk unnamed-chunk-9

I am happy with the fitted versus residuals. It seems to meet the linearity assumption. The Normal QQ shows that it also meets the normality assumption well. The Cook's distance identifies 3 outliers, but these are not far from the rest of the data points and do not seem to have a big influence so I would not remove them. I would not find a strong reason to do so. They do not seem to be typos.

Now you have checked your model and hopefully are either happy with it or have fixed any assumptions that were violated. So now you can generate new predictions of biomass in Australia based on the multiple regression model. This is still not the full picture, remember to look at the uncertainty too. You can do this by looking at your object containing the predictions.

9. Predict where the organism will spread to, based on the multiple regression.

Hint: You can do this by editting the prediction code above so it predicts from the multiple regression. Now you also only need one newdata object with two columns.

Hint: you will need a single newdata object with two columns

PredictionsBoth <- predict(ModelBoth, newdata = data.frame(Temperature = Australia$Temp,
                                                           Rainfall = Australia$Rain), 
                           interval="prediction")

plot of chunk unnamed-chunk-11

10. What would you recommend in terms of deployment of resources based on your predicted spread? Has this changed from before?

Hopefully it has now changed from focussing purely on the very north and middle to now focussing more on the middle and all of the north of Australia. The multiple regression gives a slightly different picture than having two individual regression analyses. The pattern is similar but the multiple regression predicts a larger area of highest biomass. Without the multiple regression you might have underpredicted how many resources to use.

11. Can you draw any biological conclusions about this alien species? Can you say anything about the relative influence of temperature and rainfall? Are the variables in this analysis enough, what else could have an influence?

Alien 101's growth is influenced by both temperature and rainfall. The effects are in opposite directions. The alien grows best in warm but dry conditions. We would expect it to spread in desert like conditions. Could draw any links to biology we know about deserts. We cannot say much from this analysis about the relative influences of temperature and rainfall, because they have been measured on very different scales. Therefore the beta (slope) values are not directly comparable. If we were to standardise our explanatory variables to remove the units (divide by standard deviation and center on the mean) then we could do a comparison.

We could include any other variables that are biologically sensible e.g. herbivory, nutrients, human pressure, pollution. Basically what we have is not enough for a complete picture but it is a good start. We are also missing any explicit consideration of space or connectivity. We do not assume that any areas are similar beyond rain or temperature. Nor do we look at where the space station crashed and any barriers to disperal. These could all be added. This is not a full list of all possible additions. Other ideas could be right too, this is just an example.

12. Why do we use multiple regressions rather than running individual simple linear regressions?

Rarely does a single variable influence something. This could be the case for carefully designed experiments. But often many variables influence our observations and we should take account of these. As we saw here, it can change what we would predict and the conclusions we can draw. But we don't want to include everything. Models should still be a simplification of reality. More on this in model selection.