Exercise 5: Multiple regression

Instructions:

Hints and reminders are italic

Questions appear in blue.

Needs to be completed and handed in by 28th February


Resources:


The challenge: Will we be taken over by aliens?

It is a time in the near future, space travel has become more common. Ships leave from earth every few months on exploration missions of the galaxy. Ship “Explorer 5” has been searching to see if any life on other planets could be useful on earth. During the mission the crew found a planet with life and collected some organisms. The alien organism 101 is a multicoloured, modular, plant-like organism (the plant-animal divide is less clear for aliens). The organism seems to grow well in air similar to earth's atmosphere and could be useful as it grows flower-like structures made of gold.

Alien picture

On their way back to earth the scientific crew of “Explorer 5” have been conducting experiments on alien 101 to try and determine the conditions under which it grows best. The scientists grew alien 101 in containers under different temperatures (ºC) and rainfall (mm) conditions. They recorded the weight of biomass (g) of alien 101 per square metre after one week.

Unfortunately, on the journey back, the ship hit a satellite and has crash landed on Australia. The crew were able to survive, but the experimental lab has been badly damaged. Alien 101 has been released into the wild on earth.

You are a local team of scientists assisting the space travel company with predicting how the organism might spread. The company want to know where to focus their containment resources to stop the organism from taking over. Even if it could be useful, they don't know how it could damage earth's wildlife.

It is your job to predict where the alien organism will spread to and recommend how to deploy resources


The data can be found at https://www.math.ntnu.no/emner/ST2304/2019v/Week7/Alien101LabData.csv

The first step is to import the data and assign it to an object. You can use the whole web link above to import the data. It is a csv file with column names (header) included.

The next step is to plot the data.

hint: use pairs() to see all the data at once

1. Run a simple linear regression for each of i) temperature and biomass, ii) rainfall and biomass. What are the coefficient estimates, the confidence intervals, and how much variance do the models explain?

Hint1: think about which is your response and which is your explanatory variable

Hint2: for the lm() you will need to use format lm(y ~ x, data = YourData)

2. Interpret the results of the separate linear regressions. What do the results suggest about the ideal conditions for alien 101?

You now have some results about how temperature and rainfall (individually) influence the growth of alien 101. But you want to generate predictions of how it could spread around Australia. You can do this by creating predictions of the amount of biomass for the actual average annual temperature and rainfall of Australia. The company have provided you with very simplifed data of the average annual temperature and total annual rainfall for different coordinates in Australia. Found at: https://www.math.ntnu.no/emner/ST2304/2019v/Week7/Australia.csv

This is plotted below.

Using this you can try to guess where the alien might spread to based on what you have found out about the influence of temperature and rainfall on alien 101's growth. Here you assume that amount of biomass is an indicator of how suitable the conditions are for alien 101.

plot of chunk unnamed-chunk-4

While guessing is ok. It would be better to predict actual numbers based on our linear models. To do this, you will need to:

# First thing for the prediction is to create the newdata object
# you will need as an argument for the predict() function.

# Here you need two, one for temperature and one for rainfall.
# Newdata needs to be a datafame so the function data.frame() is very
# useful here. My data is called Australia and my models are ModelTemp
# and ModelRain.
# Column names in newdata MUST be identical to those in the
# data used to make the model (the lab data). 

# newdata for temperature (it makes a dataframe with a column called 
# Temperature from the Temp column in the Australia data)
newdataTemp <- data.frame(Temperature = Australia$Temperature)

# newdata for rainfall
newdataRain <- data.frame(Rainfall = Australia$Rainfall)

# now predict - this should be familiar now
PredictionsTemp <- predict(ModelTemp, newdata = newdataTemp, 
                           interval="prediction")

# now predict - this should be familiar now
PredictionsRain <- predict(ModelRain, newdata = newdataRain, 
                           interval="prediction")

Now you have predictions for the biomass in g/m2 for the whole of Australia. If you look at the results, the numbers are very large. This is because biomass is in grams. From looking just at the numbers it is difficult to interpret these predictions. So you ask another member of your team to plot the results back onto the map of Australia. You have already noticed that the direction of the pattern is the same for the upper and lower prediction intervals, even if the actual biomass weight has uncertainty. So you only ask for a plot of the relative biomass (i.e. it is colour coded based on high, medium, low not absolute biomass values). This creates a simplified easily interpretable graph, but we should not forget about uncertainty when interpreting results.

plot of chunk unnamed-chunk-6

3. Based on these results where would you recommend the company focusses its containment resources?

Your team have made some predictions based on using temperature and rainfall separately. But you know that you can put both variables into a combined multiple regression model. So you now decide to do this.

4. Write a line of R code (and run it) to run a multiple linear regression including temperature and rainfall.

5. How can we write the multiple linear regression as an equation? Hint: think about how this is different to a simple regression

6. Compare the estimates from the multiple regression to those from the individual regressions. Include coefficients, confidence intervals, and R squared

Remember you can also plot the results. To do this you should plot each variable against biomass separately i.e. biomass against temperature and biomass agaisnt rainfall. (Example below)

R hints:

# EXAMPLE:

# make the plot of temperature and biomass
plot(Alien$Temperature, Alien$Biomass)

# save the coefficient values as an object
# contains 3 values, intercept, slope for temperature,
# slope for rainfall
coefs <- coef(ModelBoth)

# use the coefficients to plot the line for temperature
abline(a=coefs[1]+(mean(Alien$Rainfall)*coefs[3]), b=coefs[2], col=2)

7. How does R find the estimates of the parameters for these linear regressions? Hint: think back to earlier weeks - it's the same

8. Perform model checking of the multiple linear regression. Include 3 plots of model checking and decide if you think the fit is ok

Hint: code available in Exercise 4

Now you have checked your model and hopefully are either happy with it or have fixed any assumptions that were violated. So now you can generate new predictions of biomass in Australia based on the multiple regression model. This is still not the full picture, remember to look at the uncertainty too. You can do this by looking at your object containing the predictions.

9. Predict where the organism will spread to, based on the multiple regression.

Hint: You can do this by editting the prediction code above so it predicts from the multiple regression. Now you also only need one newdata object with two columns.

plot of chunk unnamed-chunk-14

10. What would you recommend in terms of deployment of resources based on your predicted spread? Has this changed from before?

11. Can you draw any biological conclusions about this alien species? Can you say anything about the relative influence of temperature and rainfall? Are the variables in this analysis enough, what else could have an influence?

12. Why do we use multiple regressions rather than running individual simple linear regressions?