Exercise 5: Alien escape

SOLUTION

Instructions:

This document contains information, questions, R code, and plots.

Hints and reminders are bold

Questions appear in blue.

Answers are in purple.

Needs to be completed and handed in by 5th March 23:59


### Rationale:

This week you will be looking at multiple regression, working out when to use it, practicing how to use it, and interpreting results.


R hints and general help

Resources:

R this week:

New functions:

Things to remember:

A bit more on the predict() function

You have been using the predict() function for a few weeks now, so it seemed like a good time to have a quick recap of how the function works.

The main arguments we use for predict are: object = YourModelObject, newdata = a dataframe of data you want to predict for, and interval = either prediction or confidence.

A few comments on the arguments:


The challenge: Will we be taken over by aliens?

It is a time in the near future, space travel has become more common. Ships leave from earth every few months on exploration missions of the galaxy. Ship “Explorer 5” has been searching to see if any life on other planets could be useful on earth. During the mission the crew found a planet with life and collected some organisms. The alien organism 101 is a multicoloured, modular, plant-like organism (the plant-animal divide is less clear for aliens). The organism seems to grow well in air similar to earth's atmosphere and could be useful as it grows flower-like structures made of gold.

Alien picture

On their way back to earth the scientific crew of “Explorer 5” have been conducting experiments on alien 101 to try and determine the conditions under which it grows best. The scientists grew alien 101 in containers under different temperatures (ºC) and rainfall (mm) conditions. They recorded the weight of biomass (g) of alien 101 per square metre after one week.

Unfortunately, on the journey back, the ship hit a satellite and has crash landed on Australia. The crew were able to survive, but the experimental lab has been badly damaged. Alien 101 has been released into the wild on earth.

You are a local team of scientists assisting the space travel company with predicting how the organism might spread. The company want to know where to focus their containment resources to stop the organism from taking over. Even if it could be useful, they don't know how it could damage earth's wildlife.

It is your job to predict where the alien organism will spread to and recommend how to deploy resources


The data can be found at https://www.math.ntnu.no/emner/ST2304/2021v/Week07/Alien101LabData.csv

The first step is to import the data and assign it to an object. You can use the whole web link above to import the data. It is a csv file with column names (header) included.

Alien <- read.csv("https://www.math.ntnu.no/emner/ST2304/2021v/Week07/Alien101LabData.csv", 
                     header=TRUE)

A1. Plot the data using the pairs() function. What can you tell from the plots?

pairs(Alien)

plot of chunk unnamed-chunk-2

The plot shows the paired relationships between the three variables: temperature, rainfall and biomass. It looks like they are have relationships with each other. Temperature and rainfall have a negative relationship (it looks like it anyway), rainfall and biomass seems negative too, but temperature and biomass look positive.

A2. Have a look at the raw data from Explorer 5, do you think rainfall and temperature were tested separately in their experiments? Why/why not?

Hint Here you need to look at the raw data i.e. the data frame you have. head() is one function you can use to do this. or click on the data object in the Environment panel on the top left.

You need to think about how the way data are collected relates to the way it is presented. For instance, if you tested temperature and rainfall separately, how would you write this into a table? How many columns would you have? What would each row represent?

If you aren't sure, pretend you are designing the experiment from the beginning and make a draft of the table you would use.


If you look at the raw data i.e. the numbers in the actual data frame, you will see that both temperature and rainfall are in each row, with only one biomass measure. So, it seems both rain and temperature were changed at the same time in the experiments. If they had not been you would expect one to be held fixed while the other changed or to be 0.

A3. Run a simple linear regression for each of i) temperature and biomass, ii) rainfall and biomass. Give the coefficient estimates (of each parameter), the confidence intervals, and say how much variance each model explains.

Remember to put the explanatory variable and the response in the right places, you might need to think about which is which.

I've forgotten how to do this!

use lm() for the model, coef() to see the estimates, confint() to get confidence intervals, and summary()$r.squared to see how much variance is explained.

If none of this is familiar - look over the exercises from previous weeks!


ModelTemp <- lm(Biomass ~ Temperature, data = Alien)
ModelRain <- lm(Biomass ~ Rainfall, data = Alien)

coef(ModelTemp)
## (Intercept) Temperature 
## 4998.396162    0.407344
coef(ModelRain)
##   (Intercept)      Rainfall 
##  5.011926e+03 -6.780394e-03
confint(ModelTemp)
##                    2.5 %       97.5 %
## (Intercept) 4996.9335738 4999.8587502
## Temperature    0.3445732    0.4701147
confint(ModelRain)
##                     2.5 %        97.5 %
## (Intercept)  5.010161e+03  5.013690e+03
## Rainfall    -8.888233e-03 -4.672554e-03
summary(ModelTemp)$r.squared
## [1] 0.6285664
summary(ModelRain)$r.squared
## [1] 0.2936914

The coefficient estimates and confidence intervals are listed above. The R squared tells us how much variance is explained, which is 63% for the temperature model and 29% for the rainfall model.

A4. Interpret the results of the separate linear regressions. What do the results suggest about the ideal conditions for alien 101?

Separately the models suggest a positive relationship between biomass and temperature (confidence intervals do not span 0) and a negative relationship between rainfall and biomass (confidence intervals also do not span 0). As the confidence intervals of each relationship do not span 0, then no relationship is outside of the plausible range for the true population parameter. We cannot say anything about how probable it is that the true parameter = any number. Only that when we including uncertainty, we still get relationships in the estimated direction. The effect of temperature is between 0.34 grams per degree and 0.47 grams per degree. The effect of rainfall is between -0.009 grams per mm and -0.005 grams per mm. Altogether this suggests that the ideal conditions for alien 101 are hot and dry (these conditions would increase biomass).

You now have some results about how temperature and rainfall (individually) influence the growth of alien 101. But you want to generate predictions of how it could spread around Australia. You can do this by creating predictions of the amount of biomass for the actual average annual temperature and rainfall of Australia. The company have provided you with very simplified data of the average annual temperature and total annual rainfall for different coordinates in Australia. Found at: https://www.math.ntnu.no/emner/ST2304/2021v/Week07/AustraliaEnvironmentalData.csv

This is plotted below.

Using this you can try to guess where the alien might spread to based on what you have found out about the influence of temperature and rainfall on alien 101's growth. Assume that amount of biomass is an indicator of how suitable the conditions are for alien 101.

plot of chunk unnamed-chunk-4

While guessing is ok. It would be better to predict actual numbers based on our linear models. To do this, you will need to:

A5. Predict biomass of alien 101 across Australia.

To make the predictions you will use the data on environmental conditions in Australia. These will become your new explanatory variable (x) values that you want to predict a response (y) from.

You need to create an object to be your newdata argument for the predict function. data.frame() is a good function to use for this. You will need two because you have two different models, one with temperature and one with rainfall.

Remember that the column names in newdata must be the same as those used to make your model.

If it doesn't work - check your lm() is written as described in the start of this document.

Make sure to read all comments. You may need to change the name of the datafile to match what you called the Australia data.

# First thing for the prediction is to create the newdata object
# you will need as an argument for the predict() function.

# Here you need two, one for temperature and one for rainfall.
# Newdata needs to be a datafame so the function data.frame() is very
# useful here. My data is called Australia and my models are ModelTemp
# and ModelRain.
# Column names in newdata MUST be identical to those in the
# data used to make the model (the lab data). 

# newdata for temperature (it makes a dataframe with a column called 
# Temperature from the Temp column in the Australia data)
newdataTemp <- data.frame(Temperature = Australia$Temp)

# newdata for rainfall
newdataRain <- data.frame(Rainfall = Australia$Rain)

# now predict - this should be familiar now
PredictionsTemp <- predict(ModelTemp, newdata = newdataTemp, 
                           interval="prediction")
head(PredictionsTemp)
##        fit      lwr      upr
## 1 4999.618 4993.228 5006.009
## 2 4999.618 4993.228 5006.009
## 3 4999.618 4993.228 5006.009
## 4 4999.917 4993.535 5006.300
## 5 4999.917 4993.535 5006.300
## 6 4999.917 4993.535 5006.300
# now predict - this should be familiar now
PredictionsRain <- predict(ModelRain, newdata = newdataRain, 
                           interval="prediction")
head(PredictionsRain)
##        fit      lwr      upr
## 1 5008.300 4999.618 5016.983
## 2 5008.169 4999.489 5016.849
## 3 5007.380 4998.707 5016.054
## 4 5008.432 4999.748 5017.116
## 5 5008.300 4999.618 5016.983
## 6 5008.169 4999.489 5016.849


Now you have predictions for the biomass in g/m2 for the whole of Australia. If you look at the results, the numbers are very large. This is because biomass is in grams. From looking just at the numbers it is difficult to interpret these predictions.

Another member of your team (Emily) has created a function in R to plot the predicted results back onto the map of Australia. It is quite a complex function and uses the package ggplot2. You can google this if you want to learn more but it can make some very pretty plots!

To use the function, first install the following packages using install.packages():

Then load them using the library() function. e.g. library(ggplot2)

The plotting function is available here: https://www.math.ntnu.no/emner/ST2304/2021v/Week07/plotting_function.R. You can bring the function into your R by using the source() function.

What does source really do?

The source() function is something we can use to run all of the code in an R script. We use this in this course to provide you with functions that we write ourselves. It is a bit like bringing in data, but instead of data, it is R code. Have a go at sourcing the script for this week using source() and using the url above as the argument. You should see that in the right hand Environment panel you now have a function called plotting_function(), it is ready to use.

plotting_function() was written by Emily, it takes two input arguments: Input_data which is the data for Australia and Predictions which is your prediction object.

A6. Plot your predictions using plotting_function().

# source the script
source("https://www.math.ntnu.no/emner/ST2304/2021v/Week07/plotting_function.R")

plotting_function(InputData = Australia, Predictions = PredictionsRain[,1])

plot of chunk unnamed-chunk-6

plotting_function(InputData = Australia, Predictions = PredictionsTemp[,1])

plot of chunk unnamed-chunk-6

A7. Based on these results where would you recommend the company focusses its containment resources? Include some mention of uncertainty in your answer.

The temperature predictions suggest high biomass only in the very north of Australia. This would suggest only a small number of resources needed in the this area. The rainfall predictions suggest high biomass in the very middle of Australia. To combine you could have resources in the middle and only in the very north. The rest of the island has low predicted biomass so does not need high resources.


Part B: Multiple regression

Your team have made some predictions based on using temperature and rainfall separately. But you know what you can put both variables into a combined multiple regression model. So you now decide to do this.

B1. Write a line of R code (and run it) to run a multiple linear regression including temperature and rainfall.

If you are not sure how to do this, have a look at question C3 from Exercise 4. In Exercise 4 you added a quadratic term to your model, this was actually creating a type of multiple regression. You add any other explanatory variables in the same way, but without I or ^2.

ModelBoth <- lm(Biomass ~ Temperature+Rainfall, data = Alien)

coef(ModelBoth)
##   (Intercept)   Temperature      Rainfall 
##  5.000778e+03  3.631127e-01 -1.986818e-03
confint(ModelBoth)
##                     2.5 %        97.5 %
## (Intercept) 4998.20254744  5.003354e+03
## Temperature    0.28984081  4.363846e-01
## Rainfall      -0.00377109 -2.025460e-04
summary(ModelBoth)$r.squared
## [1] 0.6463725

B2. Write the multiple linear regression as a mathematical equation? E.g. Y=…. Hint: think about how this is different to a simple regression

You can write it as matrices or as a sum of all betas multiplied by all Xs. Writing as a matrix is neater and easier. It will also be useful for other linear models. e.g. \(Y_i\) = \(\alpha\) + \(\beta_1\)\(x_1i\) + \(\beta_2\)\(x_2i\) or \(Y\) = \(X\)\(\beta\) + \(\epsilon\) (matrix)

B3. Compare the estimates from the multiple regression to those from the individual regressions. Include coefficients, confidence intervals, and R squared

Remember you can also plot the results. To do this you should plot each variable against biomass separately i.e. biomass against temperature and biomass against rainfall but you will need to adjust the line you are plotting with the mean value of the other variable.

To do this, you can still use abline()

The estimate values have changed a bit. They are still in the same direction as before. The mean estimate for temperature stayed the same at 0.36 but the uncertainty decreased from 0.34, 0.47 to 0.29, 0.44. The effect of rainfall decreased -0.007 to -0.002 and the uncertainty still does not span zero. (Confidence intervals = -0.0038, -2 × 10-4. The R-squared has not increased much from the individual models (0.65 for joint and 0.63 for temperature model), it is certainly not the sum of the two. The two variables do not explain completely different variation.

# EXAMPLE:

# make two plot window
par(mfrow=c(1,2))

# make the plot of temperature and biomass
plot(Alien$Temperature, Alien$Biomass,
     pch = 19, col = "red",
     ylab = "Biomass", xlab = "Temp")

# save the coefficient values as an object
# contains 3 values, intercept, slope for temperature,
# slope for rainfall
coefs <- coef(ModelBoth)

# use the coefficients to plot the line for temperature
abline(a=coefs[1]+(mean(Alien$Rainfall)*coefs[3]), b=coefs[2], col="black")

# make the plot of Rainfall and biomass
plot(Alien$Rainfall, Alien$Biomass,
          pch = 19, col = "blue",
          ylab = "Biomass", xlab = "Rain")

# save the coefficient values as an object
# contains 3 values, intercept, slope for temperature,
# slope for rainfall
coefs <- coef(ModelBoth)

# use the coefficients to plot the line for temperature
abline(a=coefs[1]+(mean(Alien$Temperature)*coefs[2]), b=coefs[3], col="black")

plot of chunk unnamed-chunk-8

B4. How does R find the estimates of the parameters for these linear regressions? Hint: think back to earlier weeks - its the same

Maximum likelihood, we are still using it here.

B5. Perform model checking of the multiple linear regression. Include 3 plots of model checking and decide if you think the fit is ok

Hint: code available in Exercise 4

# Take rounded fitted and residual values from the multiple regression
# model
BothFitted <- round(fitted(ModelBoth),2)
BothResiduals <- round(residuals(ModelBoth),2)

# Plot the fitted versus residuals 
plot(BothFitted, BothResiduals)
# add a horizontal line at 0
# line is grey and dashed (lty=2)
abline(h=0, lty=2, col="grey")

plot of chunk unnamed-chunk-9

# Plot a normal Q-Q plot
qqnorm(BothResiduals)
qqline(BothResiduals)

plot of chunk unnamed-chunk-9

# Plot Cook's distance
plot(ModelBoth, which=4)

plot of chunk unnamed-chunk-9

I am happy with the fitted versus residuals. It seems to meet the linearity assumption, there is a slight funnel shape. But not very strong. The Normal QQ shows that it also meets the normality assumption well - much better than we would usually expect for biological data. The Cook's distance identifies 3 outliers, but these are not far from the rest of the data points and do not seem to have a big influence so I would not remove them. I would not find a strong reason to do so. They do not seem to be typos.

Now you have checked your model and hopefully are either happy with it or have fixed any assumptions that were violated. So now you can generate new predictions of biomass in Australia based on the multiple regression model. This is still not the full picture, remember to look at the uncertainty too. You can do this by entering the upper or lower prediction as the Predictions argument in plotting_function(). To do this you need to use indexing like last week. The lower confidence interval is column 2 in the output from the predict() function e.g.

Predictions <- predict(YOURMODEL, newdata=newdata, interval="prediction")
Predictions[,2] # this bit gives the second column

B6. Predict where the organism will spread to, based on the multiple regression. Plot it using plotting_function().

Hint: You can do this by editting the prediction code above so it predicts from the multiple regression. Now you also only need one newdata object with two columns.

Try editting the code from A5 and A6

Australia = dataset of environmental conditions in Australia.

This shows code to plot the mean predictions. To check how uncertainty changes the patter you can change which column of the object PredictionsBoth that you plot.

PredictionsBoth <- predict(ModelBoth, newdata = data.frame(Temperature = Australia$Temp,
                                                           Rainfall = Australia$Rain), 
                           interval="prediction")

plot of chunk unnamed-chunk-12

B7. What would you recommend in terms of deployment of resources based on your predicted spread? Has this changed from before?

Hopefully it has now changed from focussing purely on the very north and middle to now focussing less on the middle and more on only the north of Australia. The multiple regression gives a slightly different picture than having two individual regression analyses. The pattern is similar but the multiple regression predicts a smaller area of highest biomass, but covering more of the north. It seems rain is not so important when we also include temperature, this is because we are uncertain about the direction of rainfall effect. Without the multiple regression you might have incorrectly predicted how many resources to use and where. You might have wasted a lot of money, or done a bad job by trying to cover too big an area but missing others.


Part C: Reflection

C1. Can you draw any biological conclusions about this alien species? Can you say anything about the relative influence of temperature and rainfall? Are the variables in this analysis enough, what else could have an influence?

Alien 101's growth is influenced by both temperature and rainfall. The effects are in opposite directions. The alien grows best in warm but dry conditions. We would expect it to spread in desert like conditions. But the effect of rainfall is not certain in direction. Could draw any links to biology we know about deserts. We cannot say much from this analysis about the relative influences of temperature and rainfall, because they have been measured on very different scales. Therefore the beta (slope) values are not directly comparable. If we were to standardise our explanatory variables to remove the units (divide by standard deviation and center on the mean) then we could do a comparison.

We could include any other variables that are biologically sensible e.g. herbivory, nutrients, human pressure, pollution. Basically what we have is not enough for a complete picture but it is a good start. We are also missing any explicit consideration of space or connectivity. We do not assume that any areas are similar beyond rain or temperature. Nor do we look at where the space station crashed and any barriers to disperal. These could all be added. This is not a full list of all possible additions. Other ideas could be right too, this is just an example.

C2. Why do we use multiple regressions rather than running individual simple linear regressions?

Rarely does a single variable influence something. This could be the case for carefully designed experiments. But often many variables influence our observations and we should take account of these. As we saw here, it can change what we would predict and the conclusions we can draw. But we don't want to include everything. In this case, because we know that both temperature and rainfall were changed at the same time in the experiment, we should include them both in our model. They did not occur in isolation. Models should still be a simplification of reality. More on this in model selection.


Part D: Feedback

D1. How do you think this exercise went? What do you think your group did well, what are you less sure about? (2 examples of each)

D2. What do you think you improved from last week?

D3. Are there any concepts you are very unsure of?

D4. What would you like feedback on this week?