This module accompanies the lectures from week 9.
This document contains information, questions, R code, and plots.
Hints and reminders are in bold
Questions appear in blue.
# R code looks like this.
# Comments are in grey and explain the code
2+2 # add two and two
## [1] 4
# outputs from R appear like the above ^^^^
lm()
and interpretting outputlm()
resultsIf you want a recap on variables and data types then go here: https://qmbio.math.ntnu.no/theory/all-about-data/
If you fit a linear model to following datasets what would you be asking each time?
Dataset 1: Data collected on wing length of butterflies from three different populations; one at sea level, one half-way up a mountain, and one at the top. Wing length is in mm, the populations are “low”, “mid”, and “high” altitude. There are 100 measurements per population.
You would be asking if there is a difference
in mean wing length between different populations. This is because we
would assume that altitude might explain wing length (you could maybe
argue the other way, but I think that would better be addressed looking
at wing length and ability to fly up or something similar) and altitude
here is represented by the variable population, which is categorical. This
means you should expect the output of the 'lm()
to give a mean and some differences.
Dataset 2: Data collected on body weight and length of frogs. Body weight is in grams and length in mm. There are 200 measurements.
You would be asking if there is a relationship between
body weight and length. This is because we
would assume that length would influence weight and these are both continuous variables. Therefore
you should expect the output of the 'lm()
to give an intercept and a slope, which
represent a continuous straight line.
Dataset 3: Data collected on bacteria growth rate under two different treatments. The first treatment controls food and is “low” and “high”. The second concerns the presence of a fungus and is “present” or “not”. There are 70 observations per treatment.
You would be asking if there is a difference
in mean bacteria growth rate between different food or fungus treatments. Growth rate
is the response variable and the treatments are the explanatory variables, which are both categorical. This
means you should expect the output of the 'lm()
to give a mean and some differences. It is
not completely clear from the description if this experiment had a full factorial design i.e. if
all combinations of both treatments were tried. So you could test an interaction or not,
both are correct.
How does R find estimates of the parameters (the coefficients)?
Using maximum likelihood estimation.
Write tips to your future self on how to decide which model to use when.
If I were to do this, I would emphasise checking what kind of data your variables are made from e.g. categorical or continuous and thinking about which is a response and which is explanatory.
Last week you had data from an experiment carried out at Rothamsted research lab in the UK.
# Import the data
Yields <- read.csv("https://www.math.ntnu.no/emner/ST2304/2020v/Week08/Hoosfield_Yields.csv")
# Make sure to relevel the time variable
Yields$After1970 <- relevel(Yields$After1970, ref="Before") # Make before the intercept
The data had four fertiliser treatments: control, manure, fertilised, stopped. You analysed the effect of these different treatments last week using t-tests, linear models, and one-way ANOVA ideas.
Towards the end of last week Bob told you that in 1970 something changed in the experiment. We don't know exactly what changed, but we want to account for it. So, we introduced an extra variable:
After1970: with levels “Before” and “After”.
Run a model with both variables in.
Look at the output, what do the coefficients mean? Which parameters in the linear equation do they relate to?
\[ y_i = \alpha + \beta x_i + \varepsilon_i \]
Check your answer!
# Run the model
modelBoth <- lm(yield ~ Treatment+After1970, data = Yields)
# Look at coefficient estimates
coef(modelBoth)
## (Intercept) TreatmentFertilised TreatmentManure TreatmentStopped After1970After
## 0.6110417 1.9616667 3.0122222 0.8094444 0.9035417
# Look at uncertainty
confint(modelBoth)
## 2.5 % 97.5 %
## (Intercept) 0.2525636 0.9695197
## TreatmentFertilised 1.4836959 2.4396374
## TreatmentManure 2.5342515 3.4901930
## TreatmentStopped 0.3314737 1.2874152
## After1970After 0.5450636 1.2620197
The code to run the model and print the estimates, with uncertainty, is shown above.
The interpretation of the coefficients in terms of the linear equation is as follows:
(Intercept) = \(\alpha\) the \(y\) value when \(x\) = 0. In this case, \(x\) doesn't actually have numeric values, instead they are groups. These can be coded as dummy variables, which do have numeric values of 0 for one group and 1 for the other. In this case \(x\) = 0 = the control treatment group Before 1970. So, when \(x\) = that group then the \(y\) value = 0.611 = mean of this first group.
TreatmentFertilised = \(\beta_1\) the slope of the effect of Fertilised. Any \(\beta\) value tells us the amount of change in \(y\) for a change of 1 in \(x\). In this case, a change of 1 in \(x\) = movement of one group. This value captures the change from the first group to the Fertilised group (still Before 1970). It is a difference between the two group means. If you add this estimate to the Intercept value you should get the mean of the Fertiliser group before 1970.
TreatmentManure = \(\beta_2\) the slope of the effect of Manure. This is the same as \(\beta_1\) but for the Manure group. It is still relative to the Intercept group (still Before 1970). If you add this estimate to the Intercept value you should get the mean of the Manure group before 1970.
TreatmentStopped = \(\beta_3\) the slope of the effect of Stopped treatment. This is the same as \(\beta_1\) but for the Stopped group. It is still relative to the Intercept group (still Before 1970). If you add this estimate to the Intercept value you should get the mean of the Stopped group before 1970.
After1970After = \(\beta_4\) the slope of the effect of time After 1970. This is the same as \(\beta_1\) but for the After1970 group, this is for the second variable. It is still relative to the Intercept group and the treatment level will be control. If you add this estimate to the Intercept value you should get the mean of the Control group after 1970.
The uncertainty numbers are for each of the effects listed about. So, the intercept row is uncertainty in a mean, but the rest are uncertainty of differences.
To find the mean of the different Treatments, After 1970, you would need to add the estimate of the Treatment effect to the estimate of the effect of time and add both to the Intercept.
Now you know what the coefficients should mean, check them by calculating the mean of each group in your data.
Code hint.
The code below shows you how you can easily (easy in terms of less typing)
take the mean of each group in the data. Remember that []
are index brackets, they
tell R, “I want something inside the object I just mentioned” e.g. object[1]
takes the
first entry in object
.
The ==
means matching and selects only entries that match what follows the ==
.
So, in words mean(Yields$yield[Yields$Treatment == "Control" & Yields$After1970 == "Before"])
tells R to take the mean of the column yields
inside data frame Yields
but only for
rows where column Treatment
= Control and column After1970
= Before.
Why does it not add up?
Because there is an interaction and we are missing it.
Often in nature, effects are not separate but interact. This means that the effect of one variable is different for different values of another, e.g. plants might not be able to grow more in response to fertiliser, unless they also have enough light. So, the effect of fertiliser would be different in low and high light.
Often, we want to include interactions in our models. We do this by changing the +
in lm()
to
*
or :
. If you use :
need to use +
too, *
does everything. So we recommend
using *
! It's much easier. e.g. lm(Yield ~ Treatment * After 1970, data = Yields)
By adding in an interaction, we now change what R calculates from the model.
In the previous example, without an interaction, the output of the lm()
had
no row for the Fertilised, Manure, or Stopped treatments for the time period
After 1970. To work out what the estimated means of these groups
you would have had to add the effect of the fertiliser treatment to the control
and then add the effect of time after 1970. This assumes that the effect of
time is the same for each fertiliser treatment AND that the effect of
each fertiliser treatment is the same regardless of time.
But in reality, the effect of time might not be the same for each fertiliser treatment and vice versa, this is an interaction. So, when you fit a model with an interaction in R it fits a separate effect for each combination of groups within the interacting variables e.g. one for fertiliser after 1970 and a separate effect for fertiliser before 1970 and a separate one for control after 1970 etc.
When you run a model with an interaction, you should now get a row for each combination of time and fertiliser treatment.
Run the interactive model.
What do these coefficients mean? Which parameters do they relate to?
Check your answer!
The (Intercept) row is still the mean of the control group before 1970. Each row that is labelled with 'Treatment' followed by a group name e.g. 'TreatmentFertilised' (but without a :) represent the difference in mean between the (Intercept) group and the named group (all before 1970). E.g. Fertilised before 1970 group has a mean 1.6225 more than the Control group before 1970.
The row that is labelled with After1970After gives the difference in mean between the (Intercept) group and the Control group after 1970.
The last few numbers are the interactions. They all incluce a :
which
tells you they are an interaction. They represent the difference between the
mean of the Treatment group before 1970 and the mean of the Treatment
group after 1970. For example, the row labelled TreatmentManure:After1970After
represents the difference between the mean for the Manure treatment
before 1970 (0.89750000 + 2.14083333) and the mean for the Manure
treatment after 1970. After 1970, the Manure Treatment had a mean yield
approximately 2.6 higher than before 1970.
# Run the model
modelBothI <- lm(yield ~ Treatment*After1970, data = Yields)
# Look at coefficient estimates
coef(modelBothI)
## (Intercept) TreatmentFertilised
## 0.89750000 1.62250000
## TreatmentManure TreatmentStopped
## 2.14083333 0.87416667
## After1970After TreatmentFertilised:After1970After
## 0.04416667 1.01750000
## TreatmentManure:After1970After TreatmentStopped:After1970After
## 2.61416667 -0.19416667
# Look at uncertainty
confint(modelBothI)
## 2.5 % 97.5 %
## (Intercept) 0.6204900 1.1745100
## TreatmentFertilised 1.2307487 2.0142513
## TreatmentManure 1.7490820 2.5325847
## TreatmentStopped 0.4824153 1.2659180
## After1970After -0.4356288 0.5239621
## TreatmentFertilised:After1970After 0.3389668 1.6960332
## TreatmentManure:After1970After 1.9356334 3.2926999
## TreatmentStopped:After1970After -0.8726999 0.4843666
Do things add up now? (we are not aiming for perfection)
Mostly. Not perfect, but close.
What can you conclude from these results?
Need to use confidence intervals to get conclusions. Can conclude that in the period before 1970 all fertiliser treatments seem to have a positive effect on yield compared to the control. All have estimates of positive mean difference, manure is highest. The effect of time is different for each treatment. For control, no clear direction of pattern (CI span 0. For Fertilised there was an interaction and you see an extra increase in mean after 1970 (0.34 to 1.7 increase) same for manure, but even bigger extra increase (1.94 to 3.29) but for the stopped treatment, you do not see an interaction (effect spans 0). So control and stopped show no difference with time. Makes me wonder if they changed the type of manure/fertiliser in 1970? But we don't know! Remember, to find the mean of the Treatment groups e.g. fertilised, stopped, and manure, for after 1970 you must add the intercept the effect of time, the effect of the treatment AND the interaction effect.
What about continuous data? So far, treated both separately but there are many scenarios where we would want to have both types of explanatory variable in a single analysis.
Example 1: Body length and temperature and water treatment. Scientists measured the body length of insects in micrometers at different temperatures either with or without rainfall.
Example 2: Lay date and insect biomass and habitat. Scientists (including Emily) measured the date in the year when birds lay their eggs (it can take any value even 1.5 or -2) in different habitats in a woodland: good, bad, ok and also measured the biomass of insects close to each nest box. Biomass is measured in grams and there are several nest boxes in each habitat.
Example 3: Tail length and dinosaur family and age of fossil. Palentologists measured the tail length of fossils in cm and recorded the family the dinosaur belonged too and the age of the fossil in million years.
In the examples, which variable is categorical and which is continuous?
Example 1 = categorical (water) and continuous (temperature). Example 2 = categorical (habitat) and continuous (biomass). Example 3 = categorical (family) and continuous (fossil age).
The data for example 1 (made up) can be found here: https://www.math.ntnu.no/emner/ST2304/2020v/Week09/BodyLengthExperiment.csv
It is a csv file and has column headers.
Run an interactive model for example 1.
# Import data
BodyLength <- read.csv("https://www.math.ntnu.no/emner/ST2304/2020v/Week09/BodyLengthExperiment.csv",
header=T)
# create the lm object
BodyLengthModel <- lm(length ~ temperature*water, data = BodyLength)
# extract the coefficient estimates
coef(BodyLengthModel)
## (Intercept) temperature waterYes temperature:waterYes
## 46.365831 5.191970 25.267954 -3.643074
# extract confidence intervals
confint(BodyLengthModel)
## 2.5 % 97.5 %
## (Intercept) 31.804175 60.927487
## temperature 4.239380 6.144560
## waterYes 4.674663 45.861245
## temperature:waterYes -4.990240 -2.295909
What do these coefficients mean? Which parameters do they relate to?
The (Intercept) is the intercept of the line for temperature when water = No. The next row is the slope of the effect of temperature when water = No. Then the next is the effect of water = Yes, in this case a change in intercept of the line. The final number is the interaction so the change in the effect of temperature when water = Yes, it is a difference in slope.
Try and draw out the results on your whiteboard.
When drawing think about what data you have and how you can best represent it in a graph. Think about what the response variable is, what the explanatory variables are and how you would like to model them e.g. a straight line or several? Or differences in means etc. Don't worry, we will go through this afterwards so it doesn't matter if you are wrong, just have a go!
PAUSE HERE
Correct your drawing of the results if you need to.
Interpret output in biological terms.
Temperature has positive effect on body length (warmer = longer). The strength of that effect is bigger when there is no water. But the effect of water itself, is to increase body length. Large uncertainty in the effect of water, but still doesn’t cross 0
Was an interaction term needed in this model?
Yes, the interaction effect was not close to 0 and the confidence intervals did not span 0.
Look at the model outputs above, match the following model types to the output:
They are all in different styles and I have deliberately given no information on the type of data. You have to work it out from the output.
Example 1
Example 2
Example 3
Example 4: lm(time ~ mean_max_temp1*April_lay)
How would interpretation bet different for a) a model with an interaction and the one with one categorical and one continuous explanatory variable b) the same data but no interaction?
span style=“color:red”> With no interaction there will be the same slope for each group but with interaction there is a difference slope for the continuous variable within each group of the categorical.