Exercise Module 9: Interactions

Instructions:

This module accompanies the lectures from week 9.

This document contains information, questions, R code, and plots.

Hints and reminders are in bold

Questions appear in blue.

# R code looks like this.
# Comments are in grey and explain the code

2+2 # add two and two

## [1] 4

# outputs from R appear like the above ^^^^

Contents of module

Part A = Choosing a model
Part B = Two categorical explanatory variables
Part C = Interactions
Part D = Mixed continuous and categorical variables
Part E = Detective skills
Part F = Practice

R skills

Using lm() and interpretting output
Plotting lm() results

Part A: Choosing a model

If you want a recap on variables and data types then go here: https://qmbio.math.ntnu.no/theory/all-about-data/

If you fit a linear model to following datasets what would you be asking each time?

Dataset 1: Data collected on wing length of butterflies from three different populations; one at sea level, one half-way up a mountain, and one at the top. Wing length is in mm, the populations are “low”, “mid”, and “high” altitude. There are 100 measurements per population.

You would be asking if there is a difference in mean wing length between different populations. This is because we would assume that altitude might explain wing length (you could maybe argue the other way, but I think that would better be addressed looking at wing length and ability to fly up or something similar) and altitude here is represented by the variable population, which is categorical. This means you should expect the output of the 'lm() to give a mean and some differences.

Dataset 2: Data collected on body weight and length of frogs. Body weight is in grams and length in mm. There are 200 measurements.

You would be asking if there is a relationship between body weight and length. This is because we would assume that length would influence weight and these are both continuous variables. Therefore you should expect the output of the 'lm() to give an intercept and a slope, which represent a continuous straight line.

Dataset 3: Data collected on bacteria growth rate under two different treatments. The first treatment controls food and is “low” and “high”. The second concerns the presence of a fungus and is “present” or “not”. There are 70 observations per treatment.

You would be asking if there is a difference in mean bacteria growth rate between different food or fungus treatments. Growth rate is the response variable and the treatments are the explanatory variables, which are both categorical. This means you should expect the output of the 'lm() to give a mean and some differences. It is not completely clear from the description if this experiment had a full factorial design i.e. if all combinations of both treatments were tried. So you could test an interaction or not, both are correct.

How does R find estimates of the parameters (the coefficients)?

Using maximum likelihood estimation.

Write tips to your future self on how to decide which model to use when.

If I were to do this, I would emphasise checking what kind of data your variables are made from e.g. categorical or continuous and thinking about which is a response and which is explanatory.

Part B: Two categorical explanatory variables

Last week you had data from an experiment carried out at Rothamsted research lab in the UK.

# Import the data
Yields <- read.csv("https://www.math.ntnu.no/emner/ST2304/2020v/Week08/Hoosfield_Yields.csv")

# Make sure to relevel the time variable
Yields$After1970 <- relevel(Yields$After1970, ref="Before") # Make before the intercept

The data had four fertiliser treatments: control, manure, fertilised, stopped. You analysed the effect of these different treatments last week using t-tests, linear models, and one-way ANOVA ideas.

Towards the end of last week Bob told you that in 1970 something changed in the experiment. We don't know exactly what changed, but we want to account for it. So, we introduced an extra variable:

After1970: with levels “Before” and “After”.

Run a model with both variables in.

Look at the output, what do the coefficients mean? Which parameters in the linear equation do they relate to?

\[ y_i = \alpha + \beta x_i + \varepsilon_i \]

Check your answer!

# Run the model
modelBoth <- lm(yield ~ Treatment+After1970, data = Yields)

# Look at coefficient estimates
coef(modelBoth)

##         (Intercept) TreatmentFertilised     TreatmentManure    TreatmentStopped      After1970After 
##           0.6110417           1.9616667           3.0122222           0.8094444           0.9035417

# Look at uncertainty
confint(modelBoth)

##                         2.5 %    97.5 %
## (Intercept)         0.2525636 0.9695197
## TreatmentFertilised 1.4836959 2.4396374
## TreatmentManure     2.5342515 3.4901930
## TreatmentStopped    0.3314737 1.2874152
## After1970After      0.5450636 1.2620197

The code to run the model and print the estimates, with uncertainty, is shown above.

The interpretation of the coefficients in terms of the linear equation is as follows:

(Intercept) = $\alpha$ the $y$ value when $x$ = 0. In this case, $x$ doesn't actually have numeric values, instead they are groups. These can be coded as dummy variables, which do have numeric values of 0 for one group and 1 for the other. In this case $x$ = 0 = the control treatment group Before 1970. So, when $x$ = that group then the $y$ value = 0.611 = mean of this first group.

TreatmentFertilised = $\beta_1$ the slope of the effect of Fertilised. Any $\beta$ value tells us the amount of change in $y$ for a change of 1 in $x$. In this case, a change of 1 in $x$ = movement of one group. This value captures the change from the first group to the Fertilised group (still Before 1970). It is a difference between the two group means. If you add this estimate to the Intercept value you should get the mean of the Fertiliser group before 1970.

TreatmentManure = $\beta_2$ the slope of the effect of Manure. This is the same as $\beta_1$ but for the Manure group. It is still relative to the Intercept group (still Before 1970). If you add this estimate to the Intercept value you should get the mean of the Manure group before 1970.

TreatmentStopped = $\beta_3$ the slope of the effect of Stopped treatment. This is the same as $\beta_1$ but for the Stopped group. It is still relative to the Intercept group (still Before 1970). If you add this estimate to the Intercept value you should get the mean of the Stopped group before 1970.

After1970After = $\beta_4$ the slope of the effect of time After 1970. This is the same as $\beta_1$ but for the After1970 group, this is for the second variable. It is still relative to the Intercept group and the treatment level will be control. If you add this estimate to the Intercept value you should get the mean of the Control group after 1970.

The uncertainty numbers are for each of the effects listed about. So, the intercept row is uncertainty in a mean, but the rest are uncertainty of differences.

To find the mean of the different Treatments, After 1970, you would need to add the estimate of the Treatment effect to the estimate of the effect of time and add both to the Intercept.

Now you know what the coefficients should mean, check them by calculating the mean of each group in your data.

Code hint.

The code below shows you how you can easily (easy in terms of less typing) take the mean of each group in the data. Remember that [] are index brackets, they tell R, “I want something inside the object I just mentioned” e.g. object[1] takes the first entry in object.

The == means matching and selects only entries that match what follows the ==. So, in words mean(Yields$yield[Yields$Treatment == "Control" & Yields$After1970 == "Before"]) tells R to take the mean of the column yields inside data frame Yields but only for rows where column Treatment = Control and column After1970 = Before.

Why does it not add up?

Because there is an interaction and we are missing it.

Part C: Interactions

Often in nature, effects are not separate but interact. This means that the effect of one variable is different for different values of another, e.g. plants might not be able to grow more in response to fertiliser, unless they also have enough light. So, the effect of fertiliser would be different in low and high light.

Often, we want to include interactions in our models. We do this by changing the + in lm() to * or :. If you use : need to use + too, * does everything. So we recommend using *! It's much easier. e.g. lm(Yield ~ Treatment * After 1970, data = Yields)

By adding in an interaction, we now change what R calculates from the model. In the previous example, without an interaction, the output of the lm() had no row for the Fertilised, Manure, or Stopped treatments for the time period After 1970. To work out what the estimated means of these groups you would have had to add the effect of the fertiliser treatment to the control and then add the effect of time after 1970. This assumes that the effect of time is the same for each fertiliser treatment AND that the effect of each fertiliser treatment is the same regardless of time.

But in reality, the effect of time might not be the same for each fertiliser treatment and vice versa, this is an interaction. So, when you fit a model with an interaction in R it fits a separate effect for each combination of groups within the interacting variables e.g. one for fertiliser after 1970 and a separate effect for fertiliser before 1970 and a separate one for control after 1970 etc.

When you run a model with an interaction, you should now get a row for each combination of time and fertiliser treatment.

Run the interactive model.

What do these coefficients mean? Which parameters do they relate to?

Check your answer!

The (Intercept) row is still the mean of the control group before 1970. Each row that is labelled with 'Treatment' followed by a group name e.g. 'TreatmentFertilised' (but without a :) represent the difference in mean between the (Intercept) group and the named group (all before 1970). E.g. Fertilised before 1970 group has a mean 1.6225 more than the Control group before 1970.

The row that is labelled with After1970After gives the difference in mean between the (Intercept) group and the Control group after 1970.

The last few numbers are the interactions. They all incluce a : which tells you they are an interaction. They represent the difference between the mean of the Treatment group before 1970 and the mean of the Treatment group after 1970. For example, the row labelled TreatmentManure:After1970After represents the difference between the mean for the Manure treatment before 1970 (0.89750000 + 2.14083333) and the mean for the Manure treatment after 1970. After 1970, the Manure Treatment had a mean yield approximately 2.6 higher than before 1970.

# Run the model
modelBothI <- lm(yield ~ Treatment*After1970, data = Yields)

# Look at coefficient estimates
coef(modelBothI)

##                        (Intercept)                TreatmentFertilised 
##                         0.89750000                         1.62250000 
##                    TreatmentManure                   TreatmentStopped 
##                         2.14083333                         0.87416667 
##                     After1970After TreatmentFertilised:After1970After 
##                         0.04416667                         1.01750000 
##     TreatmentManure:After1970After    TreatmentStopped:After1970After 
##                         2.61416667                        -0.19416667

# Look at uncertainty
confint(modelBothI)

##                                         2.5 %    97.5 %
## (Intercept)                         0.6204900 1.1745100
## TreatmentFertilised                 1.2307487 2.0142513
## TreatmentManure                     1.7490820 2.5325847
## TreatmentStopped                    0.4824153 1.2659180
## After1970After                     -0.4356288 0.5239621
## TreatmentFertilised:After1970After  0.3389668 1.6960332
## TreatmentManure:After1970After      1.9356334 3.2926999
## TreatmentStopped:After1970After    -0.8726999 0.4843666

Do things add up now? (we are not aiming for perfection)

Mostly. Not perfect, but close.

What can you conclude from these results?

Need to use confidence intervals to get conclusions. Can conclude that in the period before 1970 all fertiliser treatments seem to have a positive effect on yield compared to the control. All have estimates of positive mean difference, manure is highest. The effect of time is different for each treatment. For control, no clear direction of pattern (CI span 0. For Fertilised there was an interaction and you see an extra increase in mean after 1970 (0.34 to 1.7 increase) same for manure, but even bigger extra increase (1.94 to 3.29) but for the stopped treatment, you do not see an interaction (effect spans 0). So control and stopped show no difference with time. Makes me wonder if they changed the type of manure/fertiliser in 1970? But we don't know! Remember, to find the mean of the Treatment groups e.g. fertilised, stopped, and manure, for after 1970 you must add the intercept the effect of time, the effect of the treatment AND the interaction effect.

Part D: Mixed continuous and categorical variables

What about continuous data? So far, treated both separately but there are many scenarios where we would want to have both types of explanatory variable in a single analysis.

Example 1: Body length and temperature and water treatment. Scientists measured the body length of insects in micrometers at different temperatures either with or without rainfall.

Example 2: Lay date and insect biomass and habitat. Scientists (including Emily) measured the date in the year when birds lay their eggs (it can take any value even 1.5 or -2) in different habitats in a woodland: good, bad, ok and also measured the biomass of insects close to each nest box. Biomass is measured in grams and there are several nest boxes in each habitat.

Example 3: Tail length and dinosaur family and age of fossil. Palentologists measured the tail length of fossils in cm and recorded the family the dinosaur belonged too and the age of the fossil in million years.

In the examples, which variable is categorical and which is continuous?

Example 1 = categorical (water) and continuous (temperature). Example 2 = categorical (habitat) and continuous (biomass). Example 3 = categorical (family) and continuous (fossil age).

The data for example 1 (made up) can be found here: https://www.math.ntnu.no/emner/ST2304/2020v/Week09/BodyLengthExperiment.csv

It is a csv file and has column headers.

Run an interactive model for example 1.

# Import data

BodyLength <- read.csv("https://www.math.ntnu.no/emner/ST2304/2020v/Week09/BodyLengthExperiment.csv", 
                       header=T)

# create the lm object
BodyLengthModel <- lm(length ~ temperature*water, data = BodyLength)

# extract the coefficient estimates
coef(BodyLengthModel)

##          (Intercept)          temperature             waterYes temperature:waterYes 
##            46.365831             5.191970            25.267954            -3.643074

# extract confidence intervals
confint(BodyLengthModel)

##                          2.5 %    97.5 %
## (Intercept)          31.804175 60.927487
## temperature           4.239380  6.144560
## waterYes              4.674663 45.861245
## temperature:waterYes -4.990240 -2.295909

What do these coefficients mean? Which parameters do they relate to?

The (Intercept) is the intercept of the line for temperature when water = No. The next row is the slope of the effect of temperature when water = No. Then the next is the effect of water = Yes, in this case a change in intercept of the line. The final number is the interaction so the change in the effect of temperature when water = Yes, it is a difference in slope.

Try and draw out the results on your whiteboard.

When drawing think about what data you have and how you can best represent it in a graph. Think about what the response variable is, what the explanatory variables are and how you would like to model them e.g. a straight line or several? Or differences in means etc. Don't worry, we will go through this afterwards so it doesn't matter if you are wrong, just have a go!

PAUSE HERE

Correct your drawing of the results if you need to.

plot of chunk unnamed-chunk-7

Interpret output in biological terms.

Temperature has positive effect on body length (warmer = longer). The strength of that effect is bigger when there is no water. But the effect of water itself, is to increase body length. Large uncertainty in the effect of water, but still doesn’t cross 0

Was an interaction term needed in this model?

Yes, the interaction effect was not close to 0 and the confidence intervals did not span 0.

Part E: Detective skills

Look at the model outputs above, match the following model types to the output:

They are all in different styles and I have deliberately given no information on the type of data. You have to work it out from the output.

Example 1

Example 2

Example 3

Example 4: lm(time ~ mean_max_temp1*April_lay)

One categorical explanatory variable Example3
One categorical and one continuous explanatory variable Example1
One continuous explanatory variables Example2
Two continuous explanatory variables Example4
Model with an interaction Examples1 and 4

How would interpretation bet different for a) a model with an interaction and the one with one categorical and one continuous explanatory variable b) the same data but no interaction?

span style=“color:red”> With no interaction there will be the same slope for each group but with interaction there is a difference slope for the continuous variable within each group of the categorical.