Exercise 7: Interactions

Instructions:

Hints and reminders are italic

Questions appear in blue.

Needs to be completed and handed in by 14th March

Resources:

More details on the data p84 The New Statistics with R
Chapters 6 and 7 in The New Statistics with R (both examples in these chapters)

Extras (try to complete without these, but if you get stuck use them)

The challenge: How can we maximise plant productivity?

You are a team of agricultural scientists working for a large farming company. The company has employed you to find out how they can increase the productivity of their arable (plants) crops. The company don't want to pay for new experiments but they have given you some older data that you can analyse. The older data is from two different experiments from 2009 and 1990.

Your job is to find out how different management practices influence plant growth and recommend a plan for the company.

Dataset one

Meadow

credit Wikipedia

The first dataset the company has given you can be found at https://www.math.ntnu.no/emner/ST2304/2019v/Week9/FertilizerData.csv

The data were published in 2009 by Hautier, Niklaus, and Hector Paper link. The study was designed to look at the influence of fertilizer and light on grassland plants. 32 different plots were exposed to fertilizer addition, a light addition to the understory, neither (control), or both treatments. The treatments were conducted for two years and above ground biomass collected twice a year to mimic cutting regimes in European meadows.

The dataset includes the variables; Biomass.m2 the above ground biomass of grassland plants in grams per metre squared, Fert a column indicating if the plots had fertilizer treatment, Light a column indicating if the plots had light addition.

As always, the first step is to import the data and assign it to an object then plot it. You can use the whole web link above to import the data. It is a csv file with column names (header) included. pairs() should work here

Take a look at the plot and think about the data

1. What is the response variable and what are the explanatory variables here? What kind of data are each of these?

In biology, it is always important before creating a model or doing analyses to think about what we want to find out.

2. Write a biological question (e.g. does temperature influence lay date in birds?) that you can answer using the data you have here.

Now that you have a biological question, your team wants to create a model of the data. You can do this using lm(). It is important to remember that this time there are two explanatory variables, so both need to be included in the model. For now we will include them additively (with a plus).

We want our treatment columns (just the column names) as our X values and the response as our Y. lm(Y~X1+X2, data)

3. Run the lm() to look at the impact of fertilizer and light on above ground biomass. What are the coefficient estimates you get from the lm()? What do they represent?

4. Interpret the results in qu 3. What conclusions would you draw about the effectiveness of Fertilizer and Light from these results?

Given the results from your lm() above, it was not possible to see the combined effect of the two treatments (Fertilizer and Light). However, there was a treatment where they were both applied together so we should consider their combined effect.

5. Why is it important to consider the combined effect of the two treatments?

One way we can look at the combined effect from our model is to predict the mean biomass if both Fertilizer and Light = TRUE.

# Create newdata for F+ and L+
newdata <- data.frame(Fert = "F+", Light = "L+")
# Predict, my model was called FertModel
prediction <- predict(FertModel, newdata, interval = "prediction")

# look at the prediction 
prediction[,1]

## [1] 551.1672

The prediction based on our additive (adding each effect) linear model is that the mean biomass of the Fertilized and Light addition group should be 551 grams per square metre. We can check if this is right using the code below.

# take mean of the group F+ and L+
mean(FertData$Biomass.m2[FertData$Fert == "F+" & FertData$Light == "L+"])

## [1] 575.0187

The actual mean of this group is 575 grams of biomass per square metre.

6. Can you give a biological reason why this might be the case? i.e. why does the actual mean differ from our model prediction?

Your team has looked at the results above and decided that the additive model they fitted is not capturing the effects of Fertilizer and Light well enough. There seems to be something else going on when the two treatments are combined. The two effects are not simply added together. Therefore your team decides to fit a model including an interaction term.

To do this instead of a + you include a * in the lm() formula e.g. lm(Y~ X1*X2, data). This includes the individual effects and the interaction

7. Run a lm() with an interaction and interpret the output. How has this changed from the first lm() you ran?

Hint: now you have a model with an interaction the differences have changed a bit. While the individual effects are still differences from the (Intercept) group - control here. The coefficient/beta value/difference for the interaction effect, written FertF+:LightL+ in the output, is actually the difference from the additive assumption. I.e. the difference from what we would predict if we just added the effects of Fertilizer = True and Light = True to the control group. Code to explain this included below!

# save the coefficients as an object 
coefficients <- coef(FertModel2)
# the first number in the coefficients is the mean of the control
# then the effect of Fertilizer, the effect of Light
# finally the interaction

# The additive assumption for Fertilizer + Light group =
coefficients[1]+coefficients[2]+coefficients[3]

## (Intercept) 
##    479.6125

# Including the interaction = 
coefficients[1]+coefficients[2]+coefficients[3]+coefficients[4]

## (Intercept) 
##    575.0188

# which gives the actual mean of that group

Interpreting interaction effects from the numbers can be a bit tricky. Often it is easier to plot the results to see how the interaction works. We can do this using the function interaction.plot()

par(mfrow=c(2,1))
# with() just tells R which dataset to use

# make interaction plot for Fertilizer effect under different
# Light conditions
with(FertData, interaction.plot(Fert, Light, Biomass.m2, 
                                xlab="Fertilizer", ylab="Mean Biomass"))

# make interaction plot for Light effect under different
# Fertilizer conditions
with(FertData, interaction.plot(Light, Fert, Biomass.m2, 
                                xlab="Light", ylab="Mean Biomass"))

plot of chunk unnamed-chunk-12

8. Describe the interaction effect in your own words based on the plots above.

9. What would you recommend to the company as a strategy to maximise their production based on the results you have so far?

Dataset two

This section is material you haven't covered in this exact way before. You will need to combine what you know about linear models for continuous data with what you know about linear models for categorical data. There is also more discussion on this in Chapter 7 of The New Statistics with R, if you find you would like more explanation.

Soya bean

credit = Wikipedia

The second dataset that your team has access to is from Heggestad and Lesser (1990). This was published in a study in the Journal of Environmental Quality Paper link.

The data are from an experiment looking at the effects of low-level atmospheric pollution and drought on agricultural yields. Specifically this study looked at yields in soya bean plants and included treatments of water (either well watered or low water - i.e. water stressed) and a gradient of an atmospheric pollutant (sulphur dioxide).

The variables in the data are Yield log of the yield of the crop, Water the water treatment either Well-watered or Stressed and SO2 sulphur dioxide level (remember it is O not zero!)

The data can be found at https://www.math.ntnu.no/emner/ST2304/2019v/Week9/PollutionData.csv

10. What is the response variable and what are the explanatory variables here? What kind of data are each of these?

Hint: think carefully about whether values between the ones you have can exist

11. What is the biological question you want to address with these data?

You realise that you can, again, address this question using an lm() You decide to start simple and not include an interaction.

12. Run an lm() for the effect of water stress and sulphur dioxide on yield. Look at the output (do not interpret yet) can you work out what the coefficients mean?

Hint: Think carefully about what kind of data each variable is. This will dictate what its coefficient value means

This is another type of model, when we have mixed continuous and categorical data, where it is easier to see the effects on a graph. Below is a plot of the effect of SO2 on Yield with a line for each treatment level (Well-watered and Stressed).

plot of chunk unnamed-chunk-19

13. Interpret the output of your lm() using the coefficients and the plot.

While there are only 3 values for SO2 it is NOT categorical data

Hint: look at the numbers given in coefficient values and try to match them to the graph. This will help you work out what the (Intercept) and Well-watered values mean.

Recommendation

14. Based on all of your results, what would you recommend as a strategy for the company to improve productivity? Include discussion of what you cannot say from the current data and analyses.