Exercise 7: Maximising plant productivity

Instructions:

This document contains information, questions, R code, and plots.

Hints and reminders are bold

Questions appear in blue.

Needs to be completed and handed in by 19th March 23:59

Resources:

More details on the data p84 The New Statistics with R
Chapters 6 and 7 in The New Statistics with R (both examples in these chapters)

R this week:

Things to remember:

lm() with interactions. Use *.
relevel() a function to change the reference level of a categorical variable/factor. This controls which group in the categorical variable will appear as the (Intercept).

The challenge: How can we maximise plant productivity?

You are a team of agricultural scientists working for a large farming company. The company has employed you to find out how they can increase the productivity of their arable (plants) crops. The company don't want to pay for new experiments but they have given you some older data that you can analyse. The older data is from two different experiments from 2009 and 1990.

Your job is to find out how different management practices influence plant growth and recommend a plan for the company.

Part A: Dataset one - no interaction

Meadow

credit Wikipedia

The first dataset the company has given you can be found at https://www.math.ntnu.no/emner/ST2304/2019v/Week9/FertilizerData.csv

The data were published in 2009 by Hautier, Niklaus, and Hector Paper link. The study was designed to look at the influence of fertiliser and light on grassland plants. 32 different plots were exposed to fertiliser addition, a light addition to the understory, neither (control), or both treatments. The treatments were conducted for two years and above ground biomass collected twice a year to mimic cutting regimes in European meadows.

The dataset includes the variables; Biomass.m2 the above ground biomass of grassland plants in grams per metre squared, Fert a column indicating if the plots had fertiliser treatment, Light a column indicating if the plots had light addition.

As always, the first step is to import the data and assign it to an object then plot it. You can use the whole web link above to import the data. It is a csv file with column names (header) included.

pairs() should work here.

You might also want to relevel the factors (Fert and Light) to make sure that the contrast level will be the control treatment (light and fertiliser = FALSE).

To do this you use the function relevel(). The arguments it takes are the column name of the factor/categorical variable that you want to relevel and ref= the level that you want to be the reference e.g. FertData$Light <- relevel(FertData$Light, ref="L-").

You will also need to turn BOTH Light and Fert into factors using as.factor(). This was done in the module from week 9.

Take a look at the plot and think about the data

A1. What is the response variable and what are the explanatory variables here? What kind of data are each of these?

In biology, it is always important to think about what we want to find out before creating a model or doing analyses.

A2. Write a biological question (e.g. does temperature influence lay date in birds?) that you can answer using the data you have here.

Now that you have a biological question, your team wants to create a model of the data. You can do this using lm(). It is important to remember that this time there are two explanatory variables, so both need to be included in the model. For now we will include them without an interaction.

A3. Run the lm() to look at the impact of fertiliser and light on above ground biomass. What are the coefficient estimates you get? What do they represent?

A4. Interpret the results in qu A3. What conclusions would you draw about the effectiveness of Fertiliser and Light from these results?

Given the results from your lm() above, it was not possible to see the combined effect of the two treatments (Fertiliser and Light). However, there was a treatment where they were both applied together so we should consider their combined effect.

A5. How can you use your model output to estimate the mean of the group that has fertiliser and light?

Hint.

In the output you have an estimate for the effect of fertiliser (difference in mean from control to fertiliser group). You also have an estimate of the effect of light (difference in mean from control to light group). In this model there is no interaction, so you assume that the fertiliser + light group has the same effects as the fertiliser only and light only, but, both of them.

A6. Calculate the actual mean of the fertiliser + light group. Is the estimate you got in A5 correct? If not, why not?

How do I get the mean?

You can use the same code as in the module this week. Hint: it uses [] and ==.

No, I don't remember

# take mean of the group F+ and L+
mean(FertData$Biomass.m2[FertData$Fert == "F+" & FertData$Light == "L+"])

## [1] 575.0187

A7. Can you give a biological reason why this might be the case? i.e. why does the actual mean differ from our model estimate?

Your team has looked at the results above and decided that the first model they fitted is not capturing the effects of Fertiliser and Light well enough. There seems to be something else going on when the two treatments are combined. The two effects are not simply added together. Therefore, your team decides to fit a model including an interaction term.

Part B: Dataset one - interaction

B1. Run a lm() with an interaction and interpret the output. How has this changed from the first lm() you ran?

B2. Describe the interaction effect in your own words. You might want to draw out the effects to help.

B3. What would you recommend to the company as a strategy to maximise their production based on the results you have so far?

Part C: Dataset two

Soya bean

credit = Wikipedia

The second dataset that your team has access to is from Heggestad and Lesser (1990). This was published in a study in the Journal of Environmental Quality Paper link.

The data are from an experiment looking at the effects of low-level atmospheric pollution and drought on agricultural yields. Specifically this study looked at yields in soya bean plants and included treatments of water (either well watered or low water - i.e. water stressed) and a gradient of an atmospheric pollutant (sulphur dioxide).

The variables in the data are Yield log of the yield of the crop, Water the water treatment either Well-watered or Stressed and SO2 sulphur dioxide level (remember it is O not zero!)

The data can be found at https://www.math.ntnu.no/emner/ST2304/2019v/Week9/PollutionData.csv

Again, you will need to convert some columns to factors. This time it is only the Water column.

C1. What is the response variable and what are the explanatory variables here? What kind of data are each of these - give a reason why?

Hint: think carefully about whether values between the ones you have can exist.

You can also check how they are stored in R, anything stored as num should be continuous numeric.

C2. What is the biological question you want to address with these data?

You realise that you can, again, address this question using an lm() You decide to keep things simple and not include an interaction.

C3. Run an lm() for the effect of water stress and sulphur dioxide on yield. Look at the output (do not interpret yet) can you work out what the coefficients mean in terms of a regression line?

Hint: Think carefully about what kind of data each variable is. This will decide what its coefficient estimate means.

It could be easier to see the effects on a graph. Below is some code to plot of the effect of SO2 on Yield with a line for each treatment level (Well-watered and Stressed).

This is the same as some code you used in the module for week 9.

# make the plot of the data
plot(Yield ~ SO2, data = PollutionData, pch=16, 
     las=1, ylab="log Yield")
# las = 1 makes the y axis numbers display horizontal

# add lines for each group (watered/stressed)
abline(a=coef(PollutionModel)[1], b=coef(PollutionModel)[3], col="grey")
abline(a=coef(PollutionModel)[1]+coef(PollutionModel)[2], 
       b=coef(PollutionModel)[3], col="blue")

# add a legend
legend("topright", c("Well-watered", "Stressed"), col=c("blue","grey"),
       lty=1, cex = 0.5)

# if you aren't sure what the code does, try running just one part at a time
# e.g. coef(PollutionModel)[1] to see what it does

C4. Interpret the output of your lm() using the coefficients and the plot.

While there are only 3 values for SO2 it is NOT categorical data.

Hint: look at the numbers given in coefficient values and try to match them to the graph. This will help you work out what the (Intercept) and Well-watered values mean.

Part D: Recommendation

D1. Based on all of your results from both datasets, what would you recommend as a strategy for the company to improve productivity? Include discussion of what you cannot say from the current data and analyses.

Part E: Feedback

E1. How do you think this exercise went? What do you think your group did well, what are you less sure about? (2 examples of each)

E2. What do you think you improved from last week?

E3. Are there any concepts you are very unsure of?

E4. What would you like feedback on this week?