Categorical variables continued:

from question to interpretation and back again

Instructions:

Hints and reminders are italic

Questions appear in blue.

Answers in red or as R code

Resources:

Background on Rothamsted and Fisher - why they are important in statistics (p8 The New Statistics with R)
Definition of categorical variables
Week 4 - t-test. Lecture p32, 36-38, and Exercise Q2.
Week 5 - regression. Lecture p6-10
Week 7 - design matrix. Lecture p14-18
Your brain - we will spend more time on this part of the exercise thinking through why we are doing things and what it shows rather than just running the analyses.

Extras (try to complete without these, but if you get stuck use them)

The challenge: Did we have a solution to insect pests back in 1942?

For the first half of this exercise we have looked at what impact different fertilizer regimes have on yield of a crop (how much of it we get). For the second half we will look at the influence of insecticides on insects of crop plots in Ontario, Canada in 1942.

The data comes from a scientific paper from 1942 http://www.bio.umass.edu/biology/kunkel/pub/Biometry/Beale_InSpra-Biom1942.pdf, pretty old. You have columns of the biomass (this was originally a count but we have altered it to be an indicator of biomass because we should do a different kind of model for counts - which we will cover in a few weeks) of the dead insects randomly sampled in the plot after the treatment, spray, which insecticide spray the plot was treated with (no one seems to know what these actually were - good lesson to always keep good notes!).

Here is a picture of an insect pest - the tomato hornworm Manduca quinquemaculata - in its most elegant form. (credit- wikipedia)

Insect picture

The experiment was run by taking several independent crop plots of nightshade solanaceae (tomatoes, peppers etc) and spraying them with one of 6 different insecticide sprays (again, we only know them as A,B,C,D,E,F). Insects were randomly sampled (dead ones) after the treatment. The data cover 72 plots, each spray was applied on 12 plots.

Your job is to find out whether these sprays influence the amount of insects killed on a crop.

The data can be found at https://www.math.ntnu.no/emner/ST2304/2019v/Week8/InsectData.csv

The first step is to import the data and assign it to an object. You can use the whole web link above to import the data. It is a csv file with column names (header) included.

InsectData <- read.csv("https://www.math.ntnu.no/emner/ST2304/2019v/Week8/InsectData.csv", header=T)

The first step, as always, is to plot our data. As we just have two columns we can use the pairs() function without a problem.

pairs(InsectData)

plot of chunk unnamed-chunk-2

Take a look at the plot and think about the experimental design.

1. What is the response variable and what is the explanatory variable here? What kind of data are each of these?

Explanatory = spray Response = biomass count = continuous and spray = categorical/factor/discrete/grouped

You want to try and model this data. Given your answer to question 1, 2. What would you try and model from this data? (i.e. What estimates would we want our model to produce? Put simply, what are we interested in finding out or capturing mathematically here?)

We want to model the differences between groups. We do this by estimating the differences in group means.

Hint: think about the kind of values we can get out of models e.g. we can characterise a relationship, estimate a difference between means, estimate a probability (these are all examples of models we have used)

2b. Can you write qu 2 as a question about biology?

Anything related to “Does the use of different insecticide sprays create a difference in mean insect biomass?”

Hopefully you have suggested something for question 2 that can be achieved with a linear model i.e. lm(). If this is the case, we can run an lm() for our data. You should know how to do this by now. We want our treatment column (just the column name) as our X value and the response as our Y. lm(Y~X, data)

The lm() uses the design matrix, like you created in the first part of the exercise, to fit the model. Here we will just let R do its thing, but we know what it is doing! This links directly to what you did in the first part of the exercise and it is here where the design matrix is used (to tell R how to run the model).

# run the model
InsectModel <- lm(biomass ~ spray, data = InsectData)

# get the coefficient estimates
coef(InsectModel)

## (Intercept)      sprayB      sprayC      sprayD      sprayE      sprayF 
##   3.7606784   0.1159530  -2.5158217  -1.5963245  -1.9512174   0.2579388

# and the confidence intervals
confint(InsectModel)

##                  2.5 %     97.5 %
## (Intercept)  3.3985262  4.1228306
## sprayB      -0.3962075  0.6281135
## sprayC      -3.0279822 -2.0036612
## sprayD      -2.1084850 -1.0841640
## sprayE      -2.4633779 -1.4390569
## sprayF      -0.2542217  0.7700993

3. What are the coefficient estimates you get from the lm()? What do they represent?

Hint: think about the equation for a linear model in week 4 and week 5.

Because we have categorical data the coefficient estimates will be (Intercept) = the mean of sprayA group each other value is the difference between that group and group A.

You already know that the count variable ahs been squareroot transformed to improve the fit of the linear model. But, we shouldn't just take someone else's word for this. We should also check ourselves. Even with categorical data, we still have pretty much the same assumptions for a linear model as we did for regression (but we no longer need a straight line!).

errors are independent
errors have the same variance
errors are normally distributed
errors have zero mean

4. Look at the residuals vs fitted and Normal QQ plots below. What do you think of the model fit?

Looks pretty good. Each group has similar variance around fitted values (equal variance assumption met). Can see from residuals vs fitted plot. We can also see no curve (linearity assumption met). Normal QQ suggests very close to theoretical normal distribution. So we are happy that normality assumption is met.

# Get the rounded residuals and fitted values from the model
InsectResiduals <- round(residuals(InsectModel),2)
InsectFitted <- round(fitted(InsectModel),2)


# Plot the fitted versus residuals 
plot(InsectFitted, InsectResiduals)
# add a horizontal line at 0
# line is grey and dashed (lty=2)
abline(h=0, lty=2, col="grey")

plot of chunk unnamed-chunk-4

# Plot a normal Q-Q plot
qqnorm(InsectResiduals)
qqline(InsectResiduals)

plot of chunk unnamed-chunk-4

Now that you have thought about what these numbers represent and checked our model fit:

5. Desribe the output from lm(), include the confidence intervals here too and think about what they are the confidence intervals for. What are the numbers showing? Describe the pattern.

# Get the rounded residuals and fitted values from the model
coef(InsectModel)

## (Intercept)      sprayB      sprayC      sprayD      sprayE      sprayF 
##   3.7606784   0.1159530  -2.5158217  -1.5963245  -1.9512174   0.2579388

confint(InsectModel)

##                  2.5 %     97.5 %
## (Intercept)  3.3985262  4.1228306
## sprayB      -0.3962075  0.6281135
## sprayC      -3.0279822 -2.0036612
## sprayD      -2.1084850 -1.0841640
## sprayE      -2.4633779 -1.4390569
## sprayF      -0.2542217  0.7700993

(Intercept) row doesn't tell us much, just the estimate of the mean for the first group (spray A) and the confidence interval around the mean estimate. We are not interested in means themselves but the differences between them. Differences between spray A and sprays C, D, and E are the largest (all > 1, > 2 if we transform back to original units). The CIs for these do not cross 0, so it is unlikely that we would see these level of differences if there was no difference. Spray B and F on the other hand have very low difference from Spray A (< 0.5). Their confidence intervals do cross 0, so including uncertainty, 0 would be a plausible difference between these groups and spray A. Should note all of these are compared to Spray A - could note whether this is sensible/helpful. Basically we can't say anything about the difference between groups other than A e.g. between B and C, from our current results.

Remember: confidence intervals represent the uncertainty around our estimate. Think very carefully about what any differences shown are. What are they showing the difference between? Is this the most helpful way to show the results?

The next step once we know what pattern the numbers give is to bring it all back to the biology.

6. Interpret the output from lm(), include the confidence intervals here too, can look at R squared as well. How do these results help you answer your question from 2b? (Think about the bold point above)

summary(InsectModel)$r.squared

## [1] 0.7724112

# The R squared is quite high, the model explains 77% of the variance in
# insect biomass. So it is a useful model.

The differences described above tell us that sprays A, B, and F have a similar effect, we cannot confidently distinguish the effect of B and F from that of A. But sprays C, D and E do appear to have an effect of reducing the mean insect biomass in their plots relative to A. By around 2 to 4 non-squarerooted insects. Because this is all relative to spray A, we cannot say anything about the differences between the other groups here. We can see from the plots at the beginning that C, D and E have the lowest mean insect biomass and all at a similar level. They could also notice there is no control (we don't know what the sprays are, one could be water, but we don't know). So if we want to know if these sprays kill more insects, we cannot tell. A, B and F could increase them from a control and C, D, and E hold them steady. It helps a bit with answering the question but would be better to compare all groups not just to the mean of A.

While we can get results here and draw conclusions. This data is quite old (more than 60 years old), and the data collection was not perfect. Also our aims might have changed. With invertebrate declines being more common, we might not want an insecticide that kills everything.

7. If you were to re-design this experiment now in 2019 in Norway what would you do? Write a brief experimental design for this study.

You should include:

What you want to find out (biological question).
What treatments you would use. (Can add new ones you think were missed in 1942).
Highlight one problem with the 1942 data and say how you would fix it.
Which assumptions of a linear model do you need to consider in the experimental design phase? Say how you will do this.
Maximum of 100 words, this should be short!

This question does not have a fixed answer. It is all about how you justify and explain choices. Any sensible biological question regarding insect numbers (up or down). Should include some kind of pesticide or maybe something that encourages insects and ideally a control. The main problem I could see was lack of obvious control here - I hope some notice that and fix with a control. Anything else sensible like more replicates is good too. Or a more targeted insecticide, or all crops same species etc. Need to think about randomness and independence of samples. Make sure plots allocated at random and suitably independenent i.e. not some in one field and others in another. These are some assumptions of linear models (and most statistical analyses)