Categorical variables continued:

from question to interpretation and back again

Instructions:

Hints and reminders are italic

Questions appear in blue.

Needs to be completed and handed in by 7th March

Before you complete this part of Exercise 6 you should have finished this part

Both parts need to be completed and handed in this week. (lecture 8 and excerise 6 part 2)

Resources:

Background on Rothamsted and Fisher - why they are important in statistics (p8 The New Statistics with R)
Definition of categorical variables
Week 4 - t-test. Lecture p32, 36-38, and Exercise Q2.
Week 5 - regression. Lecture p6-10
Week 7 - design matrix. Lecture p14-18
Your brain - we will spend more time on this part of the exercise thinking through why we are doing things and what it shows rather than just running the analyses.

Extras (try to complete without these, but if you get stuck use them)

The challenge: Did we have a solution to insect pests back in 1942?

For the first half of this exercise we have looked at what impact different fertilizer regimes have on yield of a crop (how much of it we get). For the second half we will look at the influence of insecticides on insects of crop plots in Ontario, Canada in 1942.

The data comes from a scientific paper from 1942 http://www.bio.umass.edu/biology/kunkel/pub/Biometry/Beale_InSpra-Biom1942.pdf, pretty old. You have columns of the biomass (this was originally a count but we have altered it to be an indicator of biomass because we should do a different kind of model for counts - which we will cover in a few weeks) of the dead insects randomly sampled in the plot after the treatment, spray, which insecticide spray the plot was treated with (no one seems to know what these actually were - good lesson to always keep good notes!).

Here is a picture of an insect pest - the tomato hornworm Manduca quinquemaculata - in its most elegant form. (credit- wikipedia)

Insect picture

The experiment was run by taking several independent crop plots of nightshade solanaceae (tomatoes, peppers etc) and spraying them with one of 6 different insecticide sprays (again, we only know them as A,B,C,D,E,F). Insects were randomly sampled (dead ones) after the treatment. The data cover 72 plots, each spray was applied on 12 plots.

Your job is to find out whether these sprays influence the amount of insects killed on a crop.

The data can be found at https://www.math.ntnu.no/emner/ST2304/2019v/Week8/InsectData.csv

The first step is to import the data and assign it to an object. You can use the whole web link above to import the data. It is a csv file with column names (header) included.

The first step, as always, is to plot our data. As we just have two columns we can use the pairs() function without a problem.

Take a look at the plot and think about the experimental design.

1. What is the response variable and what is the explanatory variable here? What kind of data are each of these?

You want to try and model this data. Given your answer to question 1, 2. What would you try and model from this data? (i.e. What estimates would we want our model to produce? Put simply, what are we interested in finding out or capturing mathematically here?)

Hint: think about the kind of values we can get out of models e.g. we can characterise a relationship, estimate a difference between means, estimate a probability (these are all examples of models we have used)

2b. Can you write qu 2 as a question about biology?

Hopefully you have suggested something for question 2 that can be achieved with a linear model i.e. lm(). If this is the case, we can run an lm() for our data. You should know how to do this by now. We want our treatment column (just the column name) as our X value and the response as our Y. lm(Y~X, data)

The lm() uses the design matrix, like you created in the first part of the exercise, to fit the model. Here we will just let R do its thing, but we know what it is doing! This links directly to what you did in the first part of the exercise and it is here where the design matrix is used (to tell R how to run the model).

3. What are the coefficient estimates you get from the lm()? What do they represent?

Hint: think about the equation for a linear model in week 4 and week 5.

You already know that the count variable ahs been squareroot transformed to improve the fit of the linear model. But, we shouldn't just take someone else's word for this. We should also check ourselves. Even with categorical data, we still have pretty much the same assumptions for a linear model as we did for regression (but we no longer need a straight line!).

errors are independent
errors have the same variance
errors are normally distributed
errors have zero mean

4. Look at the residuals vs fitted and Normal QQ plots below. What do you think of the model fit?

plot of chunk unnamed-chunk-9

Now that you have thought about what these numbers represent and checked our model fit:

5. Desribe the output from lm(), include the confidence intervals here too and think about what they are the confidence intervals for. What are the numbers showing? Describe the pattern.

Remember: confidence intervals represent the uncertainty around our estimate. Think very carefully about what any differences shown are. What are they showing the difference between? Is this the most helpful way to show the results?

The next step once we know what pattern the numbers give is to bring it all back to the biology.

6. Interpret the output from lm(), include the confidence intervals here too, can look at R squared as well. How do these results help you answer your question from 2b? (Think about the bold point above)

While we can get results here and draw conclusions. This data is quite old (more than 60 years old), and the data collection was not perfect. Also our aims might have changed. With invertebrate declines being more common, we might not want an insecticide that kills everything.

7. If you were to re-design this experiment now in 2019 in Norway what would you do? Write a brief experimental design for this study.

You should include:

What you want to find out (biological question).
What treatments you would use. (Can add new ones you think were missed in 1942).
Highlight one problem with the 1942 data and say how you would fix it.
Which assumptions of a linear model do you need to consider in the experimental design phase? Say how you will do this.
Maximum of 200-300 words, this should be short!