This week we will cover a couple of other topics with categorical variables:
This is the same data as last week.
Yields <- read.csv("https://www.math.ntnu.no/emner/ST2304/2019v/Week8/Hoosfield_Yields.csv")
Yields$FYM.stopped <- Yields$Treatment=="FYM.stopped" # TRUE = 1
Yields$FYM <- Yields$Treatment=="FYM" # TRUE = 1
Yields$After1970 <- Yields$StartYear>1969 # create dummy variable
This data comes from Rothamstead, from an experiment started in 1852. Spring barley has been grown on the site (Hoosfield) continuously. Each plot can have one of 4 treatments:
The response is yield, i.e. how much barley was harvested from the field: a higher yield is obviously better. The units are tonnes per hecatare (t/ha). We have data that are means over about 10 years: we will treat these as replicates.
Last week we looked at 2-way ANOVAs. We can look at them again, for the FYM and (inorganic) fertiliser treatment:
Fdata <- Yields[Yields$Treatment%in%c("FYM", "Fertiliser"),]
model1 <- lm(yield ~ Treatment + After1970, data=Fdata) # fit model
coef(model1)
confint(model1)
Interpret the parameters:
We can plot the fitted model, by predicting for the different treatment levels. THe code for the plot has a lot in it, so you don’t have to try to undetstand it all!
# unique() returns the unique values
# expand.grid() returns all combinations of the different columns
# try this simple example: expand.grid(x=1:3, y=1:2)
PredData <- expand.grid(Treatment=unique(Fdata$Treatment), After1970=unique(Fdata$After1970))
# PredData$FYM <- PredData$Treatment=="FYM"
# predict for all of the levels of the factors
PredData$Pred <- predict(model1, newdata=PredData)
Colours <- c("blue", "darkgreen") # these are the colours used in the plot. Fee free to change them!
# jitter() adds a small amount of noise to the vector it's jittering: hte effect is to spread the points a bit, so they are easier to see but it's still obvious that they are in a group
plot(jitter(0+Fdata$After1970, factor=0.2), Fdata$yield,
col=Colours[1+Fdata$FYM], ylim=range(Fdata$yield), xaxt="n", xlim=c(-0.2,1.2),
xlab="", ylab="Yield", main="The Data and Estimates")
points(PredData$After1970, PredData$Pred, col=Colours[1+(PredData$Treatment=="FYM")], pch=4, cex=2)
# This adds some lines, so the differnces between effects is clearer
segments(PredData$After1970[!PredData$After1970], PredData$Pred[!PredData$After1970],
PredData$After1970[PredData$After1970], PredData$Pred[PredData$After1970],
col=Colours[1+(PredData$Treatment=="FYM")])
# This is to make a nice axis label. I won't expect to see you do this:
# there are several details that are rarely used
axis(1, c("Before\n1970", "After\n1970"), at=0:1, padj=1)
We can also plot the residuals: the x-axis for the plot could be either of the factors, but here we plot them againt the predicted values (i.e. the crosses in the plot above) to make the residuals easier to see.
plot(fitted(model1), resid(model1),
pch=1+Fdata$FYM, col=3+Fdata$After1970)
abline(h=0)
legend(3.6, 1, c("Fertilised, before 1970", "FYM, before 1970",
"Fertilised, after 1970", "FYM, after 1970"),
col=rep(3:4, each=2), pch=1:2)
The model does not seem to fit well: the effect of one factor seems to depend on the effect of the other
The problem we see is called an interaction. Some times it is an annoyance, at other times it is what we are interested in, for example I ones looked at density dependent selection in wheat powdery mildew: selection is a change over time, so I looked at the interaction between time and density). But how do we include interactions into our models?
First, let’s look at the model without the interaction. The design matrix for data from each combination of factors looks like this:
\[ X = \left( \begin{array}{cc} 1 & 0 & 0 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \\ 1 & 1 & 1 \\ \end{array} \right) \]
The first column is the intercept, the second is the treatment (either Fertilised or FYM), and the third is whether is it before or after 1970. FYM and After 1970 are indicated by 1s.
The parameters are a vector:
\[ \beta = \{ \alpha_0, \beta_{FYM}, \beta_{after} \} \] The model is \(Y = X \beta + \varepsilon\), where \(X \beta\) is the predicted value of \(Y\). We caclulate this by matrix multiplication (something you will definitely NOT be asked about on the exam):
\[ X \beta = \left( \begin{array}{cc} 1 & 0 & 0 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \\ 1 & 1 & 1 \\ \end{array} \right) \left( \begin{array}{c} \alpha_0 \\ \beta_{FYM} \\ \beta_{after} \\ \end{array} \right) = \left( \begin{array}{cc} 1 \times \alpha_0 & 0 \times \beta_{FYM} & 0 \times \beta_{after} \\ 1 \times \alpha_0 & 0 \times \beta_{FYM} & 1 \times \beta_{after} \\ 1 \times \alpha_0 & 1 \times \beta_{FYM} & 0 \times \beta_{after} \\ 1 \times \alpha_0 & 1 \times \beta_{FYM} & 1 \times \beta_{after} \\ \end{array} \right) = \left( \begin{array}{cc} \alpha_0 \\ \alpha_0 + \beta_{after} \\ \alpha_0 + \beta_{FYM} \\ \alpha_0 + \beta_{FYM} + \beta_{after} \\ \end{array} \right) \]
So the fourth data point, with the FYM and After 1970 levels is the intercept plus the sum of those two effects. The effects of FYM (say) are averaged over the Before 1970 and After 1970 levels.
So, how do we add an interaction? It has to have another parameter, so we could add it on to the last row, to give this:
\[ X \beta = \left( \begin{array}{cc} \alpha_0 \\ \alpha_0 + \beta_{after} \\ \alpha_0 + \beta_{FYM} \\ \alpha_0 + \beta_{FYM} + \beta_{after} + \beta_{FYM:after}\\ \end{array} \right) \]
From which we can work out that the dummy variable for this would have zeroes in all other rows than the last. The design matrix then looks like this:
\[ X \beta = \left( \begin{array}{cc} 1 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 1 \\ \end{array} \right) \]
The fourth column represents the interaction between the second and third columns. It is only 1 if both of these are 1: you can get this by multiplying the relelvant entries in the second and this columns (this fact is mainly helpful if you want to do these calculations by hand, but of course the computer will do it for you). Because of this we represent the interaction by a multiplication:
model.interaction <- lm(yield ~ Treatment * After1970, data=Fdata)
coef(model.interaction)
## (Intercept) TreatmentFYM
## 2.5200000 0.5183333
## After1970TRUE TreatmentFYM:After1970TRUE
## 1.0616667 1.5966667
We see we now have a new parameter, TreatmentFYM:After1970TRUE
, which is the interaction effect (i.e. \(\beta_{FYM:after}\))
Now you have to work out what this means. Plot the “The Data and Estimates” plot again. Add the estimates from the model with the interaction, by making more predictions (with PredData$InteractionPred <- predict(model.interaction, newdata=PredData)
).
We can now look at the full data, and if there are interactions. Fit the same model as above to the full data, i.e. yield ~ Treatment*After1970
to the Yields dataset (so Treatment now has 4 levels). Look at the parameters for the interaction and their confidence intervals.
The other parameters are less informative on their own, because the are only relevant for some combinations of factors.
You might be able to plot the data in the same way we did earlier, but it will need more than 2 colours.
In the analysis last week, we got models like this:
model1 <- lm(yield ~ Treatment, data=Yields)
coef(model1)
## (Intercept) TreatmentFYM TreatmentFYM.stopped
## 2.873889 1.050556 -1.152222
## TreatmentUnfertilized
## -1.961667
An obvious question is - where’s the Fertilized treatment? We saw that it has disappeared into the intercept: we can consider it a baseline, and the other coefficients are the differences from those levels to the baseline. We can see that by plotting the group means, and then adding lines for the intercept and the intercept plus each treatment effect. Here it is for FYM:
# First calculate the means for each treatment.
# aggregate takes a vector as the first argument (i.e. Yields$yield), splits it into pieces according to the second argument, a list (so here it is split into pieces for every level of Yield$Treatment), and then on each piece it calculates the function given by the third argument (i.e. mean())
Means <- aggregate(Yields$yield, list(Yields$Treatment), mean)
# Now plot the means:
# To get an informative x-axis we have to not plot thex-axis here (using xaxt="n"), and add it back in the second line
plot(1:4, Means$x, xaxt="n", xlab="Treatment", ylab="Yield (t/ha)")
axis(1, Means$Group.1, at=1:4)
# Now add a horizontal line for the intercept. We can see it is the same as the Fertiliser mean
abline(h=coef(model1)["(Intercept)"], lty=2)
# Now add a horizontal line for intercept + FYM. We can see it is the same as the FYM mean
abline(h=coef(model1)["(Intercept)"]+coef(model1)["TreatmentFYM"], lty=2, col=2)
(you won’t need to know the extra R code for the exam, but you might find them useful in practice)
For most experiments, we would be more interested in the differences from a Control treatment. Here this is the Unfertilized treatment, so we should make everything a contrast to that. We can do this by changing the way R makes the factor, using the function relevel()
. It simply changes the first level of the factor to be the one given by ref:
Yields$Treatment <- relevel(Yields$Treatment, ref = "Unfertilized")
model1a <- lm(yield ~ Treatment, data=Yields)
coef(model1a)
## (Intercept) TreatmentFertiliser TreatmentFYM
## 0.9122222 1.9616667 3.0122222
## TreatmentFYM.stopped
## 0.8094444
Now we see that the Unfertilized level is missing, so is the intercept (and also the Intercept equals the mean for the Unfertilized treatment). The other 3 levels are the differences between that level and the intercept.
Treatment
(e.g. in the model with Fertiliser
as the baseline, the intercept give the predicted value for the Fertiliser treatment. When Unfertilized
is the baseline, the predicted value is the intercept plus the Fertiliser
estimate)Unfertilised
as the baseline, explain what the results say (you may want to calculate the confidence intervals)We can actually remove the intercept:
modelNoInt <- lm(yield ~ Treatment - 1, data=Yields)
coef(modelNoInt)
## TreatmentUnfertilized TreatmentFertiliser TreatmentFYM
## 0.9122222 2.8738889 3.9244444
## TreatmentFYM.stopped
## 1.7216667
This now gives us the group means. But these may not be helpful: does it matter if the fertiliser treatment gave an average yield of 2.9 t/ha? For a farmer not at Rothamsted, this may not be relevant: the comparison to other treatments is more important. When we have more than one factor, we get this:
modelNoIntInt <- lm(yield ~ Treatment + After1970 - 1, data=Yields)
coef(modelNoIntInt)
## TreatmentUnfertilized TreatmentFertiliser TreatmentFYM
## 0.6110417 2.5727083 3.6232639
## TreatmentFYM.stopped After1970TRUE
## 1.4204861 0.9035417
So After1970
still has a baseline and a contrast. We can also fit the model with the numbers the other way around:
modelNoIntInt2 <- lm(yield ~ After1970 + Treatment - 1, data=Yields)
coef(modelNoIntInt2)