This Week: More Categorical Variables

This week we will cover a couple of other topics with categorical variables:

Yield Data

This is the same data as last week.

Yields <- read.csv("")

Yields$FYM.stopped <- Yields$Treatment=="FYM.stopped" # TRUE = 1
Yields$FYM <- Yields$Treatment=="FYM" # TRUE = 1
Yields$After1970 <- Yields$StartYear>1969 # create dummy variable

This data comes from Rothamstead, from an experiment started in 1852. Spring barley has been grown on the site (Hoosfield) continuously. Each plot can have one of 4 treatments:

The response is yield, i.e. how much barley was harvested from the field: a higher yield is obviously better. The units are tonnes per hecatare (t/ha). We have data that are means over about 10 years: we will treat these as replicates.


Last week we looked at 2-way ANOVAs. We can look at them again, for the FYM and (inorganic) fertiliser treatment:

Fdata <- Yields[Yields$Treatment%in%c("FYM", "Fertiliser"),]
model1 <- lm(yield ~ Treatment + After1970, data=Fdata) # fit model

Interpret the parameters:

  • is there a difference in yield between the Fertiliser and FYM treatments? Which one is better?
  • Is there a change in yield after 1970?
  • What does the model say about the difference between the Fertiliser and FYM treatments before and after 1970? (this is an easy, but important, question)

We can plot the fitted model, by predicting for the different treatment levels. THe code for the plot has a lot in it, so you don’t have to try to undetstand it all!

# unique() returns the unique values
# expand.grid() returns all combinations of the different columns
# try this simple example: expand.grid(x=1:3, y=1:2)
PredData <- expand.grid(Treatment=unique(Fdata$Treatment), After1970=unique(Fdata$After1970))
# PredData$FYM <- PredData$Treatment=="FYM"
# predict for all of the levels of the factors
PredData$Pred <- predict(model1, newdata=PredData)
Colours <- c("blue", "darkgreen") # these are the colours used in the plot. Fee free to change them!

# jitter() adds a small amount of noise to the vector it's jittering: hte effect is to spread the points a bit, so they are easier to see but it's still obvious that they are in a group
plot(jitter(0+Fdata$After1970, factor=0.2), Fdata$yield,
       col=Colours[1+Fdata$FYM], ylim=range(Fdata$yield), xaxt="n", xlim=c(-0.2,1.2),
     xlab="", ylab="Yield", main="The Data and Estimates")
points(PredData$After1970, PredData$Pred, col=Colours[1+(PredData$Treatment=="FYM")], pch=4, cex=2)
# This adds some lines, so the differnces between effects is clearer
segments(PredData$After1970[!PredData$After1970], PredData$Pred[!PredData$After1970],
         PredData$After1970[PredData$After1970], PredData$Pred[PredData$After1970],
# This is to make a nice axis label. I won't expect to see you do this:
#   there are several details that are rarely used
axis(1, c("Before\n1970", "After\n1970"), at=0:1, padj=1)

We can also plot the residuals: the x-axis for the plot could be either of the factors, but here we plot them againt the predicted values (i.e. the crosses in the plot above) to make the residuals easier to see.

plot(fitted(model1), resid(model1), 
     pch=1+Fdata$FYM, col=3+Fdata$After1970)
legend(3.6, 1, c("Fertilised, before 1970", "FYM, before 1970", 
                 "Fertilised, after 1970", "FYM, after 1970"), 
       col=rep(3:4, each=2), pch=1:2)

The model does not seem to fit well: the effect of one factor seems to depend on the effect of the other

  • BEFORE YOU DRAW THE PLOT, can you work out roughly what it will look like? (if it helps, try plotting the previous graph with different colours, to highlight each group).
  • What could be the biological reasons for the problem?
  • Can you think of an example from your other courses where you might see this effect?

Interactions (and also a bit more on design matrices and models)

The problem we see is called an interaction. Some times it is an annoyance, at other times it is what we are interested in, for example I ones looked at density dependent selection in wheat powdery mildew: selection is a change over time, so I looked at the interaction between time and density). But how do we include interactions into our models?

First, let’s look at the model without the interaction. The design matrix for data from each combination of factors looks like this:

\[ X = \left( \begin{array}{cc} 1 & 0 & 0 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \\ 1 & 1 & 1 \\ \end{array} \right) \]

The first column is the intercept, the second is the treatment (either Fertilised or FYM), and the third is whether is it before or after 1970. FYM and After 1970 are indicated by 1s.

The parameters are a vector:

\[ \beta = \{ \alpha_0, \beta_{FYM}, \beta_{after} \} \] The model is \(Y = X \beta + \varepsilon\), where \(X \beta\) is the predicted value of \(Y\). We caclulate this by matrix multiplication (something you will definitely NOT be asked about on the exam):

\[ X \beta = \left( \begin{array}{cc} 1 & 0 & 0 \\ 1 & 0 & 1 \\ 1 & 1 & 0 \\ 1 & 1 & 1 \\ \end{array} \right) \left( \begin{array}{c} \alpha_0 \\ \beta_{FYM} \\ \beta_{after} \\ \end{array} \right) = \left( \begin{array}{cc} 1 \times \alpha_0 & 0 \times \beta_{FYM} & 0 \times \beta_{after} \\ 1 \times \alpha_0 & 0 \times \beta_{FYM} & 1 \times \beta_{after} \\ 1 \times \alpha_0 & 1 \times \beta_{FYM} & 0 \times \beta_{after} \\ 1 \times \alpha_0 & 1 \times \beta_{FYM} & 1 \times \beta_{after} \\ \end{array} \right) = \left( \begin{array}{cc} \alpha_0 \\ \alpha_0 + \beta_{after} \\ \alpha_0 + \beta_{FYM} \\ \alpha_0 + \beta_{FYM} + \beta_{after} \\ \end{array} \right) \]

So the fourth data point, with the FYM and After 1970 levels is the intercept plus the sum of those two effects. The effects of FYM (say) are averaged over the Before 1970 and After 1970 levels.

So, how do we add an interaction? It has to have another parameter, so we could add it on to the last row, to give this:

\[ X \beta = \left( \begin{array}{cc} \alpha_0 \\ \alpha_0 + \beta_{after} \\ \alpha_0 + \beta_{FYM} \\ \alpha_0 + \beta_{FYM} + \beta_{after} + \beta_{FYM:after}\\ \end{array} \right) \]

From which we can work out that the dummy variable for this would have zeroes in all other rows than the last. The design matrix then looks like this:

\[ X \beta = \left( \begin{array}{cc} 1 & 0 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 1 & 1 & 1 \\ \end{array} \right) \]

The fourth column represents the interaction between the second and third columns. It is only 1 if both of these are 1: you can get this by multiplying the relelvant entries in the second and this columns (this fact is mainly helpful if you want to do these calculations by hand, but of course the computer will do it for you). Because of this we represent the interaction by a multiplication:

model.interaction <- lm(yield ~ Treatment * After1970, data=Fdata)
##                (Intercept)               TreatmentFYM 
##                  2.5200000                  0.5183333 
##              After1970TRUE TreatmentFYM:After1970TRUE 
##                  1.0616667                  1.5966667

We see we now have a new parameter, TreatmentFYM:After1970TRUE, which is the interaction effect (i.e. \(\beta_{FYM:after}\))

Now you have to work out what this means. Plot the “The Data and Estimates” plot again. Add the estimates from the model with the interaction, by making more predictions (with PredData$InteractionPred <- predict(model.interaction, newdata=PredData)).

  • How does the new model differ from the old one?
  • Does the model suggest that the treatment effect is different after 1970? In what direction (check the parameter estimates and confidence intervals)?
  • What does this result mean biologically?

More Interactions

We can now look at the full data, and if there are interactions. Fit the same model as above to the full data, i.e. yield ~ Treatment*After1970 to the Yields dataset (so Treatment now has 4 levels). Look at the parameters for the interaction and their confidence intervals.

  • what do the interaction parameters tell you? Does the treatment effect change from before to after 1970?
  • what do the other parameters tell you?

The other parameters are less informative on their own, because the are only relevant for some combinations of factors.

  • How does the effect of FYM change from before and after 1970? What parameter go into the calculation? So what does the FYM parameter on its own tell you?

You might be able to plot the data in the same way we did earlier, but it will need more than 2 colours.


In the analysis last week, we got models like this:

model1 <- lm(yield ~ Treatment, data=Yields)
##           (Intercept)          TreatmentFYM  TreatmentFYM.stopped 
##              2.873889              1.050556             -1.152222 
## TreatmentUnfertilized 
##             -1.961667

An obvious question is - where’s the Fertilized treatment? We saw that it has disappeared into the intercept: we can consider it a baseline, and the other coefficients are the differences from those levels to the baseline. We can see that by plotting the group means, and then adding lines for the intercept and the intercept plus each treatment effect. Here it is for FYM:

# First calculate the means for each treatment.
#  aggregate takes a vector as the first argument (i.e. Yields$yield), splits it into pieces according to the second argument, a list (so here it is split into pieces for every level of Yield$Treatment), and then on each piece it calculates the function given by the third argument (i.e. mean())
Means <- aggregate(Yields$yield, list(Yields$Treatment), mean)
# Now plot the means: 
# To get an informative x-axis we have to not plot thex-axis here (using xaxt="n"), and add it back in the second line
plot(1:4, Means$x, xaxt="n", xlab="Treatment", ylab="Yield (t/ha)")
axis(1, Means$Group.1, at=1:4)

# Now add a horizontal line for the intercept. We can see it is the same as the Fertiliser mean
abline(h=coef(model1)["(Intercept)"], lty=2)
# Now add a horizontal line for intercept + FYM. We can see it is the same as the FYM mean
abline(h=coef(model1)["(Intercept)"]+coef(model1)["TreatmentFYM"], lty=2, col=2)

(you won’t need to know the extra R code for the exam, but you might find them useful in practice)

For most experiments, we would be more interested in the differences from a Control treatment. Here this is the Unfertilized treatment, so we should make everything a contrast to that. We can do this by changing the way R makes the factor, using the function relevel(). It simply changes the first level of the factor to be the one given by ref:

Yields$Treatment <- relevel(Yields$Treatment, ref = "Unfertilized")
model1a <- lm(yield ~ Treatment, data=Yields)
##          (Intercept)  TreatmentFertiliser         TreatmentFYM 
##            0.9122222            1.9616667            3.0122222 
## TreatmentFYM.stopped 
##            0.8094444

Now we see that the Unfertilized level is missing, so is the intercept (and also the Intercept equals the mean for the Unfertilized treatment). The other 3 levels are the differences between that level and the intercept.

  • Check that the model is the same, i.e. check that it gives the same predicted value for each level of Treatment (e.g. in the model with Fertiliser as the baseline, the intercept give the predicted value for the Fertiliser treatment. When Unfertilized is the baseline, the predicted value is the intercept plus the Fertiliser estimate)
  • Using the model with Unfertilisedas the baseline, explain what the results say (you may want to calculate the confidence intervals)
  • For this problem, which level do you prefer as the baseline, and why?

Why do we need contrasts?

We can actually remove the intercept:

modelNoInt <- lm(yield ~ Treatment - 1, data=Yields)
## TreatmentUnfertilized   TreatmentFertiliser          TreatmentFYM 
##             0.9122222             2.8738889             3.9244444 
##  TreatmentFYM.stopped 
##             1.7216667

This now gives us the group means. But these may not be helpful: does it matter if the fertiliser treatment gave an average yield of 2.9 t/ha? For a farmer not at Rothamsted, this may not be relevant: the comparison to other treatments is more important. When we have more than one factor, we get this:

modelNoIntInt <- lm(yield ~ Treatment + After1970 - 1, data=Yields)
## TreatmentUnfertilized   TreatmentFertiliser          TreatmentFYM 
##             0.6110417             2.5727083             3.6232639 
##  TreatmentFYM.stopped         After1970TRUE 
##             1.4204861             0.9035417

So After1970 still has a baseline and a contrast. We can also fit the model with the numbers the other way around:

modelNoIntInt2 <- lm(yield ~ After1970 + Treatment - 1, data=Yields)
  • What has changed, and why?
  • When might you prefer to use this version without the intercept? (your answers may vary!)