This Week: Categorical Variables, aka Factors

This week we will use this document to give the exercises. The reason for this is try to build up your understanding of how categorical variables are modelled (the short version: it’s regression with just 0s and 1s).There will be questions in blue that you should provide answers to. There are other questions which you should attempt to answer as you work through this, even if you don’t have to provide us with answers.

Last week we looked at models for contiuous variables. This week we will look at how we can model the effects of categorical variables on a continuous response, e.g. the effects of different varieties of barley on yield. A lot of the theory was developed for designed experiments, e.g.

This type of model is usually called an ANOVA model (ANOVA = ANalysis Of VAriance). If there is a single categorical variable, it is called a one-way ANOVA, if there are two it is called a two-way ANOVA. The reason for this terminology is that the models were developed with the approach of trying to partition the variation into different parts of the model, so crop variety might explain some of the variation, and fungicide treatment might explain some more: the aim was to find out what crop varieies grew best, and what was the optimal treatment to apply. Nowadays the same approach is used for a much wider range of problems, often where this partitioning of the variation does not make any sense (it only really makes sense if you have a well designed experiment, otherwise it is not possible to find only one way to partition the variation).

Yield Data

This data comes from Rothamstead, from an experiment started in 1852. Spring barley has been grown on the site (Hoosfield) continuously. Each plot can have one of 4 treatments:

THe response is yield, i.e. how much barley was harvested from the field: a higher yield is obviously better. The units are tonnes per hecatare (t/ha). We have data that are means over about 10 years: we will treat these as replicates.

First we will look at the effects of fertiliser on yield, and ask if it makes a difference. But even before looking at the data, what difference in yield would you expect to see between the 4 treatments? Can you rank them from highest to lowest expected yield?

Yields <- read.csv("https://www.math.ntnu.no/emner/ST2304/2019v/Week8/Hoosfield_Yields.csv")

Analysis 1: a t-test, also a One Way ANOVA

We will start by comparing the unfertilised treatment, and the treatment that was last fertilised in 1871. We can create a data set with just these levels of the treatment in it,but first we create : FYM.stopped is a variable that is 1 if FYM used to be applied, and is 0 if it was always unfertilised. The technical term for this 0/1 variable is a dummy variable: we will be using them throughout this week’s work.

# == finds entries in the part on the left of it that match the
# criteria on the right. It assigns trues and falses
Yields$FYM.stopped <- Yields$Treatment=="FYM.stopped" # TRUE = 1

# X%in%Y tests every value of X and asks if it is in Y, 
# e.g. (1:3)%in%c(2,4) would produce FALSE TRUE FALSE
# Because we have [ ] after Yields we are subsetting our
# original data here to only the FYM.stopped and Unfertilized treatments
tdata <- Yields[Yields$Treatment%in%c("FYM.stopped", "Unfertilized"),]

# So now we have data with a column 'FYM.stopped' filled with TRUE and FALSE
# This is ok because R treats TRUE as 1 and FALSE as 0

For this data set,

  • do a t-test, using t.test(). You haven’t used this function before so it might take some puzzling. Is there an effect of treatment? How large and what direction? Think about the output, which bit tells us if there is an effect? How do we quantify the effect?
  • with the FYM.stopped variable, fit a linear model with lm() (see Week 4). Is there an effect? How large and what direction? You will need to put the data into lm() differently to t.test. You need all of the response (yield) as Y and a single X which indicates which group the Y came from. Look back at how we did it for Zombies!
  • For the lm() model, write down the design matrix for an observation in each category (i.e. for one observation from the Unfertilized treatment, and one from the FYM stopped treatment). The design matrix is used to help R create the linear model, not created from the output of the lm(). Some examples are in last week’s lecture p14-16 https://www.math.ntnu.no/emner/ST2304/2019v/Week7/MultipleRegression.pdf
  • compare the two analyses. Are the results the same? In particular, look at the differences in the means and the confidence intervals
  • can you explain why the results are similar?

(as a historical aside, the t test was invented by W.S. Gosset, who worked in Dublin turning barley into Guinness. He was helped by R.A. Fisher who worked at Rothamsted, on this and similar data)

Analysis 2: A Two Way ANOVA

What if we want to explain yield with more than one categorical covariate, e.g. treatment and year? If these were continuous, we would use multiple regression. If both categorical variables can take 2 values, we can convert each of them to 0 and 1, and fit another linear model. Here we will do this to look at whether there is a difference in effects of fertilizer before and after 1970.

You should first create another dummy variable that is 1 if the start year is after 1969. Then you should fit the model with FYM.stopped and After1970 as predictors.

tdata$After1970 <- tdata$StartYear>1969 # create dummy variable
model1 <- lm(yield ~ FYM.stopped + After1970, data=tdata) # fit model
  • Write down the design matrix for this linear model with two explanatory categorical variables. Just do this for one example of each option e.g. one row that is FYM.stopped = TRUE and After1970 = TRUE, one row that is FYM.stopped = FALSE and After1970 = TRUE etc to make all combinations of FYM.stopped and After1970 but with no repeats (should be 4 rows). This is just to make the matrix shorter, otherwise you are writing out a lot!
  • From the model results, is there a difference in yield before and after 1970? In what direction, how large, and what are the confidence intervals? You will need to look at the output of the lm() and find the part that tells you about differences
  • Is there a difference in yield in the treatments that were never fertilised and those that were only fertilised before 1870? In what direction, how large, and what are the confidence intervals?

We have now seen the basic idea of how to deal with categorical variables and to read the outputs from lm().

Analysis 3: A more complicated One-Way ANOVA

The trick of coding data as 0-1 can be extended to categorical variables with more than 2 categories. For, say, Treatment, we can create dummy variables for each treatment, above we have only been using 2 of the 4 treatment groups. But we can use all 4.

i.e. we make the following variables:

  • Fertiliser: is 1 if the treatment was to add fertiliser, and zero otherwise,
  • FYM: 1 if the treatment was to add farmyard manure
  • FYM.stopped: 1 if the treatment was to add farmyard manure before 1871, but not afterwards

Now we have 3 dummy variables

  • Write down the design matrix for a model including all 4 treatments as the dummy variables above. Like Analysis 2, just do this for a single example of each combination to make it shorter (should be 7 rows this time).
  • How is this different to Analysis 2?
  • Why don’t we add a variable for the Unfertilized treatment?

Now, create each of these variables (we have already creaed FYM.stopped earlier, so you should adapt that line of code - creating TRUE and FALSE). Then fit a linear model with all 3 dummy variables

  • Is there a difference in yield in the treatments? In what direction, how large, and what are the confidence intervals? Look at the output and find the parts that show difference - if unsure, try looking back at the linear model formula (Week 7 p8)
  • What specific differences are being estimated? What is the biological interpretation of these effects? What do the numbers refer to exactly? and can you write that with biological meaning (see Glossary on Blackboard for interpretation examples)

Analysis 4: ANOVA Without the hassle

If you are not already sick of writing dummy variables, you will be soon if you have to do it for every analysis. It will not surprise you to learn that R will do this for you. Not only that, but it will hide the computations from you. It is clever like that.

We can take a look at what R is doing. This is how you can see the design matrix:

# make the treatment column a factor
Yields$Treatment <- factor(Yields$Treatment) # actually, it already is.
# create the design matrix using model.matrix()
DesignMatrix <- model.matrix( ~ Treatment, data=Yields)
# look at the first 6 rows of the DesignMatrix
head(DesignMatrix)
##   (Intercept) TreatmentFYM TreatmentFYM.stopped TreatmentUnfertilized
## 1           1            0                    0                     1
## 2           1            0                    0                     1
## 3           1            0                    0                     1
## 4           1            0                    0                     1
## 5           1            0                    0                     1
## 6           1            0                    0                     1

Look at the whole matrix yourself, to see how it works. The first line makes Treatment a factor. We actually don’t this here as R does this automatically if read.csv() reads in a column with characters, and not numbers (if you want to stop this, you can use read.csv(..., stringsAsFactors=FALSE)).

Factors are the way R stores categorical variables that will be used in linear models (and similar, more complex, models). Each factor has several values: these are called levels (the groups or different treatments we have). We can see what the levels are of a factor with the levels()function:

levels(Yields$Treatment)
## [1] "Fertiliser"   "FYM"          "FYM.stopped"  "Unfertilized"

Internally R codes these as integers: which integer is given to which level is, essentially, arbitrary (a point we will return to). model.matrix() takes the full rght hand side of the model and converts it into the design matrix. So here it does what you did in Analysis 3. You should check the design matrix with what you worked out earlier to see if it is correct.

Now we can fit the model:

TheModel <- lm(yield ~ Treatment, data=Yields)
coef(TheModel)
  • Why do the results look different to those you have already? (hint: look a the column names in the design matrix)

This difference can be a problem if you are not aware of it, but it has also been used to create some tools that can be useful. We will look at this in more detail later this week…