This Week: Categorical Variables, aka Factors

Part A - All about data

This type of model is usually called an ANOVA model (ANOVA = ANalysis Of VAriance). If there is a single categorical variable, it is called a one-way ANOVA, if there are two it is called a two-way ANOVA. The reason for this terminology is that the models were developed with the approach of trying to partition the variation into different parts of the model, so crop variety might explain some of the variation, and fungicide treatment might explain some more: the aim was to find out what crop varieies grew best, and what was the optimal treatment to apply. Nowadays the same approach is used for a much wider range of problems, often where this partitioning of the variation does not make any sense (it only really makes sense if you have a well designed experiment, otherwise it is not possible to find only one way to partition the variation).

Yield Data

This data comes from Rothamstead, from an experiment started in 1852. Spring barley has been grown on the site (Hoosfield) continuously. Each plot had one of 4 treatments:

The response is yield, i.e. how much barley was harvested from the field: a higher yield is obviously better. The units are tonnes per hecatare (t/ha). We have data that are means over about 10 years: we will treat these as replicates.

This reads in the data. We use stringsAsFactors = FALSE to stop R converting our variables to factors (we will do this later, when we need to)

Yields <- read.csv("https://www.math.ntnu.no/emner/ST2304/2020v/Week08/Hoosfield_Yields.csv", 
                   stringsAsFactors = FALSE)

A1. Think of, and write down, a biological question you could ask using these data.

Part B - Analysis

Now we move on to the analysis part. We will start by comparing pairs of treatments, using a t-test. Here is some (very fake) data:

x <- rep(c("A", "B"), each=30)
xIsB <- as.numeric(x=="B") # use this to make different means
y = rnorm(length(x), xIsB, 1)
# Put the two groups into 2 vectors
yA <- y[x=="A"]; yB <- y[x=="B"]; 

plot(y, jitter(xIsB), yaxt="n", ylab="")
points(c(mean(yA), mean(yB)), c(0,1), pch=16, cex=2, col=2)
axis(2, c("Group A", "Group B"), at=c(0,1), las=1)

We want to comapre the means, the red dots. Our model assumes that both are normally distributed, i.e. \(y^A_i \sim N(\mu^A, \sigma^2)\) and \(y^B_j \sim N(\mu^B, \sigma^2)\). The difference between the means is \(D = \mu_A - \mu_B\), which we estimate with \(\hat{D} = \hat{\mu}_A - \hat{\mu}_B = \bar{y}_A - \bar{y}_B\). But we need to know the uncertainty in this. It turns out that \(\hat{D}\) follows a t-distribution, with variance equal to the standard error. So \[ t = \frac{\bar{y}_A - \bar{y}_B}{\sqrt{s^2/n}} \sim t_{n-2} \] where \(n-2\) is the degrees of freedom. Thus we can calculate 95% confidence intervals as \((\hat{D} + t_{n-2}(0.025) \frac{\hat{\sigma}}{\sqrt{n}}, \hat{D} + t_{n-2}(0.975) \frac{\hat{\sigma}}{\sqrt{n}})\). Or, if we want to do it R:

Dhat <- mean(yA) - mean(yB)

# Estimate s^2
Resids <- c(yA - mean(yA), yB - mean(yB))
sigma2hat <- (var(yA)+var(yB))/2
StdErr <- sqrt(sigma2hat) * sqrt(1/length(yA)+1/length(yB))
df <- length(yA)+length(yB)-2# degrees of freedom
(CI <- Dhat + qt(c(0.025, 0.975), df)*StdErr)
## [1] -1.18647229 -0.02989767

Of course, R provides functions to do this

t.test(yA, yB, var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  yA and yB
## t = -2.1052, df = 58, p-value = 0.03961
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.18647229 -0.02989767
## sample estimates:
## mean of x mean of y 
## 0.4223766 1.0305615

Part B: t-tests in R

Now we move on to the analysis part. We will want you to pick a question that is closest to your answer to A1.

All of these questions can be answered with a t-test, For the data set you have chosen:

You will need to create vectors for each treatment:

ControlYield <- Yields$yield[Yields$Treatment=="Control"]
FertilisedYield <- Yields$yield[Yields$Treatment=="Fertilised"]
ManureYield <- Yields$yield[Yields$Treatment=="Manure"]
StoppedYield <- Yields$yield[Yields$Treatment=="Stopped"]

Part B - First interpretation

B1. What did you find out from your analysis? (what was the main conclusion)

B2. Find someone who looked at a different question. Share your results.

B3. Do you think your analysis gave the whole picture of the results of the experiment?

Part C - t-tests as linear models

Here we have 4 treatments, and comparing them all separately will be a mess. We might also have more than one type of treatment (e.g. we can decide to look at applying fertiliser and fungicide in the same experiment). It is easier to look at everything in one model, and this also also improves the estimates, because the overall error is smaller (this is similar to last week, where a multiple regression improved the estimates of the effects of each variable). Because the models get more complicated, we need a general way of writing them, which is what this section builds towards.

First, three we will write a t-test in three wazs

t-tests as t-tests

\(y^A_i\) and \(y^B_j\) are vectors with the response in them. They have means \(\mu^A\) and \(\mu^B\) and a common variance (\(\sigma^2\)). The t-test asks if \(\mu^A = \mu^B\)

t-tests as an ANOVA

We have one response, \(y_{ij}\), where \(i\) says which group \(y_{ij}\) is in (i.e. A or B), and \(j\) is the \(j^{th}\) observation in group \(i\).

\[ y_{ij} \sim N(\mu + \alpha_i, \sigma^2) \]

There is a common mean, \(\mu\) and effects \(\alpha_i\). If the \(\alpha_i\)’s are different, there is an effect. This is an analysis of variance because the approacjh to it has been to look at \(\sum \alpha_i^2\), which (convenielntly ) follows a \(\chi^2\) distribution.

t-tests as a regression model

Again we have one response, \(y_{i}\), where \(i\) denotes the \(i^{th}\) observation. It has a covariate \(X_i\), where

\[ X_i = \begin{cases} 0 & \text{if } X_i = A \\ 1 & \text{if } X_i = B \end{cases} \]

then \(y_{i} \sim N(\alpha + \beta X_i, \sigma^2)\). We thus simply do a regression aginst \(X_i\). e call \(X_i\) a dummy variable. It can only take values 0 and 1, and later we will build more complicated analyses out of several dummy variables.

This works because

\[ y_i = \begin{cases} \alpha & \text{if } X_i = A \\ \alpha + \beta & \text{if } X_i = B \end{cases} \]

All of the models are the same

For all of the models we have teh same variance, and means for the two groups. But we write the means in different ways:

  • For the t-test we have 2 vectors, each with a mean
  • For the ANOVA we have 1 vector a covariate which says whether the mean is \(\mu + \alpha_1\) or \(\mu + \alpha_2\)
  • for the linear model the mean is \(\alpha + \beta X_i\), which is \(\alpha\) or \(\alpha + \beta\), depending on \(X_i\)

A Note about Identifiability

For the ANOVA model we have \(\mu_A = \mu + \alpha_A\) and \(\mu_B = \mu + \alpha_B\). What if we add a constant, \(C\), to \(\mu\), and subtract the same constant from each \(\alpha_i\)? We would get

\(\mu_i = \mu + C + \alpha_i - C = \mu + \alpha_i\)

So we need to “fix” the something. One way to do this is to say \(\sum_i n_i \alpha_i = 0\), so \(\mu\) is the grand mean of the data. Another way is to say \(\alpha_A = 0\), so \(\mu_A = \mu\) and \(\mu_B = \mu + \alpha_B\). This looks like the linear modelformulation.

We will come back to this later. For now it is worth nothing that we write the models in different ways, depending on what we want to do. THe linear model formulation lets us focus on the difference, \(\mu_A - \mu_B = \beta\).

Fitting t-tests as linear models

We can do a t-test using lm()

xIsB <- as.numeric(x=="B")
mod0 <- lm(y ~ xIsB)
round(summary(mod0)$coefficients,2)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)     0.42       0.20    2.07     0.04
## xIsB            0.61       0.29    2.11     0.04

xIsB can be 0 or 1, so this is a regression. Doing it this way means we have to convert our categorical variable into a number. But R can do this for us (which is a lot more convenient for more complicated situations). It calls categorical variables factors. Factors can only take specific values, which we call levels. We can give the different possible values of the factor sensible names, e.g. “Control”, “Fertilised”, and let R do the conversion to numbers internally. But we have to understand something about this process if we want to interpret the output of the analysis:

x.Factor <- factor(x)
mod0F <- lm(y ~ x.Factor)
round(summary(mod0F)$coefficients,2)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)     0.42       0.20    2.07     0.04
## x.FactorB       0.61       0.29    2.11     0.04

Here xB means that there is the variable x, and the estimate is of the level B, If x.Factor=A, it gets coded internally as 0. If x.Factor=B, it gets coded internally as 1.

Part C - Exercise: Fit the models with lm()

For the question you looked at before, use lm() to fit the model (i.e. to do the t-test).

You will have to create the correct data frame to do this:

ManureStopped <- Yields[Yields$Treatment=="Manure" | Yields$Treatment=="Stopped" ,]
ManureStopped$Treatment <- factor(ManureStopped$Treatment)

modMS <-lm(yield ~ Treatment, data=ManureStopped)
coef(modMS)
summary(modMS)

C1. What did you find out from your analysis with the new model? C2. Do you reach the same conclusion as you did before?

Part D - Factors with More than 2 Levels

Now we want to look at all of the treatments tgether. In Part C we made x into a factor. This is the type of object we use for categorical variables, because R knows how to use it in lm():

Yields$Treatment <- factor(Yields$Treatment)
levels(Yields$Treatment)
## [1] "Control"    "Fertilised" "Manure"     "Stopped"

R converts these to numbers, and in the analysis converts these to 0s and 1s, so that it can regress the response against a column of 0s and 1s:

(A.Factor <- rep(c("A", "B"), each=3))
## [1] "A" "A" "A" "B" "B" "B"
model.matrix(~A.Factor)[1:6,]
##   (Intercept) A.FactorB
## 1           1         0
## 2           1         0
## 3           1         0
## 4           1         1
## 5           1         1
## 6           1         1

(Intercept) is also in the form for a regression, where every data point has a value of 1.

With more than 2 leves of a factor, R creates a multiple regression.

(A.Factor3 <- c("A", "B", "C"))
## [1] "A" "B" "C"
model.matrix(~A.Factor3)[1:3,]
##   (Intercept) A.Factor3B A.Factor3C
## 1           1          0          0
## 2           1          1          0
## 3           1          0          1

The variables can be thought of as “Is it B?” and “Is it C?”. If it is not either of these, it must be A. And no data point can be B or C, i.e. each factor can only be one level. So, we can for the model for the yield data:

mod.Treatments <- lm(yield ~ Treatment, data=Yields)
summary(mod.Treatments)

You should fit the model with the code above

D1. How should you interpret the estimates of the Treatment effect? (two hints: what level is missing in the summary table? And compare the estimates to those fron the t-tests above)

Part D - Categoricals in R: Interpretation

Identifiability was mentioned earlier. With more than 2 levels, it is still a problem. With 3 levels we have \(\mu_A\), \(\mu_B\), and \(\mu_C\). We can write these as \(mu + \alpha_i\), but the same problem appears: we can add \(C\) to \(mu\), and subtract it from each \(\alpha_i\) and still get the same mean. So we have to fix something. There are several eays to do this. R does it by setting one level to be a baseline, and the others are a contrast to that level. So the TreatmentFertilised effect is \(\mu_{\text{Fertilised}} - \mu_{\text{Control}}\).

Part E - Models with Two categorical variables

The treatments in the yields experiment changed over time. Some particularly large changes happened around 1970. So we want to know if these had an effect. Lter we will ask if the effect changes with the treatments

This is like a multiple regression, so we can do this:

Yields$After1970 <- factor(Yields$After1970) # Make After1970 a factor
Yields$After1970 <- relevel(Yields$After1970, ref="Before") # Make before the intercept
mod.2way <- lm(yield ~ Treatment + After1970, data=Yields)
summary(mod.2way)

But what does it mean?

Part E - Models with Two categorical variables

Fit the one-way models (i.e. the model with Treatment, and the model with After 1970), and the two-way model

If you can do this, all other models are built up the same way

Part E - Models with Two categorical variables

First, we can look at the models with one variable:

mod.Treat <- lm(yield ~ Treatment, data=Yields)
mod.After <- lm(yield ~ After1970, data=Yields)
round(coef(mod.Treat), 2)
##         (Intercept) TreatmentFertilised     TreatmentManure 
##                0.91                1.96                3.01 
##    TreatmentStopped 
##                0.81
coef(mod.After)
##     (Intercept) After1970Before 
##       2.9604167      -0.9035417

Our Intercepts (=reference level) are

For the After9170 model, the treatment levels are ignored, so it is a mix of all of them