This data comes from Rothamstead, from an experiment started in 1852. Spring barley has been grown on the site (Hoosfield) continuously. Each plot can have one of 4 treatments:

The response is yield, i.e. how much barley was harvested from the field: a higher yield is obviously better. The units are tonnes per hecatare (t/ha). We have data that are means over about 10 years: we will treat these as replicates.

First we will look at the effects of fertiliser on yield, and ask if it makes a difference. But even before looking at the data, what difference in yield would you expect to see between the 4 treatments? Can you rank them from highest to lowest expected yield?

Yields <- read.csv("https://www.math.ntnu.no/emner/ST2304/2019v/Week8/Hoosfield_Yields.csv")

Analysis 1: a t-test, also a One Way ANOVA

We will start by comparing the unfertilised treatment, and the treatment that was last fertilised in 1871. We can create a data set with just these levels of the treatment in it,but first we create : FYM.stopped is a variable that is 1 if FYM used to be applied, and is 0 if it was always unfertilised. The technical term for this 0/1 variable is a dummy variable: we will be using them throughout this week’s work.

# == finds entries in the part on the left of it that match the
# criteria on the right. It assigns trues and falses
Yields$FYM.stopped <- Yields$Treatment=="FYM.stopped" # TRUE = 1

# X%in%Y tests every value of X and asks if it is in Y, 
# e.g. (1:3)%in%c(2,4) would produce FALSE TRUE FALSE
# Because we have [ ] after Yields we are subsetting our
# original data here to only the FYM.stopped and Unfertilized treatments
tdata <- Yields[Yields$Treatment%in%c("FYM.stopped", "Unfertilized"),]

# So now we have data with a column 'FYM.stopped' filled with TRUE and FALSE
# This is ok because R treats TRUE as 1 and FALSE as 0

For this data set,

We want to compare two sets of tdata$yield data: for FYM.stopped and Unfertilized. First, we need to make the two vectors, one with the FYM.stopped treatment yields and another with the Unfertilized treatment yields. Because tdata only has the FYM.stopped and Unfertilized treatments, we can use the tdata$FYM.stopped variable (use str(tdata) if you wan to check what is in tdata). If tdata$FYM.stopped is TRUE, we have the FYM.stopped treatment, if it is false we have the Unfertilized treatment. So we can make the vectors like this:

FYM.stopped <- tdata$yield[tdata$FYM.stopped]
Unfertilized <- tdata$yield[tdata$FYM.stopped==FALSE]

For both, we only keep the values of tdata$yield when the vector in the square brackets is true. The two ways we do this are the same. In general, if we want to keep variables when A equals value X, we use A==X. But if A is a logical vecor (so either TRUE or FALSE), we can use A (as we do for FYM.stopped). if we want the values when A is FALSE, we could use !A (“not A”), so we could use Unfertilized <- tdata$yield[!tdata$FYM.stopped] instead.

Now we need to do the t-test:

t.test(FYM.stopped, Unfertilized)
## 
##  Welch Two Sample t-test
## 
## data:  FYM.stopped and Unfertilized
## t = 5.0686, df = 21.891, p-value = 4.527e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.478153 1.140736
## sample estimates:
## mean of x mean of y 
## 1.7216667 0.9122222

We can see the mean of x is 1.7 t/ha (the FYM stopped treatment), and of y is 0.91 t/ha, so the effect of having added manure before 1871 is to give a higher yield. The confidence interval is 0.48 - 1.14 t/ha, so the difference between the treatments is definitely positive, and large (even the lower 95% confidence interval suggests an increase in yield of around 50%).

You may find you get the same results except the signs are the other way around. This is fine: it means you used t.test(Unfertilized, FYM.stopped). All you need to do is to make sure you know what you did, and make sure you interpret the parameters in the right way, so if your estimates for the difference are negative, it means that y is smaller than x.

Fitting the model is easy: we just use the FYM.stopped variable as an explanatory variable:

tmodel <- lm(yield ~ FYM.stopped, data=tdata)

What is the effect? We can look at that with coef()

coef(tmodel)
##     (Intercept) FYM.stoppedTRUE 
##       0.9122222       0.8094444

The intercept is the predicted value when FYM.stopped equals zero (i.e. FALSE). You may notice that this is the same as the mean of the Unfertilised treatment in the t-test.

What is the FYM.stopped effect? This is the slope of the FYM.stopped effect. We can get an idea about what is happening if we plot the data and the fitted linear model:

plot(tdata$FYM.stopped, tdata$yield, xlab="FYM stopped treatment", ylab="Yield, t/ha")
abline(tmodel, col=2)

The red line is the model. It makes no sense to have the FYM stopped value of (say) 0.5: only 0 and 1 make sense. Now let’s add the intercept to the plot:

plot(tdata$FYM.stopped, tdata$yield, xlab="FYM stopped treatment", ylab="Yield, t/ha")
abline(tmodel, col=2)
abline(h=coef(tmodel)[1], col="hotpink")

We can see that this is where the fitted line meets the FYM stopped=0 treatment values (i.e. it is the unfertilised treatment mean). What about the FYM-stopped treatment? This has a value of 1, so the model for that is

\[ E(y) = \alpha + \beta x = 0.91 + 0.81 \times 1 = 1.72 \]

So this is the mean for the FYM stopped treatment. 0.81 is then simply the difference between the treatments: the yield in the treatment where manure was added before 1871 is 0.81 t/ha higher than the treatment where it was not added. Because the values of the explanatory variable can only be 0 or 1, the slope is also the difference between the means for the values.

OK, let’s take a deep breath. In the last lecture we had a design matrix like this:

\[ X = \left( \begin{array}{ccc} 1 & 2.3 & 3.0 \\ 1 & 4.9 & -5.3 \\ 1 & 1.6 & -0.7 \\ \vdots & \vdots & \vdots \\ 1 & 8.4 & 1.2 \\ \end{array} \right) \]

Each row is a data point. Each column is an explanatory variable. The first column is the intercept, so it is an explanatory variable that is always the same value. The second column is the values of the first explanatory variable, the third column is the values of the second explanatory variable. So what about our data? We have an intercept, and we have another explanatory variable, FYM.stopped. So the design matrix will look something like this:

\[ X = \left( \begin{array}{cc} 1 & ? \\ 1 & ? \\ 1 & ? \\ \vdots & \vdots \\ 1 & ? \\ \end{array} \right) \]

What are the question marks? These are the values for FYM.stopped explanatory variable. They can only be 0 or 1: 0 if FYM.stopped is at the intercept, i.e. the Unfertilised treatment, and 1 if it is the FYM stopped treatment. So if we have two data points, the first Unfertilised and the second FYM stopped, the design matrix would look like this:

\[ X = \left( \begin{array}{cc} 1 & 0 \\ 1 & 1 \\ \end{array} \right) \]

As these are the only possible types of data, all of the other rows would look like one of these.

The two analyses should be the same (or if there is any difference, it is because the signs are different). We have already seen that the point estimates are the same, i.e. the means for the two treatments (and this the difference between the treatments), but the confidence intervals are also the same. These are the confidence intervals for the t-test:

t.test(FYM.stopped, Unfertilized)$conf.int
## [1] 0.478153 1.140736
## attr(,"conf.level")
## [1] 0.95

And these for the linear model. Note the FYM.stopped effect:

confint(tmodel)
##                     2.5 %   97.5 %
## (Intercept)     0.6827324 1.141712
## FYM.stoppedTRUE 0.4848969 1.133992

They are similar because they are (almost) the same. In both cases, we are comparing the means of the two treatments, assuming the residuals are normally distributed. There is a alsight difference because t.test() assumes that the variance can be different between the x and y: if we use t.test(FYM.stopped, Unfertilized, var.equal = TRUE) we get exactly the same confidence interval.