Learning Outcomes

This is a question that you should try to answer

Answer

This is where the answer will be. Unless it says “Hint” in which case it is a hint to help you.

Categorical Variables and ANOVA

In the last couple weeks we have looked at problems where there were continuous variables, and we were trying to explain \(Y\) by \(X\). Here we will start to look at another common problem, where our continuous response, \(Y\), is explained by some discrete factors. For example, we might ask whether the yield of a crop (e.g. the amount of barley) is affected by putting different fertilisers on it. The same design is used across a lot of science and technology, e.g. with clinical trials one or more treatment is compared to a control to see if it has an effect. Or whether a red or blue shirt means you are more likely to win a judo bout.

For some of these examples the analyses are more complicated: the data are often not normally distributed, but a different distribution (e.g. Poisson, binomial) works well. And several factors and covariates often have to be accounted for, which makes the whole problem more complicated. But we will start with a simple model, because the ideas in this module form the basis of how to tackle more complex problems. We have already looked at a simple example, where we are comparing two categories: the two-sample t-test. Here we will extend that to more than two samples.

The type of problem we are looking at is often called ANOVA, for ANalyis Of VAriance. This conflates two aspects of this: the model, where the explanatory variable (i.e. the \(x_i\)) is categorical, and the method used to test whether a factor has an effect, which is done by decomposing the variance into components due to different factors.

This area of statistics also drifts into issues of how to design experiments: the ANOVA approach easily extends into more complex problems: there can be more than one type of treatment (e.g. plant variety and fertiliser), or the design is more complicated (e.g. the experiment is repeated over several fields). We can even combine categorical and continuous covariates, i.e. combine this week’s ANOVA with last week’s regression.

One-Way ANOVA: the Experimental Design

The basic design is that we have a set of treatments that we want to test. For example, we might have more than one fertiliser, or students in different courses. Our response is continuous, e.g. yield, or exam scores, and we assume they are normally distributed with a mean that depends on the category:

\[ Y_{ij} \sim N(\mu_j, \sigma^2) \]

Which we can also write as \(Y_{ij} =\mu_j + \varepsilon_{ij}\), \(\varepsilon_{ij} \sim N(0, \sigma^2)\), where \(j = 1,\dots,k\) indexes the groups, and there are \(i, i = 1,\dots,n_j\) observations for each treatment.

We call the set of treatments a factor, and say that it has different levels, i.e. each treatment is a level of the factor.

The questions we want to ask about this data revolve around whether the treatments are different, e.g. are they all the same, or are treatments different from a control? And how different are they?

Example: Potatoes

This data comes from a field trial on potato yields. The problem was to work out what is a good fertiliser. A field was sown with 36 plots of potatoes (12 varieties, but we will ignore that for now), and each plot was treated with dung and one of three other fertiliser treatments: a control, sulphate, and chloride. The control treats the plants in the same way, except the extra fertiliser in not added. The response is the yield, i.e. the weight of potatoes harvested. The measurement is pounds per plant: pounds is an English unit often written as lbs. 1 lbs = 454 g.

This is the data:

(the different levels are on the y-axis. The data have been randomly “jittered” up or down a bit so they don’t totally overlap)

The questions the trial wanted to answer were whether the fertiliser improved the yield, and by how much (the fertiliser costs money, so a farmer will want to know if the cost is outweighed by the extra potatoes they will get). So, we want to compare the treatments to the control.

For today, there are 3 treatments, and 36 observations per treatment. This means 108 observations in total.

Some Notation

Much of what we will be doing involves means and sums of random variables. So we often use a dot notation to show what we are summing over. So if our variable is \(Y_{ij}\) where \(j\) is the treatment level and \(i\) is the \(i^\textrm{th}\) observation for level \(j\), the mean for level \(j\) is \(\bar{Y}_{.j} = \frac{1}{n_k}\sum_{i=1}^{n_k}Y_{ij}\).

Using the same notation, what is \(\bar{Y}_{..}\)?

Answer

It is the mean over all observations:

\(\bar{Y}_{..}= \frac{1}{n}\sum_{i=1}^{n_k}\sum_{j=1}^{k}Y_{ij} = \sum_{j=1}^{k}Y_{.j}\).

We call this the grand mean.

Estimation of Treatment Means

We will start by looking at the estimate of a treatment mean. This basically means taking results from Chapter 7 (= Week 2), so this is mainly to help you see the connections.

Show that \(\hat{\mu}_j= \frac{1}{n_j}\sum_{i=1}^{n_j}y_{ij}\) </span>

Hint: The answer is pasted from Week 2’s module, so you don’t need to do the whole proof, just work out which proof it is, and why.

Answer

This is the same as the mean for a single normal distribution. The log likelihood is

\[ \begin{align} l &= -n_j\log(\sqrt{2 \pi \sigma^2}) - \frac{1}{2 \sigma^2} \sum_{i=1}^{n_j}(Y_{ij} - \mu_j)^2 \\ &= C - \frac{1}{2 \sigma^2} \left(\sum_{i=1}^{n_j}Y_{ij}^2 - 2\sum_{i=1}^{n_j}Y_{ij}\mu_j + \sum_{i=1}^{n_j}\mu_j^2 \right) \\ &= C - \frac{1}{2 \sigma^2} \left(\sum_{i=1}^{n_j}Y_{ij}^2 - 2n \bar{Y}_j\mu_j + n_j\mu_j^2 \right) \end{align} \]

Now differentiate w.r.t \(\mu_j\) and set to 0:

\[ \frac{dl}{d\mu_j} = -\frac{1}{2 \sigma^2} \left(-2n \bar{Y}_j + 2 n_j\mu_j \right) = 0 \] And then rearrange to get \(\mu_j = \bar{Y}_j\):

\[ \begin{align} -\frac{2n_j}{2 \sigma^2} \left(- \bar{Y}_j + \mu_j \right) &= 0 \\ -\bar{Y}_j + \mu_j &= 0 \\ \mu_j &= \bar{Y}_j \end{align} \]

If we only had the data from treatment \(j\), \(\hat{\sigma}^2= S_j^2 = \frac{1}{n_j-1}\sum_{i=1}{n_j}(Y_{ij}-\bar{Y}_{.j})^2\). But with several treatments we can get a better estimate. We know that \((n_j-1)S^2_j/\sigma^2 = \sum_{i=1}^{n_j}(Y_{ij}-\bar{Y}_{.j})^2/\sigma^2\) follows a \(\chi^2_{n_j-1}\) distribution (from Week 3). So what happens when we sum this up over all of the treatments, i.e. what happens to \(SSE = \sum_{j=1}^{k}\sum_{i=1}^{n_j}(Y_{ij}-\bar{Y}_{.j})^2\)?

(SSE = “Sum of Squares of the Error”. It is also the residual sum of squares)

Show that \(SSE/\sigma^2\) follows a \(\chi^2_{n-k}\) distribution (remember \(n=\sum_{j=1}^{k}n_j\)).

Hint: use the properties of a \(\chi^2\) distribution (and the independence of each observation).

Answer

Theorem 12.2.3a.

Show that \(\hat{\sigma}^2 = \frac{1}{n-k}\sum_{j=1}^{k}\sum_{i=1}^{n_j}(Y_{ij}-\bar{Y}_{.j})^2\) is an unbiased estimator for \(\sigma^2\), i.e. \(E\left(\frac{1}{n-k} SSE \right)=\sigma^2\)

Answer

Because \(SSE/\sigma^2 \sim \chi^2_{n-k}\), \(E(SSE/\sigma^2)=n-k\), so \(E(SSE)=(n-k)\sigma^2\).

Then \(E(\frac{1}{n-k}SSE) = \frac{1}{n-k} E(SSE) = \frac{1}{n-k} \sigma^2(n-k)=\sigma^2\)

Derive the confidence interval for \(\hat{\mu}_j\). How is it different from if there is only one treatment?

Hint: this is almost the same as Chapter 7.

Answer

The confidence interval is \(\bar{Y}_j \pm t_{\alpha/2, n-k}\sqrt{\frac{SSE}{(n-k)n_k}}\).

The difference is that the we use an estimate of \(\sigma\) based on all of the data. This makes sense as we assume that the variance is the same across all of the data, so using all of the data gives a more precise estimate.

Potato Yield

For the control treatment in the potato data, this is the data:

##  [1] 2.82 2.42 2.75 1.61 1.43 3.07 3.50 2.89 2.00 1.96 2.55 4.21 1.75 2.17 2.75
## [16] 2.00 2.25 3.25 2.32 4.20 3.00 2.86 3.39 3.64 4.71 2.17 3.32 2.46 2.79 3.50
## [31] 3.29 4.32 3.88 3.56 3.36 4.11

The mean is 2.95 lbs/plant. We can calculate the SSE for each treatment. This is some code to do it, which I hope is clear enough (the code is written in R, but this is very similar to C and Python):

(SSE.Control <- sum((Control.Yield - mean(Control.Yield))^2))

## [1] 23.7593

(SSE.Chloride <- sum((Chloride.Yield - mean(Chloride.Yield))^2))

## [1] 23.4516

(SSE.Sulphate <- sum((Sulphate.Yield - mean(Sulphate.Yield))^2))

## [1] 24.1385

The total SSE is just 23.76 + 23.45 + 24.14 = 71.35, and then \(\hat{\sigma}^2 = \frac{1}{108-3}\) 71.35 = 0.68

From this we can calculate the confidence interval for \(\mu_{Control}\) as 2.95 \(\pm t_{0.05/2, 105}\sqrt{\frac{70.26}{(108-3)\times 36}} = 3.08 \pm 1.98 \times 0.12 =\) (2.72, 3.19).

What are the confidence intervals for the other two treatments?

The treatment means are 3.03 for the Chloride treatment and 3.09 for the Sulphate treatment.

Answer

Chloride: 3.03 \(\pm t_{0.05/2, 105}\sqrt{\frac{70.26}{(108-3)\times 36}} =\) (2.8, 3.27).

Sulphate: 3.09 \(\pm t_{0.05/2, 105}\sqrt{\frac{70.26}{(108-3)\times 36}} =\) (2.85, 3.33).

Notice that the only differences are due to the different means. We use the same estimate of \(\hat{sigma}^2\), and because there is the same number of observations in each treatment (the experimental design is “balanced”), the standard error has to be the same.

Whilst this estimation is useful, the purpose of collecting this data is to compare different treatments, e.g. to see if adding a fertiliser is better than a control.

Estimation of Contrasts

If we have 2 treatments, then a sensible comparison is the difference in means. This is particularly sensible because we have already looked at this problem, in Week 3 (or Chapter 7).

What is the estimate for \(\mu_2-\mu_1\), and how would you construct a confidence interval?

Hint: this is revision, so you can see how it fits in to what we are doing now. There is also an element of working out the notation.

Answer

The estimate is \(\bar{Y}_2 - \bar{Y}_1\). The confidence interval is

\[ \bar{Y}_2-\bar{Y}_1 \pm t_{\alpha/2, n-1} s\sqrt{\frac{1}{n_1}+\frac{1}{n_2}} \]

where \(s = \sqrt{\frac{SSE}{n-k}}\).

Note that this works for the comparison of any 2 treatments even if there are more than 2. The generalisation works because the numerator is Gaussian and the denominator is the standard deviation of the numerator. This means we can generalise even further.

We know that if (in general) \(X_i \sim N(\mu_i, \sigma^2_i)\), \(i=1,\dots,m\), and are independent, \(\sum_{i=1}^m a_i X_i\sim N(\sum_{i=1}^m a_i\mu_i, \sum_{i=1}^m a^2_i\sigma^2_i)\). So we can use this to create more sophisticated estimates, which we call contrasts.

A contrast is a linear combination of the true means, i.e. \(C = \sum_{j=1}^k c_j \mu_j\) such that the \(c_j\)’s are constant (i.e. known, and set by the whoever is doing the analysis), and \(\sum_{j=1}^k c_j = 0\).

For a t-test (i.e. k=2), what is the vector of contrasts?

Answer

\(C=(1,-1)\). Although \(C=(-1,1)\) also works.

For a design with 5 treatments (i.e. \(k=5\)), what is the vector of contrasts to estimate the difference between treatments \(j=1\) and \(j=3\)?

Answer

\(C=(1,0,-1,0,0)\) (or \(C=(-1,0,1,0,0)\)) The zeroes remove the treatments we are not interested in.

We often have experiments with a control and several treatments, e.g. a placebo (a sugar pill) and several drugs. One comparison would be if the control was different from the treatments.

If there are \(k\) treatments, with the first treatment being a control, what would the contrasts be to compare the control to the other treatments (assuming the other treatments are equivalent)?

Answer

There are actually many possibilities, but the basic structure is \(C=(-(k-1),1,1,\dots,1)\). We can multiply this by any constant, of course. One possibility would be to multiply by \((k-1)^{-1}\), so the contrast is of treatment 1 to the mean of the other treatments.

For a design with \(k\) treatments, and a vector of coefficients \(C=(c_1,c_2,\dots,c_k)\), what unbiased estimator of the contrast, \(C = \sum_{j=1}^k c_j \mu_k\) would we use? And how would we construct a confidence interval?

Answer

The obvious estimator is \(\hat{C} = \sum_{j=1}^k c_j \bar{Y}_k\). This is unbiased because

\[ E(\hat{C}) = E(\sum_{j=1}^k c_j \bar{Y}_k) = \sum_{j=1}^k c_j E(\bar{Y}_k) = \sum_{j=1}^k c_j \mu_k = C \]

And the variance is

\[ Var(\hat{C}) = Var(\sum_{j=1}^k c_j \bar{Y}_k) = \sum_{j=1}^k c^2_j Var(\bar{Y}_k) = \sum_{j=1}^k c_j^2 \frac{\sigma^2}{n_k} = \sigma^2\sum_{j=1}^k \frac{c_j^2}{n_k} \]

We can obviously use \(s^2= SSE/(n-k)\) as an estimator of \(\sigma^2\), so \(\hat{C}\) will follow a t-distribution, and the confidence interval is \(\bar{C} \pm t_{\alpha/2,n-k} \sqrt{\frac{SSE}{n-k} \sum_{j=1}^k \frac{c_j^2}{n_k}}\).

Contrasts are discuss more in §12.4: we will return to them later. But now you know how to compare different treatments.

Potato Yield

The natural question is what is the difference between the control and (say) the Sulphate treatment.

The contrast could be (-1,0,1), if the treatments are in the order (Control, Chloride, Sulphate). The estimate of the contrast would be \(-0.5\times\) 2.95 \(+ 0\times\) 3.03 \(+ 0.5\times\) 3.09 = 0.14. The 95% confidence interval is

0.14 \(\pm t_{\alpha/2,n-k}\sqrt{\frac{71.35}{108 3}\left( \frac{(-1)^2}{36} + \frac{0^2}{36} + \frac{1^2}{36} \right)} =\) (-0.25, 0.52)

So the estimate is that yield changes by 0.14 lbs/plant (or about 64g), but the confidence interval says that sulphate could reduce yield by 0.25 lbs/plant or increase it by 0.52 lbs/plant. Remember that the average yield for a control is NaN lbs/plant, so the increase could be 0.52/NaN = 0.18 times, i.e. about a 20% increase. But is could also be a 10% decrease. These differences could be important (but ask an agronomist, i.e. someone who works with yields, to decide if they really are). So this suggests we really need more data (or a better model) to decide if there is an effect worth following up.

What is the estimate of the contrast where the control is compared to the average of the other treatments?

Answer

There are many contrasts we could use, but they should weight the treatments equally, and they should sum to 0. So (-2,1,1) is a simple choice.

The estimate of the contrast would be \(-2\times 2.95 + 1\times 3.03 + 1\times 3.09 = 0.22\). The 95% confidence interval is

0.22 \(\pm t_{\alpha/2,n-k}\sqrt{\frac{71.35}{108 3}\left( \frac{(-2)^2}{36} + \frac{1^2}{36} + \frac{1^2}{36} \right)} =\) (-0.45, 0.89)

Testing if there are any effects

Now we know we can estimate our parameters and compare different treatments. But the appeal of ANOVA is not is the estimation, but in the way it can be used to test hypotheses. We can motivate this with a t-test.

We know that if we have two groups,

\[ t = \frac{\bar{Y}_{.2} - \bar{Y}_{.1}}{\sqrt{s^2/(n-2)}} \] follows a t distribution with \(n-2\) degrees of freedom. We also know (Theorem 7.3.4) that the square of a t distribution is an F distribution with \(1,n-2\) degrees of freedom. i.e. if the null hypothesis is true,

\[ \frac{(\bar{Y}_{.2} - \bar{Y}_{.1})^2}{\sqrt{s^2/(n-2)}} \sim F_{1,n-2} \]

Show that when \(n_1=n_2\), the numerator is the same as \(2\sum_{j=1}^{2} (\bar{Y}_{.j} - \bar{Y}_{..})^2\)

Answer

Note first that \(Y_{..} = (Y_{.1} + Y_{.2})/2\)

\[ \begin{align} \sum_{j=1}^{2} (\bar{Y}_{.j} - \bar{Y}_{..})^2 &= \sum_{j=1}^{2} (\bar{Y}_{.j} - (\bar{Y}_{.1} + \bar{Y}_{.2})/2)^2 \\ &=(\bar{Y}_{.1}/2 - \bar{Y}_{.2}/2)^2 + (\bar{Y}_{.2}/2 - \bar{Y}_{.1}/2)^2\\ &=2 (\bar{Y}_{.1}/2 - \bar{Y}_{.2}/2)^2\\ &=\frac{1}{2} (\bar{Y}_{.1} - \bar{Y}_{.2})^2\\ \end{align} \] Where the jump from second to third line is simply because \((-a)^2=a^2\).

Based on this, if we have more than 2 treatment levels, we might want to use F tests to test if the means are the same. This leads us to the insight that the best way to compare means is to compare variance between the means, hence analysis of variance¹.

The main trick is to use sums of squares. From the log.likelihood for the normal distribution is \(l = C - \sum_{i=1}^n(y_i - \mu_i)^2\) for some vector \(\mu_i\), so maximising the likelihood is the same as minimising the sums of squares. And sums of square have a lot of convenient properties.

To start with, we can define a few sums of squares:

Total sum of squares, \(SS_{Tot}\): \(\sum_{j=1}^k\sum_{i=1}^{n_k}(Y_{ij} - \bar{Y}_{..})^2\)
Treatment sum of squares, \(SS_{TR}\): \(\sum_{j=1}^k\sum_{i=1}^{n_k}(\bar{Y}_{.j} - \bar{Y}_{..})^2=\sum_{j=1}^k{n_k}(\bar{Y}_{.j} - \bar{Y}_{..})^2\)
Error sum of squares, \(SS_{E}\): \(\sum_{j=1}^k\sum_{i=1}^{n_k}(Y_{ij} - \bar{Y}_{.j})^2\)

The total sum of squares represents the total variation in the data (\(Var(Y)=SS_{Tot}/n-1\)). The treatment sum of squares represents the variation in the treatment means, and the error sums of squares represents the residual variation that is not explained by anything else: \(\hat{\sigma}^2=SS_{E}/{n-n_k}\) (see above).

Now what makes this fun is that \(SS_{Tot} = SS_{TR} + SS_{E}\).

Prove that \(SS_{Tot} = SS_{TR} + SS_{E}\)

i.e. show \(\sum_{j=1}^k\sum_{i=1}^{n_k}(Y_{ij} - \bar{Y}_{..})^2 = \sum_{j=1}^k\sum_{i=1}^{n_k}(\bar{Y}_{.j} - \bar{Y}_{..})^2+\sum_{j=1}^k\sum_{i=1}^{n_k}(Y_{ij} - \bar{Y}_{.j})^2\).

Hint: work out how to get \(\bar{Y}_{.j}\) in there in a way you can get the sums of squares.

Hint

The approach is to write \(\sum_{j=1}^k\sum_{i=1}^{n_k}(Y_{ij} - \bar{Y}_{..})^2 = \sum_{j=1}^k\sum_{i=1}^{n_k}[(Y_{ij} -\bar{Y}_{.j}) + (\bar{Y}_{.j}- \bar{Y}_{..})]^2\), expand and show the cross-product term disappears

Answer

Theorem 12.2.4. Enjoy, it’s neat and the trick is used all over the place.

It turns out that \(SS_{TR}\) and \(SS_E\) are independent (Theorem 12.2.3, which basically says see Theorem 7.3.2). We worked out above that \(SSE/\sigma^2 \sim \chi^2_{n-k}\), but what about \(SSTR\)?

Show that \(SS_{TR} = \sum_{k=1}^{k}n_j(\bar{y}_{..}-\mu)^2 -n(\bar{Y}_{..}-\mu)^2\)

Hint

This is the same trick as above: \((a-b)^2 = (a-\mu - b - \mu)^2\) and there cross-product term is either 0 or minus twice times one of the squared terms.

Answer

This is the proof of equation 12.2.1

From\(SS_{TR} = \sum_{k=1}^{k}n_j(\bar{y}_{..}-\mu)^2 -n(\bar{Y}_{..}-\mu)^2\) show that \(E(SS_{TR}) = (k-1)\sigma^2 + \sum_{j=1}^k n_j(\mu_j-\mu)^2\)

Hint

You will need Theorems 3.6.1 and 3.6.2 to deal with \(E[(\bar{Y}_{.j}-\mu)^2]\)

Answer

This is the theorem 12.2.1

This is useful because \(\mu_1=\mu_2=\dots=\mu_k\), \(E(SS_{TR})=(k-1)\sigma^2\). It then also follows a \(\chi^2_{k-1}\) distribution (Theorem 12.2.2: there is a proof using moment generating functions in Appendix 12.A.1). Also, \(SS_{TR}\) and \(SS_{E}\) are independent (Theorem 12.2.3: we have not looked at this independence in any depth).

So, to test \(H_0: \mu_1=\mu_2=\dots=\mu_k\), we can use the ratio \(SS_{TR}/SS_{E}\)

If \(H_0: \mu_1=\mu_2=\dots=\mu_k\), what is the distribution of \(F=\frac{SS_{TR}/(k-1)}{SS_{E}/(n-k)}\)?

Hint

If you are not sure, have a look at Week 2’s module.

Answer

Under \(H_0\), \(SS_{TR} \sim \chi^2_{k-1}\), and \(SS_{E} \sim \chi^2_{n-k}\) (see above, or Theorem 12.2.3a). And they are independent. So

\[ F = \frac{SS_{TR}/(k-1)}{SS_{E}/(n-k)} \sim F_{k-1,n-k} \]

From the definition of the F distribution.

This is also Theorem 12.2.5.

Now we know the distribution of the test statistic, we can develop a test for equality of the means. Before we do that, we need one more piece of information: what \(SS_{TR}\) looks like what \(H_0\) is incorrect. Because \(E(SS_{TR}) = (k-1)\sigma^2 + \sum_{j=1}^k n_j(\mu_j-\mu)^2\), if \(H_0\) is wrong, at least one mean, \(\mu_j\), must be non-zero, so \(\sum_{j=1}^k n_j(\mu_j-\mu)^2\) must be non-zero. And, clearly it must be positive. Thus, if \(H_0\) is false, \(E(F)>1\). So we should use a one-sided test.

Thus, our test is whether the F-ratio (defined above) could come from an F distribution.

Example: Testing Potatoes

We can have a look at the test for whether the yield is different when the potatoes are treated with different fertilisers.

Above we calculated \(SSE = 71.35\). We can calculate \(SS_{TR}\) as \(\sum_{j=1}^k n_k(\bar{Y}_{.j}-\bar{Y}_{..})^2\):

# Calculate treatment means
(TreatmentMeans <- tapply(potato$Yield, potato$Fert, mean))

##        B       Cl        S 
## 2.951667 3.033333 3.090278

(SSTR <- sum(nk*(TreatmentMeans - mean(potato$Yield))^2))

## [1] 0.3495019

So \(F = \frac{71.35/105}{0.3495019/2} = 0.26\). If \(H_0\) is true, this will follow an F distribution with 2 and 105 degrees of freedom. Using stats tables or a computer, we can calculate \(Pr(F<0.26|H_0) = 0.77\). So there is no evidence that any of the treatments differ in their effect on yield.

Testing Contrasts

Finally, we can look at testing contrasts. First, we should note that if a test that the means are all equal cannot be rejected, there is little point in testing whether any contrast is difference from 0: if \(\mu_1=\mu_2=\dots\mu_k=0\), so will any contrast. But if some of the \(\mu\)s are different, the difference could be interesting.

We could test contrasts from their t-distribution, but we can also look at the square of that:

If \(C_i = \sum_{j=1}^k c_j \mu_j\), \(SS_{C_i} = \frac{\hat{C}_i^2}{{\sum_{j=1}^k}\frac{c_{ij}^2}{n_j}}\) is a weighted sum of squares, and \(\frac{C_i}{SS_E/(n-k)} \sim F_{1, n-k}\).

Larsen & Marx (§12.4) show that this is related to \(SS_{TR}\) because \(SS_{TR}\) can be decomposed into a set of orthogonal contrasts. I have never seen this used, although it may be related to what we will look at next week: more complex ANOVA problems.

Multiple Testing

Often when a statistically significant effect has been found in an ANOVA the response is to test all of the possible contrasts to see which one(s) show an effect. This is horrible, but understandable., after all one should be able (somehow) to say what the differences are.

The reason it is horrible is that there are a lot of tests. The wirst thing that is done to to test all pairwise combinations, and see which are different. But if there are (say) 5 treatment levels, there are \(5!=120\) tests. If we are testing at the 5% level, this means that 6 of these can be significant by chance.

We will not delve into this further this week (sorry): there are several approaches to solving this, in essence to find a correction to the 5% error rate. if we get time later, I might add this to another module.

Almost inevitably, Analysis of Variance was invented by R.A. Fisher. But he didn’t do this in the context of re-inventing statistics: instead it was when he was re-inventing evolutionary biology, by combining Darwin’s ideas with genetics.↩︎

ST1201 Week 9: ANOVA

Bob O’Hara

2022-10-17

Learning Outcomes

Categorical Variables and ANOVA

One-Way ANOVA: the Experimental Design

Example: Potatoes

Some Notation

Estimation of Treatment Means

Potato Yield

Estimation of Contrasts

Potato Yield

Testing if there are any effects

Example: Testing Potatoes

Testing Contrasts

Multiple Testing