Learning Outcomes

This is a question that you should try to answer

Answer This is where the answer will be. Unless it says “Hint” in which case it is a hint to help you.

Introduction

We have now looked at a few models, and how to estimate their parameters and test if effects are zero. This week I want to go back to Chapter 8, and look at types of data. The reason for this is to try to show how the different analyses relate to each other, and also how they are extended to more complex analyses.

A large part of this module is about how we classify data. This can be done in many ways: Chapter 8 one approach. as with anything like this, different people can agree and disagree on how to do this.

There is an assumption (or perhaps a hope) in a lot of statistics that you know what you want to do, i.e. you have not just pulled together a pile of data to see what it does (this even applies to machine learning, which can give that impression). This helps to classify the data in a way that will guide how it is analysed.

Types of Data

The first thing to do is to structure a data set into the response and the predictors.

In our potato data:

What is the predictor variable?

Answer

It is fertiliser treatment in the potato yield data. And in the two-way ANOVA the potato variety as well. We can have lots of predictor variables.

More generally, it is the factor that we are using to explain the response.

What is the response variable?

Answer The potato yield.

What is the predictor variable in a one-sample t-test?

(this is sort-of a trick question)

Answer

The predictor is a constant. All of the observations have the same value of it, so it’s a bit boring.

There is a useful insight here, essentially that there is always a predictor variable, even in it is boring. In hypothesis tests the null hypothesis, \(H_0\), is often that the predictor is a constant.

Quantitative and Qualitative Data

There are two fundamentally different types of data: quantitative and qualitative data. Quantitative data is basically numbers: we have measured a quantity (length, yield of potatoes, number of potatoes, IQ of politicians etc.). We are interested in what affects these numbers.

Qualitative data are not based on numbers. These are things like what species a bird is, what country someone is from, what political party someone would vote for etc. If we are to do statistical analysis of these sorts of data, we have to reduce them to numbers, e.g. how many birds of a species we have seen, or what proportion of the population would vote for the Official Monster Raving Loony Party1. If the counts or proportions are large enough, they can be treated as being normally distributed. When they are smaller, they can be Poisson or multinomially distributed. This is where the goodness of fit tests come in: they can test whether the proportions of the quantities are what we would expect.

More complicated approaches can be used to look at qualitative data in more details. These are based on extending quantitative models, in essence the counts of the numbers in each qualitative class are modeled as linear on the log scale, e.g. \(\log(E_i) = \beta_0 + \beta_1 x_i\).

The Response Variable(s)

The response variable is the thing we are trying to model, e.g. the yield of potatoes. For most problems there is only one of these: it is univariate. But there can be more, e.g. if we are interested in the shape of fish, we might measure their length, width, distance from front to eye etc. This sort of data is called multivariate. We have looked at the simplest version of this: the bivariate normal distribution.

So far we have almost exclusively looked at the normal distribution. But all of the models can be adapted to different distributions (the one-sampled t-test is the easiest).

The Predictors

Our response has mostly been normal, so what about the predictors? Our normal models have looked like this:

\[ Y_i \sim N(\mu_i, \sigma^2) \]

and the differences have been in the \(\mu_i\). We have seen the following models:

  • \(\mu_i=\mu\): the one-sampled t-test
  • \(\mu_i=\mu_j, k\in\{1,2\}\): the two-sampled t-test
  • \(\mu_i=\mu_j, j\in\{1,\dots,k\}\): the one-way ANOVA
  • \(\mu_i=\mu_j + \beta_l, j\in\{1,\dots,k\}, l\in\{1,\dots,b\}\): the two-way ANOVA (the book calls this a block design, which it sometimes is)
  • \(\mu_i=\beta_0 + \beta_1 x_i\): a simple regression

Distinguishing between them comes down to working out what the model for \(\mu_i\) is. A large amount of modern statistics is based on the same idea: in essence, we just make \(\mu_i\) more complicated.

The predictors are of two types: continuous and categorical.

  • Continuous predictors can be considered as real numbers (even if they are, for example, integers).
  • Categorical predictors can only take a few values. They are often also qualitative, e.g. potato variety.

Some times a continuous predictor is treated as categorical, e.g. if a drug is used at a few doses, say 2-4. In practice the difference between the results are not too large, and it is more convenient to compare “None”, “Low”, and “High” to the doses used.

How to use this information

If you are faced with a data set that you should analyse, there are a few things you need to check. Some of these are listed in §8.2. Larsen and Marx present a flowchart (Figure 8.2.4) that can be helpful in sorting out the different types of design of the data.

One aspect the book ignores is that you need to think about what you want to do. All of the methods we have looked at are are designed to answer specific sorts of question, and what method you chose to use will depend on your questions. This is most apparent when comparing at regression and correlation.

To recap: in regression we have a response variable, \(Y\), which we try to explain with the help of a covariate, \(X\). We assume that \(Y_i\) is normally distributed with a mean that depends on \(X_i\). In contrast, when we look at a correlation, we treat both \(X\) and \(Y\) as response variables, with some correlation between them.

So if we have a problem where we want to explain \(Y\) with \(X\), e.g. we want to predict the exam scores of students in ST1201 based on their scores in ST1101, we would use a regression. On the other hand, if we want to ask if there is some general aptitude of students to do statistics, then we might focus on the correlation between the two scores.

Larsen and Marx suggest that whether units are similar or dis-similar should be used to decide whether a correlation or regression analysis should be used. I think this is a bizarre suggestion: there are examples where similar units should be used in a regression, and dissimilar units in a correlation. Checking if units are similar or dis-similar is a good thing to do: it can be used as a sanity check. You do not want to be adding weight in grams to height in meters. You also do not want to be adding lengths in meters to lengths in feet, at least not without converting both to the same scale.

So, feel free to use most of Figure 8.2.4, but not that one question

Give an example of a problem where data with similar units should be used in a regression.

Answer

OK, there are lots. Including the case that the term “regression” was coined for: heights of parents and offspring. There, Francis Galton (and not Fisher!) was looking at their relationship, and whether the heights of parents could explain the heights of offspring.

The key point of whatever example you find is that \(X\) is being used to explain or predict \(Y\).

What question could you use in Fig. 8.2.4 to replace the “Are the units similar or dis-similar?”

Answer Again, there are a few choices. But something along the lines of “Are you trying to explain \(Y\) with \(X\), or just ask if they are related?”

Design of Experiments

This is a whole field of statistics, and the field was started with fields. We saw in weeks 9 and 10 an experiment with potatoes. The intent of the experiment was to look at the effects of fertiliser, but the variety also had to be accounted for. This gave us useful information about which varieties are best to grow (or at least which are best to grow at Rothamsted). So we have two factors: variety and fertiliser treatment. We may have more (e.g. treatment with different pesticides). If we design the experiment well, we can test all of these together, as well as any interactions between them.

Larsen & Marx describe the two-way ANOVA as a “randomised block design”. Why a block? The idea was developed to deal with field data, and in particular fields. If one wants to test a lot of treatments, this takes time and space to do. But they also need to be replicated: all of the analyses rely on estimating \(\sigma^2\), and (in essence) you need degrees of freedom to do that, i.e. more data than parameters.

Replication can happen at different levels: we can have more than one potato plant, or more than one patch of plants in a field, or several fields. Ideally we would have replication at every level, but this can be expensive. So we use a slightly more complicated design. We have (say) several fields, and in each field we have one plot of each treatment. So there is no replication of treatments within a block, but they are replicated across blocks.

If there is one treatment and one block, the analysis is just like a two-way ANOVA, i.e. what Larsen & Marx call a randomised block design. So we have already seen how to do the analysis: the model is \(\mu_i=\mu_j + \beta_l, j\in\{1,\dots,k\}, l\in\{1,\dots,b\}\), with only one observation for each \((j,l)\). With more than one treatment (e.g. fertiliser and pesticide) the same ideas apply, but are obviously more complicated.

The key thing is that we are not interested in the blocks. We accept that we need them: either because they represent “real” variation that we need to account for, or because we can’t do our experiment without some sort of blocking. For example, our field may not be big enough for our replication, or each observation might take time, so we have to spread the observations over days: here the days would be the blocks.

Randomisation

Larsen & Marx call the two-way ANOVA a “randomised block design” We have dealt with blocks, but what about the “randomised”?

Imagine we have 3 treatments (1,2,3) and 4 blocks (A, B, C, D). Each block could be a day with 3 spaces (e.g. Morning, Lunchtime, Afternoon). We could set them out like this:

A B C D
Morning 1 1 1 1
Lunchtime 2 2 2 2
Afternoon 3 3 3 3

This might work, but it might run into problems if the time of day has an effect, e.g. if there is a big effect of Morning, then we might interpret that as an effect of treatment 1. We call this confounding: the treatments and positions within a block are confounded.

What to do? The solution suggested by Fisher was to randomise the order. For example, this:

A B C D
3 1 3 2
1 2 1 1
2 3 2 3

His argument was that in the long run you will be mislead less often, because the randomisation means that you are less likely to have confounded variables. In this example, the randomisation is clearly not perfect (there are three treatment 1s at lunch time), but with a larger data set this problem is less likely. If the number of blocks equals the number of treatments, the experiment could be set up so that each treatment occurred once in each position in the block, e.g. 

A B C
Morning 1 3 2
Lunchtime 3 2 1
Afternoon 2 1 3

Notice how each row and each column has each treatment once. This is called a Latin Square design.

We will not go into this further (fun as some of these ideas can be). But we will note that the reasons for designing experiments well are to get good inferences. For example:

  • any effects we estimate should be due to those effects (i.e. no unwanted confounding)
  • our estimates should be precise (i.e. small confidence intervals. We can always make them smaller by having more data, but we also want to minimise costs)
  • our estimates should be robust, e.g. they should not depend too much on a few values.

Explaining how to do this well would be a whole new course.


  1. yes, they do exist in the UK. They used to be lead by Screaming Lord Sutch, and have included Lord Buckethead as a candidate in a general election. They sometimes appear to be one of the saner political parties.↩︎