This Week: Model selection

This week you will:

  • find out why model selection is needed
  • be able to use AIC and BIC to compare models
  • be able to compare hypotheses with F tests

The material is split into 3 modules (because one was getting too long). This is the first, the second is about hypothesis testing, and the third is about exploratory model selection. You should try to follow them in order.

Here we will cover one of the more contentious areas of statistics - how to decide what model to use. One reason for the controversy is the link to hypothesis testing. In essence, hypothesis testing is model selection: each hypothesis (null and alternative - \(H_0\) and \(H_1\)) is represented by a model, and hypothesis testing asks whether \(H_0\) can be supported by the data. As we will see, this makes the test depend on the amount of data. As \(H_0\) is usually wrong, this means we are asking if we have enough data to see if \(H_0\) is wrong.

This is not to argue that hypothesis testing is always a bad idea, rather it is one statistical tool, and should (like all tools) be used with caution.

First, a video click here or watch:

Why select models?

It is worth starting out by looking at why we want to select models. It is probably clear when we have specific hypotheses, e.g. when we are asking “does adding cow manure to fields improve yield?” we are asking whether the manure effect is non-zero. So comparing models when it is zero or non-zero is sensible. But when we are asking “which one of these 1000 loci explains how wide people’s faces are”, one might argue that it is better just to use all of the effects. But using all of the variables comes at a price…

Let us see what happens. We want you to simulate some data, with 100 points and (up to) 90 explanatory variables. The first variable will explain about ~1% of data, the rest will be simulated independently.

# This first line sets a seed to make sure you all get 
# the same data
NData <- 100 
NExplanatory <- 90
# Create data of explanatory variables
x <- matrix(rnorm(NData*NExplanatory), nrow=NData)
# Calculate "true" mean
mu <- 0.1*x[,1] # true R^2 = 0.1^2/(0.1^2 + 1) = 1%
# Simulate the data
y <- rnorm(NData, mu, 1)

Now we want to fit a regression model to the data. We know the true model is \(y_i \sim N(0.1x_{1i}, 1)\), but what happens when we add more variables?

We will use a function, GetR2(), to fit a model, and return the \(R^2\), and the estimate of the effect of the first variable. We use it multiple times, looping over the vector 1:Nexplanatory, each time using that number of covariates. Rather than writing a for loop, we use the sapply() function, which does the looping internally1:

R2 <- sapply(1:NExplanatory, GetR2, Xmat=x, Y=y)

Once we have these, we should plot estimate & \(R^2\):

par(mfrow=c(2,1), mar=c(2,4.1,1,1), oma=c(2,0,0,0))
plot(1:NExplanatory, R2[4,], ylab=expression(R^2), ylim=c(0,1))
plot(1:NExplanatory, R2[1,], ylim=range(R2[2:3,]),
     ylab="Coefficient for X1")
segments(2:NExplanatory, R2[2,], 2:NExplanatory, R2[3,])
mtext("Number of parameters", 1, outer=TRUE) 

Plot estimate & \(R^2\)