Sometimes we don’t have strong hypotheses, and instead we might be exploring which variables could have an effect. We might do this is we are interested in prediction, and not explanation, or if we want to ask some specific questions, but there are several other predictors that could have an effect.
The aim here is not to test specific hypotheses, rather it is to try to get a good model, whether it is to predict, or to explore what are probably the major drivers.
There is a famous maxim from George Box that “All models are wrong, but some are useful”. So we should look for a “good” model, but not necessarily the true model. So how do we find a good model?
As we now know, we can get a better fitting model simply by adding more parameters. But bigger models are more complicated, and more difficult to understand. But they are also less robust: models can end up explaining statistical noise, and over-estimating effects.
Thus, we want a model that fits the data well, is simple, and (ideally) is understandable. Because adding terms to a model improves the fit, there is a trade-off between fit to the data and simplicity. Fortunately we can measure both of these. The likelihood measures the fit to the model, and we can use the number of parameters as a measure of simplicity. The more parameters, the more complex the model.
There are several ways to penalise a model, depending on what we mean by a “good” model. Broadly, there are two ways of thinking of what we mean by good:
Unfortunately, it is impossible to develop an approach which will do both (yes, somebody actually proved this). But there are approaches to do each on them.
For each model we can calculate a statistic that summarises its adequacy (for either prediction or closeness to the truth). The model with the lowest value of the statistic is the “best” model. Here we will use these statistics:
There is some mathematics that shows that this is what these two statistics do, at least when you have a lot of data.
For these we use the log-likelihood, \(l = log p(y|\theta)\), the number of parameters in the model, \(p\), and the number of observations \(n\).
This finds the model that would best predict replicate data, from exactly the same conditions. It is calculated as:
\[ AIC = -2l + 2p \] or AIC = -2 Likelihood + 2 Number of Parameters.
If the data set is small, we can use the corrected AIC:
\[ AIC_c = -2l + 2p + \frac{2p(p+1)}{n-p-1} \]
which should do a better job of finding a good model.
This is designed to find the model which is most likely to be “true”. Thanks to George Box, we know that we can’t find the truth, so we are perhaps looking for the model closest to this. it is calculated as
\[ BIC = -2 l + \log(n) p \]
i.e. BIC = -2 Likelihood + log(N) Number of Parameters
We can see that AIC penalises the likelihood with \(2p\), and BIC penalises with \(\log(n) p\). So BIC penalises more if \(\log(n) > 2\), i.e. if \(n>7\). This means that BIC will usually find that smaller models are optimal.
If we are strict, we should always select the best model. But statistics is rarely that neat. By chance the best model might have a slightly higher (i.e. worse) AIC or BIC. The usualy guideline is that if the difference between two models is less than 2, they are roughly the same.
For an example we will use simulated data to look at this approach, with 100 observations and 20 predictors. Only the first predictor has a real effect.
# Just like in the previous part, we set a seed so you all get the same numbers
set.seed(25)
# Then set up the number of observations
NSmall <- 100
# and the number of predictors
PSmall <- 20
# Then we make the predictor data
xSmall <- matrix(rnorm(NSmall*PSmall), nrow=NSmall)
mu <- 0.1*xSmall[,1] # true R^2 = 0.1^2/(0.1^2 + 1) = 1%
# Finally we make a response variable
ySmall <- rnorm(NSmall, mu)
We can extract AIC from a model object with the AIC()
function:
model.null <- lm(ySmall ~ 1) # makes a null model
model.full <- lm(ySmall ~ xSmall) # makes the full model
model.2 <- lm(ySmall ~ xSmall[,2]) # makes a model with one predictor
AIC(model.null, model.2, model.full) # compare using the AIC
A lower value is better, so the null model is better than having all of the variables in it, and the model with variable 2 is slightly better still.
If we want BIC, we can use the BIC()
function:
BIC(model.null, model.2, model.full) # compare using the BIC
So here we see that the null model and model 2 have almost the same BIC.
Fit all of the models with one covariate (i.e. y~x[,1]
, y~x[,2]
etc.). Which one gives the best model (i.e. has the lowest AIC)?
model.1 <- lm(ySmall ~ xSmall[,1])
model.2 <- lm(ySmall ~ xSmall[,2])
model.3 <- lm(ySmall ~ xSmall[,3])
# ... up to
model.20 <- lm(ySmall ~ xSmall[,20])
AIC(model.1, model.2, model.3, model.20)
(if you are more comfortable with R, you could use apply()
)
apply()
thing
modelsAIC <- apply(xSmall, 2, function(X, y) {
AIC(lm(y ~ X))
}, y=ySmall)
which(modelsAIC==min(modelsAIC))
round(modelsAIC-min(modelsAIC), 2)
With AIC the first model is the best. If we look at the differnces in AIC, we can see that a modelw ith the second covaraite is fairly close, but the others are much worse.
If you want to watch the video, it’s below, or you can click here
First, we will fit all of the models. This is the “short” way to do it:
modelsAIC <- apply(xSmall, 2, function(X, y) {
AIC(lm(y ~ X))
}, y=ySmall)
(don’t worry if you used model.1 <- lm(ySmall ~ xSmall[,1])
20 times)
So modelsAIC
is a vector of AICs. We can see which has the lowest AIC like this:
which(modelsAIC==min(modelsAIC))
## [1] 1
And we see that the first model has the lowest AIC.
We can also look at the differences from the lowest AIC:
round(modelsAIC-min(modelsAIC), 2)
## [1] 0.00 1.85 5.45 4.44 5.77 5.81 5.47 5.12 4.13 4.00 5.90 5.67 5.80 5.89 5.84
## [16] 4.43 5.80 5.63 5.88 5.57
Most of the models have an AIC that is about 4-6 higher than the best model, but the model for x2
has and AIC that is less than 2 away. There is a rule of thumb that models within about 2 of each other have roughly the same support.
If we have a lot of models, writing code to go through them all to find the best model is messy. Instead, we can use the bestglm
package to do this:
library(bestglm) # might need install.packages("bestglm")
# This joins xSmall and ySmall together as columns
UseData <- data.frame(cbind(xSmall, ySmall))
# Use bestglm() for AIC and BIC
AllSubsetsAIC <- bestglm(Xy=UseData, IC="AIC")
AllSubsetsBIC <- bestglm(Xy=UseData, IC="BIC")
The bestglm
object has several pieces:
names(AllSubsetsAIC)
## [1] "BestModel" "BestModels" "Bestq" "qTable" "Subsets"
## [6] "Title" "ModelReport"
Subsets
gives the AIC (or BIC) for the best models:
AllSubsetsAIC$BestModels
You can run this yourselves, but the output is big.
BestModel
is the lm
object for the best model, e.g.
coef(AllSubsetsBIC$BestModel)
Run the code to compare the models.
Which model is best according to AIC, and which according to BIC?
If you want to watch the video, it’s below, or you can click here
AllSubsetsAIC <- bestglm(Xy=UseData, IC="AIC")
AllSubsetsAIC$BestModel
##
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE),
## drop = FALSE], y = y))
##
## Coefficients:
## (Intercept) V1 V2 V10 V16
## 0.0257 0.2580 0.2004 0.1717 -0.1380
According to AIC, a model with 4 variables (nos. 1,2,10 and 16) is best. What about BIC? Well…
AllSubsetsBIC <- bestglm(Xy=UseData, IC="BIC")
AllSubsetsBIC$BestModel
##
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE),
## drop = FALSE], y = y))
##
## Coefficients:
## (Intercept) V1
## 0.04905 0.24932
We see that using BIC leads to a smaller model, with only V1 being in it.
How good are the models? How do they compare with the truth?
This is about interpretation, and comparing the parameter estimates with the true values (which may be zero, of course).
The true value of \(\beta_1\) is 0.1, and the true \(R^2\) should be about 1%. So first for AIC…
summary(AllSubsetsAIC$BestModel)
##
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE),
## drop = FALSE], y = y))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.4571 -0.6835 0.1547 0.5894 2.4038
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.02570 0.10220 0.251 0.8020
## V1 0.25802 0.10005 2.579 0.0114 *
## V2 0.20044 0.10996 1.823 0.0715 .
## V10 0.17169 0.09244 1.857 0.0664 .
## V16 -0.13800 0.09633 -1.433 0.1553
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9946 on 95 degrees of freedom
## Multiple R-squared: 0.1355, Adjusted R-squared: 0.09909
## F-statistic: 3.722 on 4 and 95 DF, p-value: 0.007378
We see that the \(R^2\) is about 14%, with the estimate of V1 being larger than the truth. Of course this might not be true next time.
Now BIC:
summary(AllSubsetsBIC$BestModel)
##
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE),
## drop = FALSE], y = y))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.86076 -0.58678 0.00383 0.67267 2.42862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.04905 0.10394 0.472 0.6381
## V1 0.24932 0.10215 2.441 0.0165 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.023 on 98 degrees of freedom
## Multiple R-squared: 0.0573, Adjusted R-squared: 0.04768
## F-statistic: 5.957 on 1 and 98 DF, p-value: 0.01645
For BIC we see, again , that the effect of V1 is over-estimated, but the \(R^2\) is lower.
In (simulated) reality only V1 has an effect, so BIC finds the true model, whereas AIC finds one that is more complex. This is typical, so the choice of whether to use AIC or BIC often depends on how complex the final model should be.
In this set of modules we have seen how to chose between different models, either by specific comparisons (using hypothesis testing) or by a more exploratory approach.
Model selection is also an area with a long history of statistical abuse. This is mainly because it has been seen as the end of the process. In particular hypothesis testing is seen as saying whether there is an effect or not. THIS IS NOT TRUE. Strictly it only says if the data are likely under the null hypothesis, and the assumption is that if they are not then it is because the alternative hypothesis is better (or perhaps even true). But the null hypothesis is almost always false, so in that sense a hypothesis test is testing whether you have enough data to see that.
This is not to say that hypothesis tests are useless, it is a reaction to decades of mis-use. But they should not be the end of the analysis: once you have decided that an alternative hypothesis (e.g. that womens’ times in the 100m have changed over the years), you need to go on to ask about how importsnt that change is: look at the parameter estimates and other statistics, like \(R^2\) to understand that. So you need to be able to understand what the parameters mean, and what (for example) a “large” effect would be. The statisical aspect is not the main issue, instead it is one of biology (or chemistry, or sports science or whatever you are studying), of interpreting what the parameters mean.