Exploring for Good Models

Sometimes we don’t have strong hypotheses, and instead we might be exploring which variables could have an effect. We might do this is we are interested in prediction, and not explanation, or if we want to ask some specific questions, but there are several other predictors that could have an effect.

The aim here is not to test specific hypotheses, rather it is to try to get a good model, whether it is to predict, or to explore what are probably the major drivers.

What does a good model look like?

There is a famous maxim from George Box that “All models are wrong, but some are useful”. So we should look for a “good” model, but not necessarily the true model. So how do we find a good model?

As we now know, we can get a better fitting model simply by adding more parameters. But bigger models are more complivcated, and more difficult to understand. But they are also less robust: models can end up explaining statistical noise, and over-estimating effects.

Thus, we want a model that fits the data well, is simple, and (ideally) is understandable. Because adding terms to a model improves the fit, there is a trade-off between fit to the data and simplicity. Fortunately we can measure both of these: the likelihood measures the fit to the model, and we can use the number of parameters as a measure of simplicity: the more parameters, the more complex the model.

There are several ways to penalise a model, depending on what we mean by a “good” model. Broadly, there are two ways of thinking of what we mean by good:

  • a model that is close to the truth
  • a model which predicts the data well

Unfortunately, it is impossible to develop an approach which will do both. But there are approches to do each on them.


For each model we can calculate a statistic that summarises its adequacy (for either prediction or closeness to the truth). The model with thelowest value of the statistic is the “best” model. Here we will use these statistics:

  • AIC: Akaike’s Information Criterion. This tries to find the model that best predicts the data
  • BIC: Bayesian Information Criterion. This tries to find the model most likely to be true

For these we use the log-likelihood, \(l = log p(y|\theta)\), the number of parameters in the model, \(p\), and the number of observations \(n\).


This finds the model that would best predict replicate data, from exactly the same conditions. It is calculated as:

\[ AIC = -2l + 2p \] or AIC = -2 Likelihood + 2 Number of Parameters.

If the data set is small, we can use the corrected AIC:

\[ AIC_c = -2l + 2p + \frac{2p(p+1)}{n-p-1} \]

which should do a better job of finding a good model.


This is designed to find the model which is most likely to be “true”. Thanks to eorge Box, we know that we can’t find the truth, so we are perhaps looking for the model closest to this. it is calculated as

\[ BIC = -2 l + \log(n) p \]

i.e. BIC = -2 Likelihood + log(N) Number of Parameters

So, we can see that AIC penalises the likelihood with \(2p\), and BIC penailises with \(\log(n) p\). So BIC penalises more if \(\log(n) > 2\), i.e. if \(n>7\). This means that BIC will usually find that smaller models are optimal.

If we are strict, we should always select the best model. But statistics is rarely that neat. By chance the best model might have a slightly higher (i.e. worse) AIC or BIC. The usualy guideline is that if the difference between two models is less than 2, they are roughly the same.

AIC and BIC in R

For an example we will use simulated data to look at this approach, with 100 observations and 20 predictors. Only the first predictor has a real effect.

# Just like in the previous part, we set a seed so you all get the same numbers
# Then set up the number of observations
NSmall <- 100 
# and the number of predictors
PSmall <- 20
# Then we make the predictor data
xSmall <- matrix(rnorm(NSmall*PSmall), nrow=NSmall)
mu <- 0.1*xSmall[,1] # true R^2 = 0.1^2/(0.1^2 + 1) = 1%
# Finally we make a response variable
ySmall <- rnorm(NSmall, mu)

We can extract AIC from a model object with the AIC()function:

model.null <- lm(ySmall ~ 1) # makes a null model
model.full <- lm(ySmall ~ xSmall) # makes the full model
model.2 <- lm(ySmall ~ xSmall[,2]) # makes a model with one predictor

AIC(model.null, model.2, model.full) # compare using the AIC

A lower value is better, so the null model is better than having all of the variables in it, and the model with variable 2 is slightly better still.

If we want BIC, we can use the BIC() function:

BIC(model.null, model.2, model.full) # compare using the BIC

So here we see that the null model and model 2 have almost the same BIC.

Your task

Fit all of the models with one covariate (i.e. y~x[,1], y~x[,2] etc.). Which one gives the best model (i.e. has the lowest AIC)?

Code Hint

model.1 <- lm(ySmall ~ xSmall[,1])
model.2 <- lm(ySmall ~ xSmall[,2])
model.3 <- lm(ySmall ~ xSmall[,3])
# ... up to
model.20 <- lm(ySmall ~ xSmall[,20])

AIC(model.1, model.2, model.3, model.20)

(if you are more comfortable with R, you could use apply())

Show me the apply() thing

modelsAIC <- apply(xSmall, 2, function(X, y) {
  AIC(lm(y ~ X))
}, y=ySmall)
## [1] 1
round(modelsAIC-min(modelsAIC), 2)
##  [1] 0.00 1.85 5.45 4.44 5.77 5.81 5.47 5.12 4.13 4.00 5.90 5.67 5.80 5.89 5.84
## [16] 4.43 5.80 5.63 5.88 5.57

With AIC the first model is the best. If we look at the differnces in AIC, we can see that a modelw ith the second covaraite is fairly close, but the others are much worse.


If we have a lot of models, writing code to go through them all to find the best model is messy. Instead, we can use the bestglm package to do this:

library(bestglm) # might need install.packages("bestglm")

# This joins xSmall and ySmall together as columns
UseData <- data.frame(cbind(xSmall, ySmall))

# Use bestglm() for AIC and BIC
AllSubsetsAIC <- bestglm(Xy=UseData, IC="AIC")
AllSubsetsBIC <- bestglm(Xy=UseData, IC="BIC")

The bestglm object has several pieces:

## [1] "BestModel"   "BestModels"  "Bestq"       "qTable"      "Subsets"    
## [6] "Title"       "ModelReport"

Subsets gives the AIC (or BIC) for the best models:


You can run this yourselves, but the output is big.

BestModel is the lm object for the best model, e.g.


Run the code to compare the models.

Which model is best according to AIC, and which according to BIC?


AllSubsetsAIC <- bestglm(Xy=UseData, IC="AIC")
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), 
##     drop = FALSE], y = y))
## Coefficients:
## (Intercept)           V1           V2          V10          V16  
##      0.0257       0.2580       0.2004       0.1717      -0.1380

According to AIC, a model with 4 vaiables (nos. 1,2,10 and 16) is best. What about BIC? Well…

AllSubsetsBIC <- bestglm(Xy=UseData, IC="BIC")
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), 
##     drop = FALSE], y = y))
## Coefficients:
## (Intercept)           V1  
##     0.04905      0.24932

We see that using BIC leads to a smaller model, with only V1 being in it.

How good are the models? How do they compare with the truth?


Note that the true value of \(\beta_1\) is 0.1, and the true \(R^2\) should be about 1%. So…

## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), 
##     drop = FALSE], y = y))
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4571 -0.6835  0.1547  0.5894  2.4038 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.02570    0.10220   0.251   0.8020  
## V1           0.25802    0.10005   2.579   0.0114 *
## V2           0.20044    0.10996   1.823   0.0715 .
## V10          0.17169    0.09244   1.857   0.0664 .
## V16         -0.13800    0.09633  -1.433   0.1553  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 0.9946 on 95 degrees of freedom
## Multiple R-squared:  0.1355, Adjusted R-squared:  0.09909 
## F-statistic: 3.722 on 4 and 95 DF,  p-value: 0.007378

We see that the \(R^2\) is about 14%, with the estimate of V1 being larger than the truth. Of course this might not be true next time.

## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), 
##     drop = FALSE], y = y))
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.86076 -0.58678  0.00383  0.67267  2.42862 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.04905    0.10394   0.472   0.6381  
## V1           0.24932    0.10215   2.441   0.0165 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.023 on 98 degrees of freedom
## Multiple R-squared:  0.0573, Adjusted R-squared:  0.04768 
## F-statistic: 5.957 on 1 and 98 DF,  p-value: 0.01645

For BIC we see, again , that the effect of V1 is over-estimated, but the \(R^2\) is lower.

If you have time…

A more realistic model for reality can be that everything has an effect, but some have a stronger effect than others. We can look at how model selection behaves then:

betas <- 0.3*(0.9^(1:PSmall))
# plot(1:PSmall, betas)
NSmall <- 100; PSmall <- 20
xSmall <- matrix(rnorm(NSmall*PSmall), nrow=NSmall)
mu <- sweep(xSmall, 2, betas, '*')
yTaper <- rnorm(NSmall, mu)
TaperData <- data.frame(cbind(xSmall, yTaper))

AllTaperAIC <- bestglm(Xy=TaperData, IC="AIC")
AllTaperBIC <- bestglm(Xy=TaperData, IC="BIC")

Look at the results.