Latest changes: (22.05: added toc for html version, changed from k to j for better notation for CV-formulas due to double booking of k.)
This is important both for
estimate the performance of different models (often different order of complexity within one model class) to choose the best model.
having chosen a final model, estimating its performance (prediction error) on new data.
from ComplsoryEx 1- Problem 1
We aim to do model selection in KNN-regression, where true curve is \(f(x)=-x+x^2+x^3\) with \(x \in [-3,3]\). \(n=61\) for the training data.
Regression problem: training data and true regression curve
\(n=61\) both for training and for test data (using same x-grid).
Regression problem: MSE for training and test set, M=1000 versions
\(K\) small: high complexity (left) and \(K\) large: low complexity (right).
Regression problem: bias-variance traceoff
Left (high complexity): low squared-bias (red) and high variance (green). Right (low complexity): high squared-bias (red) and low variance (green).
reminder - we will use
Mean squared error (quadratic loss) for regression problems: \[Y_i=f({\bf x}_i)+\varepsilon_i \text{ for }i=1,\ldots,n \text{ and } \text{MSE}=\frac{1}{n}\sum_{i=1}^n (y_i-\hat{f}({\bf x}_i))^2\]
0/1 loss for classification problems: \[P(Y=j\mid {\bf x}_0) \text{ for }j=1,\ldots,K \] and classify to the class with the highest probability \(\hat{y}_i\). Then \[\frac{1}{n}\sum_{i=1}^n \text{I}(y_i \neq \hat{y}_i)\]
our example was based on simulated data, so I had unlimited access to data. Now, let us move to real data!
If we had a large amount of data we could divide our data into three parts:
Q: Why not enough with training and validation (why also test)?
A: We will be a bit optimistic if we report the error that we know we have chosen to be the smallest.
If this is the case - great - then you do not need Module 5. But, this is very seldom the case - so we will study other solutions based on efficient sample reuse with resampling data.
An alternative strategy for model selection (using methods penalizing model complexity) is covered in Module 6.
First we look at crossvalidation, then at bootstrapping.
Consider the case when you have a data set consisting of \(n\) observations.
To fit a model and to test its predictive performance you randomly divide the data set into two parts (\(n/2\) sample size each):
First: focus on model selection
Auto data set (library ISLR
): predict mpg
(miles pr gallon) using polynomial function of horsepower
(of engine), \(n=392\).
\[\text{MSE}_i=(y_i-\hat{y}_i)^2\]
\[CV_{n}=\frac{1}{n}\sum_{i=1}^n \text{MSE}_i\]
library(ISLR)
library(boot)
library(ggplot2)
set.seed(123)
n=dim(Auto)[1]
testMSEvec=NULL
start=Sys.time()
for (polydeg in 1:10)
{
glm.fit=glm(mpg~poly(horsepower,polydeg),data=Auto)
glm.cv1=cv.glm(Auto, glm.fit,K=n)
testMSEvec=c(testMSEvec,glm.cv1$delta[1])
}
stopp=Sys.time()
yrange=c(15,28)
plotdf=data.frame("testMSE"=testMSEvec,"degree"=1:10)
g0=ggplot(plotdf,aes(x=degree,y=testMSE))+geom_line()+geom_point()+scale_y_continuous(limits = yrange)+scale_x_continuous(breaks=1:10)+labs(y="LOOCV")
g0
\[\text{Var}(\sum_{i=1}^na_iX_i+b)=\sum_{i=1}^n\sum_{j=1}^n a_ia_j\text{Cov}(X_i,X_j)\] \[=\sum_{i=1}^na_i^2\text{Var}(X_i)+2\sum_{i=2}^n \sum_{j=1}^{i-1} a_ia_j\text{Cov}(X_i,X_j).\]
\[ CV_{n}=\frac{1}{n}\sum_{i=1}^n \left( \frac{y_i-\hat{y}_i}{1-h_{ii}} \right) ^2\]
where \(h_i\) is the \(i\)th diagonal element (leverage) of the hat matrix \({\bf H}={\bf X}({\bf X}^T{\bf X})^{-1}{\bf X}^T\).
See class notes for drawings!
\[\text{MSE}_k=\frac{1}{n_k}\sum_{i\in C_k}(y_i-\hat{y}_i)^2\] where \(\hat{y}_i\) is the fit for observation \(i\) obtained from the data with part \(k\) removed.
\[CV_{k}=\frac{1}{n} \sum_{j=1}^k n_j \text{MSE}_j\]
Observe: setting \(k=n\) gives LOOCV.
library(ISLR)
library(boot)
library(ggplot2)
set.seed(123)
n=dim(Auto)[1]
testMSEvec5=NULL
testMSEvec10=NULL
start=Sys.time()
for (polydeg in 1:10)
{
glm.fit=glm(mpg~poly(horsepower,polydeg),data=Auto)
glm.cv5=cv.glm(Auto, glm.fit,K=5)
glm.cv10=cv.glm(Auto, glm.fit,K=10)
testMSEvec5=c(testMSEvec5,glm.cv5$delta[1])
testMSEvec10=c(testMSEvec10,glm.cv10$delta[1])
}
stopp=Sys.time()
yrange=c(15,28)
plotdf=data.frame("testMSE5"=testMSEvec5,"degree"=1:10)
g0=ggplot(plotdf,aes(x=degree,y=testMSE5))+geom_line()+geom_point()+scale_y_continuous(limits = yrange)+scale_x_continuous(breaks=1:10)+labs(y="CV")+ggtitle("5 and 10 fold CV")
g0+geom_line(aes(y=testMSEvec10),colour="red")+geom_point(aes(y=testMSEvec10),colour="red")+ggtitle("5 fold (black), 10 fold (red)")
Other loss function - 0/1 loss, or maybe 1-AUC?
ISL book slides, page 17: model assessment.
We use this strategy to produce a classifier:
How can we use cross-validation to produce an estimate of the performance of this classifier?
Q: Can we apply cross-validation only to step 2? Why or why not?
Can we apply cross-validation only to step 2?
A: No, step 1 is part of the training procedure (the class labels are used) and must be part of the CV to give an honest estimate of the performance of the classifier.
We will see in the Recommended Exercises that doing the wrong thing can give a misclassification error approximately 0 - even if the “true” rate is 50%.
Article by Christophe Ambroise and Geoffrey J. McLachlan, PNAS 2002: Direct quotation from the abstract of the article follows.
Problem 1: Explain how \(k\)-fold cross-validation is implemented
Hint: the words “loss function”, “fold”, “training”, “validation” are central.
Problem 2: What are the advantages and disadvantages of \(k\)-fold cross-validation relative to
Hint: the words “bias”, “variance” and “computational complexity” should be included.
Problem 3:. Selection bias and the “wrong way to do CV”.
The task here is to devise an algorithm to “prove” that the wrong way is wrong and that the right way is right.
library(boot)
# GENERATE DATA
# reproducible
set.seed(4268)
n=50 #number of observations
p=5000 #number of predictors
d=25 #top correlated predictors chosen
kfold=10
#generating predictor data
xs=matrix(rnorm(n*p,0,4),ncol=p,nrow=n) #simple way to to uncorrelated predictors
dim(xs) # n times p
# generate class labels independent of predictors - so if all classifies as class 1 we expect 50% errors in general
ys=c(rep(0,n/2),rep(1,n/2)) #now really 50% of each
table(ys)
# WRONG CV - using cv.glm
# here select the most correlated predictors outside the CV
corrs=apply(xs,2,cor,y=ys)
hist(corrs)
selected=order(corrs^2,decreasing = TRUE)[1:d] #top d correlated selected
data=data.frame(ys,xs[,selected])
#apply(xs[,selected],2,cor,y=ys) yes, ave the most correlated
# then cv around the fitting of the classifier - use logistic regression and built in cv.glm function
logfit=glm(ys~.,family="binomial",data=data)
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
cvres=cv.glm(data=data,cost=cost,glmfit=logfit,K=kfold)
cvres$delta
# observe - near 0 misclassification rate
# WRONG without using cv.glm - should be similar (just added to see the similarity to the RIGHT version)
reorder=sample(1:n,replace=FALSE)
validclass=NULL
for (i in 1:kfold)
{
neach=n/kfold
trainids=setdiff(1:n,(((i-1)*neach+1):(i*neach)))
traindata=data.frame(xs[reorder[trainids],],ys[reorder[trainids]])
validdata=data.frame(xs[reorder[-trainids],],ys[reorder[-trainids]])
colnames(traindata)=colnames(validdata)=c(paste("X",1:p),"y")
data=traindata[,c(selected,p+1)]
trainlogfit=glm(y~.,family="binomial",data=data)
pred=plogis(predict.glm(trainlogfit,newdata=validdata[,selected]))
print(pred)
validclass=c(validclass,ifelse(pred > 0.5, 1, 0))
}
table(ys[reorder],validclass)
1-sum(diag(table(ys[reorder],validclass)))/n
# CORRECT CV
reorder=sample(1:n,replace=FALSE)
validclass=NULL
for (i in 1:kfold)
{
neach=n/kfold
trainids=setdiff(1:n,(((i-1)*neach+1):(i*neach)))
traindata=data.frame(xs[reorder[trainids],],ys[reorder[trainids]])
validdata=data.frame(xs[reorder[-trainids],],ys[reorder[-trainids]])
colnames(traindata)=colnames(validdata)=c(paste("X",1:p),"y")
foldcorrs= apply(traindata[,1:p],2,cor,y=traindata[,p+1])
selected=order(foldcorrs^2,decreasing = TRUE)[1:d] #top d correlated selected
data=traindata[,c(selected,p+1)]
trainlogfit=glm(y~.,family="binomial",data=data)
pred=plogis(predict.glm(trainlogfit,newdata=validdata[,selected]))
validclass=c(validclass,ifelse(pred > 0.5, 1, 0))
}
table(ys[reorder],validclass)
1-sum(diag(table(ys[reorder],validclass)))/n
Problem 4: Trying out different versions of cross-validation with R.
ILSR
library.Use Section 5.3 in the ISL book as a starting point.
The inventor: Bradley Efron in 1979 - see interview.
The name? To pull oneself up by one’s bootstraps from “The Surprising Adventures of Baron Munchausen” by Rudolph Erich Raspe:
The Baron had fallen to the bottom of a deep lake. Just when it looked like all was lost, he thought to pick himself up by his own bootstraps.
Idea: Use the data itself to get more information about a statistic (an estimator).
Assume that we observe a random sample \(X_1, X_2, \ldots, X_n\) from an unknown probability distribution \(f\). We are interesting in saying something about the population median, and to do that we calculate the sample median \(\tilde{X}\). But, how accurate is \(\tilde{X}\) as an estimator?
The bootstrap was introduced as a computer-based method to estimate the standard deviation of an estimator, for example our estimator \(\tilde{X}\).
But, before we look at the bootstrap method, first we assume that we know \(F\) and can sample from \(F\), and use simulations to answer our question.
set.seed(123)
n=101
B=1000
estimator=rep(NA,B)
for (b in 1:B)
{
xs=rnorm(n)
estimator[b]=median(xs)
}
sd(estimator)
# approximation for large samples
# (sd of median of standard normal)
1.253*1/sqrt(n)
## [1] 0.1259035
## [1] 0.1246782
ggplot(data=data.frame(x=estimator),aes(x=x))+
geom_density()
The bootstrap method is using the observed data to estimate the empirical distribution \(\hat{f}\), that is each observed value of \(x\) is given probability \(1/n\).
A bootstrap sample \(X^*_1,X^*_2,\ldots, X^*_n\) is a random sample drawn from \(\hat{f}\).
A simple way to obtain the bootstrap sample is to draw with replacement from \(X_1, X_2, \ldots, X_n\).
This means that our bootstrap sample consists of \(n\) members of \(X_1, X_2, \ldots, X_n\) - some appearing 0 times, some 1, some 2, etc.
set.seed(123)
n=101
original=rnorm(n)
median(original)
boot1=sample(x=original,size=n,replace=TRUE)
table(table(boot1))
n-sum(table(table(boot1)))
median(boot1)
## [1] 0.05300423
##
## 1 2 3 4 5
## 42 16 6 1 1
## [1] 35
## [1] -0.1388914
\(B\) bootstrap samples: drawn with replacement from the original data
Evaluate statistic: on each of the \(B\) bootstrap samples to get \(\tilde{X}^*_b\) for the \(b\)th bootstrap sample.
Estimate squared standard error by: \[\frac{1}{B-1}\sum_{b=1}^B (\tilde{X}^*_b-\frac{1}{B}\sum_{b=1}^B \tilde{X}^*_b)^2\]
set.seed(123)
n=101
original=rnorm(n)
median(original)
B=1000
estimator=rep(NA,B)
for (b in 1:B)
{
thisboot=sample(x=original,size=n,replace=TRUE)
estimator[b]=median(thisboot)
}
sd(estimator)
## [1] 0.05300423
## [1] 0.1365856
ggplot(data=data.frame(x=estimator),aes(x=x))+
geom_density()
boot
function from library boot
library(boot)
set.seed(123)
n=101
original=rnorm(n)
median(original)
summary(original)
## [1] 0.05300423
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.30917 -0.50232 0.05300 0.08248 0.68864 2.18733
boot.median=function(data,index) return(median(data[index]))
B=1000
boot(original,boot.median,R=B)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = original, statistic = boot.median, R = B)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.05300423 -0.01413291 0.136591
In bootstrapping we sample with replacement from our observations.
What if we instead sample without replacement?
(Then we would always get the same sample - given that the order of the sample points is not important for our estimator.)
(In permutation testing we sample without replacement to get samples under the null hypothesis - a separate field of research.)
We assume, for observation \(i\): \[Y_i= \beta_0 + \beta_{1} x_{i1} + \beta_2 x_{i2} + ... + \beta_p x_{ip} + \varepsilon_i,\] where \(i=1,2,...,n\). The model can be written in matrix form: \[{\bf Y}={\bf X} \boldsymbol{\beta}+{\boldsymbol{\varepsilon}}.\]
The least squares estimator: \(\hat{\boldsymbol\beta}=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf Y}\) has \(\text{Cov}(\boldsymbol\beta)=\sigma^2({\bf X}^T{\bf X})^{-1}\).
We will in the recommended exercises look at how to use bootstrapping to estimate the covariance of the estimator. Why is that “needed” if we already know the mathematical formula for the standard deviation? Answer: not needed - but ok example.
We will not do this here - but our bootstrap samples can also be used to make confidence intervals for the regression coefficients or prediction intervals for new observations. This means that we do not have to rely on assuming that the error terms are normally distributed!
Bagging is a special case of ensemble methods.
In Module 8 we will look at bagging, which is built on bootstrapping the the fact that it is possible to reduce the variance of a prediction by taking the average of many model fits.
We can therefore obtain a new model (our average of the individual models) that has a smaller variance than each of the individual model because
\[\text{Var}(\bar{X})=\frac{\sigma^2}{n}\]
Since this averaged predictor has smaller variance than each of the predictors we would assum that this is a more accurate prediction. However, the interpretation of this bagged prediction might be harder then for the separate predictors.
Models that have poor prediction ability (as we may see can happen with regression and classification trees) might benefit greatly from bagging. More in Module 8.
Problem 1: We will calculate the probability that a given observation in our original sample is part of a bootstrap sample. This is useful for us to know in Module 8.
Our sample size is \(n\).
Problem 2: Explain with words and an algorithm how you would proceed to use bootstrapping to estimate the standard deviation of one of the regression parameters in multiple linear regression. Comment on which assumptions you make for your regression model.
Problem 3: Implement your algorithm from 2 both using for-loop and using the boot
function. Hint: see page 195 of our ISLR book. Use our SLID data set and provide standard errors for the coefficient for age. Compare with the theoretical value \(({\bf X}^T{\bf X})^{-1}\hat{\sigma}^2\) that you find in the output from the regression model.
library(car)
library(boot)
SLID = na.omit(SLID)
n = dim(SLID)[1]
SLID.lm = glm(wages~., data = SLID)
summary(SLID.lm)$coeff[3,2]
(Wednesday 10.15-12.00 and Friday 12.15-14.00)
Enter - participate in interactive lecture IL (with lecturer) or supervision of CompEx1 (TAs). If supervision only - allocated table - or participate in IL and take breaks to ask for supervision of CompEx1.
If IL - answer: name+study programme+year+“I found CompEx1 ok or difficult” Then groups are formed (by lecuturer) as homogeneous as possible (so, not random).
11.00/13.00: Break - with fruits.
# packages to install before knitting this R Markdown file
# to knit the Rmd
install.packages("knitr")
install.packages("rmarkdown")
# cool layout for the Rmd
install.packages("prettydoc") # alternative to github
#plotting
install.packages("ggplot2") # cool plotting
install.packages("ggpubr") # for many ggplots
#datasets
install.packages("ElemStatLearn")
install.packages("ISLR")
# cross-validation and bootstrapping
install.packages("boot")