(This was Problem 2 for Compulsory exercise 2 in 2018.)
In this exercise we will study lasso and ridge regression. We continue using the ourAutoTrain
dataset from Problem 1 (of Compulsory exercise 2 in 2018).
library(ISLR)
ourAuto=data.frame("mpg"=Auto$mpg,"cylinders"=factor(cut(Auto$cylinders,2)),
"displace"=Auto$displacement,"horsepower"=Auto$horsepower,
"weight"=Auto$weight,"acceleration"=Auto$acceleration,
"year"=Auto$year,"origin"=as.factor(Auto$origin))
colnames(ourAuto)
## [1] "mpg" "cylinders" "displace" "horsepower"
## [5] "weight" "acceleration" "year" "origin"
ntot=dim(ourAuto)[1]
ntot
## [1] 392
set.seed(4268)
testids=sort(sample(1:ntot,ceiling(0.2*ntot),replace=FALSE))
ourAutoTrain=ourAuto[-testids,]
ourAutoTest=ourAuto[testids,]
In a regression model with \(p\) predictors the ridge regression coefficients are the values that minimize
\[
\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^p\beta_j x_{ij})^2+\lambda \sum_{j=1}^{p}\beta_j^2
\] while the lasso regression coefficients are the values that minimize \[
\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^p\beta_j x_{ij})^2+\lambda \sum_{j=1}^{p}\lvert \beta_j \rvert.
\] In Figure 1 and Figure 2 you see the results from lasso and ridge regression applied to ourAutoTrain
. Standardized coefficients \(\hat{\beta_1},...,\hat{\beta_8}\) are plotted against the tuning parameter \(\lambda\).
Figure 1.
Figure 2.
In the following, we will use functions in the glmnet
package to perform \(lasso\) regression. The first step is to find the optimal tuning parameter \(\lambda\). This is done by cross-validation using the cv.glmnet()
function:
library(glmnet)
set.seed(4268)
x=model.matrix(mpg~.,ourAutoTrain)[,-1] #-1 to remove the intercept.
head(x)
y=ourAutoTrain$mpg
lambda=c(seq(from=5,to=0.1,length.out=150),0.01,0.0001) #Create a set of tuning parameters, adding low value to also see least squares fit
cv.out=cv.glmnet(x,y,alpha=1,nfolds=10,lambda=lambda, standardize=TRUE) #alpha=1 gives lasso, alpha=0 gives ridge
plot(cv.out)
## cylinders(5.5,8.01] displace horsepower weight acceleration year origin2
## 1 1 307 130 3504 12.0 70 0
## 2 1 350 165 3693 11.5 70 0
## 4 1 304 150 3433 12.0 70 0
## 5 1 302 140 3449 10.5 70 0
## 8 1 440 215 4312 8.5 70 0
## 9 1 455 225 4425 10.0 70 0
## origin3
## 1 0
## 2 0
## 4 0
## 5 0
## 8 0
## 9 0
cv.glmnet
does. Hint: help(cv.glmnet)
.1se
-rule. See help(cv.glmnet)
.cv.glmnet
and the 1se-rule
to choose the “optimal”" \(\lambda\).displace=150
, horsepower=100
, weight=3000
, acceleration=10
, year=82
and comes from Europe. What is the predicted mpg
for this car given the chosen model from Q17? Hint: you need to construct the new observation in the same way as observations in the model matrix x
(the dummy variable coding for cylinders and origin) and newx
need to be a matrix newx=matrix(c(0,150,100,3000,10,82,1,0),nrow=1)
.Write the design matrix for a natural spline with \(X\) = year
and one knot \(c_1 = 2006\). Let the boundary knots be the extreme values of year
, that is \(c_0 = 2003\) and \(c_2 = 2009\). A general basis for a natural spline is \[
b_1(x_i) = x_i, \quad b_{k+2}(x_i) = d_k(x_i)-d_K(x_i),\; k = 0, \ldots, K - 1,\\
\] \[
d_k(x_i) = \frac{(x_i-c_k)^3_+-(x_i-c_{K+1})^3_+}{c_{K+1}-c_k}.
\]
Load the Wage dataset by writing library(ISLR)
and attach(Wage)
. Use library(gam)
to fit an additive model with wage
as response, a polynomial for age
and a cubic spline for year
. Use 4 basis functions for each covariate.
for module 6 and 7!