Version 07.03: corrected Q18 Europe coding
Maximal score is 10 points. You need a score of 4/10 for the exercise to be approved. Your score will make up 10% points of you final grade.
Supervision:
Practical issues:
echo=TRUE
and eval=TRUE
to make it simpler for us to read and grade.Template file for submission of Compulsory exercise 2 is here: https://www.math.ntnu.no/emner/TMA4268/2018v/CompEx2mal.Rmd
You need to install the following packages in R:
install.packages("ISLR")
install.packages("ggplot2")
install.packages("GGally")
install.packages("leaps")
install.packages("glmnet")
install.packages("gam")
Hint: pages 244-247 in our textbook “Introduction to Statistical Learning”.
We will now study the Auto
data set in the ISLR
package. We will use mpg
(miles pr gallon) as response, and selected covariates (see below, observe that we have recoded the cylinders
into two groups, and that origin
is a categorical covariate). We set aside 20% of the data (ourAutoTest
) to be used to assess the selected model(s), and use the rest for model selection (ourAutoTrain
).
library(ISLR)
ourAuto=data.frame("mpg"=Auto$mpg,"cylinders"=factor(cut(Auto$cylinders,2)),
"displace"=Auto$displacement,"horsepower"=Auto$horsepower,
"weight"=Auto$weight,"acceleration"=Auto$acceleration,
"year"=Auto$year,"origin"=as.factor(Auto$origin))
colnames(ourAuto)
ntot=dim(ourAuto)[1]
ntot
set.seed(4268)
testids=sort(sample(1:ntot,ceiling(0.2*ntot),replace=FALSE))
ourAutoTrain=ourAuto[-testids,]
ourAutoTest=ourAuto[testids,]
## [1] "mpg" "cylinders" "displace" "horsepower"
## [5] "weight" "acceleration" "year" "origin"
## [1] 392
Below we have performed best subset selection with the BIC criterion using the function regsubsets
from the leaps
package.
library(leaps)
res=regsubsets(mpg~.,nbest=1,data=ourAutoTrain)
sumres=summary(res)
sumres
plot(res,scale="bic")
sumres$bic
which.min(sumres$bic)
coef(res,id=which.min(sumres$bic))
## Subset selection object
## Call: regsubsets.formula(mpg ~ ., nbest = 1, data = ourAutoTrain)
## 8 Variables (and intercept)
## Forced in Forced out
## cylinders(5.5,8.01] FALSE FALSE
## displace FALSE FALSE
## horsepower FALSE FALSE
## weight FALSE FALSE
## acceleration FALSE FALSE
## year FALSE FALSE
## origin2 FALSE FALSE
## origin3 FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## cylinders(5.5,8.01] displace horsepower weight acceleration year
## 1 ( 1 ) " " " " " " "*" " " " "
## 2 ( 1 ) " " " " " " "*" " " "*"
## 3 ( 1 ) "*" " " " " "*" " " "*"
## 4 ( 1 ) "*" " " " " "*" " " "*"
## 5 ( 1 ) "*" "*" " " "*" " " "*"
## 6 ( 1 ) "*" "*" "*" "*" " " "*"
## 7 ( 1 ) "*" "*" "*" "*" " " "*"
## 8 ( 1 ) "*" "*" "*" "*" "*" "*"
## origin2 origin3
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " "*"
## 5 ( 1 ) " " "*"
## 6 ( 1 ) " " "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## [1] -343.1283 -494.5988 -501.7008 -505.3911 -505.5065 -514.2228 -516.9417
## [8] -512.8292
## [1] 7
## (Intercept) cylinders(5.5,8.01] displace
## -19.410696099 -3.533041185 0.032788184
## horsepower weight year
## -0.048833206 -0.005823726 0.784897712
## origin2 origin3
## 1.794007415 2.906285958
ourAutoTrain
and comment on the model fit. Report the MSE on ourAutoTrain
.ourAutoTest
and report the MSE.Refer to pages 248-249 in our textbook “Introduction to Statistical Learning”, and also solutions to Recommended Problem 5 in Module 6 for hints on how to make the R code.
ourAutoTrain
(no, do not use ourAutoTest
here). (Hint: smart to make a predict.regsubsets
function as on page 249 in “Introduction to Statistical Learning”.)ourAutoTrain
. Report interesting features of this final model. Also use this model fit to predict new values for ourAutoTest
and report the MSE. (If you get the same best model as Q4, just refer back to Q4 and Q5.)In this exercise we will study lasso and ridge regression. We continue using the ourAutoTrain
dataset from Problem 1.
In a regression model with \(p\) predictors the ridge regression coefficients are the values that minimize
\[
\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^p\beta_j x_{ij})^2+\lambda \sum_{j=1}^{p}\beta_j^2
\] while the lasso regression coefficients are the values that minimize \[
\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^p\beta_j x_{ij})^2+\lambda \sum_{j=1}^{p}\lvert \beta_j \rvert.
\] In Figure 1 and Figure 2 you see the results from lasso and ridge regression applied to ourAutoTrain
. Standardized coefficients \(\hat{\beta_1},...,\hat{\beta_8}\) are plotted against the tuning parameter \(\lambda\).
Figure 1.
Figure 2.
In the following, we will use functions in the glmnet
package to perform \(lasso\) regression. The first step is to find the optimal tuning parameter \(\lambda\). This is done by cross-validation using the cv.glmnet()
function:
library(glmnet)
set.seed(4268)
x=model.matrix(mpg~.,ourAutoTrain)[,-1] #-1 to remove the intercept.
head(x)
y=ourAutoTrain$mpg
lambda=c(seq(from=5,to=0.1,length.out=150),0.01,0.0001) #Create a set of tuning parameters, adding low value to also see least squares fit
cv.out=cv.glmnet(x,y,alpha=1,nfolds=10,lambda=lambda, standardize=TRUE) #alpha=1 gives lasso, alpha=0 gives ridge
plot(cv.out)
## cylinders(5.5,8.01] displace horsepower weight acceleration year origin2
## 1 1 307 130 3504 12.0 70 0
## 2 1 350 165 3693 11.5 70 0
## 4 1 304 150 3433 12.0 70 0
## 5 1 302 140 3449 10.5 70 0
## 8 1 440 215 4312 8.5 70 0
## 9 1 455 225 4425 10.0 70 0
## origin3
## 1 0
## 2 0
## 4 0
## 5 0
## 8 0
## 9 0
cv.glmnet
does. Hint: help(cv.glmnet)
.1se
-rule. See help(cv.glmnet)
.cv.glmnet
and the 1se-rule
to choose the “optimal”" \(\lambda\).displace=150
, horsepower=100
, weight=3000
, acceleration=10
, year=82
and comes from Europe. What is the predicted mpg
for this car given the chosen model from Q17? Hint: you need to construct the new observation in the same way as observations in the model matrix x
(the dummy variable coding for cylinders and origin) and newx
need to be a matrix newx=matrix(c(0,150,100,3000,10,82,1,0),nrow=1)
.In this exercise we take a quick look at different non-linear regression methods. We continue using the ourAutoTrain
dataset from Problem 1 and 2.
gam
from package gam
. Call the result gamobject
.
mpg
is the response,displace
is a cubic spline (hint: bs
) with one knot at 290,horsepower
is a polynomial of degree 2 (hint: poly
),weight
is a linear function,acceleration
is a smoothing spline with df=3
,origin
is a step function (what we previously have called dummy variable coding). Plot the resulting curves (hint: first set par(mfrow=c(2,3)
and then plot(gamobject,se=TRUE,col="blue")
). Comment on what you see.displace
).