Version 07.03: corrected Q18 Europe coding
Maximal score is 10 points. You need a score of 4/10 for the exercise to be approved. Your score will make up 10% points of you final grade.
Supervision:
Practical issues:
echo=TRUE and eval=TRUE to make it simpler for us to read and grade.Template file for submission of Compulsory exercise 2 is here: https://www.math.ntnu.no/emner/TMA4268/2018v/CompEx2mal.Rmd
You need to install the following packages in R:
install.packages("ISLR")
install.packages("ggplot2")
install.packages("GGally")
install.packages("leaps")
install.packages("glmnet")
install.packages("gam")
Hint: pages 244-247 in our textbook “Introduction to Statistical Learning”.
We will now study the Auto data set in the ISLR package. We will use mpg (miles pr gallon) as response, and selected covariates (see below, observe that we have recoded the cylinders into two groups, and that origin is a categorical covariate). We set aside 20% of the data (ourAutoTest) to be used to assess the selected model(s), and use the rest for model selection (ourAutoTrain).
library(ISLR)
ourAuto=data.frame("mpg"=Auto$mpg,"cylinders"=factor(cut(Auto$cylinders,2)),
"displace"=Auto$displacement,"horsepower"=Auto$horsepower,
"weight"=Auto$weight,"acceleration"=Auto$acceleration,
"year"=Auto$year,"origin"=as.factor(Auto$origin))
colnames(ourAuto)
ntot=dim(ourAuto)[1]
ntot
set.seed(4268)
testids=sort(sample(1:ntot,ceiling(0.2*ntot),replace=FALSE))
ourAutoTrain=ourAuto[-testids,]
ourAutoTest=ourAuto[testids,]
## [1] "mpg" "cylinders" "displace" "horsepower"
## [5] "weight" "acceleration" "year" "origin"
## [1] 392
Below we have performed best subset selection with the BIC criterion using the function regsubsets from the leaps package.
library(leaps)
res=regsubsets(mpg~.,nbest=1,data=ourAutoTrain)
sumres=summary(res)
sumres
plot(res,scale="bic")
sumres$bic
which.min(sumres$bic)
coef(res,id=which.min(sumres$bic))
## Subset selection object
## Call: regsubsets.formula(mpg ~ ., nbest = 1, data = ourAutoTrain)
## 8 Variables (and intercept)
## Forced in Forced out
## cylinders(5.5,8.01] FALSE FALSE
## displace FALSE FALSE
## horsepower FALSE FALSE
## weight FALSE FALSE
## acceleration FALSE FALSE
## year FALSE FALSE
## origin2 FALSE FALSE
## origin3 FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## cylinders(5.5,8.01] displace horsepower weight acceleration year
## 1 ( 1 ) " " " " " " "*" " " " "
## 2 ( 1 ) " " " " " " "*" " " "*"
## 3 ( 1 ) "*" " " " " "*" " " "*"
## 4 ( 1 ) "*" " " " " "*" " " "*"
## 5 ( 1 ) "*" "*" " " "*" " " "*"
## 6 ( 1 ) "*" "*" "*" "*" " " "*"
## 7 ( 1 ) "*" "*" "*" "*" " " "*"
## 8 ( 1 ) "*" "*" "*" "*" "*" "*"
## origin2 origin3
## 1 ( 1 ) " " " "
## 2 ( 1 ) " " " "
## 3 ( 1 ) " " " "
## 4 ( 1 ) " " "*"
## 5 ( 1 ) " " "*"
## 6 ( 1 ) " " "*"
## 7 ( 1 ) "*" "*"
## 8 ( 1 ) "*" "*"
## [1] -343.1283 -494.5988 -501.7008 -505.3911 -505.5065 -514.2228 -516.9417
## [8] -512.8292
## [1] 7
## (Intercept) cylinders(5.5,8.01] displace
## -19.410696099 -3.533041185 0.032788184
## horsepower weight year
## -0.048833206 -0.005823726 0.784897712
## origin2 origin3
## 1.794007415 2.906285958
ourAutoTrain and comment on the model fit. Report the MSE on ourAutoTrain.ourAutoTest and report the MSE.Refer to pages 248-249 in our textbook “Introduction to Statistical Learning”, and also solutions to Recommended Problem 5 in Module 6 for hints on how to make the R code.
ourAutoTrain (no, do not use ourAutoTest here). (Hint: smart to make a predict.regsubsets function as on page 249 in “Introduction to Statistical Learning”.)ourAutoTrain. Report interesting features of this final model. Also use this model fit to predict new values for ourAutoTest and report the MSE. (If you get the same best model as Q4, just refer back to Q4 and Q5.)In this exercise we will study lasso and ridge regression. We continue using the ourAutoTrain dataset from Problem 1.
In a regression model with \(p\) predictors the ridge regression coefficients are the values that minimize
\[
\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^p\beta_j x_{ij})^2+\lambda \sum_{j=1}^{p}\beta_j^2
\] while the lasso regression coefficients are the values that minimize \[
\sum_{i=1}^{n}(y_i-\beta_0-\sum_{j=1}^p\beta_j x_{ij})^2+\lambda \sum_{j=1}^{p}\lvert \beta_j \rvert.
\] In Figure 1 and Figure 2 you see the results from lasso and ridge regression applied to ourAutoTrain. Standardized coefficients \(\hat{\beta_1},...,\hat{\beta_8}\) are plotted against the tuning parameter \(\lambda\).
Figure 1.
Figure 2.
In the following, we will use functions in the glmnet package to perform \(lasso\) regression. The first step is to find the optimal tuning parameter \(\lambda\). This is done by cross-validation using the cv.glmnet() function:
library(glmnet)
set.seed(4268)
x=model.matrix(mpg~.,ourAutoTrain)[,-1] #-1 to remove the intercept.
head(x)
y=ourAutoTrain$mpg
lambda=c(seq(from=5,to=0.1,length.out=150),0.01,0.0001) #Create a set of tuning parameters, adding low value to also see least squares fit
cv.out=cv.glmnet(x,y,alpha=1,nfolds=10,lambda=lambda, standardize=TRUE) #alpha=1 gives lasso, alpha=0 gives ridge
plot(cv.out)
## cylinders(5.5,8.01] displace horsepower weight acceleration year origin2
## 1 1 307 130 3504 12.0 70 0
## 2 1 350 165 3693 11.5 70 0
## 4 1 304 150 3433 12.0 70 0
## 5 1 302 140 3449 10.5 70 0
## 8 1 440 215 4312 8.5 70 0
## 9 1 455 225 4425 10.0 70 0
## origin3
## 1 0
## 2 0
## 4 0
## 5 0
## 8 0
## 9 0
cv.glmnet does. Hint: help(cv.glmnet).1se-rule. See help(cv.glmnet).cv.glmnet and the 1se-rule to choose the “optimal”" \(\lambda\).displace=150, horsepower=100, weight=3000, acceleration=10, year=82 and comes from Europe. What is the predicted mpg for this car given the chosen model from Q17? Hint: you need to construct the new observation in the same way as observations in the model matrix x (the dummy variable coding for cylinders and origin) and newx need to be a matrix newx=matrix(c(0,150,100,3000,10,82,1,0),nrow=1).In this exercise we take a quick look at different non-linear regression methods. We continue using the ourAutoTrain dataset from Problem 1 and 2.
gam from package gam. Call the result gamobject.
mpg is the response,displace is a cubic spline (hint: bs) with one knot at 290,horsepower is a polynomial of degree 2 (hint: poly),weight is a linear function,acceleration is a smoothing spline with df=3,origin is a step function (what we previously have called dummy variable coding). Plot the resulting curves (hint: first set par(mfrow=c(2,3) and then plot(gamobject,se=TRUE,col="blue")). Comment on what you see.displace).