(Last changes: January 31 - first version)
Maximal score is 15 points. You need a score of 6/15 for the exercise to be approved. Your score will make up 15% points of your final grade.
All supervision is in Smia.
General supervision:
echo=TRUE
and eval=TRUE
to make it simpler for us to read and grade.You need to install the following packages in R to run the code in this file.
install.packages("knitr") #probably already installed
install.packages("rmarkdown") #probably already installed
install.packages("GLMsData") #data set for Problem 1
install.packages("ggplot2") #plotting with ggplot (all code provided)
install.packages("nortest") #test that data comes from normal distribution
install.packages("class")# for function knn
install.packages("colorspace") #for nice colors in provided plotting code
install.packages("caret") #for confusion matrices
install.packages("pROC") #for ROC curves
[Maximal score: 4 points]
The lung capacity data lungcap
(from the GLMsData
R package) gives information on health and on smoking habits of a sample of 654 youths, aged 3 to 19, in the area of East Boston during middle to late 1970s.
We will focus on modelling forced expiratory volume FEV
, a measure of lung capacity. For each person in the data set we have measurements of the following 5 variables:
FEV
the forced expiratory volume in litres, a measure of lung capacity; a numeric vector,Age
the age of the subject in completed years; a numeric vector,Ht
the height in inches; a numeric vector,Gender
the gender of the subjects: a numeric vector with females coded as 0 and males as 1,Smoke
the smoking status of the subject: a numeric vector with non-smokers coded as 0 and smokers as 1First we transform the height from inches to cm. Then a multiple normal linear regression model is fitted to the data set with log(FEV)
as response and the other variables as covariates. The following R code may be used.
library(GLMsData)
data("lungcap")
lungcap$Htcm=lungcap$Ht*2.54
modelA = lm(log(FEV) ~ Age + Htcm + Gender + Smoke, data=lungcap)
summary(modelA)
##
## Call:
## lm(formula = log(FEV) ~ Age + Htcm + Gender + Smoke, data = lungcap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.63278 -0.08657 0.01146 0.09540 0.40701
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.943998 0.078639 -24.721 < 2e-16 ***
## Age 0.023387 0.003348 6.984 7.1e-12 ***
## Htcm 0.016849 0.000661 25.489 < 2e-16 ***
## GenderM 0.029319 0.011719 2.502 0.0126 *
## Smoke -0.046067 0.020910 -2.203 0.0279 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1455 on 649 degrees of freedom
## Multiple R-squared: 0.8106, Adjusted R-squared: 0.8095
## F-statistic: 694.6 on 4 and 649 DF, p-value: < 2.2e-16
We call the model fitted above modelA
.
Q1: Write down the equation for the fitted modelA
.
Q2: Explain (with words and formulas) what the following in the summary
-output means, use the Age
and/or the Smoke
covariate for numerical examples.
Estimate
- in particular interpretation of Intercept
Std.Error
Residual standard error
F-statistic
Q3: What is the proportion of variability explained by the fitted modelA
? Comment.
Q4: Run the code below to produce diagnostic plots of “fitted values vs. standardized residuals” and “QQ-plot of standardized residuals” to assess the model fit. Comment on what you see.
library(ggplot2)
# residuls vs fitted
ggplot(modelA, aes(.fitted, .stdresid)) + geom_point(pch = 21) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_smooth(se = FALSE, col = "red", size = 0.5, method = "loess") +
labs(x = "Fitted values", y = "Standardized residuals",
title = "Fitted values vs. standardized residuals",
subtitle = deparse(modelA$call))
# qq-plot of residuals
ggplot(modelA, aes(sample = .stdresid)) +
stat_qq(pch = 19) +
geom_abline(intercept = 0, slope = 1, linetype = "dotted") +
labs(x = "Theoretical quantiles", y = "Standardized residuals",
title = "Normal Q-Q", subtitle = deparse(modelA$call))
# normality test
library(nortest)
ad.test(rstudent(modelA))
Q5: Now fit a model, call this modelB
, with FEV
as response, and the same covariates as for modelA
. Would you prefer to use modelA
or modelB
when the aim is to make inference about FEV
? Explain what you base your conclusion on.
We use modelA
and focus on addressing the association between Age
and the response.
Q6: We would like to test if Age
has an effect on log(FEV)
in modelA
, that is we want to test \(H_0:\beta_{\text{Age}}=0\) against \(H_1:\beta_{\text{Age}}\neq 0\). What type of test can we perform? Perform the test and report the \(p\)-value. At which significance levels will you reject the null hypothesis?
Q7: Construct a 99% confidence interval for \(\beta_{\text{Age}}\) (write out the formula and calculate the interval numerically). Explain what this interval tells you. From this confidence interval, is it possible for you know anything about the value of the \(p\)-value for the test in Q6? Explain.
Consider a 16 year old male. He is 170 cm tall and not smoking.
new = data.frame(Age=16, Htcm=170, Gender="M", Smoke=0)
Q8: What is your best guess for his log(FEV)
? Construct a 95% prediction interval for his forced expiratory volume FEV
. Comment. Hint: first contruct values on the scale of the response log(FEV)
and then transform the upper and lower limits of the prediction interval. Do you find this prediction interval useful? Comment.
[Maximal score: 7 points]
We will look at statistics from the four major tennis tournaments in 2013. For more information on the data set see https://archive.ics.uci.edu/ml/datasets/Tennis+Major+Tournament+Match+Statistics.
Each row in the data set contains information about one match. The variables that end with .1
relate to player 1, while .2
concerns player 2.
In tennis, you have two attempts at the serve. You lose the point if you fail both. The number of these double faults commited is given in the variable DBF
, while the variable ACE
is the number of times your opponent fails to return your serve. Similarily, unforced errors (mistakes that are supposedly not forced by good shots of your opponent) are called UFE
. A skilled player will score many aces while committing few double faults and unforced errors.
Each match involves two players, and there are no draws in tennis. The result
of a match is either that
1
(so, success for player 1) or that0
(so, failure for player 1).For our two players, player 1 and player 2, the quality of player \(k\) (\(k=1,2\)) can be summarized as \(x_k=\mathrm{ACE.}k-\mathrm{UFE.}k-\mathrm{DBF.}k\).
We will try to predict the outcome of a match (success or failure for player 1) just using the quality statistics \(x_1\) from player 1 and \(x_2\) from player 2.
The code below puts the result of each match and the quality score of each player in a data frame. The rows in the data set used for training are called tr
, which means that test indices are -tr
.
In the scatter plot the training observations are depictured as dots (circles) and the test observations as triangles. Matches where player 1 won is in turquoise and where player 2 won in red.
library(class)# for function knn
library(caret)# for confusion matrices
## Loading required package: lattice
## Loading required package: ggplot2
library(ggplot2)# for ggplot
raw = read.csv("https://www.math.ntnu.no/emner/TMA4268/2019v/data/tennis.csv")
M = na.omit(data.frame(y=as.factor(raw$Result),
x1=raw$ACE.1-raw$UFE.1-raw$DBF.1,
x2=raw$ACE.2-raw$UFE.2-raw$DBF.2))
set.seed(4268) # for reproducibility
tr = sample.int(nrow(M),nrow(M)/2)
trte=rep(1,nrow(M))
trte[tr]=0
Mdf=data.frame(M,"istest"=as.factor(trte))
ggplot(data=Mdf,aes(x=x1,y=x2,colour=y,group=istest,shape=istest))+
geom_point()+theme_minimal()
For a positive integer \(K\) let \(\mathcal{N_0}\) be the \(K\) points in the training data nearest to \(x = (x_1, x_2)\).
Q9: Write down the mathmatical formula for the K-nearest neighbours estimator \(\hat{y}(x) \in \{0,1\}\).
Q10: Compute the misclassification error for the training data and the test data using KNN classification for all \(K\) in ks=1:30
. See ?class::knn
for help on the function knn
in the class
R package. Save the error rates of the training and test sets as train.e
and test.e
, each vectors of length 30.
The optimal \(K\) can be chosen by cross-validation. Consider the code below where we divide the training data into 5 folds using createFolds
from the R package caret
. The entry \((j,k)\) of the matrix cv
is the CV error rate for test fold \(j\) where \(K=k\) for the KNN-classifier.
set.seed(0)
ks = 1:30 # Choose K from 1 to 30.
idx = createFolds(M[tr,1], k=5) # Divide the training data into 5 folds.
# "Sapply" is a more efficient for-loop.
# We loop over each fold and each value in "ks"
# and compute error rates for each combination.
# All the error rates are stored in the matrix "cv",
# where folds are rows and values of $K$ are columns.
cv = sapply(ks, function(k){
sapply(seq_along(idx), function(j) {
yhat = class::knn(train=M[tr[ -idx[[j]] ], -1],
cl=M[tr[ -idx[[j]] ], 1],
test=M[tr[ idx[[j]] ], -1], k = k)
mean(M[tr[ idx[[j]] ], 1] != yhat)
})
})
Q11: Compute the average cv.e
and standard error cv.se
of the CV error over all 5 folds. Store the \(K\) corresponding to the smallest CV error as k.min
.
cv.e = # fill in
cv.se = #fill in
k.min = # fill in
Q12: Plot the misclassification errors using the code below. Will the bias in \(\hat{y}(x)\) increase with \(K\)? What about the variance?
library(colorspace)
co = rainbow_hcl(3)
par(mar=c(4,4,1,1)+.1, mgp = c(3, 1, 0))
plot(ks, cv.e, type="o", pch = 16, ylim = c(0, 0.7), col = co[2],
xlab = "Number of neighbors", ylab="Misclassification error")
arrows(ks, cv.e-cv.se, ks, cv.e+cv.se, angle=90, length=.03, code=3, col=co[2])
lines(ks, train.e, type="o", pch = 16, ylim = c(0.5, 0.7), col = co[3])
lines(ks, test.e, type="o", pch = 16, ylim = c(0.5, 0.7), col = co[1])
legend("topright", legend = c("Test", "5-fold CV", "Training"), lty = 1, col=co)
Q13: Run the code below. Instead of simply choosing the \(K\) that results in the smallest CV error rate, we will try a different strategy shown in the first line of the code below. Explain the proposed strategy for choosing \(K\).
k = tail(which(cv.e < cv.e[k.min] + cv.se[k.min]), 1)
size = 100
xnew = apply(M[tr,-1], 2, function(X) seq(min(X), max(X), length.out=size))
grid = expand.grid(xnew[,1], xnew[,2])
grid.yhat = knn(M[tr,-1], M[tr,1], k=k, test=grid)
np = 300
par(mar=rep(2,4), mgp = c(1, 1, 0))
contour(xnew[,1], xnew[,2], z = matrix(grid.yhat, size), levels=.5,
xlab=expression("x"[1]), ylab=expression("x"[2]), axes=FALSE,
main = paste0(k,"-nearest neighbors"), cex=1.2, labels="")
points(grid, pch=".", cex=1, col=grid.yhat)
points(M[1:np,-1], col=factor(M[1:np,1]), pch = 1, lwd = 1.5)
legend("topleft", c("Player 1 wins", "Player 2 wins"),
col=c("red", "black"), pch=1)
box()
Q14:
Now select the optimal choice of \(K\) that you found in Q13. Use the training set to produce the probability for player 1 winning the match for the matches in the test set (see code below). Then produce an ROC based on the test set. Explain how the ROC is produced. Also report the AUC and explain why random guessing would produce an AUC of 0.5.
K=# your choice from Q13
# knn with prob=TRUE outputs the probability of the winning class
# therefore we have to do an extra step to get the probability of player 1 winning
KNNclass=class::knn(train=M[tr,-1], cl=M[tr,1], test=M[-tr,-1], k = K,prob=TRUE)
KNNprobwinning=attributes(KNNclass)$prob
KNNprob= ifelse(KNNclass == "0", 1-KNNprobwinning, KNNprobwinning)
# now KNNprob has probability that player 1 wins, for all matches in the test set
library(pROC)
# now you use predictor=KNNprob and response=M[-tr,1]
# in your call to the function roc in the pROC library
Let us consider a different classifier, namely \(\tilde{y}(x) = \mathrm{argmax}_k(x_k)\), which is giving the win to the player with the highest quality score.
Q15: Add the decision boundary of \(\tilde{y}(x)\) in the plot from Q13. Make confusion matrices and report the estimated misclassification error of \(\hat{y}(x)\) from Q13 and \(\tilde{y}(x)\) using the test data. Which classifier do you prefer? Hint: the R package caret
can easily be used for producing a confusion matrix.
[Maximal score: 4 points]
In this problem we will first mathematically derive analytical formulas for the bias and variance terms for two regression estimators, and then plot curves in R.
Let \({\bf x}\) be a \((p+1)\times 1\) vector of covariates (including a constant 1 as the first term). We look at a regression problem \[ Y=f({\bf x})+\varepsilon, \text{ where } \text{E}(\varepsilon)=0 \text{ and } \text{Var}(\varepsilon)=\sigma^2.\] Assume that we know that the true function is a linear combination of observed \((p+1)\) covariates \(f({\bf x})={\bf x}^T{\boldsymbol \beta}\). This means that the irreducible error is \(\text{Var}(\varepsilon)=\sigma^2\).
We will consider two different estimators for \({\boldsymbol \beta}\), and let the prediction of a new response at some covariate vector \({\bf x}_0\) be \[\hat{f}({\bf x}_0)={\bf x}_0^T\hat{\boldsymbol \beta} \text{ or } \widetilde{f}({\bf x}_0)={\bf x}_0^T\widetilde{\boldsymbol \beta}\] for the two estimators \(\hat{\boldsymbol \beta}\) and \(\widetilde{\boldsymbol \beta}\) to be given below.
The estimators are based on a training set, where \({\bf X}\) is a \(n \times (p-1)\) design matrix and \({\bf Y}\) is a \(n\times 1\) vector of reponses, where we assume that the observed responses are independent of each other. That is, we assume that the expected vector is \(\text{E}({\bf Y})={\bf X}{\boldsymbol \beta}\) and the variance-covariance matrix is \(\text{Cov}({\bf Y})=\sigma^2 {\bf I}\).
The first estimator \(\hat{\boldsymbol \beta}\) we will consider is the least squares estimator \[ \hat{\boldsymbol \beta}=({\bf X}^T{\bf X})^{-1}{\bf X}^T{\bf Y} \]
Q16: Find the expected value vector and variance-covariance matrix for \(\hat{\boldsymbol \beta}\).
Q17: Use this to find the expected value and variance for \(\hat{f}({\bf x}_0)={\bf x}_0^T\hat{\boldsymbol \beta}\), that is \(\text{E}(\hat{f}({\bf x}_0)\) and \(\text{Var}(\hat{f}({\bf x}_0) )\).
Q18: Use what you found in Q17 to write out \(\text{E}[(Y_0-\hat{f}({\bf x}_0))^2]\) as a sum of the squared bias, variance and irreducible error, that is
\[\text{E}[(Y_0-\hat{f}({\bf x}_0))^2]=[\text{E}(\hat{f}({\bf x}_0)-f({\bf x}_0)]^2+\text{Var}(\hat{f}({\bf x}_0) ) + \text{Var}(\varepsilon)\]
A competing estimator is given as \[ \widetilde{\boldsymbol \beta}=({\bf X}^T{\bf X}+\lambda {\bf I})^{-1}{\bf X}^T{\bf Y} \] for some choice of a numerical parameter \(\lambda\) (called a regularization parameter). Observe that the ridge regression estimator equals the least squares estimator for \(\lambda=0\).
Q19: Find the expected value vector and variance-covariance matrix for \(\widetilde{\boldsymbol \beta}\).
Q20: Use this to find the expected value and variance for \(\widetilde{f}({\bf x}_0)={\bf x}_0^T\widetilde{\boldsymbol \beta}\), that is \(\text{E}(\widetilde{f}({\bf x}_0)\) and \(\text{Var}(\widetilde{f}({\bf x}_0) )\).
Q21: Use what you found in Q20 to write out \(\text{E}[(Y_0-\widetilde{f}({\bf x}_0))^2]\) as a sum of the squared bias, variance and irreducible error, that is
\[\text{E}[(Y_0-\widetilde{f}({\bf x}_0))^2]=[\text{E}(\widetilde{f}({\bf x}_0)-f({\bf x}_0)]^2+\text{Var}(\widetilde{f}({\bf x}_0) ) + \text{Var}(\varepsilon)\]
We will now plot the three components (squared bias, variance and irreducible error) as a function of \(\lambda\) for one set of values for \({\bf X}\), \({\bf x}_0\), \({\boldsymbol \beta}\) and \(\sigma\).
values=dget("https://www.math.ntnu.no/emner/TMA4268/2019v/data/BVtradeoffvalues.dd")
X=values$X
dim(X)
x0=values$x0
dim(x0)
beta=values$beta
dim(beta)
sigma=values$sigma
sigma
## [1] 100 81
## [1] 81 1
## [1] 81 1
## [1] 0.5
Observe that all the elements X
, x0
and beta
, are given as matrices, to fit into your formulas. Hint: we perform matrix multiplication using %*%
, transpose of a matrix A
with t(A)
and inverse with solve(A)
.
Q22: Write a function sqbias
with input parameters lambda
, X
, x0
, beta
and return the squared bias (a scalar). Plot the function (use colour red) for values of \(\lambda\) in thislambda=seq(0,2,length=500)
, see code below. Does this curve look like you expected?
sqbias=function(lambda,X,x0,beta)
{
p=dim(X)[2]
value= #HERE YOU FILL IN
return(value)
}
thislambda=seq(0,2,length=500)
sqbiaslambda=rep(NA,length(thislambda))
for (i in 1:length(thislambda)) sqbiaslambda[i]=sqbias(thislambda[i],X,x0,beta)
plot(thislambda,sqbiaslambda,col=2,type="l")
Q23: Write a function variance
with input parameters lambda
, X
, x0
, sigma
which return the variance (a scalar). Plot the function (use colour blue) for values of \(\lambda\) in thislambda=seq(0,2,length=500)
, see code below. Does this curve look like you expected?
variance=function(lambda,X,x0,sigma)
{
p=dim(X)[2]
inv=solve(t(X)%*%X+lambda*diag(p))
value=#HERE YOU FILL IN
return(value)
}
thislambda=seq(0,2,length=500)
variancelambda=rep(NA,length(thislambda))
for (i in 1:length(thislambda)) variancelambda[i]=variance(thislambda[i],X,x0,sigma)
plot(thislambda,variancelambda,col=4,type="l")
Q24: Then plot the squared bias (red), the variance (blue), the irreducible error (orange) and the sum of the three (black) as function of lambda. See code below (then sqbiaslambda
and variancelambda
need to be calculated in Q22 and Q23). What is the optimal value of \(\lambda\) for this problem?
tot=sqbiaslambda+variancelambda+sigma^2
which.min(tot)
thislambda[which.min(tot)]
plot(thislambda,tot,col=1,type="l",ylim=c(0,max(tot)))
lines(thislambda, sqbiaslambda,col=2)
lines(thislambda, variancelambda,col=4)
lines(thislambda,rep(sigma^2,500),col="orange")
abline(v=thislambda[which.min(tot)],col=3)