(Last changes: final version of 04.05.2019)
You need to install the following packages in R to run the code in this file.
install.packages("knitr") #probably already installed
install.packages("rmarkdown") #probably already installed
install.packages("GLMsData") #data set for Problem 1
install.packages("ggplot2") #plotting with ggplot (all code provided)
install.packages("nortest") #test that data comes from normal distribution
install.packages("class")# for function knn
install.packages("colorspace") #for nice colors in provided plotting code
install.packages("caret") #for confusion matrices
install.packages("pROC") #for ROC curves
[Maximal score: 4 points]
The lung capacity data lungcap
(from the GLMsData
R package) gives information on health and on smoking habits of a sample of 654 youths, aged 3 to 19, in the area of East Boston during middle to late 1970s.
We will focus on modelling forced expiratory volume FEV
, a measure of lung capacity. For each person in the data set we have measurements of the following 5 variables:
FEV
the forced expiratory volume in litres, a measure of lung capacity; a numeric vector,Age
the age of the subject in completed years; a numeric vector,Ht
the height in inches; a numeric vector,Gender
the gender of the subjects: a numeric vector with females coded as 0 and males as 1,Smoke
the smoking status of the subject: a numeric vector with non-smokers coded as 0 and smokers as 1First we transform the height from inches to cm. Then a multiple normal linear regression model is fitted to the data set with log(FEV)
as response and the other variables as covariates. The following R code may be used.
library(GLMsData)
data("lungcap")
lungcap$Htcm=lungcap$Ht*2.54
modelA = lm(log(FEV) ~ Age + Htcm + Gender + Smoke, data=lungcap)
summary(modelA)
##
## Call:
## lm(formula = log(FEV) ~ Age + Htcm + Gender + Smoke, data = lungcap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.63278 -0.08657 0.01146 0.09540 0.40701
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.943998 0.078639 -24.721 < 2e-16 ***
## Age 0.023387 0.003348 6.984 7.1e-12 ***
## Htcm 0.016849 0.000661 25.489 < 2e-16 ***
## GenderM 0.029319 0.011719 2.502 0.0126 *
## Smoke -0.046067 0.020910 -2.203 0.0279 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1455 on 649 degrees of freedom
## Multiple R-squared: 0.8106, Adjusted R-squared: 0.8095
## F-statistic: 694.6 on 4 and 649 DF, p-value: < 2.2e-16
We call the model fitted above modelA
.
Write down the equation for the fitted modelA
.
Model A: \[\log{(\text{FEV})}=\beta_0 + \beta_1 \text{Age} + \beta_2 \text{Htcm} + \beta_3 \text{Gender} + \beta_4 \text{Smoke} + \epsilon\] with the fitted version \(\hat{\log(\text{FEV})}=\) -1.9439982 + 0.0233872 Age + 0.0168487 Htcm + 0.0293194 Gender + -0.0460675 Smoke
Explain (with words and formulas) what the following in the summary
-output means, use the Age
and/or the Smoke
covariate for numerical examples.
Estimate
- in particular interpretation of Intercept
Std.Error
Residual standard error
F-statistic
Estimate
column give the estimated regression coefficients, and are given by \(\hat{\beta}=({\bf X}^T{\bf X})^{-1}{\bf X}^T{\bf Y}\). The interpretation of \(\hat{\beta}_j\) is that when all other covariates are kept constant and the covariate \(x_j\) is increased to from \(x_j\) to \(x_j+1\) then on average the response increases by \(\hat{\beta}_j\). For example, an increase in height of one cm is associated with an increase in the mean \(\log(\text{FEV})\) by \(0.016849\), keeping all other variables constant. The quantitative variable Age can be interpreted in the same way. Parameter estimates for qualitative covariates indicate how much the value of explanatory variable changes compared to the reference level. For example the value of \(\log(\text{FEV})\) will change by a factor of \(-0.046067\) for smokers \((\text{Smoke=1})\), compared to non-smokers (\(\text{Smoke=0}\)). Std.Error
\(\hat{SD}(\hat{\beta_j})\) of the estimated coefficients is given by the square root of the diagonal entries of \(({\bf X}^T{\bf X})^{-1}\hat{\sigma^2}\), where \(\hat{\sigma}=\text{RSS}/(n-p-1)\). Here \(n=654\) and \(p=4\).F-statistic
is used test the hypothesis that all regression coefficients are zero,
\[\begin{align*}
H_0: & \beta_1 = \beta_2 = \cdots = \beta_p = 0 \quad \text{vs} \\
H_1: &\text{at least one $\beta$ is $\neq 0$} \\
\end{align*}\]
and is computed by
\[\begin{equation*}
F = \frac{(TSS-RSS)/p}{RSS/(n-p-1)}
\end{equation*}\]
where \(TSS= \sum_{i=1}^n(y_i-\bar y)^2\), \(RSS = \sum_{i=1}^n(y_i-\hat y_i)^2\), \(n\) is the number of observations and \(p\) is the number of covariates (and \(p+1\) the number of estimated regression parameters). If the \(p\)-value is less than 0.05, we reject the hypothesis that there are no coefficients with effect on the outcome in the model.What is the proportion of variability explained by the fitted modelA
? Comment.
Run the code below to produce diagnostic plots of “fitted values vs. standardized residuals” and “QQ-plot of standardized residuals” to assess the model fit. Comment on what you see.
library(ggplot2)
# residuls vs fitted
ggplot(modelA, aes(.fitted, .stdresid)) + geom_point(pch = 21) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_smooth(se = FALSE, col = "red", size = 0.5, method = "loess") +
labs(x = "Fitted values", y = "Standardized esiduals",
title = "Fitted values vs. standardized residuals",
subtitle = deparse(modelA$call))
# qq-plot of residuals
ggplot(modelA, aes(sample = .stdresid)) +
stat_qq(pch = 19) +
geom_abline(intercept = 0, slope = 1, linetype = "dotted") +
labs(x = "Theoretical quantiles", y = "Standardized residuals",
title = "Normal Q-Q", subtitle = deparse(modelA$call))
# normality test
library(nortest)
ad.test(rstudent(modelA))
##
## Anderson-Darling normality test
##
## data: rstudent(modelA)
## A = 1.9256, p-value = 6.486e-05
Now fit a model, call this modelB
, with FEV
as response, and the same covariates as for modelA
. Would you prefer to use modelA
or modelB
when the aim is to make inference about FEV
? Explain what you base your conclusion on.
modelB = lm(FEV ~ Age + Htcm + Gender + Smoke, data=lungcap)
# residuls vs fitted
ggplot(modelB, aes(.fitted, .stdresid)) + geom_point(pch = 21) +
geom_hline(yintercept = 0, linetype = "dashed") +
geom_smooth(se = FALSE, col = "red", size = 0.5, method = "loess") +
labs(x = "Fitted values", y = "Standardized residuals", title = "Fitted values vs. standardized residuals", subtitle = deparse(modelB$call))
# qq-plot of residuals
ggplot(modelB, aes(sample = .stdresid)) +
stat_qq(pch = 19) +
geom_abline(intercept = 0, slope = 1, linetype = "dotted") +
labs(x = "Theoretical quantiles", y = "Standardized residuals", title = "Normal Q-Q", subtitle = deparse(modelB$call))
# normality test
library(nortest)
ad.test(rstudent(modelB))
##
## Anderson-Darling normality test
##
## data: rstudent(modelB)
## A = 1.2037, p-value = 0.003853
Not constant spread. Normality is better that modelA
, but the hypothesis of normality for \(a = 0.05\). We prefer modelA
beacause the residuals are linear in comparison with modelB
, and in both cases the hypothesis of normality is rejected.
We use modelA
and focus on addressing the association between Age
and the response.
We would like to test if Age
has an effect on log(FEV)
in modelA
, that is we want to test \(H_0:\beta_{\text{Age}}=0\) against \(H_1:\beta_{\text{Age}}\neq 0\). What type of test can we perform? Perform the test and report the \(p\)-value. At which significance levels will you reject the null hypothesis?
We use \(t\)-test based on \[ T_j=\frac{\hat{\beta}_j-\beta_j}{\sqrt{c_{jj}}\hat{\sigma}}\sim t_{n-p-1},\] where \(c_{jj}\) is diagonal element corresponding to \(\hat{\beta}_j\) of \(({\bf X}^T{\bf X})^{-1}\). In that case we set \(\beta_j = 0.\) P-value is give from the model output as Pr(>|t|)
and is equal to \(7.1\times 10^{-6}\). In order to reject the null hypotheses we should choose a significance level \(\ge 7.1\times 10^{-6}\), which is really a low level. This means that we really want to reject \(H_0\).
Construct a 99% confidence interval for \(\beta_{\text{Age}}\) (write out the formula and calculate the interval numerically). Explain what this interval tells you. From this confidence interval, is it possible for you know anything about the value of the \(p\)-value for the test in Q6? Explain.
\[T_j=\frac{\hat{\beta}_j-\beta_j}{\sqrt{c_{jj}}\hat{\sigma}}\sim t_{n-p-1}\] \[ P(\hat{\beta}_j-t_{\alpha/2,n-p-1}\sqrt{c_{jj}}\hat{\sigma} \le \beta_j \le \hat{\beta}_j+t_{\alpha/2,n-p-1}\sqrt{c_{jj}}\hat{\sigma})=1-\alpha\] A \((1-\alpha)\)% CI for \(\beta_j\) is when we insert numerical values for the upper and lower limits: \([\hat{\beta}_j-t_{\alpha/2,n-p-1}\sqrt{c_{jj}}\hat{\sigma},\hat{\beta}_j+t_{\alpha/2,n-p-1}\sqrt{c_{jj}}\hat{\sigma}]\). In this case \(\alpha = 0.01\) and \(j = 2.\) This means that before we have collected the data this interval has a 99% chance of covering the true value of \(\beta_{Age}\). After the interval is made - now this is [0.01473670, 0.03203773] the the true value is either within the interval or not. But, collecting new data and making 99% CIs, then 99% of these will on average cover the true \(\beta_{Age}\).
n = dim(lungcap)[1]
p = dim(lungcap)[2]-1
betahat=modelA$coefficients[2]
sdbetahat=summary(modelA)$coeff[2,2]
UCI = betahat + qt(0.005, df = n-p-1, lower.tail = F)*sdbetahat
LCI = betahat - qt(0.005, df = n-p-1, lower.tail = F)*sdbetahat
c(LCI, UCI)
## Age Age
## 0.01473670 0.03203773
Since the interval does not cover 0, we know that the p-value is less than 0.01.
Consider a 16 year old male. He is 170 cm tall and not smoking.
new = data.frame(Age=16, Htcm=170, Gender="M", Smoke=0)
What is your best guess for his log(FEV)
? Construct a 95% prediction interval for his forced expiratory volume FEV
. Comment. Hint: first contruct values on the scale of the response log(FEV)
and then transform the upper and lower limits of the prediction interval. Do you find this prediction interval useful? Comment.
new = data.frame(Age=16, Htcm=170, Gender="M", Smoke=0)
pred = predict(modelA, newdata = new)
pred #Best guess for log(FEV)
# the inverse of log (natural) is exp
fev = exp(pred)
fev
logf.ci = predict(modelA, newdata = new, level = 0.95, interval = "prediction")
fev.ci = exp(logf.ci)
fev.ci
## 1
## 1.323802
## 1
## 3.75768
## fit lwr upr
## 1 3.75768 2.818373 5.010038
We see that the interval is rather wide - so it gives us limited information. https://en.wikipedia.org/wiki/Spirometry#/media/File:Normal_values_for_FVC,_FEV1_and_FEF_25-75.png
[Maximal score: 7 points]
We will look at statistics from the four major tennis tournaments in 2013. For more information on the data set see https://archive.ics.uci.edu/ml/datasets/Tennis+Major+Tournament+Match+Statistics.
Each row in the data set contains information about one match. The variables that end with .1
relate to player 1, while .2
concerns player 2.
In tennis, you have two attempts at the serve. You lose the point if you fail both. The number of these double faults commited is given in the variable DBF
, while the variable ACE
is the number of times your opponent fails to return your serve. Similarily, unforced errors (mistakes that are supposedly not forced by good shots of your opponent) are called UFE
. A skilled player will score many aces while committing few double faults and unforced errors.
Each match involves two players, and there are no draws in tennis. The result
of a match is either that
1
(so, success for player 1) or that0
(so, failure for player 1).For our two players, player 1 and player 2, the quality of player \(k\) (\(k=1,2\)) can be summarized as \(x_k=\mathrm{ACE.}k-\mathrm{UFE.}k-\mathrm{DBF.}k\).
We will try to predict the outcome of a match (success or failure for player 1) just using the quality statistics \(x_1\) from player 1 and \(x_2\) from player 2.
The code below puts the result of each match and the quality score of each player in a data frame. The rows in the data set used for training are called tr
, which means that test indices are -tr
.
In the scatter plot the training observations are depictured as dots (circles) and the test observations as triangles. Matches where player 1 won is in turquoise and where player 2 won in red.
library(class)# for function knn
library(caret)# for confusion matrices
library(ggplot2)# for ggplot
raw = read.csv("https://www.math.ntnu.no/emner/TMA4268/2019v/data/tennis.csv")
M = na.omit(data.frame(y=as.factor(raw$Result),
x1=raw$ACE.1-raw$UFE.1-raw$DBF.1,
x2=raw$ACE.2-raw$UFE.2-raw$DBF.2))
set.seed(4268) # for reproducibility
tr = sample.int(nrow(M),nrow(M)/2)
trte=rep(1,nrow(M))
trte[tr]=0
Mdf=data.frame(M,"istest"=as.factor(trte))
ggplot(data=Mdf,aes(x=x1,y=x2,colour=y,group=istest,shape=istest))+
geom_point()+theme_minimal()
For a positive integer \(K\) let \(\mathcal{N_0}\) be the \(K\) points in the training data nearest to \(x = (x_1, x_2)\).
Write down the mathmatical formula for the K-nearest neighbours estimator \(\hat{y}(x) \in \{0,1\}\).
\(\hat{y}(x)=\mathrm{argmax}_k \sum_{i \in \mathcal{N_0}}I(y_i = k)\).
Compute the misclassification error for the training data and the test data using KNN classification for all \(K\) in ks=1:30
. See ?class::knn
for help on the function knn
in the class
R package. Save the error rates of the training and test sets as train.e
and test.e
, each vectors of length 30.
The misclassification errors are computed below.
ks = 1:30
yhat = sapply(ks, function(k){
class::knn(train=M[tr,-1], cl=M[tr,1], test=M[,-1], k = k) #test both train and test
})
train.e = colMeans(M[tr,1]!=yhat[tr,])
test.e = colMeans(M[-tr,1]!=yhat[-tr,])
The optimal \(K\) can be chosen by cross-validation. Consider the code below where we divide the training data into 5 folds using createFolds
from the R package caret
. The entry \((j,k)\) of the matrix cv
is the CV error rate for test fold \(j\) where \(K=k\) for the KNN-classifier.
set.seed(0)
ks = 1:30 # Choose K from 1 to 30.
idx = createFolds(M[tr,1], k=5) # Divide the training data into 5 folds.
# "Sapply" is a more efficient for-loop.
# We loop over each fold and each value in "ks"
# and compute error rates for each combination.
# All the error rates are stored in the matrix "cv",
# where folds are rows and values of $K$ are columns.
cv = sapply(ks, function(k){
sapply(seq_along(idx), function(j) {
yhat = class::knn(train=M[tr[ -idx[[j]] ], -1],
cl=M[tr[ -idx[[j]] ], 1],
test=M[tr[ idx[[j]] ], -1], k = k)
mean(M[tr[ idx[[j]] ], 1] != yhat)
})
})
Compute the average cv.e
and standard error cv.se
of the average CV error over all 5 folds (that is, standard deviation over the 5 folds divided by sqrt of 5). Store the \(K\) corresponding to the smallest CV error as k.min
.
The sizes of the folds are approximately equal.
cv.e = colMeans(cv)
cv.se = apply(cv, 2, sd)/sqrt(5)
k.min = which.min(cv.e)
print(k.min)
## [1] 22
Plot the misclassification errors using the code below. Will the bias in \(\hat{y}(x)\) increase with \(K\)? What about the variance?
library(colorspace)
co = rainbow_hcl(3)
par(mar=c(4,4,1,1)+.1, mgp = c(3, 1, 0))
plot(ks, cv.e, type="o", pch = 16, ylim = c(0, 0.7), col = co[2],
xlab = "Number of neighbors", ylab="Misclassification error")
arrows(ks, cv.e-cv.se, ks, cv.e+cv.se, angle=90, length=.03, code=3, col=co[2])
lines(ks, train.e, type="o", pch = 16, ylim = c(0.5, 0.7), col = co[3])
lines(ks, test.e, type="o", pch = 16, ylim = c(0.5, 0.7), col = co[1])
legend("topright", legend = c("Test", "5-fold CV", "Training"), lty = 1, col=co)
The bias in \(\hat{y}(x)\) will increase with \(K\), but the variance will decrease.
Run the code below. Instead of simply choosing the \(K\) that results in the smallest CV error rate, we will try a different strategy shown in the first line of the code below. Explain the proposed strategy for choosing \(K\).
k = tail(which(cv.e < cv.e[k.min] + cv.se[k.min]), 1)
print(paste("Value of k",k))
size = 100
xnew = apply(M[tr,-1], 2, function(X) seq(min(X), max(X), length.out=size))
grid = expand.grid(xnew[,1], xnew[,2])
grid.yhat = knn(M[tr,-1], M[tr,1], k=k, test=grid)
np = 300
par(mar=rep(2,4), mgp = c(1, 1, 0))
contour(xnew[,1], xnew[,2], z = matrix(grid.yhat, size), levels=.5,
xlab=expression("x"[1]), ylab=expression("x"[2]), axes=FALSE,
main = paste0(k,"-nearest neighbors"), cex=1.2, labels="")
points(grid, pch=".", cex=1, col=grid.yhat)
points(M[1:np,-1], col=factor(M[1:np,1]), pch = 1, lwd = 1.5)
legend("topleft", c("Player 1 wins", "Player 2 wins"),
col=c("red", "black"), pch=1)
box()
abline(0,1, col="blue")
## [1] "Value of k 30"
The preferred number of neighbors is the largest \(K\) (in ks
) such that the misclassification error is still within one standard error of the minimum. We choose the largest because we want the less flexible model.
Let us consider a different classifier, namely \(\tilde{y}(x) = \mathrm{argmax}_k(x_k)\), which is giving the win to the player with the highest quality score.
Now select the optimal choice of \(K\) that you found in Q13. Use the training set to produce the probability for player 1 winning the match for the matches in the test set (see code below). Then produce an ROC based on the test set. Explain how the ROC is produced. Also report the AUC and explain why random guessing would produce an AUC of 0.5.
K=30# your choice from Q13
# knn with prob=TRUE outputs the probability of the winning class
# therefore we have to do an extra step to get the probability of player 1 winning
KNNclass=class::knn(train=M[tr,-1], cl=M[tr,1], test=M[-tr,-1], k = K,prob=TRUE)
KNNprobwinning=attributes(KNNclass)$prob
KNNprob= ifelse(KNNclass == "0", 1-KNNprobwinning, KNNprobwinning)
# now KNNprob has probability that player 1 wins, for all matches in the test set
library(pROC)
# now you use predictor=KNNprob and response=M[-tr,1]
# in your call to the function roc in the pROC library
KNNroc=roc(response=M[-tr,1],predictor=KNNprob)
ggroc(KNNroc)+ggtitle("ROC curve")+
geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1, color="Random guess"),linetype="dashed")+ theme(legend.title=element_blank())
KNNroc$auc
## Area under the curve: 0.8178
The area under the straight line is 0.5. The ROC curve for random guessing is a straight line in expectation because the expected sensitivity equals one minus the expected spesificity. The expected values are
\[\begin{align}
\mathrm{E}(\text{Sensitivity}) &= \mathrm{E}(\mathrm{TP}/P) = \mathrm{Pr}(\hat y = 1|y=1) = \mathrm{Pr}(\hat y = 1), \\
\mathrm{E}(1-\text{Specificity}) &= \mathrm{E}(\mathrm{FP}/N) = \mathrm{Pr}(\hat y = 1|y=0) = \mathrm{Pr}(\hat y = 1).
\end{align}\]
Add the decision boundary of \(\tilde{y}(x)\) in the plot from Q13. Make confusion matrices and report the estimated misclassification error of \(\hat{y}(x)\) from Q13 and \(\tilde{y}(x)\) using the test data. Which classifier do you prefer? Hint: the R package caret
can easily be used for producing a confusion matrix.
See the last line in the code of Q13.
See the code below. We prefer \(\tilde{y}(x)\) as it has a lower estimated misclassification error and is simpler.
table(M$y[-tr],yhat[-tr,k])
mean(M$y[-tr]!=yhat[-tr,k])
ytilde = (M$x1 > M$x2) + 0
table(M$y[-tr], ytilde[-tr])
mean(M$y[-tr]!= ytilde[-tr])
##
## 0 1
## 0 138 56
## 1 44 156
## [1] 0.2538071
##
## 0 1
## 0 149 45
## 1 47 153
## [1] 0.2335025
[Maximal score: 4 points]
In this problem we will first mathematically derive analytical formulas for the bias and variance terms for two regression estimators, and then plot curves in R.
Let \({\bf x}\) be a \((p+1)\times 1\) vector of covariates (including a constant 1 as the first term). We look at a regression problem \[ Y=f({\bf x})+\varepsilon, \text{ where }\quad \text{E}(\varepsilon)=0\;\text{ and }\quad \text{Var}(\varepsilon)=\sigma^2.\] Assume that we know that the true function is a linear combination of observed \((p+1)\) covariates \(f({\bf x})={\bf x}^T{\boldsymbol \beta}\). This means that the irreducible error is \(\text{Var}(\varepsilon)=\sigma^2\).
We will consider two different estimators for \({\boldsymbol \beta}\), and let the prediction of a new response at some covariate vector \({\bf x}_0\) be \[\hat{f}({\bf x}_0)={\bf x}_0^T\hat{\boldsymbol \beta}\quad \text{ or }\quad \widetilde{f}({\bf x}_0)={\bf x}_0^T\widetilde{\boldsymbol \beta}\] for the two estimators \(\hat{\boldsymbol \beta}\) and \(\widetilde{\boldsymbol \beta}\) to be given below.
The estimators are based on a training set, where \({\bf X}\) is a \(n \times (p-1)\) design matrix and \({\bf Y}\) is a \(n\times 1\) vector of reponses, where we assume that the observed responses are independent of each other. That is, we assume that the expected vector is \(\text{E}({\bf Y})={\bf X}{\boldsymbol \beta}\) and the variance-covariance matrix is \(\text{Cov}({\bf Y})=\sigma^2 {\bf I}\).
The first estimator \(\hat{\boldsymbol \beta}\) we will consider is the least squares estimator \[ \hat{\boldsymbol \beta}=({\bf X}^T{\bf X})^{-1}{\bf X}^T{\bf Y} \]
Find the expected value vector and variance-covariance matrix for \(\hat{\boldsymbol \beta}\).
\[ \text{E}(\hat{\boldsymbol \beta})=({\bf X}^T{\bf X})^{-1}{\bf X}^T\text{E}({\bf Y})=({\bf X}^T{\bf X})^{-1}{\bf X}^T{\bf X}\boldsymbol{\beta}=\boldsymbol{\beta}\]
\[ \text{Cov}(\hat{\boldsymbol \beta}) = ({\bf X}^T{\bf X})^{-1}{\bf X}^T \text{Cov}({\bf Y})(({\bf X}^T{\bf X})^{-1}{\bf X}^T)^T =({\bf X}^T{\bf X})^{-1} {\bf X}^T \sigma^2 {\bf I} {\bf X}({\bf X}{\bf X})^{-1}= \sigma^2({\bf X}^T{\bf X})^{-1} \]
Use this to find the expected value and variance for \(\hat{f}({\bf x}_0)={\bf x}_0^T\hat{\boldsymbol \beta}\), that is \(\text{E}(\hat{f}({\bf x}_0)\) and \(\text{Var}(\hat{f}({\bf x}_0) )\).
\[ \text{E}({\bf x}_0^T\hat{\boldsymbol \beta})={\bf x}_0^T\text{E}(\hat{\boldsymbol \beta})={\bf x}_0^T{\boldsymbol \beta} \] \[ \text{Var}({\bf x}_0^T\hat{\boldsymbol \beta})={\bf x}_0^T \text{Cov}(\hat{\boldsymbol \beta}) {\bf x}_0= {\bf x}_0^T \sigma^2({\bf X}^T{\bf X})^{-1} {\bf x}_0\]
Use what you found in Q17 to write out \(\text{E}[(Y_0-\hat{f}({\bf x}_0))^2]\) as a sum of the squared bias, variance and irreducible error, that is
\[\text{E}[(Y_0-\hat{f}({\bf x}_0))^2]=[\text{E}(\hat{f}({\bf x}_0)-f({\bf x}_0)]^2+\text{Var}(\hat{f}({\bf x}_0) ) + \text{Var}(\varepsilon)\]
It is given in the text that \(f({\bf x}_0)={\bf x}_0^T{\boldsymbol \beta}\) so the bias is zero. From the text it is also given that \(\text{Var}(\varepsilon)=\sigma^2\). This gives:
\[\text{E}[(Y_0-\hat{f}({\bf x}_0))^2]= ({\bf x}_0^T{\boldsymbol \beta}-{\bf x}_0^T{\boldsymbol \beta})^2+{\bf x}_0^T \sigma^2({\bf X}^T{\bf X})^{-1} {\bf x}_0+\sigma^2=0^2+{\bf x}_0^T \sigma^2({\bf X}^T{\bf X})^{-1} {\bf x}_0+\sigma^2=(1+{\bf x}_0^T ({\bf X}^T{\bf X})^{-1} {\bf x}_0)\sigma^2\]
A competing estimator is given as \[ \widetilde{\boldsymbol \beta}=({\bf X}^T{\bf X}+\lambda {\bf I})^{-1}{\bf X}^T{\bf Y} \] for some choice of a numerical parameter \(\lambda\) (called a regularization parameter). Observe that the ridge regression estimator equals the least squares estimator for \(\lambda=0\).
Find the expected value vector and variance-covariance matrix for \(\widetilde{\boldsymbol \beta}\).
\[ \text{E}(\widetilde{\boldsymbol \beta})=({\bf X}^T{\bf X}+\lambda {\bf I})^{-1}{\bf X}^T\text{E}({\bf Y})=({\bf X}^T{\bf X}+\lambda {\bf I})^{-1}{\bf X}^T{\bf X}\boldsymbol{\beta} \] \[ \text{Cov}(\widetilde{\boldsymbol \beta}) =({\bf X}^T{\bf X}+\lambda {\bf I})^{-1}{\bf X}^T \text{Cov}({\bf Y})(({\bf X}^T{\bf X}+\lambda {\bf I})^{-1}{\bf X}^T)^T\]
\[ =({\bf X}^T{\bf X}+\lambda {\bf I})^{-1} {\bf X}^T \sigma^2 {\bf I} {\bf X}({\bf X}^T{\bf X}+\lambda {\bf I})^{-1}= \sigma^2({\bf X}^T{\bf X}+\lambda {\bf I})^{-1}{\bf X}^T{\bf X}({\bf X}^T{\bf X}+\lambda {\bf I})^{-1}\]
Use this to find the expected value and variance for \(\widetilde{f}({\bf x}_0)={\bf x}_0^T\widetilde{\boldsymbol \beta}\), that is \(\text{E}(\widetilde{f}({\bf x}_0)\) and \(\text{Var}(\widetilde{f}({\bf x}_0) )\).
\[\text{E}({\bf x}_0^T\widetilde{\boldsymbol \beta})={\bf x}_0^T\text{E}(\widetilde{\boldsymbol \beta})={\bf x}_0^T({\bf X}^T{\bf X}+\lambda {\bf I})^{-1}{\bf X}^T{\bf X}\boldsymbol{\beta}\] \[ \text{Var}({\bf x}_0^T\widetilde{\boldsymbol \beta})={\bf x}_0^T \text{Cov}(\widetilde{\boldsymbol \beta}) {\bf x}_0= {\bf x}_0^T \sigma^2({\bf X}^T{\bf X}+\lambda {\bf I})^{-1}{\bf X}^T{\bf X}({\bf X}^T{\bf X}+\lambda {\bf I})^{-1} {\bf x}_0\]
Use what you found in Q20 to write out \(\text{E}[(Y_0-\widetilde{f}({\bf x}_0))^2]\) as a sum of the squared bias, variance and irreducible error, that is
\[\text{E}[(Y_0-\widetilde{f}({\bf x}_0))^2]=[\text{E}(\widetilde{f}({\bf x}_0)-f({\bf x}_0)]^2+\text{Var}(\widetilde{f}({\bf x}_0) ) + \text{Var}(\varepsilon)\]
We just insert what we found in Q20:
\[ \text{E}[(Y_0-\widetilde{f}({\bf x}_0))^2]=({\bf x}_0^T({\bf X}^T{\bf X}+\lambda {\bf I})^{-1}{\bf X}^T{\bf X}\boldsymbol{\beta}-{\bf x}_0^T{\boldsymbol \beta})^2+{\bf x}_0^T \sigma^2({\bf X}^T{\bf X}+\lambda {\bf I})^{-1}{\bf X}^T{\bf X}({\bf X}^T{\bf X}+\lambda {\bf I})^{-1} {\bf x}_0+\sigma^2 \]
We will now plot the three components (squared bias, variance and irreducible error) as a function of \(\lambda\) for one set of values for \({\bf X}\), \({\bf x}_0\), \({\boldsymbol \beta}\) and \(\sigma\).
values=dget("https://www.math.ntnu.no/emner/TMA4268/2019v/data/BVtradeoffvalues.dd")
X=values$X
dim(X)
x0=values$x0
dim(x0)
beta=values$beta
dim(beta)
sigma=values$sigma
sigma
## [1] 100 81
## [1] 81 1
## [1] 81 1
## [1] 0.5
Observe that all the elements X
, x0
and beta
, are given as matrices, to fit into your formulas. Hint: we perform matrix multiplication using %*%
, transpose of a matrix A
with t(A)
and inverse with solve(A)
.
Write a function sqbias
with input parameters lambda
, X
, x0
, beta
and return the squared bias (a scalar). Plot the function (use colour red) for values of \(\lambda\) in thislambda=seq(0,2,length=500)
, see code below. Does this curve look like you expected?
sqbias=function(lambda,X,x0,beta)
{
p=dim(X)[2]
value=(t(x0)%*%solve(t(X)%*%X+lambda*diag(p))%*%t(X)%*%X%*%beta-t(x0)%*%beta)^2
return(value)
}
thislambda=seq(0,2,length=500)
sqbiaslambda=rep(NA,length(thislambda))
for (i in 1:length(thislambda)) sqbiaslambda[i]=sqbias(thislambda[i],X,x0,beta)
plot(thislambda,sqbiaslambda,col=2,type="l")
See the R code above. I would expect that the bias increased with \(\lambda\), so yes. But maybe not expect the bump in the start.
Write a function variance
with input parameters lambda
, X
, x0
, sigma
which return the variance (a scalar). Plot the function (use colour blue) for values of \(\lambda\) in thislambda=seq(0,2,length=500)
, see code below. Does this curve look like you expected?
variance=function(lambda,X,x0,sigma)
{
p=dim(X)[2]
inv=solve(t(X)%*%X+lambda*diag(p))
value=sigma^2*(t(x0)%*%inv%*%t(X)%*%X%*%inv%*%x0)
return(value)
}
thislambda=seq(0,2,length=500)
variancelambda=rep(NA,length(thislambda))
for (i in 1:length(thislambda)) variancelambda[i]=variance(thislambda[i],X,x0,sigma)
plot(thislambda,variancelambda,col=4,type="l")
See the R code above. Yes, would expect that the variance decreased with increasing \(\lambda\) since the the data would get less to say.
Then plot the squared bias (red), the variance (blue), the irreducible error (orange) and the sum of the three (black) as function of lambda. See code below (then sqbiaslambda
and variancelambda
need to be calculated in Q22 and Q23). What is the optimal value of \(\lambda\) for this problem?
tot=sqbiaslambda+variancelambda+sigma^2
which.min(tot)
thislambda[which.min(tot)]
plot(thislambda,tot,col=1,type="l",ylim=c(0,max(tot)))
lines(thislambda, sqbiaslambda,col=2)
lines(thislambda, variancelambda,col=4)
lines(thislambda,rep(sigma^2,500),col="orange")
abline(v=thislambda[which.min(tot)],col=3)
## [1] 249
## [1] 0.993988
See R code above, and the printed optimal value is around 1 for \(\lambda\).