Solutions Recommended Exercises

TMA4268 Statistical Learning V2018. Module 3: LINEAR REGRESSION

Thea Roksvåg, Ingeborg Hem and Mette Langaas, Department of Mathematical Sciences, NTNU

26.01.2018

Problem 1

a)

\[\begin{align} E(\hat{\boldsymbol\beta})&=E((X^T X)^{-1}X^T Y)=(X^TX)^{-1}X^T E(Y) =(X^TX)^{-1}X^T E(X \boldsymbol\beta +\epsilon) \\ &=(X^TX)^{-1}X^T (X \boldsymbol\beta +0)=(X^TX)^{-1}(X^T X) \boldsymbol\beta = I \boldsymbol\beta = \boldsymbol\beta \end{align}\] \[\begin{align} Cov(\hat{\boldsymbol\beta})&=Cov((X^T X)^{-1}X^T Y)=(X^TX)^{-1}X^T Cov(Y)((X^TX)^{-1}X^T)^T \\ &=(X^TX)^{-1}X^T \sigma^2 I ((X^TX)^{-1}X^T)^T\\ &=\sigma^2 (X^TX)^{-1} \\ \end{align}\]

We need to assume that \(Y\) is multivariate normal. As \(\hat{\boldsymbol\beta}\) is a linear transformation of a multivariate normal vector \(Y\), \(\hat{\boldsymbol\beta}\) is also multivariate normal.

The components of a multivariate normal vector, is univariate normal. This means that \(\hat{\beta}_j\) is normally distributed with expected value given by the \(\beta_j\) and the variance given by the \(j\)’th diagonal element of \(\sigma^2 (X^T X)^{-1}\).

b)

Fix covariates X. *Collect \(Y\), create CI using \(\hat{\beta}\) and \(\hat{\sigma}\)*, repeat from * to * many times. 95 % of the times the CI contains the true \(\beta\). Collect \(Y\) means simulate it with the true \(\beta\) as parameter(s). See also R code below.

c)

Same idea. Fix covariates X and \(x_0\). *Collect \(Y\), create PI using \(\hat{\beta}\) and \(\hat{\sigma}\), simulate \(Y_0\)*, repeat from * to * many times. 95 % of the times the PI contains \(Y_0\). Collect \(Y\) and \(Y_0\) means simulate it with the true \(\beta\) as parameter(s). \(Y_0\) should not be used to estimate \(\beta\) or \(\sigma\). See also R code below.

d)

95 % CI for \(\mathbf{x}_0^T\mathbf{\beta}\): Same idea as for \(\beta_j\). Use that \(\mathbf{x}_0^T\hat{\mathbf{\beta}} \sim N(\mathbf{x}_0^T\mathbf{\beta}, \mathbf{x}_0^T\)Var\((\hat{\mathbf{\beta}})\mathbf{x}_0)\) and do as for \(\beta_j\). Note that \(\mathbf{x}_0\) is a vector. The connection between CI for \(\mathbf{\beta}\), \(\mathbf{x}_0^T\mathbf{\beta}\) and PI for \(Y\) at \(\mathbf{x}_0\): The first is CI for a parameter, the second is CI for the expected regression line in the point \(x_0\) (when you only have one covariate, this may be more intuitive), and the last is the PI for the response \(Y_0\). The difference between the two latter is that \(Y\) are the observations, and \(\mathbf{x}_0^T\mathbf{\beta}\) is the expected value of the observations and hence a function of the model parameters (NOT an observation).

e)

We have a model on the form \(Y=X \beta + \epsilon\) where \(\epsilon\) is the error. The error of the model is unknown and unobserved, but we can estimate it by what we call the residuals. The residuals are given by the difference between the true response and the predicted value \[\hat{\epsilon}=Y-\hat{Y}=(I-X(X^TX)^{-1}X^T)Y.\]

Properties of raw residuals: Normally distributed with mean 0 and covariance \(Cov(\hat{\epsilon})=\sigma^2 (I-X(X^TX)^{-1}X^T).\) This means that the residuals may have different variance (depending on \(X\)) and may also be correlated.

In a model check, we want to check that our errors are independent, homoscedastic (same variance for each observation) and not dependent on the covariates. As we don’t know the true error, we use the residuals as predictors, but as mentioned, the residuals may have different variances and may be correlated. This is why we don’t want to use the raw residuals for model check.

To amend our problem we need to try to fix the residuals so that they at least have equal variances. We do that by working with standardized or studentized residuals.

f)

RSS(small) \(\geq\) RSS(large) since RSS will be smaller with more covariates explaining the variation (and for a covariate that is completly unrelated to the data it might not be a large change, but the RSS will not increase). \(R^2\) is directly related to RSS: \(R^2 = 1\) - RSS/TSS, and TSS does not change when the model changes.

Problem 2: Munich Rent index

a)

library(ggplot2)
library(gamlss.data)
library(dplyr)
data("rent99")

rent99$location=as.factor(rent99$location)
formula1 <- rent ~ area +location + bath + kitchen + cheating
formula2 <- rentsqm ~ area + location + bath + kitchen + cheating

rent1 <- lm(formula1,data=rent99)
rent2<-lm(formula2,data=rent99)

Look at the summary

summary(rent1)

## 
## Call:
## lm(formula = formula1, data = rent99)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -633.41  -89.17   -6.26   82.96 1000.76 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -21.9733    11.6549  -1.885   0.0595 .  
## area          4.5788     0.1143  40.055  < 2e-16 ***
## location2    39.2602     5.4471   7.208 7.14e-13 ***
## location3   126.0575    16.8747   7.470 1.04e-13 ***
## bath1        74.0538    11.2087   6.607 4.61e-11 ***
## kitchen1    120.4349    13.0192   9.251  < 2e-16 ***
## cheating1   161.4138     8.6632  18.632  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 145.2 on 3075 degrees of freedom
## Multiple R-squared:  0.4504, Adjusted R-squared:  0.4494 
## F-statistic:   420 on 6 and 3075 DF,  p-value: < 2.2e-16

summary(rent2)

## 
## Call:
## lm(formula = formula2, data = rent99)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4959 -1.4084 -0.0733  1.3847  9.4400 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.108319   0.168567  42.169  < 2e-16 ***
## area        -0.038154   0.001653 -23.077  < 2e-16 ***
## location2    0.628698   0.078782   7.980 2.04e-15 ***
## location3    1.686099   0.244061   6.909 5.93e-12 ***
## bath1        0.989898   0.162113   6.106 1.15e-09 ***
## kitchen1     1.412113   0.188299   7.499 8.34e-14 ***
## cheating1    2.414101   0.125297  19.267  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.1 on 3075 degrees of freedom
## Multiple R-squared:  0.2584, Adjusted R-squared:  0.2569 
## F-statistic: 178.6 on 6 and 3075 DF,  p-value: < 2.2e-16

Consider residual plots. We plot standardized residuals against fitted values.

yhat1=predict(rent1)
yhat2=predict(rent2)

estand1=rstandard(rent1);
estand2=rstandard(rent2)
 
ggplot(data.frame(yhat1,estand1),aes(yhat1,estand1))+geom_point(pch=19)+geom_abline(intercept=0,slope=0,col="red")

ggplot(data.frame(yhat2,estand2),aes(yhat2,estand2))+geom_point(pch=19)+geom_abline(intercept=0,slope=0,col="red")

We plot the true observations against the fitted values

ggplot(data.frame(yhat1,rent99),aes(rent99$rent,yhat1))+geom_point(pch=19)+geom_abline(intercept=0,slope=1,col="red")

ggplot(data.frame(yhat2,rent99),aes(rent99$rentsqm,yhat2))+geom_point(pch=19)+geom_abline(intercept=0,slope=1,col="red")

Normal Q-Q

ggplot(rent1, aes(sample = .stdresid)) +
  stat_qq(pch = 19) +
  geom_abline(intercept = 0, slope = 1, linetype = "dotted") +
  labs(x = "Theoretical quantiles", y = "Standardized residuals", title = "Normal Q-Q", subtitle = deparse(rent1$call))

ggplot(rent2, aes(sample = .stdresid)) +
  stat_qq(pch = 19) +
  geom_abline(intercept = 0, slope = 1, linetype = "dotted") +
  labs(x = "Theoretical quantiles", y = "Standardized residuals", title = "Normal Q-Q", subtitle = deparse(rent2$call))

Can’t really see that one response is better than the other, so we proceed with \(rent\).

b&c)

We consider the summary

summary(rent1)

## 
## Call:
## lm(formula = formula1, data = rent99)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -633.41  -89.17   -6.26   82.96 1000.76 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -21.9733    11.6549  -1.885   0.0595 .  
## area          4.5788     0.1143  40.055  < 2e-16 ***
## location2    39.2602     5.4471   7.208 7.14e-13 ***
## location3   126.0575    16.8747   7.470 1.04e-13 ***
## bath1        74.0538    11.2087   6.607 4.61e-11 ***
## kitchen1    120.4349    13.0192   9.251  < 2e-16 ***
## cheating1   161.4138     8.6632  18.632  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 145.2 on 3075 degrees of freedom
## Multiple R-squared:  0.4504, Adjusted R-squared:  0.4494 
## F-statistic:   420 on 6 and 3075 DF,  p-value: < 2.2e-16

The column estimate gives us the \(\hat{\beta}\)’s. For example if the living area of the house increases qith 1 square meter, the predicted response, net rent per month, increases with \(4.5788\) euro. The column std.error gives the standard deviation of the \(\hat{\beta}\)’s. The remaining columns report the t value (t value) with corresponding p-value (Pr(>|t|)). The \(RSE\), \(R^2\), \(R_{adj}^2\) and \(F\)-statistic are also reported as Residual standard error, Multiple R-squared, Adjusted R-Squared and F-statistic respectively. See module page for definition of these terms.

Interpretation of the intercept: This is the intercept of a average location (location=1). (If location=2, the intercept is -21.9733+39.2602, and if location=3 the intercept is -21.9733+126.0575.)

d)

orgfit=lm(rent~area+location+bath+kitchen+cheating,data=rent99)
summary(orgfit)
set.seed(1) #to be able to reproduce results
n=dim(rent99)[1]
IQ=rnorm(n,100,16)
fitIQ=lm(rent~area+as.factor(location)+bath+kitchen+cheating+IQ,data=rent99)
summary(fitIQ)

summary(orgfit)$sigma
summary(fitIQ)$sigma

summary(orgfit)$r.squared
summary(fitIQ)$r.squared
summary(orgfit)$adj.r.squared
summary(fitIQ)$adj.r.squared

## 
## Call:
## lm(formula = rent ~ area + location + bath + kitchen + cheating, 
##     data = rent99)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -633.41  -89.17   -6.26   82.96 1000.76 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -21.9733    11.6549  -1.885   0.0595 .  
## area          4.5788     0.1143  40.055  < 2e-16 ***
## location2    39.2602     5.4471   7.208 7.14e-13 ***
## location3   126.0575    16.8747   7.470 1.04e-13 ***
## bath1        74.0538    11.2087   6.607 4.61e-11 ***
## kitchen1    120.4349    13.0192   9.251  < 2e-16 ***
## cheating1   161.4138     8.6632  18.632  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 145.2 on 3075 degrees of freedom
## Multiple R-squared:  0.4504, Adjusted R-squared:  0.4494 
## F-statistic:   420 on 6 and 3075 DF,  p-value: < 2.2e-16
## 
## 
## Call:
## lm(formula = rent ~ area + as.factor(location) + bath + kitchen + 
##     cheating + IQ, data = rent99)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -630.95  -89.50   -6.12   82.62  995.76 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -41.3879    19.5957  -2.112   0.0348 *  
## area                   4.5785     0.1143  40.056  < 2e-16 ***
## as.factor(location)2  39.2830     5.4467   7.212 6.90e-13 ***
## as.factor(location)3 126.3356    16.8748   7.487 9.18e-14 ***
## bath1                 74.1979    11.2084   6.620 4.23e-11 ***
## kitchen1             120.0756    13.0214   9.221  < 2e-16 ***
## cheating1            161.4450     8.6625  18.637  < 2e-16 ***
## IQ                     0.1940     0.1574   1.232   0.2179    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 145.2 on 3074 degrees of freedom
## Multiple R-squared:  0.4507, Adjusted R-squared:  0.4494 
## F-statistic: 360.3 on 7 and 3074 DF,  p-value: < 2.2e-16
## 
## [1] 145.1879
## [1] 145.1757
## [1] 0.4504273
## [1] 0.4506987
## [1] 0.449355
## [1] 0.4494479

\(R^2\) will always increase (or stay the same) if we add a parameter to the model. Thus, we cannot use this alone for model selection. However, the adjusted \(R_{adj}^2\) is “punished” based on the number of parameters in the model and will not necessarily increase if we add a covariate to the model.

Sigma (or RSE) is given by \(\hat{\sigma}=\sqrt{\frac{1}{n-p-1}RSS}\). Multiple R-squared is given by \(R^2=1-\frac{RSS}{TSS}=1-\frac{\hat{\sigma}^2 (n-p-1)}{TSS}\).

e)

formula <- rent ~ area + location + bath + kitchen + cheating
rent1 <- lm(formula, data = rent99)#, contrasts = list(location = "contr.sum"))

rent99 <- rent99 %>% mutate(yearc.cat = cut(yearc, breaks = c(-Inf, seq(1920,2000,10)), labels = 10*1:9))

formula <- rent ~ area + location + bath + kitchen + cheating + yearc.cat
rent2 <- lm(formula, data = rent99)#, contrasts = list(location = "contr.sum"))

rent99 <- rent99 %>% mutate(yearc.cat2 = cut(yearc, breaks = c(-Inf, seq(1920,2000,20)), labels = c(20,40,60,80,00)))

formula <- rent ~ area + location + bath + kitchen + cheating + yearc.cat2
rent3 <- lm(formula, data = rent99)#,contrasts = list(location = "contr.sum"))

f)

library(MASS)
library(leaps)
best <- regsubsets(model.matrix(rent3)[,-1], y = rent99$rent,method="exhaustive")
summary(best)

## Subset selection object
## 10 Variables  (and intercept)
##              Forced in Forced out
## area             FALSE      FALSE
## location2        FALSE      FALSE
## location3        FALSE      FALSE
## bath1            FALSE      FALSE
## kitchen1         FALSE      FALSE
## cheating1        FALSE      FALSE
## yearc.cat240     FALSE      FALSE
## yearc.cat260     FALSE      FALSE
## yearc.cat280     FALSE      FALSE
## yearc.cat20      FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          area location2 location3 bath1 kitchen1 cheating1 yearc.cat240
## 1  ( 1 ) "*"  " "       " "       " "   " "      " "       " "         
## 2  ( 1 ) "*"  " "       " "       " "   " "      " "       " "         
## 3  ( 1 ) "*"  " "       " "       " "   " "      "*"       " "         
## 4  ( 1 ) "*"  " "       "*"       " "   " "      "*"       " "         
## 5  ( 1 ) "*"  "*"       "*"       " "   " "      "*"       " "         
## 6  ( 1 ) "*"  "*"       "*"       " "   " "      "*"       " "         
## 7  ( 1 ) "*"  "*"       "*"       " "   "*"      "*"       " "         
## 8  ( 1 ) "*"  "*"       "*"       "*"   "*"      "*"       " "         
##          yearc.cat260 yearc.cat280 yearc.cat20
## 1  ( 1 ) " "          " "          " "        
## 2  ( 1 ) " "          " "          "*"        
## 3  ( 1 ) " "          " "          "*"        
## 4  ( 1 ) " "          " "          "*"        
## 5  ( 1 ) " "          " "          "*"        
## 6  ( 1 ) " "          "*"          "*"        
## 7  ( 1 ) " "          "*"          "*"        
## 8  ( 1 ) " "          "*"          "*"

A selection method is used (you will learn more later). The output shows the best model of each size (1-8 covariates). The best model with one covariate uses only area, the best model with two covariates uses area and yearc.cat20 and so on.

summary(best)$cp

## [1] 1015.979023  540.680039  228.460243  184.179283  125.679898   75.667739
## [7]   32.571877    9.418625

Model 8 gives the lowest Mallows Cp and is the preferred model.

Problem 3: Simulations in R

a)

# CI for beta_j

true_beta <- c(3.14, 10, 0.8) # choosing true betas
true_sd <- 10 # choosing true sd
set.seed(345); X <- matrix(c(rep(1, 100), runif(100, 2, 5), sample(1:100, 100, replace = TRUE)), 
            nrow = 100, ncol = 3) # fixing X. set.seed() is used to produce same X every time this code is used

# simulating and fitting models many times
ci_int <- ci_x1 <- ci_x2 <- 0; nsim <- 1000
for (i in 1:nsim){
  y <- rnorm(n = 100, mean = X%*%true_beta, sd = rep(true_sd, 100))
  mod <- lm(y ~ x1 + x2, data = data.frame(y = y, x1 = X[,2], x2 = X[,3]))
  ci <- confint(mod)
  ci_int[i] <- ifelse(true_beta[1] >= ci[1,1] && true_beta[1] <= ci[1,2], 1, 0)
  ci_x1[i] <- ifelse(true_beta[2] >= ci[2,1] && true_beta[2] <= ci[2,2], 1, 0)
  ci_x2[i] <- ifelse(true_beta[3] >= ci[3,1] && true_beta[3] <= ci[3,2], 1, 0)
}

c(mean(ci_int), mean(ci_x1), mean(ci_x2))

## [1] 0.952 0.944 0.945

b)

# PI for Y_0

true_beta <- c(3.14, 10, 0.8) # choosing true betas
true_sd <- 10 # choosing true sd
set.seed(345); 
X <- matrix(c(rep(1, 100), runif(100, 2, 5), 
              sample(1:100, 100,replace = TRUE)), nrow = 100, ncol = 3) # fixing X. 

#set.seed() is used to produce same X every time this code is used

x0 <- c(1,3,50)

# simulating and fitting models many times
pi_y0 <- 0; nsim <- 1000
for (i in 1:nsim){
  y <- rnorm(n = 100, mean = X%*%true_beta, sd = rep(true_sd, 100))
  mod <- lm(y ~ x1 + x2, data = data.frame(y = y, x1 = X[,2], x2 = X[,3]))
  y0 <- rnorm(n = 1, mean = x0%*%true_beta, sd = true_sd)
  pi <- predict(mod, newdata = data.frame(x1 = x0[2], x2 = x0[3]), interval = "predict")[,2:3]
  pi_y0[i] <- ifelse (y0 >= pi[1] && y0 <=pi[2], 1, 0)
}

mean(pi_y0)

## [1] 0.958

c)

library(ggplot2)
#Homoscedastic error
n=1000
x=seq(-3,3,length=n)
beta0=-1
beta1=2
xbeta=beta0+beta1*x
sigma=1
e1=rnorm(n,mean=0,sd=sigma)
y1=xbeta+e1
ehat1=residuals(lm(y1~x))

#ggplot-solution
ggplot(data.frame(x=x,y1=y1),aes(x,y1)) +
  geom_point(pch =19)+geom_abline(slope=beta1,intercept=beta0,col="red")

ggplot(data.frame(x=x,e1=e1),aes(x,e1))+geom_point(pch=19)+geom_hline(yintercept=0,col="red")

Correct model: We don’t see any pattern in the residual plot. The variance seems to be independent of the covariate \(x\).

#Heteroscedastic errors
sigma=(0.1+0.3*(x+3))^2
e2=rnorm(n,0,sd=sigma)
y2=xbeta+e2
ehat2=residuals(lm(y2~x))

#ggplot-solution
ggplot(data.frame(x=x,y2=y2),aes(x,y2)) +
  geom_point(pch =19)+geom_abline(slope=beta1,intercept=beta0,col="red")

ggplot(data.frame(x=x,e2=e2),aes(x,e2))+geom_point(pch=19)+geom_hline(yintercept=0,col="red")

Wrong model: The variance of the residuals increases as a function of \(x\).

d)

Reduce the sample size to for example \(n=10\). Then we see a difference between the standardized and the studentized residuals (red and blue). The expressions for standardized and studentized residuals for an observation \(y_i\) are identical, except that the latter estimates \(\hat{\sigma}\) without using observation number \(i\). When the sample size is large, it typically doesn’t matter if we include \(y_i\) in the estimation or not.

n=10
beta=matrix(c(0,1,1/2,1/3),ncol=1)
set.seed(123)
x1=rnorm(n,0,1); x2=rnorm(n,0,2); x3=rnorm(n,0,3)
X=cbind(rep(1,n),x1,x2,x3)
y=X%*%beta+rnorm(n,0,2)
fit=lm(y~x1+x2+x3)
yhat=predict(fit)
summary(fit)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8557 -0.5714 -0.2191  0.6980  0.9667 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   0.4958     0.3067   1.617  0.15706   
## x1            1.8135     0.3744   4.844  0.00287 **
## x2            0.3260     0.1909   1.708  0.13853   
## x3            0.2076     0.1268   1.638  0.15262   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8686 on 6 degrees of freedom
## Multiple R-squared:  0.8852, Adjusted R-squared:  0.8278 
## F-statistic: 15.43 on 3 and 6 DF,  p-value: 0.003162

ehat=residuals(fit); estand=rstandard(fit); estud=rstudent(fit)

ggplot(data=data.frame(ehat,yhat,estand,estud),aes(yhat,ehat))+geom_point(pch=19)+geom_point(aes(yhat,estand),col="red")+geom_point(aes(yhat,estud),col="blue",pch=19)