install.packages("gamlss.data")
install.packages("tidyverse")
install.packages("GGally")
install.packages("Matrix")

For Compulsory 1: jump to sequential anova formula

Aim of this module

First week

Aim of multiple linear regression.
Define and understand the multiple linear regression model (matrix notation)
parameter estimation with maximum likelihood (and least squares), geometric interpretation
likelihood, score vector and Hessian (observed Fisher information matrix)
properties of parameter estimates (distribution)
categorical covariates: dummy or effect coding
simple vs multiple regression: why (and when) do estimated regression effects change from simple to multiple regression?
QQ-plots
assessing model fit (diagnostic), residuals

Jump to interactive for first week

Second week

statistical inference for parameter estimates
- confidence intervals,
- prediction intervals,
- hypothesis test,
- linear hypotheses
analysis of variance decompositions and $R^2$, sequential ANOVA table
model selection with AIC and variants

Jump to interactive for second week

Textbook: Chapter 2.2, 3 and B.4 (Chapter 3 was on the reading list for TMA4267 Linear statistical models in 2016 and 2017, but for TMA4255 Applied statistics sum and not matrix notation was used and a different textbook.)

FIRST WEEK

Aim of multiple linear regression

Construct a model to help understand the relationship between a response and one or several explanatory variables. [Correlation, or cause and effect?]
Construct a model to predict the reponse from a set of (one or several) explanatory variables. [More or less “black box”]

$\oplus$ Notation

Classnotes 28.08.2017

${\bf Y}: (n \times 1)$ vector of responses [e.g. one of the following: rent, rent pr sqm, weight of baby, ph of lake, volume of tree]

${\bf X}: (n \times p)$ design matrix [e.g. location of flat, gestation age of baby, chemical measurement of the lake, height of tree]

${\bf \beta}: (p \times 1)$ vector of regression parameters (intercept included, so $p=k+1$)

${\bf \varepsilon}: (n\times 1)$ vector of random errors.

We assume that pairs $({\bf x}_i^T,y_i)$ ($i=1,...,n$) are measured from sampling units. That is, the observation pair $({\bf x}_1^T,y_1)$ is independent from $({\bf x}_2^T,y_2)$.

$\odot$ Munich rent index: response and covariates

Study the print-out and discuss the following questions:

What can be response, and what covariates? (using what you know about rents) A: rent or rentsqm for response and all other as covariates.
What type of response(s) do we have? (continuous, categorical, nominal, ordinal, discrete, factors, …). A: rent, rentsqm (continuous).
What types of covariates? (continuous, categorical, nominal, ordinal, discrete, factors, …) A: area (continuous), yearc (ordinal or continuous), location (categorical - ordinal), bath, kichen, cheating (categorical, nominal?), district (categorical - nominal).
Explain what the elements of model.matrix are. (Hint: coding of location) A: our design matrix, contains numerical values for the continuous covariates, and dummy variable coding for categorical covariates (more about this later).

# install.packages('gamlss.data') the package needs to be installed before
# this Rmd is run
library("gamlss.data")
ds = rent99
colnames(ds)

## [1] "rent"     "rentsqm"  "area"     "yearc"    "location" "bath"    
## [7] "kitchen"  "cheating" "district"

summary(ds)

##       rent            rentsqm             area            yearc     
##  Min.   :  40.51   Min.   : 0.4158   Min.   : 20.00   Min.   :1918  
##  1st Qu.: 322.03   1st Qu.: 5.2610   1st Qu.: 51.00   1st Qu.:1939  
##  Median : 426.97   Median : 6.9802   Median : 65.00   Median :1959  
##  Mean   : 459.44   Mean   : 7.1113   Mean   : 67.37   Mean   :1956  
##  3rd Qu.: 559.36   3rd Qu.: 8.8408   3rd Qu.: 81.00   3rd Qu.:1972  
##  Max.   :1843.38   Max.   :17.7216   Max.   :160.00   Max.   :1997  
##  location bath     kitchen  cheating    district   
##  1:1794   0:2891   0:2951   0: 321   Min.   : 113  
##  2:1210   1: 191   1: 131   1:2761   1st Qu.: 561  
##  3:  78                              Median :1025  
##                                      Mean   :1170  
##                                      3rd Qu.:1714  
##                                      Max.   :2529

dim(ds)

## [1] 3082    9

head(ds)

##       rent   rentsqm area yearc location bath kitchen cheating district
## 1 109.9487  4.228797   26  1918        2    0       0        0      916
## 2 243.2820  8.688646   28  1918        2    0       0        1      813
## 3 261.6410  8.721369   30  1918        1    0       0        1      611
## 4 106.4103  3.547009   30  1918        2    0       0        0     2025
## 5 133.3846  4.446154   30  1918        2    0       0        1      561
## 6 339.0256 11.300851   30  1918        2    0       0        1      541

str(ds$location)

##  Factor w/ 3 levels "1","2","3": 2 2 1 2 2 2 1 1 1 2 ...

contrasts(ds$location)

##   2 3
## 1 0 0
## 2 1 0
## 3 0 1

X = model.matrix(rentsqm ~ area + yearc + location + bath + kitchen + cheating + 
    district, data = ds)
head(X)

##   (Intercept) area yearc location2 location3 bath1 kitchen1 cheating1
## 1           1   26  1918         1         0     0        0         0
## 2           1   28  1918         1         0     0        0         1
## 3           1   30  1918         0         0     0        0         1
## 4           1   30  1918         1         0     0        0         0
## 5           1   30  1918         1         0     0        0         1
## 6           1   30  1918         1         0     0        0         1
##   district
## 1      916
## 2      813
## 3      611
## 4     2025
## 5      561
## 6      541

Model

\[{\bf Y=X \beta}+{\bf\varepsilon}\] is called a classical linear model if the following is true:

$\text{E}(\bf{\varepsilon})=\bf{0}$.
$\text{Cov}(\bf{\varepsilon})=\text{E}(\bf{\varepsilon}\bf{\varepsilon}^T)=\sigma^2\bf{I}$.
The design matrix has full rank, $\text{rank}({\bf X})=k+1=p$.

The classical normal linear regression model is obtained if additionally

$\varepsilon\sim N_n(\bf{0},\sigma^2\bf{I})$ holds.

For random covariates these assumptions are to be understood conditionally on $\bf{X}$.

$\oplus$ Questions

What is “full rank”? Why is this needed? Example of rank less than $p$?
For the normal linear model - what is the distribution for each $Y_i$? Are the $Y_i$s independent?
Would the normal linear model be a possible model for the Munich rent data? Why?

Parameter estimation

In multiple linear regression there are two popular methods for estimating the regression parameters in $\beta$: maximum likelihood and least squares. These two methods give the same estimator when we assume the normal linear regression model. We will in this module focus on maximum likelihood estimation, since that can be used also when we have non-normal responses (modules 3-6: binomial, Poisson, gamma, multinomial).

$\oplus$ Likelihood theory (from B.4)

We assume that we have collected independent pairs of data from $n$ units (individuals, sample points)- here we move on to define the likelihood, loglikelihood, score vector, observed and expected Fisher information. Then we solve “score vector=0” to find parameter estimates.

Classnotes 21.08.2017

Least squares and maximum likelihood estimator for ${\bf \beta}$:

\[ \hat\beta=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf Y}\]

$\odot$ Munich rent index: parameter estimates

Explain what the values under Estimate mean in practice.

fit = lm(rentsqm ~ area + yearc + location + bath + kitchen + cheating, data = ds)
summary(fit)

## 
## Call:
## lm(formula = rentsqm ~ area + yearc + location + bath + kitchen + 
##     cheating, data = ds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4303 -1.4131 -0.1073  1.3244  8.6452 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -45.475484   3.603775 -12.619  < 2e-16 ***
## area         -0.032330   0.001648 -19.618  < 2e-16 ***
## yearc         0.026959   0.001846  14.606  < 2e-16 ***
## location2     0.777133   0.076870  10.110  < 2e-16 ***
## location3     1.725068   0.236062   7.308 3.45e-13 ***
## bath1         0.762808   0.157559   4.841 1.35e-06 ***
## kitchen1      1.136908   0.183088   6.210 6.02e-10 ***
## cheating1     1.765261   0.129068  13.677  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.031 on 3074 degrees of freedom
## Multiple R-squared:  0.3065, Adjusted R-squared:  0.3049 
## F-statistic: 194.1 on 7 and 3074 DF,  p-value: < 2.2e-16

Reproduce the values under Estimate by calculating without the use of lm.

X = model.matrix(rentsqm ~ area + yearc + location + bath + kitchen + cheating, 
    data = ds)
Y = ds$rentsqm
betahat = solve(t(X) %*% X) %*% t(X) %*% Y
# betahat-fit$coefficients
print(betahat)

##                     [,1]
## (Intercept) -45.47548356
## area         -0.03233033
## yearc         0.02695857
## location2     0.77713297
## location3     1.72506792
## bath1         0.76280784
## kitchen1      1.13690814
## cheating1     1.76526110

Projection matrices: idempotent, symmetric/orthogonal:

First, we define predictions as $\hat{\bf Y}={\bf X}\hat{\beta}$, and inserted the ML (and LS) estimate we get $\hat{\bf Y}={\bf X}({\bf X}^T{\bf X})^{-1}{\bf X}^T{\bf Y}$.

We define the projection matrix \[ {\bf H}={\bf X}({\bf X}^T{\bf X})^{-1} {\bf X}^T\] called the hat matrix. This simplifies the notation for the predictions, \[\hat{\bf Y}={\bf H}{\bf Y}\] so the hat matrix is putting the hat on the response ${\bf Y}$.

In addition we define residuals as \[\begin{align*} \hat{\varepsilon}&={\bf Y}-\hat{\bf Y} \\ \hat{\varepsilon}&={\bf Y}-{\bf HY}=({\bf I-H}){\bf Y}\\ \end{align*}\]

so we have a second projection matrix \[ {\bf I-H}={\bf I}-{\bf X}({\bf X}^T{\bf X})^{-1} {\bf X}^T \]

Geometry of Least Squares (involving our two projection matrices)

Mean response vector: $\text{E}({\bf Y})={\bf X}{\bf \beta}$
As $\beta$ varies, ${\bf X}\beta$ spans the model plane of all linear combinations. I.e. the space spanned by the columns of ${\bf X}$: the column-space of ${\bf X}$.
Due to random error (and unobserved covariates), ${\bf Y}$ is not exactly a linear combination of the columns of ${\bf X}$.
LS-estimation chooses $\hat{\beta}$ such that ${\bf X}\hat{\beta}$ is the point in the column-space of ${\bf X}$ that is closes to ${\bf Y}$.
The residual vector $\hat{\varepsilon}={\bf Y}-\hat{{\bf Y}}=({\bf I-H}){\bf Y}$ is perpendicular to the column-space of ${\bf X}$.
Multiplication by ${\bf H}={\bf X}({\bf X}^T{\bf X})^{-1}{\bf X}^T$ projects a vector onto the column-space of ${\bf X}$.
Multiplication by ${\bf I-H}={\bf I}-{\bf X}({\bf X}^T{\bf X})^{-1}{\bf X}^T$ projects a vector onto the space perpendicular to the column-space of ${\bf X}$.

Graphical presentations: Putanen, Styan and Isotalo: Matrix Tricks for Linear Statistical Models: Our Personal Top Twenty, Figure 8.3.

Restricted maximum likelihood estimator for ${\bf \sigma}^2$:

\[ \hat{\sigma}^2=\frac{1}{n-p}({\bf Y}-{\bf X}{\hat{\beta}})^T({\bf Y}-{\bf X}{\bf \hat{\beta}})=\frac{\text{SSE}}{n-p}\]

In the generalized linear models setting (remember exponential family from Module 1) we will look at the parameter $\sigma^2$ as a nuisance parameter = parameter that is not of interest to us. Our our focus will be on the parameters of interest - which will be related to the mean of the response, which is modelled using our covariate - so the regression parameters $\beta$ are therefore our prime focus.

However, to perform inference we need an estimator for $\sigma^2$.

The maximum likelihood estimator for $\sigma^2$ is $\frac{\text{SSE}}{n}$, which is found from maximizing the likelihood inserted our estimate of $\hat{\beta}$ \[L(\hat{\beta},{\sigma^2})=(\frac{1}{2\pi})^{n/2}(\frac{1}{\sigma^2})^{n/2}\exp(-\frac{1}{2\sigma^2} ({\bf y}-{\bf X}\hat{\beta})^T({\bf y}-{\bf X}\hat{\beta}))\] \[\begin{align*} l(\hat{\beta},{\sigma^2})&=\text{ln}(L(\hat{\beta},{\sigma^2}))\\ &=-\frac{n}{2}\text{ln}(2\pi)-\frac{n}{2}\text{ln}\sigma^2-\frac{1}{2\sigma^2} ({\bf y}-{\bf X}\hat{\beta})^T({\bf y}-{\bf X}\hat{\beta})\\ \end{align*}\]

The score vector with respect to $\sigma^2$ is \[\frac{\partial l}{\partial \sigma^2}=0-\frac{n}{2\sigma^2}+\frac{1}{2\sigma^4}({\bf y}-{\bf X}\hat{\beta})^T({\bf y}-{\bf X}\hat{\beta}) \] Solving $\frac{\partial l}{\partial \sigma^2}=0$ gives us the estimator \[ \frac{1}{n}({\bf y}-{\bf X}\hat{\beta})^T({\bf y}-{\bf X}\hat{\beta})=\frac{\text{SSE}}{n}\] But, this estimator is biased, which can be seen using the trace-formula (from TMA4267).

But, it is asympotically unbiased (unbiased when the sample size $n$ increases to infinity).

However, a unbiased version is preferred, and is found based on restricted maximum likelihood (REML). We will look into REML-estimation in Module 7. In our case the (unbiased) REML estimate is \[ \hat{\sigma}^2=\frac{1}{n-p}({\bf y}-{\bf X}\hat{\beta})^T({\bf y}-{\bf X}\hat{\beta})=\frac{\text{SSE}}{n-p}\]

The restricted maximum likelihood estimate is used in lm.

Q: What does it mean that the REML estimate is unbiased? Where is the estimate $\hat{\sigma}$ in the regression output? (See output from lm for the rent index example.) A: Residual standard error: 2.031 is $\hat{\sigma}$.

Properties for the normal linear model

To be able to do inference (=make confidence intervals, prediction intervals, test hypotheses) we need to know about the properties of our parameter estimators in the (normal) linear model.

Least squares and maximum likelihood estimator for ${\bf \beta}$: \[ \hat{\beta}=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf Y}\] with $\hat{\beta}\sim N_{p}(\beta,\sigma^2({\bf X}^T{\bf X})^{-1})$.
Restricted maximum likelihood estimator for ${\bf \sigma}^2$: \[ \hat{\sigma}^2=\frac{1}{n-p}({\bf Y}-{\bf X}\hat{\beta})^T({\bf Y}-{\bf X}\hat{\beta})=\frac{\text{SSE}}{n-p}\] with $\frac{(n-p)\hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p}$.
Statistic for inference about $\beta_j$, $c_{jj}$ is diagonal element $j$ of $({\bf X}^T{\bf X})^{-1}$. \[ T_j=\frac{\hat{\beta}_j-\beta_j}{\sqrt{c_{jj}}\hat{\sigma}}\sim t_{n-p}\] This requires that $\hat{\beta}_j$ and $\hat{\sigma}$ are independent (see below).

However, when we work with large samples then $n-p$ becomes large and the $t$ distribution goes to a normal distribution, so we may use the standard normal in place of the $t_{n-p}$.

Are $\hat{{\bf \beta}}$ and SSE are independent? (optional)

Can be proven using knowledge from TMA4267 on properties of the multivariate normal distribution.

Independence: Let ${\bf X}_{(p \times 1)}$ be a random vector from $N_p({\bf \mu},{\bf \Sigma})$. Then ${\bf A}{\bf X}$ and ${\bf B}{\bf X}$ are independent iff ${\bf A}{\bf \Sigma}{\bf B}^T={\bf 0}$.

We have:

${\bf Y}\sim N_n({\bf X}{\bf \beta},\sigma^2{\bf I})$
${\bf AY}=\hat{{\bf \beta}}=({\bf X}^T{\bf X})^{-1}{\bf X}^T {\bf Y}$, and
${\bf BY}=({\bf I}-{\bf H}){\bf Y}$.
Now ${\bf A}\sigma^2{\bf I}{\bf B}^T=\sigma^2 {\bf A}{\bf B}^T=\sigma^2 ({\bf X}^T{\bf X})^{-1}{\bf X}^T ({\bf I}-{\bf H})={\bf 0}$
since ${\bf X}({\bf I}-{\bf H})={\bf X}-{\bf HX}={\bf X}-{\bf X}={\bf 0}$.
We conclude that $\hat{{\bf \beta}}$ is independent of $({\bf I}-{\bf H}){\bf Y}$,
and, since SSE=function of $({\bf I}-{\bf H}){\bf Y}$: SSE=${\bf Y}^T({\bf I}-{\bf H}){\bf Y}$,
then $\hat{{\bf \beta}}$ and SSE are independent, and the result with $T_j$ being t-distributed with $n-p$ degrees of freedom is correct.

Categorical covariates - dummy and effect coding

Example: consider our rent dataset with rent as reponse, and continuous covariate area and categorical covariate location. Let the location be a factor with levels average, good, excellent.

require(gamlss.data)
require(tidyverse)
require(GGally)

ds = rent99 %>% select(location, area, rent)
levels(ds$location)

## [1] "1" "2" "3"

# change to meaningful names
levels(ds$location) = c("average", "good", "excellent")
ggpairs(ds)

Q: comment on what you see in the ggpairs plot.

Categorical covariates may either be ordered or unordered. We will only consider unordered categories here. In general, we could like to estimate regression coefficients for all levels for the categorical covariates. However, if we want to include an intercept in our model we can only include codings for one less variable than the number of levels we have - or else our design matrix will not have full rank.

Q: Assume you have a categorical variable with three levels. Check for yourself that making a design matrix with one intercept and three columns with dummy (0-1) variable coding will result in a matrix that is singular.

# make 'wrong' dummy variable coding with 3 columns
n = length(ds$location)
X = cbind(rep(1, n), ds$area, rep(0, n), rep(0, n), rep(0, n))
X[ds$location == "average", 3] = 1
X[ds$location == "good", 4] = 1
X[ds$location == "excellent", 5] = 1
X[c(1, 3, 69), ]

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1   26    0    1    0
## [2,]    1   30    1    0    0
## [3,]    1   55    0    0    1

require(Matrix)
dim(X)

## [1] 3082    5

rankMatrix(X)

## [1] 4
## attr(,"method")
## [1] "tolNorm2"
## attr(,"useGrad")
## [1] FALSE
## attr(,"tol")
## [1] 6.843415e-13

This is why we need to instead work with different ways of coding categorical variables. One solution is to not include an intercept in the model, but that is often not what we want. We will look at two other solutions - one where we decide on a reference category (that we not include in the coding, and therefore is kind of included in the intercept - this is called “treatment coding”) and one where we require that the the sum of the coeffisients are zero (called “effect coding). This mainly effects how we interpret parameter estimates and communicate our findings to the world.

If we fit a regression model with lm to the data with rent as response and area and location as covariates, a model matrix is made - and how to handle the categorical variable is either specified the call to lm in contrasts=list(location="contr.treatment") (or to model.matrix) or globally for all categorical variables with options(contrasts=c("contr.treatment","contr.poly"))- where first element give choice for unordered factor (then treatment contrast is default) and second for ordered (and then this polynomial contrast is default). We will only work with unordered factors now.

Dummy variable coding aka treatment contrast

This is the default coding. The reference level is automatically chosen as the “lowest” level (sorted alphabetically). For our example this means that the reference category for location is “average”. If we instead wanted “good” to be reference category we could relevel the factor.

X1 = model.matrix(~area + location, data = ds)
X1[c(1, 3, 69), ]

##    (Intercept) area locationgood locationexcellent
## 1            1   26            1                 0
## 3            1   30            0                 0
## 69           1   55            0                 1

ds$locationRELEVEL = relevel(ds$location, ref = "good")
X2 = model.matrix(~area + locationRELEVEL, data = ds)
X2[c(1, 3, 69), ]

##    (Intercept) area locationRELEVELaverage locationRELEVELexcellent
## 1            1   26                      0                        0
## 3            1   30                      1                        0
## 69           1   55                      0                        1

So, what does this mean in practice? Model 1 has average as reference category and model 2 good.

fit1 = lm(rent ~ area + location, data = ds, contrasts = list(location = "contr.treatment"))
summary(fit1)

## 
## Call:
## lm(formula = rent ~ area + location, data = ds, contrasts = list(location = "contr.treatment"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -790.98 -100.89   -4.87   94.47 1004.98 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       128.0867     8.6947  14.732  < 2e-16 ***
## area                4.7056     0.1202  39.142  < 2e-16 ***
## locationgood       28.0040     5.8662   4.774 1.89e-06 ***
## locationexcellent 131.1075    18.2614   7.180 8.73e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 157.1 on 3078 degrees of freedom
## Multiple R-squared:  0.3555, Adjusted R-squared:  0.3549 
## F-statistic:   566 on 3 and 3078 DF,  p-value: < 2.2e-16

fit2 = lm(rent ~ area + locationRELEVEL, data = ds, contrasts = list(locationRELEVEL = "contr.treatment"))
summary(fit2)

## 
## Call:
## lm(formula = rent ~ area + locationRELEVEL, data = ds, contrasts = list(locationRELEVEL = "contr.treatment"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -790.98 -100.89   -4.87   94.47 1004.98 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              156.0907     9.4950  16.439  < 2e-16 ***
## area                       4.7056     0.1202  39.142  < 2e-16 ***
## locationRELEVELaverage   -28.0040     5.8662  -4.774 1.89e-06 ***
## locationRELEVELexcellent 103.1034    18.4021   5.603 2.30e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 157.1 on 3078 degrees of freedom
## Multiple R-squared:  0.3555, Adjusted R-squared:  0.3549 
## F-statistic:   566 on 3 and 3078 DF,  p-value: < 2.2e-16

Q: Comment on the print-out. How do we interpret the intercept estimate?

Effect coding aka sum-zero-contrast:

This is an equally useful and popular coding - and this is the coding that is preferred when working with analysis of variance in general. The effect coding assumes that the sum of the effects for the levels of the factor sums to zero, and this is done with the following coding scheme (Model 3 with the original location and 4 with the releveled version.)

X3 = model.matrix(~area + location, data = ds, contrasts = list(location = "contr.sum"))
X3[c(1, 3, 69), ]

##    (Intercept) area location1 location2
## 1            1   26         0         1
## 3            1   30         1         0
## 69           1   55        -1        -1

X4 = model.matrix(~area + locationRELEVEL, data = ds, contrasts = list(locationRELEVEL = "contr.sum"))
X4[c(1, 3, 69), ]

##    (Intercept) area locationRELEVEL1 locationRELEVEL2
## 1            1   26                1                0
## 3            1   30                0                1
## 69           1   55               -1               -1

Observe the coding scheme. This means that when we find “the missing location level estimate” as the negative of the sum of the parameter estimates for the other estimated levels.

So, what does this mean in practice?

fit3 = lm(rent ~ area + location, data = ds, contrasts = list(location = "contr.sum"))
summary(fit3)

## 
## Call:
## lm(formula = rent ~ area + location, data = ds, contrasts = list(location = "contr.sum"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -790.98 -100.89   -4.87   94.47 1004.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 181.1238    10.6383  17.026  < 2e-16 ***
## area          4.7056     0.1202  39.142  < 2e-16 ***
## location1   -53.0372     6.6428  -7.984 1.98e-15 ***
## location2   -25.0331     6.7710  -3.697 0.000222 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 157.1 on 3078 degrees of freedom
## Multiple R-squared:  0.3555, Adjusted R-squared:  0.3549 
## F-statistic:   566 on 3 and 3078 DF,  p-value: < 2.2e-16

fit4 = lm(rent ~ area + locationRELEVEL, data = ds, contrasts = list(locationRELEVEL = "contr.sum"))
summary(fit4)

## 
## Call:
## lm(formula = rent ~ area + locationRELEVEL, data = ds, contrasts = list(locationRELEVEL = "contr.sum"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -790.98 -100.89   -4.87   94.47 1004.98 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      181.1238    10.6383  17.026  < 2e-16 ***
## area               4.7056     0.1202  39.142  < 2e-16 ***
## locationRELEVEL1 -25.0331     6.7710  -3.697 0.000222 ***
## locationRELEVEL2 -53.0372     6.6428  -7.984 1.98e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 157.1 on 3078 degrees of freedom
## Multiple R-squared:  0.3555, Adjusted R-squared:  0.3549 
## F-statistic:   566 on 3 and 3078 DF,  p-value: < 2.2e-16

Q: Comment on the print-out. How do we now interpret the intercept estimate?

Simple vs. multiple regression: why (and when) do estimated regression effects change from simple to multiple regression?

The Aquarium

(story originally told by Kathrine Frey Frøslie, look up her blog at http://statistrikk.no)

Imagine an aquarium - maybe 50 cm wide (x-axis) and 1 meter long (our y axis) and 1 meter high (our z-axis). Inside the aquarium there are many many small fish swimming together as a shoal of fish. Suddenly - we are able to freeze time and all the fish are fixated in the water.

We move to the short side of the aquarium, bend down and see the fish in two dimensions (x and z) - we are able to use a ruler to draw a line through the fish - they look as they lie on a straight line (we do a simple linear regression). Then we move over to the long side of the aquarium, bend down and see the fish in two dimensions (now y and z) - we are able to use a ruler to draw a line through the fish - and again they look as they lie on a straight line (and we can do simple linear regression).

Then I hand you a fictive glass plate and ask you to try to fit it to the shoal of fish in three dimentions - you do a multiple linear regression.

Design matrix - orthogonal columns gives diagonal $X^T X$. What does that mean? How can we get orthogonal columns?
If we have orthogonal columns, will then simple and multiple estimated regression coefficients be different?
What is multicollinearity?

Checking model assumptions

In the normal linear model we have made the following assumptions.

Linearity of covariates: ${\bf Y}={\bf X}\beta+\varepsilon$
Homoscedastic error variance: $\text{Cov}(\varepsilon)=\sigma^2 {\bf I}$.
Uncorrelated errors: $\text{Cov}(\varepsilon_i,\varepsilon_j)=0$.
Additivity of errors: ${\bf Y}={\bf X}\beta+\varepsilon$
Assumption of normality: $\varepsilon \sim N_n({\bf 0},\sigma^2{\bf I})$

General theory on QQ-plots

Go to separate R Markdown or html document: QQ–plot as Rmd or QQ–plot as html

Residuals

If we assume the normal linear model then we know that the residuals ($n\times 1$ vector) \[\hat{\varepsilon}={\bf Y}-{\bf \hat{Y}}=({\bf I}-{\bf H}){\bf Y}\] has a normal (singular) distribution with mean $\text{E}(\hat{\varepsilon})={\bf 0}$ and covariance matrix $\text{Cov}(\hat{ \varepsilon})=\sigma^2({\bf I}-{\bf H})$ where ${\bf H}={\bf X}({\bf X}^T{\bf X})^{-1}{\bf X}^T$.

This means that the residuals (possibly) have different variance, and may also be correlated.

Our best guess for the error $\varepsilon$ is the residual vector $\hat{\varepsilon}$, and we may think of the residuals as predictions of the errors. Be aware: don’t mix errors (the unobserved) with the residuals (“observed”).

But, we see that the residuals are not independent and may have different variance, therefore we will soon define variants of the residuals that we may use to assess model assumptions after a data set is fitted.

Q: How can we say that the residuals can have different variance and may be correlated? Why is that a problem?

We would like to check the model assumptions - we see that they are all connected to the error terms. But, but we have not observed the error terms $\varepsilon$ so they can not be used for this. However, we have made “predictions” of the errors - our residuals. And, we want to use our residuals to check the model assumptions.

That is, we want to check that our errors are independent, homoscedastic (same variance for each observation), and not dependent on our covariates - and we want to use the residuals (observed) in place of the errors (unobserved). Then it would have been great if the residuals have these properties when the underlying errors have. To amend our problem we need to try to fix the residual so that they at least have equal variances. We do that by working with standardized or studentized residuals.

Standardized residuals:

\[r_i=\frac{\hat{\varepsilon}_i}{\hat{\sigma}\sqrt{1-h_{ii}}}\] where $h_ii$ is the $i$th diagonal element of the hat matrix ${\bf H}$.

In R you can get the standardized residuals from an lm-object (named fit) by rstandard(fit).

Studentized residuals:

\[r^*_i=\frac{\hat{\varepsilon}_i}{\hat{\sigma}_{(i)}\sqrt{1-h_{ii}}}\] where $\hat{\sigma}_{(i)}$ is the estimated error variance in a model with observation number $i$ omitted. This seems like a lot of work, but it can be shown that it is possible to calculated the studentized residuals directly from the standardized residuals: \[r^*_i=r_i (\frac{n-p-1}{n-p-r_i^2})^{1/2}\]

In R you can get the studentized residuals from an lm-object (named fit) by rstudent(fit).

Plotting residuals - and what to do when assumptions are violated?

Some important plots

Plot the residuals, $r^*_i$ against the predicted values, $\hat{y}_i$.

Dependence of the residuals on the predicted value: wrong regression model?
Nonconstant variance: transformation or weighted least squares is needed?

Plot the residuals, $r^*_i$, against predictor variable or functions of predictor variables. Trend suggest that transformation of the predictors or more terms are needed in the regression.
Assessing normality of errors: QQ-plots and histograms of residuals. As an additional aid a test for normality can be used, but must be interpreted with caution since for small sample sizes the test is not very powerful and for large sample sizes even very small deviances from normality will be labelled as significant.
Plot the residuals, $r^*_i$, versus time or collection order (if possible). Look for dependence or autocorrelation.

Residuals can be used to check model assumptions, and also to discover outliers.

Diagnostic plots in R

More information on the plots here: http://data.library.virginia.edu/diagnostic-plots/ and http://ggplot2.tidyverse.org/reference/fortify.lm.html

You can use the function fortify.lm in ggplot2 to create a dataframe from an lm-object, which ggplot uses automatically when given a lm-object. This can be used to plot diagnostic plots.

We will use a datasett called women from the car-library as illustration. The information fortify provides is

data(women)  # Load a built-in dataframe called women
fit <- lm(weight ~ height, data = women)  # Run a regression analysis
format(head(fortify(fit)), digits = 4L)

##   weight height    .hat .sigma   .cooksd .fitted   .resid .stdresid
## 1    115     58 0.24167  1.370 0.5276638   112.6  2.41667    1.8198
## 2    117     59 0.19524  1.556 0.0605634   116.0  0.96667    0.7066
## 3    120     60 0.15595  1.579 0.0125634   119.5  0.51667    0.3688
## 4    123     61 0.12381  1.587 0.0001541   122.9  0.06667    0.0467
## 5    126     62 0.09881  1.583 0.0038437   126.4 -0.38333   -0.2648
## 6    129     63 0.08095  1.567 0.0143093   129.8 -0.83333   -0.5700

# to show what fortify implicitly does in ggplot

Residuals vs fitted values

A plot with the fitted values of the model on the x-axis and the residuals on the y-axis shows if the residuals have non-linear patterns. The plot can be used to test the assumption of a linear relationship between the response and the covariates. If the residuals are spread around a horizontal line with no distinct patterns, it is a good indication on no non-linear relationships, and a good model.

Does this look like a good plot for this data set?

ggplot(fit, aes(.fitted, .resid)) + geom_point(pch = 21) + geom_hline(yintercept = 0, 
    linetype = "dashed") + geom_smooth(se = FALSE, col = "red", size = 0.5, 
    method = "loess") + labs(x = "Fitted values", y = "Residuals", title = "Residuals vs Fitted values", 
    subtitle = deparse(fit$call))

Normal Q-Q

This plot shows if the residuals are Gaussian (normally) distributed. If they follow a straigt line it is an indication that they are, and else they are probably not.

ggplot(fit, aes(sample = .stdresid)) + stat_qq(pch = 19) + geom_abline(intercept = 0, 
    slope = 1, linetype = "dotted") + labs(x = "Theoretical quantiles", y = "Standardized residuals", 
    title = "Normal Q-Q", subtitle = deparse(fit$call))

Scale-location

This is also called spread-location plot. It shows if the residuals are spread equally along the ranges of predictors. Can be used to check the assumption of equal variance (homoscedasticity). A good plot is one with a horizontal line with randomly spread points.

Is this plot good for your data?

ggplot(fit, aes(.fitted, sqrt(abs(.stdresid)))) + geom_point() + geom_smooth(se = FALSE, 
    col = "red", size = 0.5, method = "loess") + labs(x = "Fitted values", y = expression(sqrt("Standardized residuals")), 
    title = "Scale-location", subtitle = deparse(fit$call))

Residual vs Leverage

This plot can reveal influential outliers. Not all outliers are influential in linear regression; even though data have extreme values, they might not be influential to determine the regression line (the results don’t differ much if they are removed from the data set). These influential outliers can be seen as observations that does not get along with the trend in the majority of the observations. In plot.lm, dashed lines are used to indicate the Cook’s distance, instead of using the size of the dots as is done here.

Cook’s distance is the Euclidean distance between the $\mathbf{\hat{y}}$ (the fitted values) and $\mathbf{\hat{y}}_{(i)}$ (the fitted values calculated when the $i$-th observation is omitted from the regression). This is then a measure on how much the model is influences by observation $i$. The distance is scaled, and a rule of thumb is to examine observations with Cook’s distance larger than 1, and give some attention to those with Cook’s distance above 0.5.

Leverage is defined as the diagonal elements of the hat matrix, i.e., the leverage of the $i$-th data point is $h_{ii}$ on the diagonal of $\mathbf{H = X(X^TX)^{-1}X^T}$. A large leverage indicated that the observation ($i$) has a large influence on the estimation results, and that the covariate values ($\mathbf{x}_i$) are unusual.

ggplot(fit, aes(.hat, .stdresid)) + geom_smooth(se = FALSE, col = "red", size = 0.5, 
    method = "loess") + geom_point(aes(size = .cooksd)) + scale_size_continuous("Cook's dist.") + 
    labs(x = "Leverage", y = "Standardized residuals", title = "Residuals vs Leverage", 
        subtitle = deparse(fit$call))

Interactive tasks for the first week (of this module)

Theoretical questions

Check what has been covered so far in this module - and stop and answer the questions (marked with Q) in the text. Discuss among yourself if there are common theoretical or practical questions.
Using the normal linear regression model write down the likelihood. Then define the score vector and observed Fisher information matrix. What is the set of equations we solve to find parameter estimates? What if we could not find a closed form solution to our set of equations? What if we have so many observations that they don’t even fit in the computer memory?
A core finding is $\hat \beta$. \[ \hat{\beta}=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf Y}\] with $\hat{\beta}\sim N_{p}(\beta,\sigma^2({\bf X}^T{\bf X})^{-1})$.

Show that $\hat{\beta}$ has this distribution with the given mean and covariance matrix. What does this imply for the distribution of the $j$th element of $\hat{\beta}$? In particular, how can we calculate the variance of $\hat{\beta}_j$?

Explain the difference between error and residual. What are the properties of the raw residuals? Why don’t we want to use the raw residuals for model check? What is our solution to this?
That is the theoretical intercept and slope of a QQ–plot based on a normal sample?

Munich Rent Index data

Fit the regression model with first rent and then rentsqm as reponse and following covariates: area, location (dummy variable coding using location2 and location3), bath, kitchen and cheating (central heating).

Look at diagnostic plots for the two fits. Which response do you prefer?

Consentrate on the response-model you choose for the rest of the tasks.

Explain what the parameter estimates mean in practice. In particular, what is the interpretation of the intercept?
What if you wanted to recode the location variable (average-good-excellent) so that the good location is the reference level. How can you do that, and what would that mean for your fitted model? Hint: first define a location factor, and then use relevel.
Go through the summary printout and explain the parts you know now, and also observe the parts you don’t know yet (on the agenda for next week?).
If you wanted to tell the common man about your fitted regression model - perferably using graphics from in ggplot. How could you do that?

Next week: more on inference on this data set.

Dummy vs. effect coding in MLR

We will study a dataset where we want to model income as response and two unordered categorical covariates genderand place (location).

income <- c(300, 350, 370, 360, 400, 370, 420, 390, 400, 430, 420, 410, 300, 
    320, 310, 305, 350, 370, 340, 355, 370, 380, 360, 365)
gender <- c(rep("Male", 12), rep("Female", 12))
place <- rep(c(rep("A", 4), rep("B", 4), rep("C", 4)), 2)
data <- data.frame(income, gender = factor(gender, levels = c("Female", "Male")), 
    place = factor(place, levels = c("A", "B", "C")))

First, describe the data set. How can you present the data? ggpairs?
Check out the interaction.plot(data$gender,data$place,data$income). What does it show? Do we need an interaction term if we want to model a MLR with income as response? Is there a ggplot version of the interaction plot?
Check our plot.design(income~place+gender, data = data). What does it show?
First, use treatment contrast (dummy variable coding) and fit a MLR with income as response and gender and place as covariates. Explain what your model estimates mean. In particular, what is the interpretation of the intercept estimate?
Now, turn to sum-zero contrast (effect coding). Explain what your model estimates mean. Now, what is the intercept estimate? Calculate the estimate for place=C.

Next week we connect this to linear hypotheses and ANOVA.

Simulations in R

For simple linear regression, simulate at data set with homoscedastic errore and with heteroscedastic errors. Here is a suggestion of one solution - not using ggplot. You use ggplot. Why this? To see how things looks when the model is correct and wrong.

# Homoscedastic errore
n = 1000
x = seq(-3, 3, length = n)
beta0 = -1
beta1 = 2
xbeta = beta0 + beta1 * x
sigma = 1
e1 = rnorm(n, mean = 0, sd = sigma)
y1 = xbeta + e1
ehat1 = residuals(lm(y1 ~ x))
plot(x, y1, pch = 20)
abline(beta0, beta1, col = 1)
plot(x, e1, pch = 20)
abline(h = 0, col = 2)
# Heteroscedastic errors
sigma = (0.1 + 0.3 * (x + 3))^2
e2 = rnorm(n, 0, sd = sigma)
y2 = xbeta + e2
ehat2 = residuals(lm(y2 ~ x))
plot(x, y2, pch = 20)
abline(beta0, beta1, col = 2)
plot(x, e2, pch = 20)
abline(h = 0, col = 2)

All this fuss about raw, standardized and studentized residuals- does really matter in practice? Below is one example where the raw residuals are rather different from the standardized, but the standardized is identical to the studentized. Can you come up with a simuation model where the standardized and studentized are very different? Hint: what about at smaller sample size?

n = 1000
beta = matrix(c(0, 1, 1/2, 1/3), ncol = 1)
set.seed(123)
x1 = rnorm(n, 0, 1)
x2 = rnorm(n, 0, 2)
x3 = rnorm(n, 0, 3)
X = cbind(rep(1, n), x1, x2, x3)
y = X %*% beta + rnorm(n, 0, 2)
fit = lm(y ~ x1 + x2 + x3)
yhat = predict(fit)
summary(fit)
ehat = residuals(fit)
estand = rstandard(fit)
estud = rstudent(fit)
plot(yhat, ehat, pch = 20)
points(yhat, estand, pch = 20, col = 2)

Demonstrations

Using github (Jarl) - fork, update, pull request, merge. You need to have made your own account on github to benefit the most from this demo.
You may then send me a pull request to add you Rmd and html version of your work (for the interactive session) to the module folders at https://github.com/mettelang/TMA4315H2017
Kristin showed that there are many group tools for your compulsory exercises - blogg and file upload are two.

SECOND WEEK

$\oplus$ What to remember?

Model: \[{\bf Y=X \beta}+{\bf\varepsilon}\] with full rank design matrix. And classical normal linear regression model when \[\varepsilon\sim N_n(\bf{0},\sigma^2\bf{I}).\]

Parameter of interest is $\beta$ and $\sigma^2$ is a nuisance. Maximum likelihood estimator \[ \hat\beta=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf Y}\] has distribution: $\hat{\beta}\sim N_{p}(\beta,\sigma^2({\bf X}^T{\bf X})^{-1})$.

Restricted maximum likelihood estimator for ${\bf \sigma}^2$: \[ \hat{\sigma}^2=\frac{1}{n-p}({\bf Y}-{\bf X}\hat{\beta})^T({\bf Y}-{\bf X}\hat{\beta})=\frac{\text{SSE}}{n-p}\] with $\frac{(n-p)\hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p}$.

Statistic for inference about $\beta_j$, $c_{jj}$ is diagonal element $j$ of $({\bf X}^T{\bf X})^{-1}$. \[ T_j=\frac{\hat{\beta}_j-\beta_j}{\sqrt{c_{jj}}\hat{\sigma}}\sim t_{n-p}\] This requires that $\hat{\beta}_j$ and $\hat{\sigma}$ are independent.

Classnotes 04.09.2017

Inference

We will consider confidence intervals and prediction intervals, and then test single and linear hypotheses.

Confidence intervals (CI)

In addition to providing a parameter estimate for each element of our parameter vector $\beta$ we should also report a $(1-\alpha)100$% confidence interval (CI) for each element. (We will not consider simultanous confidence regions in this course.)

We focus on element $j$ of $\beta$, called $\beta_j$. It is known that $T_j =\frac{\hat{\beta}_j-\beta_j}{\sqrt{c_{jj}}\hat{\sigma}}$ follows a $t$-distribution with $n-p$ degrees of freedom. Let $t_{\alpha/2,n-p}$ be such that $P(T_j>t_{\alpha/2,n-p})=\alpha/2$. REMARK: our textbook would here look at area to the left instead of to the right - but we stick with this notation. Since the $t$-distribution is symmetric around 0, then $P(T_j< -t_{\alpha/2,n-p})=\alpha/2$. We may then write \[ P(-t_{\alpha/2,n-p}\le T_j \le t_{\alpha/2,n-p})=1-\alpha\]

require(ggplot2)
df=10
ggplot(data.frame(x=c(-4,4)), aes(x))+ 
  theme(axis.title.x=element_blank(),axis.title.y=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks = element_blank())+
  stat_function(fun=function(x) dt(x,df), geom="line", colour="red")+
  geom_vline(xintercept=qt(0.975,df),color="blue")+
  geom_vline(xintercept=-qt(0.975,df),color="blue")+
  geom_hline(yintercept=0)+
  annotate("text",x=qt(0.975,df)+0.6,y=0.03,label="alpha/2",parse=TRUE,size=5)+
  annotate("text",x=-qt(0.975,df)-0.6,y=0.03,label="alpha/2",parse=TRUE,size=5)+
  annotate("text",x=0,y=0.03,label="1-alpha",parse=TRUE,size=5)

(Blue lines at $\pm t_{\alpha/2,n-p}$.)

Inserting $T_j =\frac{\hat{\beta}_j-\beta_j}{\sqrt{c_{jj}}\hat{\sigma}}$ and solving so $\beta_j$ is in the middle gives: \[ P(\hat{\beta}_j-t_{\alpha/2,n-p}\sqrt{c_{jj}}\hat{\sigma} \le \beta_j \le \hat{\beta}_j+t_{\alpha/2,n-p}\sqrt{c_{jj}}\hat{\sigma})=1-\alpha\]

A $(1-\alpha)$% CI for $\beta_j$ is when we insert numerical values for the upper and lower limits: $[\hat{\beta}_j-t_{\alpha/2,n-p}\sqrt{c_{jj}}\hat{\sigma},\hat{\beta}_j+t_{\alpha/2,n-p}\sqrt{c_{jj}}\hat{\sigma}]$.

CIs can be found in R using confint on an lm object. (Here dummy variable coding is used for location, with average as reference location.)

require(gamlss.data)
fit = lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
confint(fit)

##                  2.5 %      97.5 %
## (Intercept) -44.825534   0.8788739
## area          4.354674   4.8029443
## location2    28.579849  49.9405909
## location3    92.970636 159.1443278
## bath1        52.076412  96.0311030
## kitchen1     94.907671 145.9621578
## cheating1   144.427555 178.4000215

Q (and A):

What is the interpretation of a 95% confidence interval? A: see Classnotes 04.09.2017
Does the CI for $\hat{\beta}_{\text{area}}$ change if we change the regression model (e.g. not include cheating)? A: Yes!
How can we in practice find a CI for location1 (average location) - when that is not printed above? (Yes, may use formula, but in R without maths?) A: relevel to give good as reference.

Prediction intervals

Remember, one aim for regression was to “construct a model to predict the reponse from a set of (one or several) explanatory variables- more or less black box”.

Assume we want to make a prediction (of the response - often called $Y_0$) given specific values for the covariates - often called ${\bf x}_0$. An intuitive point estimate is $\hat{Y}_0={\bf x}_0^T \hat{\beta}$ - but to give a hint of the uncertainty in this prediction we also want to present a prediction interval for the $Y_0$.

To arrive at such an estimate we start with the difference between the unobserved response $Y_0$ (for a given covariate vector ${\bf x}_0$) and the point prediction $\hat{Y}_0$, $Y_0-\hat{Y}_0$. First, we assume that the unobserved response at covariate ${\bf x}_0$ is independent of our previous observations and follows the same distibution, that is $Y_0 \sim N({\bf x}_0^T \beta,\sigma^2)$. Further,

\[\hat{Y}_0={\bf x}_0^T \hat{\beta} \sim N({\bf x}_0^T \beta,\sigma^2 {\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0).\]

Then, for $Y_0-{\bf x}_0^T \hat{\beta}$ we have \[\text{E}(Y_0-{\bf x}_0^T \hat{\beta})=0 \text{ and } \text{Var}(Y_0-{\bf x}_0^T \hat{\beta})=\text{Var}(Y_0)+\text{Var}({\bf x}_0^T \hat{\beta})=\sigma^2+\sigma^2{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0\] so that \[Y_0-{\bf x}_0^T \hat{\beta}\sim N(0,\sigma^2 (1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0)) \] Inserting our REML-estimate for $\sigma^2$ gives

\[T=\frac{Y_0-{\bf x}_0^T \hat{\beta}}{\hat{\sigma}\sqrt{1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0}}\sim t_{n-p}.\]

Then, we start with \[ P(-t_{\alpha/2,n-p}\le \frac{Y_0-{\bf x}_0^T \hat{\beta}}{\hat{\sigma}\sqrt{1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0}} \le t_{\alpha/2,n-p})=1-\alpha\] and solve so that $Y_0$ is in the middle, which gives \[P({\bf x}_0^T \hat{\beta}-t_{\alpha/2,n-p}\hat{\sigma}\sqrt{1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0} \le Y_0 \le {\bf x}_0^T \hat{\beta}+t_{\alpha/2,n-p}\hat{\sigma}\sqrt{1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0})=1-\alpha\]

A $(1-\alpha)$% PI for $Y_0$ is when we insert numerical values for the upper and lower limits: $[{\bf x}_0^T \hat{\beta}-t_{\alpha/2,n-p}\hat{\sigma}\sqrt{1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0}, {\bf x}_0^T \hat{\beta}+t_{\alpha/2,n-p}\hat{\sigma}\sqrt{1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0}]$.

PIs can be found in R using predict on an lm object, but make sure that newdata is a data.frame with the same names as the original data. We want to predict the rent - with PI - for an appartment with area 50, location 2 (“good”), nice bath and kitchen and with central heating.

require(gamlss.data)
fit = lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
newobs = rent99[1, ]
newobs[1, ] = c(NA, NA, 50, NA, 2, 1, 1, 1, NA)
predict(fit, newdata = newobs, interval = "prediction", type = "response")

##        fit      lwr      upr
## 1 602.1298 315.5353 888.7243

Q (and A):

When is a prediction interval of interest? A: Always? Gives useful information in addition to a point prediction.
Explain the result from predict above. A: Fit it point prediction, lwr is lower and upr is upper limit of the 95% PI (default with 95, argument if other.)
What is the interpretation of a 95% prediction interval? A: see Classnotes 04.09.2017 —-

Single hypothesis testing set-up

In single hypothesis testing we are interesting in testing one null hypothesis against an alternative hypothesis. In MLR the hypothesis is often about a regression parameter $\beta_j$: \[ H_0: \beta_j=0 \text{ vs. } H_1: \beta_j\neq 0\] Remark: we implicitly say that our test is done given that the other variables are present in the model, that is, the other $\beta_i$s ($j\neq i$) are not zero.

Two types of errors:

“Reject $H_0$ when $H_0$ is true”=“false positives” = “type I error” =“miscarriage of justice”. These are our fake news, which we is very important for us to avoid.
“Fail to reject $H_0$ when $H_1$ is true (and $H_0$ is false)”=“false negatives” = “type II error”= “guilty criminal go free”.

We choose to reject $H_0$ at some significance level $\alpha$ if the $p$-value of the test (see below) is smaller than the chosen significance level. We say that : Type I error is “controlled” at significance level $\alpha$, which means that the probability of miscarriage of justice (Type I error) does not exceed $\alpha$.

Q (and A):

Draw a 2 by 2 table showing the connection between “truth” ($H_0$ true or $H_0$ false) and “action” (reject “$H_0$” and accept “$H_0$”) - and place the two types of errors in the correct position within the table. What else should be written within the last two cells?

A: see Classnotes 04.09.2017

Hypothesis test on $\beta_j$

In our MLR our test statistic for testing $H_0: \beta_j=0$ is \[ T_0=\frac{\hat{\beta}_j-0}{\sqrt{c_{jj}}\hat{\sigma}}\sim t_{n-p}\] Inserted observed values (and estimates) we have $t_0$.

We would in a two-sided setting reject $H_0$ for large values of $\text{abs}(t_0)$. We may rely on calculating a $p$-value.

The $p$-value

A $p$-value $p(X)$ is a test statistic satisfying $0 \leq p({\bf Y}) \leq 1$ for every vector of observations ${\bf Y}$.

Small values give evidence that $H_1$ is true.
In single hypothesis testing, if the $p$-value is less than the chosen significance level (chosen upper limit for the probability of committing a type I error), then we reject the null hypothesis, $H_0$. The chosen significance level is often referred to as $\alpha$.
A $p$-value is valid if \[ P(p({\bf Y}) \leq \alpha) \leq \alpha\] for all $\alpha$, $0 \leq \alpha \leq 1$, whenever $H_0$ is true, that is, if the $p$-value is valid, rejection on the basis of the $p$-value ensures that the probability of type I error does not exceed $\alpha$.
If $P(p({\bf Y}) \leq \alpha) = \alpha$ for all $\alpha$, $0 \leq \alpha \leq 1$, the $p$-value is called an exact $p$-value.

In our MLR we use the $t$-distibution to calculate $p$-values for our two-sided test situation $H_0: \beta_j=0$ vs. $H_1: \beta_j \neq 0$. Assume we have observed that our test statistic $T_0$ takes the numerical value $t_0$. Since the $t$-distribution is symmetric around $0$ we have

\[p\text{-value}=P(T_0>\text{abs}(t_0))+P(T_0<-\text{abs}(t_0))=2\cdot P(T_0>\text{abs}(t_0)).\]

We reject $H_0$ if our calculated $p$-value is below our chosen signficance level. We often choose as significance level $\alpha=0.05$.

Munich rent index hypothesis test

We look at print-out using summary from fitting lm.

require(gamlss.data)
colnames(rent99)

## [1] "rent"     "rentsqm"  "area"     "yearc"    "location" "bath"    
## [7] "kitchen"  "cheating" "district"

fit = lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
summary(fit)

## 
## Call:
## lm(formula = rent ~ area + location + bath + kitchen + cheating, 
##     data = rent99)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -633.41  -89.17   -6.26   82.96 1000.76 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -21.9733    11.6549  -1.885   0.0595 .  
## area          4.5788     0.1143  40.055  < 2e-16 ***
## location2    39.2602     5.4471   7.208 7.14e-13 ***
## location3   126.0575    16.8747   7.470 1.04e-13 ***
## bath1        74.0538    11.2087   6.607 4.61e-11 ***
## kitchen1    120.4349    13.0192   9.251  < 2e-16 ***
## cheating1   161.4138     8.6632  18.632  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 145.2 on 3075 degrees of freedom
## Multiple R-squared:  0.4504, Adjusted R-squared:  0.4494 
## F-statistic:   420 on 6 and 3075 DF,  p-value: < 2.2e-16

Q (and A):

Where is hypothesis testing performed here, and which are the hypotheses rejected at level $0.01$? A: Column named t value gives numerical value for the t-statistic and column *Pr(>|t|) gives $p$-value for the two-sided test.
Will the test statistics and $p$-values change if we change the regression model? A: Yes. Unless our design matrix has orthogonal columns.
What is the relationship between performing an hypothesis test and constructing a CI interval? Remember:

require(gamlss.data)
fit = lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
confint(fit)

##                  2.5 %      97.5 %
## (Intercept) -44.825534   0.8788739
## area          4.354674   4.8029443
## location2    28.579849  49.9405909
## location3    92.970636 159.1443278
## bath1        52.076412  96.0311030
## kitchen1     94.907671 145.9621578
## cheating1   144.427555 178.4000215

A: If a $(1-\alpha)$ 100% CI covers the hypothsized value (here 0) then this is a sensible value for the parameter it can be shown that then the $p$-value will be larger than $\alpha$. And, if a $(1-\alpha)$ 100% CI does not cover the hypothsized value (here 0) then this is not a sensible value for the parameter it can be shown that then the $p$-value will be smaller than $\alpha$.

Testing linear hypotheses in regression

We study a normal linear regression model with $p=k+1$ covariates, and refer to this as model A (the larger model). We then want to investigate the null and alternative hypotheses of the following type(s): \[\begin{eqnarray*} H_0: \beta_{j}&=&0 \text{ vs. } H_1:\beta_j\neq 0\\ H_0: \beta_{1}&=&\beta_{2}=\beta_{3}=0 \text{ vs. } H_1:\text{ at least one of these }\neq 0\\ H_0: \beta_{1}&=&\beta_{2}=\cdots=\beta_{k}=0 \text{ vs. } H_1:\text{ at least one of these }\neq 0\\ \end{eqnarray*}\]

We call the restricted model (when the null hypotesis is true) model B, or the smaller model.

These null hypotheses and alternative hypotheses can all be rewritten as a linear hypothesis \[H_0: {\bf C}{\bf \beta}={\bf d} \text{ vs. } {\bf C}{\bf \beta}\neq {\bf d} \] by specifying ${\bf C}$ to be a $r \times p$ matrix and ${\bf d}$ to be a column vector of length $p$.

The test statistic for performing the test is called $F_{obs}$ and can be formulated in two ways: \[\begin{eqnarray} F_{obs}&=&\frac{\frac{1}{r} (SSE_{H_0}-SSE)}{\frac{SSE}{n-p}} \label{Fobsnested}\\ F_{obs}&=&\frac{1}{r}({\bf C}\hat{{\bf \beta}}-{\bf d})^{\text T}[\hat{\sigma}^2 {\bf C}({\bf X}^{\text T}{\bf X})^{-1}{\bf C}^{\text T}]^{-1}({\bf C}\hat{{\bf \beta}}-{\bf d}) \label{FobsCbeta} \end{eqnarray}\]

where $SSE$ is from the larger model A, $SSE_{H_0}$ from the smaller model B, and $\hat{{\bf \beta}}$ and $\hat{\sigma}^2$ are estimators from the larger model A.

Testing a set of parameters - what is ${\bf C}$ and ${\bf d}$?

We consider a regression model with intercept and five covariates, $x_1, \ldots, x_5$. Assume that we want to know if the covariates $x_3$, $x_4$, and $x_5$ can be dropped (due to the fact that none of the corresponding $\beta_j$s are different from zero). This means that we want to test:

\[H_0: \beta_{3}=\beta_{4}=\beta_{5}=0 \text{ vs. } H_1:\text{ at least one of these }\neq 0\] This means that our ${\bf C}$ is a $6\times 3$ matrix and ${\bf d}$ a $3 \times 1$ column vector \[ {\bf C}=\begin{pmatrix} 0 &0 &0 &1 &0&0 \\ 0&0&0&0&1&0\\ 0&0&0&0&0& 1\\ \end{pmatrix} \text{ and } {\bf d} \begin{pmatrix} 0\\0\\0\\ \end{pmatrix}\]

Testing one regression parameter

If we set ${\bf C}=(0,1, 0, \cdots, 0)^T$, a row vector with 1 in position 2 and 0 elsewhere, and ${\bf d}=(0,0,\ldots,0)$, a column vector with 0s, then we test \[ H_0: \beta_1=0 \text{ vs. } H_1: \beta_1\neq 0.\] Now ${\bf C}\hat{\beta}=\beta_1$ and ${\bf C}({\bf X}^{\text T}{\bf X})^{-1}{\bf C}^{\text T}=c_{11}$, so that $F_{obs}$ then is equal to the square of the $t$-statistics for testing a single regression parameter. \[F_{obs}=(\hat{\beta}_1-0)^T[\hat{\sigma}^2 c_{jj}]^{-1}(\hat{\beta}_1-0)=T_1^2\]

Repeat the argument with $\beta_j$ instead of $\beta_1$.

Remark: Remember that $T_{\nu}^2=F_{1,\nu}$.

Testing “significance of the regression”

If we set ${\bf C}=(0,1,1, \cdots ,1)^T$, a row vector with 0 in position 1 and 0 elsewhere, and ${\bf d}=(0,0,\ldots,0)$, a column vector with 0s, then we test \[ H_0: \beta_1=\beta_2=\cdots= \beta_k =0 \text{ vs. } H_1: \text{at least one different from zero}.\] This means we test if at least one of the regression parameters (in addition to the intercept) is different from 0. The small model is then the model with only the intercept, and for this model the SSE$_{H_0}$ is equal to SST (sums of squares total, see below). Let SSE be the sums-of-squares of errors for the full model. If we have $k$ regression parameters (in addition to the intercept) then the F-statistic becomes \[ F_{obs}=\frac{\frac{1}{k}(\text{SST}-\text{SSE})}{\frac{\text{SSE}}{n-p}}\] with $k$ and $n-p$ degrees of freedom under $H_0$.

require(gamlss.data)
fit = lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
summary(fit)

## 
## Call:
## lm(formula = rent ~ area + location + bath + kitchen + cheating, 
##     data = rent99)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -633.41  -89.17   -6.26   82.96 1000.76 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -21.9733    11.6549  -1.885   0.0595 .  
## area          4.5788     0.1143  40.055  < 2e-16 ***
## location2    39.2602     5.4471   7.208 7.14e-13 ***
## location3   126.0575    16.8747   7.470 1.04e-13 ***
## bath1        74.0538    11.2087   6.607 4.61e-11 ***
## kitchen1    120.4349    13.0192   9.251  < 2e-16 ***
## cheating1   161.4138     8.6632  18.632  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 145.2 on 3075 degrees of freedom
## Multiple R-squared:  0.4504, Adjusted R-squared:  0.4494 
## F-statistic:   420 on 6 and 3075 DF,  p-value: < 2.2e-16

Q (and A): Is the regression significant? A: Yes, the F-test is given in the last row of the summary print-out and gives a $p$-value clearly below any sensible signficance level.

Relation to Wald test

Since $\text{Cov}(\hat{\beta})=\sigma^2 ({\bf X}^T{\bf X})^{-1}$, then $\text{Cov}({\bf C}\hat{\beta})={\bf C}\sigma^2({\bf X}^T{\bf X})^{-1}{\bf C}^T$, so that ${\bf C}\hat{\sigma}^2({\bf X}^T{\bf X})^{-1}{\bf C}^T$ can be seen as an estimate of $\text{Cov}({\bf C}\hat{\beta})$. Therefore, $F_{obs}$ can be written

\[F_{obs}=\frac{1}{r}({\bf C}\hat{{\bf \beta}}-{\bf d})^{\text T}[\widehat{\text{Cov}}({\bf C}\hat{\beta})]^{-1}({\bf C}\hat{{\bf \beta}}-{\bf d})=\frac{1}{r}W\] where $W$ is a socalled Wald test. It is known that $W\sim \chi^2_r$ asymptotically as $n$ becomes large. We will study the Wald test in more detail later in this course.

Asympotic result

It can in general be shown that \[r F_{r,n-p}\stackrel{n\rightarrow \infty}{\longrightarrow} \chi^2_r.\] That is, if we have a random variable $F$ that is distributed as Fisher with $r$ (numerator) and $n-p$ (denominator) degrees of freedom, then when $n$ goes to infinity ($p$ kept fixed), then $rF$ is approximately $\chi^2$-distributed with $r$ degrees of freedom.

Also, if our error terms are not normally distributed then we can assume that when the number of observation becomes very large then $rF_{r,n-p}$ is approximately $\chi^2_r$.

Analysis of variance decomposition and coefficient of determination, $R^2$

It is possible to decompose the total variability in the data, called SST (sums-of-squares total), into a part that is explained by the regression SSR (sums-of-squares regression), and a part that is not explained by the regression SSE (sums-of-squares error, or really residual).

Let $\bar{Y}=\frac{1}{n}\sum_{i=1}^n Y_i$, and $\hat{Y}_i={\bf x}_i^T\hat{\beta}$. Then, \[\begin{align*} \text{SST}&=\text{SSR}+\text{SSE}\\ \text{SST}&=\sum_{i=1}^n (Y_i-\bar{Y})^2={\bf Y}^T({\bf I}-\frac{1}{n}{\bf 1 1}^T){\bf Y}\\ \text{SSR}&=\sum_{i=1}^n (\hat{Y}_i-\bar{Y})^2={\bf Y}^T({\bf H}-\frac{1}{n}{\bf 1 1}^T){\bf Y}\\ \text{SSE}&=\sum_{i=1}^n (Y_i-\hat{Y}_i)^2=\sum_{i=1}^n \hat{\varepsilon}_i^2={\bf Y}^T({\bf I}-{\bf H}){\bf Y}.\\ \end{align*}\]

The proof can be found in Section 3.5 in our text book Regression.

Based on this decomposition we may define the coefficient of determination ($R^2$) as the ratio between SSR and SST, that is \[R^2=\text{SSR}/\text{SST}=1-\text{SSE}/\text{SST}\] 1. The interpretation of this coefficient is that the closer it is to 1 the better the fit to the data. If $R^2=1$ then all residuals are zero - that is, perfect fit to the data.

In a simple linear regression the $R^2$ equals the squared correlation coefficient between the reponse and the predictor. In multiple linear regression $R^2$ is the squared correlation coefficient between the observed and predicted response.
If we have two models M1 and M2, where model M2 is a submodel of model M1, then \[ R^2_{M_1}\ge R^2_{M_2}.\] This can be explained from the fact that $\text{SSE}_{M_1}\le \text{SSE}_{M_2}$. (More in the Theoretical questions.)

Analysis of variance tables - with emphasis on sequential Type I ANOVA

It is possible to call the function anova on an lm-object. What does that function do?

require(gamlss.data)
fit = lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
anova(fit)

## Analysis of Variance Table
## 
## Response: rent
##             Df   Sum Sq  Mean Sq  F value    Pr(>F)    
## area         1 40299098 40299098 1911.765 < 2.2e-16 ***
## location     2  1635047   817524   38.783 < 2.2e-16 ***
## bath         1  1676825  1676825   79.547 < 2.2e-16 ***
## kitchen      1  2196952  2196952  104.222 < 2.2e-16 ***
## cheating     1  7317894  7317894  347.156 < 2.2e-16 ***
## Residuals 3075 64819547    21080                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What is produced is a sequential table of the reductions in residual sum of squares (SSE) as each term in the regression formula is added in turn. This type of ANOVA is often referred to as “Type 1” (not to be confused with type I errors).

We can produce the same table by fitting larger and larger regression models.

require(gamlss.data)
fit = lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
fit0 <- lm(rent ~ 1, data = rent99)
fit1 <- update(fit0, . ~ . + area)
fit2 <- update(fit1, . ~ . + location)
fit3 <- update(fit2, . ~ . + bath)
fit4 <- update(fit3, . ~ . + kitchen)
fit5 <- update(fit4, . ~ . + cheating)
anova(fit0, fit1, fit2, fit3, fit4, fit5, test = "F")

## Analysis of Variance Table
## 
## Model 1: rent ~ 1
## Model 2: rent ~ area
## Model 3: rent ~ area + location
## Model 4: rent ~ area + location + bath
## Model 5: rent ~ area + location + bath + kitchen
## Model 6: rent ~ area + location + bath + kitchen + cheating
##   Res.Df       RSS Df Sum of Sq        F    Pr(>F)    
## 1   3081 117945363                                    
## 2   3080  77646265  1  40299098 1911.765 < 2.2e-16 ***
## 3   3078  76011217  2   1635047   38.783 < 2.2e-16 ***
## 4   3077  74334393  1   1676825   79.547 < 2.2e-16 ***
## 5   3076  72137441  1   2196952  104.222 < 2.2e-16 ***
## 6   3075  64819547  1   7317894  347.156 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# anova(fit0,fit1) # compare model 0 and 1 - NOT sequential anova(fit0,fit5)
# # compare model 0 and 5 - NOT sequential

If we had changed the order of adding the covariates to the model, then our anova table would also change. You might check that if you want.

See the last page of the classnotes 04.09.2017 for mathematical notation on the sequential test in anova, and details on the print-out comes next - NEW: now with formulas!

Details on the test `anova(fit)`

When running anova on one fitted regression the $F$-test in anova is calculated as for “testing linear hypotheses” - but with a slight twist. Our large model is still the full regression model (from the fitted object), but the smaller model is replaced by the the change from one model to the next.

Let SSE be the sums-of-squares-error (residual sums of squares) from the full (large, called A) model - this will be our denominator (as always). For our rent example the denominator will be SSE/(n-p)=64819547/3075 (see print-out above).

However, for the numerator we are not comparing one small model with the full (large) one, we are instead looking at the change in SSE between two (smaller) models (calles model B1 and B2). So, now we have in the numerator the difference in SSE between models B1 and B2, scaled with the difference in number of parameters estimated in model B1 and B2 =“number in B2 minus in B1” (which is the same as the difference in degrees of freedom for the two models).

So, B1 could be the model with only intercept, and B2 could be the model with intercept and area. Then we calculate the SSE for model B1 and for model B2, and keep the difference (here 40299098). Then we count the number of parameters in model B1 and model B2 and compute “number in B2 minus in B1” (here 1). We could istead have calculated the number of degrees of freedom in the smaller model minus the number of degrees of freedom for the larger model (which here is 1). Our numerator is then 40299098/1.

This means that the test statistics we use are:

\[ F_0=\frac{\frac{\text{SSE}_{B1}-\text{SSE}_{B2}}{\text{df}_{B1}-\text{df}_{B2}}}{\frac{\text{SSE}_A}{\text{df}_A}}\] Remark: notice that the denominator is just the $\hat{\sigma^2}$ from the larger model A.

This makes our $F$-test statistic: $f_0=\frac{40299098/1}{64819547/3075}=1911.765$ (remember that we swap from capital to small letters when we insert numerical values).

To produce a $p$-value to the test that \[H_0: \text{"Model B1 and B2 are equally good" vs }H_1:\text{"Model B2 is better than B1}\] and then the $F\sim {\text{df}_{B1}-\text{df}_{B2},\text{df}_A}$. In our example we compare to an F-distribution with 1 and 3075 degrees of freedom. The $p$-value is the “probability of observing a test statistic at least as extreme as we have” so we calculate the $p$-value as $P(F>f_0)$. This gives a $p$-value that is practically 0.

If you then want to use the asymptotic version (relating to a chi-square instead of the F), then multiply your F-statistic with $\text{df}_{B1}-\text{df}_{B2}$ and relate to a $\chi^2$ distribution with $\text{df}_{B1}-\text{df}_{B2}$ degrees of freedom, where $\text{df}_{B1}-\text{df}_{B2}$ is the difference in number of parameters in models B1 and B2. In our example $\text{df}_{B1}-\text{df}_{B2}=1$.

For the anova table we do this sequentially for all models from starting with only intercept to the full model A. This means you need to calculate SSE and df for models of all sizes to calculate lots of these $F_0$s. Assume that we have 4 covariates that are added to the model, and call the 5 possible models (given the order of adding the covariates)

model 1: model with only intercept
model 2: model with intercept and covariate 1
model 3: model with intercept and covariate 1 and covariate 2
model 4: model with intercept and covariate 1 and covariate 2 and covariate 3
model 5: model with intercept and covariate 1 and covariate 2 and covariate 3 and covariate 4

Fit a linear model (lm) for each model 1-5, and store SSE and degrees of freedom=df (number of observations minus number of covariates estimated) for each of the models. Call these SSE$_1$ to SSE$_5$ and df$_1$ to $df$_5$.

The anova output has columns: Df Sum Sq Mean Sq F value Pr(>F) and one row for each covariate added to the model.

model 2 vs model 1: Df=df$_1$-df$_2$, Sum Sq=SSE$_1$-SSE$_2$, Mean Sq=Sum Sq/Df, F value=(Mean Sq)/(SSE$_5$/df$_5$)=$f_0$, Pr(>F)=pvalue=$P(F>f_0)$.
model 3 vs model 2: Df=df$_2$-df$_3$, Sum Sq=SSE$_2$-SSE$_3$, Mean Sq=Sum Sq/Df, F value=(Mean Sq)/(SSE$_5$/df$_5$)=$f_0$, Pr(>F)=pvalue=$P(F>f_0)$.
model 4 vs model 3: Df=df$_3$-df$_4$, Sum Sq=SSE$_3$-SSE$_4$, Mean Sq=Sum Sq/Df, F value=(Mean Sq)/(SSE$_5$/df$_5$)=$f_0$, Pr(>F)=pvalue=$P(F>f_0)$.
model 5 vs model 4: Df=df$_4$-df$_5$, Sum Sq=SSE$_4$-SSE$_5$, Mean Sq=Sum Sq/Df, F value=(Mean Sq)/(SSE$_5$/df$_5$)=$f_0$, Pr(>F)=pvalue=$P(F>f_0)$.

In R the p-value is calculated as 1-pf(f0,Df) or as 1-pchisq(Df*f0,Df) if the asymptotic chisquare distribution is used.

So, this is what is presented - a sequential record of the effect of adding a new covariate.

**Q*: what if you change the order of the covariates into the model? Yes, then everything changes. That is the drawback of Type I (sequential) thinking.

A competing way of thinking is called type 3 ANOVA and instead of looking sequentially at adding terms, we (like in summary) calculated the contribution to a covariate (or factor) given that all other covariates are present in the regression model. Type 3 ANOVA is available from library car as function Anova (possible to give type of anova as input).

Check : Take a look at the print-out from summary and anova and observe that for our rent data the $p$-values for each covariate are different due to the different nature of the $H_0$s tested (sequential vs. “all other present”).

If we had orthogonal columns for our different covariates the type 1 and type 3 ANOVA tables would have been equal.

Optional (beyond the scope of this course)

There is also something called a type 2 ANOVA table, but that is mainly important if we have interactions in our model, so we do not consider that here. If you want to read more this blogplot http://goanna.cs.rmit.edu.au/~fscholer/anova.php is a good read. And, in combination with different variants of dummy and effct coding - read this: http://rstudio-pubs-static.s3.amazonaws.com/65059_586f394d8eb84f84b1baaf56ffb6b47f.html

Model choice

Quality measures

To assess the quality of the regression we can report the $R^2$ coefficient of determination. However, since adding covariates to the linear regression can not make the SSE larger, this means that adding covariates can not make the $R^2$ smaller. This means that SSE and $R^2$ are only useful measures for comparing models with the same number of regression parameters estimated.

If we consider two models with the same model complexity then SSE can be used to choose between (or compare) these models.

But, if we want to compare models with different model complexity we need to look at other measures of quality for the regression.

$R^2$ adjusted (corrected)

\[R^2_{\text{adj}}=1-\frac{\frac{SSE}{n-p}}{\frac{SST}{n-1}}=1-\frac{n-1}{n-p}(1-R^2)\] Choose the model with the largest $R^2_{\text{adj}}$.

AIC Akaike information criterion

AIC is one of the most widely used criteria, and is designed for likelihood-based inference. Let $l(\hat{\beta}_M,\tilde{\sigma}^2)$ be the maximum of the log-likelihood of the data inserted the maximum likelihood estimates for the regression and nuisance parameter. Further, let $\lvert M \rvert$ be the number of estimated regression parameters in our model. \[\text{AIC} =-2 \cdot l(\hat{\beta}_M,\tilde{\sigma}^2)+2(\lvert M\rvert +1)\]

For a normal regression model: \[\text{AIC} =n\ln(\tilde{\sigma}^2)+2(\lvert M\rvert +1)+C\] where C is a function of $n$ (will be the same for two models for the same data set). Remark that $\tilde{\sigma}^2=SSE/n$ - our ML estimator (not our unbiased REML), so that the first term in the AIC is just a function of the SSE. For MLR the AIC and the Mallows Cp gives the same result when comparing models.

Choose the model with the minimum AIC.

BIC Bayesian information criterion.

The BIC is also based on the likelihood (see notation above). \[\text{BIC} =-2 \cdot l(\hat{\beta}_M,\tilde{\sigma}^2)+\ln(n)\cdot (\lvert M\rvert +1)\]

For a normal regression model: \[ \text{BIC}= n\ln(\tilde{\sigma}^2)+\ln(n)(\lvert M\rvert +1)\] Choose the model with the minimum BIC.

AIC and BIC are motivated in very different ways, but the final result for the normal regression model is very similar. BIC has a larger penalty than AIC ($\log(n)$ vs. $2$), and will often give a smaller model (=more parsimonious models) than AIC. In general we would not like a model that is too complex. —-

Model selection strategies

All subset selection: use smart “leaps and bounds” algorithm, works fine for number of covariates in the order of 40.
Forward selection: choose starting model (only intercept), then add one new variable at each step - selected to make the best improvement in the model selection criteria. End when no improvement is made.
Backward elimination: choose starting model (full model), then remove one new variable at each step - selected to make the best improvement in the model selection criteria. End when no improvement is made.
Stepwise selection: combine forward and backward.

Interactive tasks for the second week (of this module)

Theoretical questions

What is the interpretation of a 95% confidence interval? Hint: repeat experiment (on $Y$), on average how many CIs cover the true $\beta_j$?
What is the interpretation of a 95% prediction interval? Hint: repeat experiment (on $Y$) for a given ${\bf x}_0$.
Construct a 95% CI for ${\bf x}_0^T \beta$. Explain what is the connections between a CI for $\beta_j$, a CI for ${\bf x}_0^T \beta$ and a PI for $Y$ at ${\bf x}_0$.
Explain why using summary and anova on a fitted lm gives differen $p$-values listed for each covariate. Maybe test out Anova in library car with type 3 ANOVA to compare?
Consider a MLR model $A$ and a submodel $B$ (all parameters in $B$ are in $A$ also). We say that $B$ is nested within $A$. Assume that regression parameters are estimated using maximum likelihood. Why is then the following true: SSE for model $A$ will always be smaller or equal to SSE for model $B$. And thus, $R^2$ for model $A$ can never be worse than $R^2$ for model B.
Take a look at the details getting from the general AIC criterion to the version used for MLR. We will partly be using AIC for this course, but with likelihood from the binomial, Poission, etc.

Munich Rent index

Last week all groups decided on using rent or rentsqm as response, and in short - there was not really a big difference. So, now use rent as the response.

We now want to use model selection to arrive at a good model. Start by defining which covariates you want to include and how to code them (location as dummy or effect coding). What about year of construction - is that a linear covariate? Maybe you want to make intervals in time instead? Linear or categorical for the time? What about the district? We leave that since we have not talked about how to use spatial covariates.

Hint: if you want to test out interval versions of year of construction the function mutate (from dplyr) is useful:

rent99 <- rent99 %>% mutate(yearc.cat = cut(yearc, breaks = c(-Inf, seq(1920, 
    2000, 10)), labels = 10 * 1:9))

More on dplyr: Tutorial: http://genomicsclass.github.io/book/pages/dplyr_tutorial.html and Cheat sheet (data wrangling): https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf and dplyr in particular: https://github.com/rstudio/cheatsheets/raw/master/source/pdfs/data-transformation-cheatsheet.pdf

There are many ways to perform model selection in MLR. One possibility is best subsets, which can be done using the regsubsets function from library leaps. You may define x from model.matrix(fit)[,-1] (not including the intercept term), and then run best=regsubsets(x=model.matrix(fit)[,-1],y=rent99$rent) and look at summary(best). Explain the print-out (with all the stars). Using the Mallows Cp (named cp in the list from summary(best)) will give the same result at using AIC (which is not available in this function). What is your preferred model? Hint: look at the R-code in Problem 2 (Figure 3) from the TMA4267V2017 exam: pdf, and maybe the solutions for the interprtation pdf
Check what is done if you use stepAIC. Do you get the same model choice as with best subset selection based on AIC? Why, or why not?

Dummy vs. effect coding in MLR (continued)

We have studied the data set with income, place and gender - with focus on dummy variable coding (with different reference levels) and effect coding and the interpretation of parameter estimates. Now we continue with the same data set, but with focus on hypothesis testing (linear hypotheses) and analysis of variance decomposition.

Previously, we have read in the data and fitted linear models - look back to see what we found.

income <- c(300, 350, 370, 360, 400, 370, 420, 390, 400, 430, 420, 410, 300, 
    320, 310, 305, 350, 370, 340, 355, 370, 380, 360, 365)
gender <- c(rep("Male", 12), rep("Female", 12))
place <- rep(c(rep("A", 4), rep("B", 4), rep("C", 4)), 2)
data <- data.frame(income, gender = factor(gender, levels = c("Female", "Male")), 
    place = factor(place, levels = c("A", "B", "C")))

Fit the following model model = lm(income~place-1,data=data,x=TRUE). Here x=TRUE tells the function to calculate the design matrix X, which is stored as model$x.

Examine the results with summary and anova. What parametrization is used? What is the interpretation of the parameters? Which null hypothesis is tested in the anova-call? What is the result of the hypothesis test?

Fit the models:

model1 = lm(income ~ place, data = data, x = TRUE, contrasts = list(place = "contr.treatment"))
head(model1$x)
summary(model1)
model2 = lm(income ~ place, data = data, x = TRUE, contrasts = list(place = "contr.sum"))
head(model2$x)
summary(model2)

We have talked about dummy- and effect encoding of categorical covariates. What are the parametrizations used here? What is the interpretation of the parameters and how do the parameter interpretations differ between model1 and model2?

We want to test the (one-way ANOVA) null hypothesis that there is no effect of place. Use the $F_{obs}$ to do this both using the dummy-variable and the effect coding of the place-factor. Compare the results from the two coding strategies. Hint: in R this can be done using anova(model0,model1) where model0 is the model with only intercept.
Suppose now that there are two factors placeand gender.

model3 = lm(income ~ place + gender, data = data, x = TRUE, contrasts = list(place = "contr.treatment", 
    gender = "contr.treatment"))
summary(model3)

## 
## Call:
## lm(formula = income ~ place + gender, data = data, x = TRUE, 
##     contrasts = list(place = "contr.treatment", gender = "contr.treatment"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.500  -6.250   0.000   9.687  25.000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  306.250      6.896  44.411  < 2e-16 ***
## placeB        47.500      8.446   5.624 1.67e-05 ***
## placeC        65.000      8.446   7.696 2.11e-07 ***
## genderMale    41.250      6.896   5.982 7.54e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.89 on 20 degrees of freedom
## Multiple R-squared:  0.8322, Adjusted R-squared:  0.8071 
## F-statistic: 33.07 on 3 and 20 DF,  p-value: 6.012e-08

anova(model3)

## Analysis of Variance Table
## 
## Response: income
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## place      2 18100.0  9050.0  31.720 6.260e-07 ***
## gender     1 10209.4 10209.4  35.783 7.537e-06 ***
## Residuals 20  5706.2   285.3                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model4 = lm(income ~ place + gender, data = data, x = TRUE, contrasts = list(place = "contr.sum", 
    gender = "contr.sum"))
summary(model4)

## 
## Call:
## lm(formula = income ~ place + gender, data = data, x = TRUE, 
##     contrasts = list(place = "contr.sum", gender = "contr.sum"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.500  -6.250   0.000   9.687  25.000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  364.375      3.448 105.680  < 2e-16 ***
## place1       -37.500      4.876  -7.691 2.13e-07 ***
## place2        10.000      4.876   2.051   0.0536 .  
## gender1      -20.625      3.448  -5.982 7.54e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.89 on 20 degrees of freedom
## Multiple R-squared:  0.8322, Adjusted R-squared:  0.8071 
## F-statistic: 33.07 on 3 and 20 DF,  p-value: 6.012e-08

anova(model4)

## Analysis of Variance Table
## 
## Response: income
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## place      2 18100.0  9050.0  31.720 6.260e-07 ***
## gender     1 10209.4 10209.4  35.783 7.537e-06 ***
## Residuals 20  5706.2   285.3                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What are the parameterizations? What is the interpretation of the parameters? Does the ANOVA table look different for the two parametrizations? Hint: orthogonality of design matrix for this balanced design?

Finally, fit a model with interactions (model formula is place*gender for both the contrasts and check if the interaction effect is significant.

Simulations in R

Make R code that shows the interpretation of a 95% CI for $\beta_j$. Hint: Theoretical question 1.
Make R code that shows the interpretation of a 95% PI for a new response at ${\bf x}_0$. Hint: Theoretical question 2.

Quiz with Kahoot!

Around 13.45 in the IL on Wednesday:

One person on each group go to https:\\kahoot.it on a mobile device or a laptop. (The lecturer will hijack the screen for showing questions so you it is difficult to use the PC.)

Give the pin (shown soon) and then give the team nick name “Group1”-“Group8” or make your own personalized group name. Then - if you want - add nicks for all group members. Work together and only provide one answer to each question for each group. In team mode there is a short “team talk” period before you can provide the answer - so you have some time. 1000 points if you answer correctly immediately, 500 if you answer when the time is up, 0 for wrong answers.

There is a price to the winning team!

$\odot$ Further reading

Slightly different presentation (more focus on multivariate normal theory): Slides and written material from TMA4267 Linear Statistical Models in 2017, Part 2: Regression (by Mette Langaas).
And, same source, but now [Slides and written material from TMA4267 Linear Statistical Models in 2017, Part 3: Hypothesis testing and ANOVA] (http://www.math.ntnu.no/emner/TMA4267/2017v/TMA4267V2017Part3.pdf)

Aim of this module

First week

Second week

FIRST WEEK

Aim of multiple linear regression

\(\oplus\) Notation

\(\odot\) Munich rent index: response and covariates

Model

\(\oplus\) Questions

Parameter estimation

\(\oplus\) Likelihood theory (from B.4)

Least squares and maximum likelihood estimator for \({\bf \beta}\):

\(\odot\) Munich rent index: parameter estimates

Projection matrices: idempotent, symmetric/orthogonal:

Geometry of Least Squares (involving our two projection matrices)

Restricted maximum likelihood estimator for \({\bf \sigma}^2\):

Properties for the normal linear model

Are \(\hat{{\bf \beta}}\) and SSE are independent? (optional)

Categorical covariates - dummy and effect coding

Dummy variable coding aka treatment contrast

Effect coding aka sum-zero-contrast:

Simple vs. multiple regression: why (and when) do estimated regression effects change from simple to multiple regression?

The Aquarium

Checking model assumptions

General theory on QQ-plots

Residuals

Standardized residuals:

Studentized residuals:

Plotting residuals - and what to do when assumptions are violated?

Diagnostic plots in R

Residuals vs fitted values

Normal Q-Q

Scale-location

Residual vs Leverage

Interactive tasks for the first week (of this module)

Theoretical questions

Munich Rent Index data

Dummy vs. effect coding in MLR

Simulations in R

Demonstrations

SECOND WEEK

\(\oplus\) What to remember?

Inference

Confidence intervals (CI)

Prediction intervals

Single hypothesis testing set-up

Two types of errors:

Hypothesis test on \(\beta_j\)

The \(p\)-value

Munich rent index hypothesis test

Testing linear hypotheses in regression

Testing a set of parameters - what is \({\bf C}\) and \({\bf d}\)?

Testing one regression parameter

Testing “significance of the regression”

Relation to Wald test

Asympotic result

Analysis of variance decomposition and coefficient of determination, \(R^2\)

Analysis of variance tables - with emphasis on sequential Type I ANOVA

Details on the test anova(fit)

Optional (beyond the scope of this course)

Model choice

Quality measures

\(R^2\) adjusted (corrected)

AIC Akaike information criterion

BIC Bayesian information criterion.

Model selection strategies

Interactive tasks for the second week (of this module)

Theoretical questions

Munich Rent index

Dummy vs. effect coding in MLR (continued)

Simulations in R

Quiz with Kahoot!

\(\odot\) Further reading

Details on the test `anova(fit)`