(Lastest changes: 26.08.2018. Theory for w1 is up to date. Missing: QC of ILw1.)

Overview

Learning material

Textbook: Chapter 2.2, 3 and B.4. (Chapter 3 was on the reading list for TMA4267 Linear statistical 2016-2018, so much of this module is know from before.)
Classnotes 30.08.2018
Classnotes 06.09.2018

Topics

First week

Aim of multiple linear regression.
Define and understand the multiple linear regression model - traditional and GLM way
parameter estimation with maximum likelihood (and least squares)
likelihood, score vector and Hessian (observed Fisher information matrix)
big data implementation (if time)
properties of parameter estimators
assessing model fit (diagnostic), residuals, QQ-plots
design matric: how to code categorical covariates (dummy or effect coding), and how to handle interactions

Jump to IL for first week

Second week

What did we do last week?
Statistical inference for parameter estimates
- confidence intervals,
- prediction intervals,
- hypothesis test,
- linear hypotheses
analysis of variance decompositions and $R^2$, sequential ANOVA table
DEVIANCE???
model selection with AIC and variants

Jump to IL for second week

FIRST WEEK

Aim of multiple linear regression

Construct a model to help understand the relationship between a response and one or several explanatory variables. [Correlation, or cause and effect?]
Construct a model to predict the reponse from a set of (one or several) explanatory variables. [More or less “black box”]

Munich rent index

Munich, 1999: 3082 observations on 9 variables.

rent: the net rent per month (in Euro).
rentsqm: the net rent per month per square meter (in Euro).
area: Living area in square meters.
yearc: year of construction.
location: quality of location: a factor indicating whether the location is average location, 1, good location, 2, and top location, 3.
bath: quality of bathroom: a a factor indicating whether the bath facilities are standard, 0, or premium, 1.
kitchen: Quality of kitchen: 0 standard 1 premium.
cheating: central heating: a factor 0 without central heating, 1 with central heating.
district: District in Munich.

More information in Fahrmeir et. al., (2013) page 5.

library("gamlss.data")
library(GGally)
ggpairs(rent99,lower = list(combo = wrap(ggally_facethist, binwidth = 0.5)))

Interesting questions

Is there a relationship between rent and area?
How strong is this relationship?
Is the relationship linear?
Are also other variables associated with rent?
How well can we predict the rent of an appartment?
Is the effect of area the same on rent for appartments at average, good and top location? (interaction)

Notation

${\bf Y}: (n \times 1)$ vector of responses (random variable) [e.g. one of the following: rent, rent pr sqm, weight of baby, ph of lake, volume of tree]

${\bf X}: (n \times p)$ design matrix [e.g. location of flat, gestation age of baby, chemical measurement of the lake, height of tree]

${\bf \beta}: (p \times 1)$ vector of regression parameters (intercept included, so $p=k+1$)

${\bf \varepsilon}: (n\times 1)$ vector of random errors. Used in “traditional way”.

We assume that pairs $({\bf x}_i^T,y_i)$ ($i=1,...,n$) are measured from sampling units. That is, the observation pair $({\bf x}_1^T,y_1)$ is independent from $({\bf x}_2^T,y_2)$, and so on.

Hands on: Munich rent index — response and covariates

Study the print-out and discuss the following questions:

What can be response, and what covariates? (using what you know about rents)
What type of response(s) do we have? (continuous, categorical, nominal, ordinal, discrete, factors, …).
What types of covariates? (continuous, categorical, nominal, ordinal, discrete, factors, …)
Explain what the elements of model.matrix are. (Hint: coding of location)

## Answers 
## a) rent or rentsqm for response and all other as covariates.
## b) rent, rentsqm (continuous). 
## c) area (continuous), yearc (ordinal or continuous), location (categorical - ordinal), bath, kichen, cheating (categorical, nominal?), district (categorical - nominal).
## d) our design matrix, contains numerical values for the continuous covariates, and dummy variable coding for categorical covariates (more about this later).

library("gamlss.data")
ds=rent99
colnames(ds)
summary(ds)
dim(ds)
head(ds)
str(ds$location)
contrasts(ds$location)

X=model.matrix(rentsqm~area+yearc+location+bath+kitchen+cheating+district,data=ds)
head(X)

## [1] "rent"     "rentsqm"  "area"     "yearc"    "location" "bath"    
## [7] "kitchen"  "cheating" "district"
##       rent            rentsqm             area            yearc     
##  Min.   :  40.51   Min.   : 0.4158   Min.   : 20.00   Min.   :1918  
##  1st Qu.: 322.03   1st Qu.: 5.2610   1st Qu.: 51.00   1st Qu.:1939  
##  Median : 426.97   Median : 6.9802   Median : 65.00   Median :1959  
##  Mean   : 459.44   Mean   : 7.1113   Mean   : 67.37   Mean   :1956  
##  3rd Qu.: 559.36   3rd Qu.: 8.8408   3rd Qu.: 81.00   3rd Qu.:1972  
##  Max.   :1843.38   Max.   :17.7216   Max.   :160.00   Max.   :1997  
##  location bath     kitchen  cheating    district   
##  1:1794   0:2891   0:2951   0: 321   Min.   : 113  
##  2:1210   1: 191   1: 131   1:2761   1st Qu.: 561  
##  3:  78                              Median :1025  
##                                      Mean   :1170  
##                                      3rd Qu.:1714  
##                                      Max.   :2529  
## [1] 3082    9
##       rent   rentsqm area yearc location bath kitchen cheating district
## 1 109.9487  4.228797   26  1918        2    0       0        0      916
## 2 243.2820  8.688646   28  1918        2    0       0        1      813
## 3 261.6410  8.721369   30  1918        1    0       0        1      611
## 4 106.4103  3.547009   30  1918        2    0       0        0     2025
## 5 133.3846  4.446154   30  1918        2    0       0        1      561
## 6 339.0256 11.300851   30  1918        2    0       0        1      541
##  Factor w/ 3 levels "1","2","3": 2 2 1 2 2 2 1 1 1 2 ...
##   2 3
## 1 0 0
## 2 1 0
## 3 0 1
##   (Intercept) area yearc location2 location3 bath1 kitchen1 cheating1
## 1           1   26  1918         1         0     0        0         0
## 2           1   28  1918         1         0     0        0         1
## 3           1   30  1918         0         0     0        0         1
## 4           1   30  1918         1         0     0        0         0
## 5           1   30  1918         1         0     0        0         1
## 6           1   30  1918         1         0     0        0         1
##   district
## 1      916
## 2      813
## 3      611
## 4     2025
## 5      561
## 6      541

Model

The traditional way

\[{\bf Y=X \beta}+{\bf\varepsilon}\] is called a classical linear model if the following is true:

$\text{E}(\bf{\varepsilon})=\bf{0}$.
$\text{Cov}(\bf{\varepsilon})=\text{E}(\bf{\varepsilon}\bf{\varepsilon}^T)=\sigma^2\bf{I}$.
The design matrix has full rank, $\text{rank}({\bf X})=k+1=p$.

The classical normal linear regression model is obtained if additionally

$\varepsilon\sim N_n(\bf{0},\sigma^2\bf{I})$ holds.

For random covariates these assumptions are to be understood conditionally on $\bf{X}$.

The GLM way

Independent pairs $(Y_i, {\bf x}_i)$ for $i=1,\ldots,n$.

Random component: $Y_i \sim N$ with $\text{E}(Y_i)=\mu_i$ and $\text{Var}(Y_i)=\sigma^2$.
Systematic component: $\eta_i={\bf x}_i^T \boldsymbol{\beta}$.
Link function: linking the random and systematic component (linear predictor): Identity link and response function. $\mu_i=\eta_i$.

Questions

Compare the traditional and GLM way. Have we made the same assumptions for both?
What is the connection between each ${\bf x}_i$ and the design matrix?
What is “full rank”? Why is this needed? Example of rank less than $p$?
Why do you think we move from traditional to GLM way? Could we not just let $\varepsilon$ be from binomial, Poisson, etc. distribution?

Parameter estimation

In multiple linear regression there are two popular methods for estimating the regression parameters in $\beta$: maximum likelihood and least squares. These two methods give the same estimator when we assume the normal linear regression model. We will in this module focus on maximum likelihood estimation, since that can be used also when we have non-normal responses (modules 3-6: binomial, Poisson, gamma, multinomial).

Likelihood theory (from B.4)

Likelihood $L(\beta)$

We assume that pairs of covariates and response are measured independently of each other: $({\bf x}_i,Y_i)$, and $Y_i$ follows the distribution specified above, and ${\bf x}_i$ is fixed.

\[L(\beta)=\prod_{i=1}^n L_i(\beta)=\prod_{i=1}^n f(y_i; \beta)\] Q: fill in with the normal density for $f$ and the multiple linear regression model.

Loglikelihood $l(\beta)$

The log-likelihood is just the natural log of the likelihood, and we work with the log-likelihood because this makes the mathematics simpler - since we work with exponential families. The main aim with the likelihood is to maximize it to find the maximum likelihood estimate, and since the log is a monotone function the maximum of the log-likelihood will be in the same place as the maximum of the likelihood.

\[ l(\beta)=\ln L(\beta)=\sum_{i=1}^n \ln L_i(\beta)=\sum_{i=1}^n l_i(\beta)\\ \] Observe that the log-likelihood is a sum of invidual contributions for each observation pair $i$.

Q: fill in with the normal density for $f$ and the multiple linear regression model.

Repetition: rules for derivatives of vector

H¨{a}rdle and Simes (2015), page 65.

Let ${\beta}$ be a $p$-dimensional column vector of interest,
and let $\frac{\partial}{\partial {\beta}}$ denote the $p$-dimensional vector with partial derivatives wrt the $p$ elements of ${\beta}$.
Let ${\bf d}$ be a $p$-dimensional column vector of constants and
${\bf D}$ be a $p\times p$ symmetric matrix of constants.

Rule 1: \[\frac{\partial}{\partial \boldsymbol{\beta}}({\bf d}^T\boldsymbol{\beta})=\frac{\partial}{\partial \boldsymbol{\beta}}(\sum_{j=1}^p d_j \beta_j)={\bf d}\] Rule 2: \[\frac{\partial}{\partial \boldsymbol{\beta}}(\boldsymbol{\beta}^T{\bf D}\boldsymbol{\beta})=\frac{\partial}{\partial \boldsymbol{\beta}}(\sum_{j=1}^p \sum_{k=1}^p \beta_j d_{jk} \beta_k)=2{\bf D}\boldsymbol{\beta}\]

Rule 3: The Hessian of the quadratic form $\boldsymbol{\beta}^T{\bf D}\boldsymbol{\beta}$ is

\[\frac{\partial^2 \boldsymbol{\beta}^T{\bf D}\boldsymbol{\beta}}{\partial \boldsymbol{\beta}\partial \boldsymbol{\beta}^T}= 2{\bf D}\]

Score function $s(\beta)$

The score function is a $p\times 1$ vector, $s(\beta)$, with the partial derivatives of the log-likelihood with respect to the $p$ elements of the $\beta$ vector.

\[s(\beta)=\frac{\partial l(\beta)}{\partial \beta}= \sum_{i=1}^n \frac{\partial l_i(\beta)}{\partial \beta}= \sum_{i=1}^n s_i(\beta)\]

Again, observe that the score function is a sum of individual contributions for each observation pair $i$.

Q: fill in for the multiple linear regression model.

To find the maximum likelihood estimate $\hat{\beta}$ we solve the set of $p$ equations: \[s(\hat{\beta})=0\]

Q: fill in for the multiple linear regression model. Specify what the normal equations are.

For the normal linear regression model, these equations $s(\hat{\beta})=0$ have a solution to be written on closed form.

Least squares and maximum likelihood (ML) estimator for ${\bf \beta}$:

\[ \hat\beta=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf Y}\]

Q: How can you see that least squares and ML gives the same estimator?

Looking ahead: Hessian and Fisher information

But, for other distribution than the normal we get a set of non-linear equations when we look at $s(\hat{\beta})=0$, and then we will use the Newton-Raphson or Fisher Scoring iterative methods.

Observed Fisher information matrix $H(\beta)$

\[H(\beta) = -\frac{\partial^2l(\beta)}{\partial\beta\partial\beta^T} = -\frac{\partial s(\beta)}{\partial\beta^T}\]

so this is minus the Hessian of the loglikelihood. $H(\beta)$ may be considered as a local measure of information that the likelihood contains. The higher the curvature of the log-likelihood near its maximum the more information is providd by the likelihood about the unknown parameter. Since we look at minus the Hessian, we have a positive $H(\beta)$ near the maximum.

Q: Calculate this for the multiple linear regression model. What is the dimension of $H(\beta)$?

In addition we also use the expected Fisher information matrix $F(\beta)$ which we may find in two ways, one is by taking the mean of the observed Fisher information matrix:

\[F(\beta) = E\left( -\frac{\partial^2l(\beta)}{\partial\beta\partial\beta^T} \right).\]

Q: Calculate this for the multiple linear regression model. What is the dimension of $F(\beta)$?

In Module 3 we need the Fisher information matrix in the Newton-Raphson method, and also to find the (asympotic) covariance matrix of our estimated coefficents $\hat{\beta}$ - so much more about this then.

Hands on: Munich rent index parameter estimates

Explain what the values under Estimate mean in practice.

fit=lm(rentsqm~area+yearc+location+bath+kitchen+cheating,data=ds)
summary(fit)

## 
## Call:
## lm(formula = rentsqm ~ area + yearc + location + bath + kitchen + 
##     cheating, data = ds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4303 -1.4131 -0.1073  1.3244  8.6452 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -45.475484   3.603775 -12.619  < 2e-16 ***
## area         -0.032330   0.001648 -19.618  < 2e-16 ***
## yearc         0.026959   0.001846  14.606  < 2e-16 ***
## location2     0.777133   0.076870  10.110  < 2e-16 ***
## location3     1.725068   0.236062   7.308 3.45e-13 ***
## bath1         0.762808   0.157559   4.841 1.35e-06 ***
## kitchen1      1.136908   0.183088   6.210 6.02e-10 ***
## cheating1     1.765261   0.129068  13.677  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.031 on 3074 degrees of freedom
## Multiple R-squared:  0.3065, Adjusted R-squared:  0.3049 
## F-statistic: 194.1 on 7 and 3074 DF,  p-value: < 2.2e-16

Reproduce the values under Estimate by calculating without the use of lm.

X=model.matrix(rentsqm~area+yearc+location+bath+kitchen+cheating,data=ds)
Y=ds$rentsqm
betahat=solve(t(X)%*%X)%*%t(X)%*%Y
#betahat-fit$coefficients
print(betahat)

##                     [,1]
## (Intercept) -45.47548356
## area         -0.03233033
## yearc         0.02695857
## location2     0.77713297
## location3     1.72506792
## bath1         0.76280784
## kitchen1      1.13690814
## cheating1     1.76526110

Projection matrices: idempotent, symmetric/orthogonal

(Optional - known from TMA4267)

First, we define predictions as $\hat{\bf Y}={\bf X}\hat{\beta}$, and inserted the ML (and LS) estimate we get $\hat{\bf Y}={\bf X}({\bf X}^T{\bf X})^{-1}{\bf X}^T{\bf Y}$.

We define the projection matrix \[ {\bf H}={\bf X}({\bf X}^T{\bf X})^{-1} {\bf X}^T\] called the hat matrix. This simplifies the notation for the predictions, \[\hat{\bf Y}={\bf H}{\bf Y}\] so the hat matrix is putting the hat on the response ${\bf Y}$.

In addition we define residuals as \[\begin{align*} \hat{\varepsilon}&={\bf Y}-\hat{\bf Y} \\ \hat{\varepsilon}&={\bf Y}-{\bf HY}=({\bf I-H}){\bf Y}\\ \end{align*}\]

so we have a second projection matrix \[ {\bf I-H}={\bf I}-{\bf X}({\bf X}^T{\bf X})^{-1} {\bf X}^T \]

Geometry of Least Squares — involving our two projection matrices

(Optional - known from TMA4267)

Mean response vector: $\text{E}({\bf Y})={\bf X}{\bf \beta}$
As $\beta$ varies, ${\bf X}\beta$ spans the model plane of all linear combinations. I.e. the space spanned by the columns of ${\bf X}$: the column-space of ${\bf X}$.
Due to random error (and unobserved covariates), ${\bf Y}$ is not exactly a linear combination of the columns of ${\bf X}$.
LS-estimation chooses $\hat{\beta}$ such that ${\bf X}\hat{\beta}$ is the point in the column-space of ${\bf X}$ that is closes to ${\bf Y}$.
The residual vector $\hat{\varepsilon}={\bf Y}-\hat{{\bf Y}}=({\bf I-H}){\bf Y}$ is perpendicular to the column-space of ${\bf X}$.
Multiplication by ${\bf H}={\bf X}({\bf X}^T{\bf X})^{-1}{\bf X}^T$ projects a vector onto the column-space of ${\bf X}$.
Multiplication by ${\bf I-H}={\bf I}-{\bf X}({\bf X}^T{\bf X})^{-1}{\bf X}^T$ projects a vector onto the space perpendicular to the column-space of ${\bf X}$.

Nice graphical presentations: Putanen, Styan and Isotalo: Matrix Tricks for Linear Statistical Models: Our Personal Top Twenty, Figure 8.3.

Restricted maximum likelihood estimator for ${\bf \sigma}^2$

\[ \hat{\sigma}^2=\frac{1}{n-p}({\bf Y}-{\bf X}{\hat{\beta}})^T({\bf Y}-{\bf X}{\bf \hat{\beta}})=\frac{\text{SSE}}{n-p}\]

In the generalized linear models setting (remember exponential family from Module 1) we will look at the parameter $\sigma^2$ as a nuisance parameter = parameter that is not of interest to us. Our our focus will be on the parameters of interest - which will be related to the mean of the response, which is modelled using our covariate - so the regression parameters $\beta$ are therefore our prime focus.

However, to perform inference we need an estimator for $\sigma^2$.

The maximum likelihood estimator for $\sigma^2$ is $\frac{\text{SSE}}{n}$, which is found from maximizing the likelihood inserted our estimate of $\hat{\beta}$ \[L(\hat{\beta},{\sigma^2})=(\frac{1}{2\pi})^{n/2}(\frac{1}{\sigma^2})^{n/2}\exp(-\frac{1}{2\sigma^2} ({\bf y}-{\bf X}\hat{\beta})^T({\bf y}-{\bf X}\hat{\beta}))\] \[\begin{align*} l(\hat{\beta},{\sigma^2})&=\text{ln}(L(\hat{\beta},{\sigma^2}))\\ &=-\frac{n}{2}\text{ln}(2\pi)-\frac{n}{2}\text{ln}\sigma^2-\frac{1}{2\sigma^2} ({\bf y}-{\bf X}\hat{\beta})^T({\bf y}-{\bf X}\hat{\beta})\\ \end{align*}\]

The score vector with respect to $\sigma^2$ is \[\frac{\partial l}{\partial \sigma^2}=0-\frac{n}{2\sigma^2}+\frac{1}{2\sigma^4}({\bf y}-{\bf X}\hat{\beta})^T({\bf y}-{\bf X}\hat{\beta}) \] Solving $\frac{\partial l}{\partial \sigma^2}=0$ gives us the estimator \[ \frac{1}{n}({\bf y}-{\bf X}\hat{\beta})^T({\bf y}-{\bf X}\hat{\beta})=\frac{\text{SSE}}{n}\] But, this estimator is biased.

To prove this you may use the trace-formula, that is $\text{E}({\bf Y}^T {\bf A}{\bf Y})=\text{tr}({\bf A}\text{Cov}({\bf Y}))+\text{E}({\bf Y})^T{\bf A}\text{E}({\bf Y})$, and we use that $\text{SSE}={\bf Y}^T ({\bf I}-{\bf H}){\bf Y}$. This was done in class notes from TMA4267 - lecture 10

But, the estimator is asympotically unbiased (unbiased when the sample size $n$ increases to infinity).

When an unbiased version is preferred, it is found using restricted maximum likelihood (REML). We will look into REML-estimation in Module 7. In our case the (unbiased) REML estimate is \[ \hat{\sigma}^2=\frac{1}{n-p}({\bf y}-{\bf X}\hat{\beta})^T({\bf y}-{\bf X}\hat{\beta})=\frac{\text{SSE}}{n-p}\]

The restricted maximum likelihood estimate is used in lm.

Q: What does it mean that the REML estimate is unbiased? Where is the estimate $\hat{\sigma}$ in the regression output? (See output from lm for the rent index example.)

## Answer: 
## Residual standard error: 2.031 is sigmahat

Properties for the normal linear model

To be able to do inference (=make confidence intervals, prediction intervals, test hypotheses) we need to know about the properties of our parameter estimators in the (normal) linear model.

Least squares and maximum likelihood estimator for ${\bf \beta}$: \[ \hat{\beta}=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf Y}\] with $\hat{\beta}\sim N_{p}(\beta,\sigma^2({\bf X}^T{\bf X})^{-1})$.
Restricted maximum likelihood estimator for ${\bf \sigma}^2$: \[ \hat{\sigma}^2=\frac{1}{n-p}({\bf Y}-{\bf X}\hat{\beta})^T({\bf Y}-{\bf X}\hat{\beta})=\frac{\text{SSE}}{n-p}\] with $\frac{(n-p)\hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p}$.
Statistic for inference about $\beta_j$, $c_{jj}$ is diagonal element $j$ of $({\bf X}^T{\bf X})^{-1}$. \[ T_j=\frac{\hat{\beta}_j-\beta_j}{\sqrt{c_{jj}}\hat{\sigma}}\sim t_{n-p}\] This requires that $\hat{\beta}_j$ and $\hat{\sigma}$ are independent (see below).

However, when we work with large samples then $n-p$ becomes large and the $t$ distribution goes to a normal distribution, so we may use the standard normal in place of the $t_{n-p}$.

Asymptotically we have: \[\hat{\beta}\sim N_{p}(\beta,\tilde{\sigma}^2({\bf X}^T{\bf X})^{-1})\]. and \[ T_j=\frac{\hat{\beta}_j-\beta_j}{\sqrt{c_{jj}}\tilde{\sigma}}\sim N(0,1)\] where $\tilde{\sigma}^2=\frac{\text{SSE}}{n}$ (the ML estimator).

Q: Pointing forwards: do you see any connection between the covariance matrix of $\hat{\boldsymbol{\beta}}$ and the Fisher information?

Are $\hat{{\bf \beta}}$ and SSE are independent? (optional)

Can be proven using knowledge from TMA4267 on properties of the multivariate normal distribution.

Independence: Let ${\bf X}_{(p \times 1)}$ be a random vector from $N_p({\bf \mu},{\bf \Sigma})$. Then ${\bf A}{\bf X}$ and ${\bf B}{\bf X}$ are independent iff ${\bf A}{\bf \Sigma}{\bf B}^T={\bf 0}$.

We have:

${\bf Y}\sim N_n({\bf X}{\bf \beta},\sigma^2{\bf I})$
${\bf AY}=\hat{{\bf \beta}}=({\bf X}^T{\bf X})^{-1}{\bf X}^T {\bf Y}$, and
${\bf BY}=({\bf I}-{\bf H}){\bf Y}$.
Now ${\bf A}\sigma^2{\bf I}{\bf B}^T=\sigma^2 {\bf A}{\bf B}^T=\sigma^2 ({\bf X}^T{\bf X})^{-1}{\bf X}^T ({\bf I}-{\bf H})={\bf 0}$
since ${\bf X}({\bf I}-{\bf H})={\bf X}-{\bf HX}={\bf X}-{\bf X}={\bf 0}$.
We conclude that $\hat{{\bf \beta}}$ is independent of $({\bf I}-{\bf H}){\bf Y}$,
and, since SSE=function of $({\bf I}-{\bf H}){\bf Y}$: SSE=${\bf Y}^T({\bf I}-{\bf H}){\bf Y}$,
then $\hat{{\bf \beta}}$ and SSE are independent, and the result with $T_j$ being t-distributed with $n-p$ degrees of freedom is correct.

Remark: a similar result will exist for GLMs, using the concept of orthogonal parameters.

Checking model assumptions

In the normal linear model we have made the following assumptions.

Linearity of covariates: ${\bf Y}={\bf X}\beta+\varepsilon$. Problem: non-linear relationship?
Homoscedastic error variance: $\text{Cov}(\varepsilon)=\sigma^2 {\bf I}$. Problem: Non-constant variance of error terms
Uncorrelated errors: $\text{Cov}(\varepsilon_i,\varepsilon_j)=0$.
Additivity of errors: ${\bf Y}={\bf X}\beta+\varepsilon$
Assumption of normality: $\varepsilon \sim N_n({\bf 0},\sigma^2{\bf I})$

The same assumptions are made when we do things the GLM way for the normal linear model.

In addtion the following might cause problems:

Outliers
High leverage points
Collinearity

General theory on QQ-plots

Read this for yourself. You do not need to understand this in detail, but is useful to have a basic idea why we look for a straight line in a QQ-plot. There is one question about this in the ILw1.

Go to separate R Markdown or html document: QQ–plot as Rmd or QQ–plot as html

Residuals

If we assume the normal linear model then we know that the residuals ($n\times 1$ vector) \[\hat{\varepsilon}={\bf Y}-{\bf \hat{Y}}=({\bf I}-{\bf H}){\bf Y}\] has a normal (singular) distribution with mean $\text{E}(\hat{\varepsilon})={\bf 0}$ and covariance matrix $\text{Cov}(\hat{ \varepsilon})=\sigma^2({\bf I}-{\bf H})$ where ${\bf H}={\bf X}({\bf X}^T{\bf X})^{-1}{\bf X}^T$.

This means that the residuals (possibly) have different variance, and may also be correlated.

Our best guess for the error $\varepsilon$ is the residual vector $\hat{\varepsilon}$, and we may think of the residuals as predictions of the errors. Be aware: don’t mix errors (the unobserved) with the residuals (“observed”).

But, we see that the residuals are not independent and may have different variance, therefore we will soon define variants of the residuals that we may use to assess model assumptions after a data set is fitted.

Q: How can we say that the residuals can have different variance and may be correlated? Why is that a problem?

We would like to check the model assumptions - we see that they are all connected to the error terms. But, but we have not observed the error terms $\varepsilon$ so they can not be used for this. However, we have made “predictions” of the errors - our residuals. And, we want to use our residuals to check the model assumptions.

That is, we want to check that our errors are independent, homoscedastic (same variance for each observation), and not dependent on our covariates - and we want to use the residuals (observed) in place of the errors (unobserved). Then it would have been great if the residuals have these properties when the underlying errors have. To amend our problem we need to try to fix the residual so that they at least have equal variances. We do that by working with standardized or studentized residuals.

Standardized residuals:

\[r_i=\frac{\hat{\varepsilon}_i}{\hat{\sigma}\sqrt{1-h_{ii}}}\] where $h_ii$ is the $i$th diagonal element of the hat matrix ${\bf H}$.

In R you can get the standardized residuals from an lm-object (named fit) by rstandard(fit).

Studentized residuals:

\[r^*_i=\frac{\hat{\varepsilon}_i}{\hat{\sigma}_{(i)}\sqrt{1-h_{ii}}}\] where $\hat{\sigma}_{(i)}$ is the estimated error variance in a model with observation number $i$ omitted. This seems like a lot of work, but it can be shown that it is possible to calculated the studentized residuals directly from the standardized residuals: \[r^*_i=r_i (\frac{n-p-1}{n-p-r_i^2})^{1/2}\]

In R you can get the studentized residuals from an lm-object (named fit) by rstudent(fit).

Plotting residuals - and what to do when assumptions are violated?

Some important plots

Plot the residuals, $r^*_i$ against the predicted values, $\hat{y}_i$.

Dependence of the residuals on the predicted value: wrong regression model?
Nonconstant variance: transformation or weighted least squares is needed?

Plot the residuals, $r^*_i$, against predictor variable or functions of predictor variables. Trend suggest that transformation of the predictors or more terms are needed in the regression.
Assessing normality of errors: QQ-plots and histograms of residuals. As an additional aid a test for normality can be used, but must be interpreted with caution since for small sample sizes the test is not very powerful and for large sample sizes even very small deviances from normality will be labelled as significant.
Plot the residuals, $r^*_i$, versus time or collection order (if possible). Look for dependence or autocorrelation.

Residuals can be used to check model assumptions, and also to discover outliers.

Diagnostic plots in R

More information on the plots here: http://data.library.virginia.edu/diagnostic-plots/ and http://ggplot2.tidyverse.org/reference/fortify.lm.html

You can use the function fortify.lm in ggplot2 to create a dataframe from an lm-object, which ggplot uses automatically when given a lm-object. This can be used to plot diagnostic plots.

For simplicity we use the Munch rent index with rent as response and only area as the only covariate. (You may change the model to a more complex one, and rerun the code chunks.)

fit <- lm(rent ~ area, data=rent99) # Run a regression analysis
format(head(fortify(fit)), digits = 4L)
# to show what fortify implicitly does in ggplot
# for more information ggplot2::fortify.lm

##    rent area     .hat .sigma   .cooksd .fitted  .resid .stdresid
## 1 109.9   26 0.001312  158.8 5.870e-04   260.0 -150.00   -0.9454
## 2 243.3   28 0.001219  158.8 1.678e-05   269.6  -26.31   -0.1658
## 3 261.6   30 0.001130  158.8 6.956e-06   279.2  -17.60   -0.1109
## 4 106.4   30 0.001130  158.8 6.711e-04   279.2 -172.83   -1.0891
## 5 133.4   30 0.001130  158.8 4.779e-04   279.2 -145.85   -0.9191
## 6 339.0   30 0.001130  158.8 8.032e-05   279.2   59.79    0.3768

Residuals vs fitted values

A plot with the fitted values of the model on the x-axis and the residuals on the y-axis shows if the residuals have non-linear patterns. The plot can be used to test the assumption of a linear relationship between the response and the covariates. If the residuals are spread around a horizontal line with no distinct patterns, it is a good indication on no non-linear relationships, and a good model.

Does this look like a good plot for this data set?

ggplot(fit, aes(.fitted, .stdresid)) +
  geom_point(pch = 21) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_smooth(se = FALSE, col = "red", size = 0.5, method = "loess") +
  labs(x = "Fitted values", y = "Standardized residuals", title = "Fitted values vs standardized residuals", subtitle = deparse(fit$call))

Normal Q-Q

This plot shows if the residuals are Gaussian (normally) distributed. If they follow a straigt line it is an indication that they are, and else they are probably not.

ggplot(fit, aes(sample = .stdresid)) +
  stat_qq(pch = 19) +
  geom_abline(intercept = 0, slope = 1, linetype = "dotted") +
  labs(x = "Theoretical quantiles", y = "Standardized residuals", title = "Normal Q-Q", subtitle = deparse(fit$call))

library(nortest)
ad.test(rstudent(fit))

## 
##  Anderson-Darling normality test
## 
## data:  rstudent(fit)
## A = 6.4123, p-value = 9.809e-16

Scale-location

This is also called spread-location plot. It shows if the residuals are spread equally along the ranges of predictors. Can be used to check the assumption of equal variance (homoscedasticity). A good plot is one with a horizontal line with randomly spread points.

Is this plot good for your data?

ggplot(fit, aes(.fitted, sqrt(abs(.stdresid)))) +
  geom_point() +
  geom_smooth(se = FALSE, col = "red", size = 0.5, method = "loess") +
  labs(x = "Fitted values", y = expression(sqrt("Standardized residuals")), title = "Scale-location", subtitle = deparse(fit$call))

Residual vs Leverage

This plot can reveal influential outliers. Not all outliers are influential in linear regression; even though data have extreme values, they might not be influential to determine the regression line (the results don’t differ much if they are removed from the data set). These influential outliers can be seen as observations that does not get along with the trend in the majority of the observations. In plot.lm, dashed lines are used to indicate the Cook’s distance, instead of using the size of the dots as is done here.

Cook’s distance is the Euclidean distance between the $\mathbf{\hat{y}}$ (the fitted values) and $\mathbf{\hat{y}}_{(i)}$ (the fitted values calculated when the $i$-th observation is omitted from the regression). This is then a measure on how much the model is influences by observation $i$. The distance is scaled, and a rule of thumb is to examine observations with Cook’s distance larger than 1, and give some attention to those with Cook’s distance above 0.5.

Leverage is defined as the diagonal elements of the hat matrix, i.e., the leverage of the $i$-th data point is $h_{ii}$ on the diagonal of $\mathbf{H = X(X^TX)^{-1}X^T}$. A large leverage indicated that the observation ($i$) has a large influence on the estimation results, and that the covariate values ($\mathbf{x}_i$) are unusual.

ggplot(fit, aes(.hat, .stdresid)) +
  geom_smooth(se = FALSE, col = "red", size = 0.5, method = "loess") +
  geom_point(aes(size = .cooksd)) +
  scale_size_continuous("Cook's dist.") +
  labs(x = "Leverage", y = "Standardized residuals", title = "Residuals vs Leverage", subtitle = deparse(fit$call))

(Some observations does not fit our model, but if we fit a more complex model this may change.)

Categorical covariates - dummy and effect coding

(read for yourself - topic of ILw1)

Example: consider our rent dataset with rent as reponse, and continuous covariate area and categorical covariate location. Let the location be a factor with levels average, good, excellent.

library(gamlss.data)
library(tidyverse)
library(GGally)

ds=rent99 %>% 
  select(location, area,rent)
levels(ds$location)
# change to meaningful names
levels(ds$location)=c("average","good","excellent")
ggpairs(ds)

## [1] "1" "2" "3"

Q: comment on what you see in the ggpairs plot.

Categorical covariates may either be ordered or unordered. We will only consider unordered categories here. In general, we could like to estimate regression coefficients for all levels for the categorical covariates. However, if we want to include an intercept in our model we can only include codings for one less variable than the number of levels we have - or else our design matrix will not have full rank.

Q: Assume you have a categorical variable with three levels. Check for yourself that making a design matrix with one intercept and three columns with dummy (0-1) variable coding will result in a matrix that is singular.

# make "wrong" dummy variable coding with 3 columns
n=length(ds$location)
X=cbind(rep(1,n),ds$area,rep(0,n),rep(0,n),rep(0,n))
X[ds$location=="average",3]=1
X[ds$location=="good",4]=1
X[ds$location=="excellent",5]=1
X[c(1,3,69),]
library(Matrix)
dim(X)
rankMatrix(X)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1   26    0    1    0
## [2,]    1   30    1    0    0
## [3,]    1   55    0    0    1
## [1] 3082    5
## [1] 4
## attr(,"method")
## [1] "tolNorm2"
## attr(,"useGrad")
## [1] FALSE
## attr(,"tol")
## [1] 6.843415e-13

This is why we need to instead work with different ways of coding categorical variables. One solution is to not include an intercept in the model, but that is often not what we want. We will look at two other solutions - one where we decide on a reference category (that we not include in the coding, and therefore is kind of included in the intercept - this is called “treatment coding”) and one where we require that the the sum of the coeffisients are zero (called “effect coding). This mainly effects how we interpret parameter estimates and communicate our findings to the world.

If we fit a regression model with lm to the data with rent as response and area and location as covariates, a model matrix is made - and how to handle the categorical variable is either specified the call to lm in contrasts=list(location="contr.treatment") (or to model.matrix) or globally for all categorical variables with options(contrasts=c("contr.treatment","contr.poly"))- where first element give choice for unordered factor (then treatment contrast is default) and second for ordered (and then this polynomial contrast is default). We will only work with unordered factors now.

Dummy variable coding aka treatment contrast

This is the default coding. The reference level is automatically chosen as the “lowest” level (sorted alphabetically). For our example this means that the reference category for location is “average”. If we instead wanted “good” to be reference category we could relevel the factor.

X1=model.matrix(~area+location,data=ds)
X1[c(1,3,69),]
ds$locationRELEVEL=relevel(ds$location,ref="good")
X2=model.matrix(~area+locationRELEVEL,data=ds)
X2[c(1,3,69),]

##    (Intercept) area locationgood locationexcellent
## 1            1   26            1                 0
## 3            1   30            0                 0
## 69           1   55            0                 1
##    (Intercept) area locationRELEVELaverage locationRELEVELexcellent
## 1            1   26                      0                        0
## 3            1   30                      1                        0
## 69           1   55                      0                        1

So, what does this mean in practice? Model 1 has average as reference category and model 2 good.

fit1=lm(rent~area+location,data=ds,contrasts = list(location="contr.treatment"))
summary(fit1)
fit2=lm(rent~area+locationRELEVEL,data=ds,contrasts = list(locationRELEVEL="contr.treatment"))
summary(fit2)

## 
## Call:
## lm(formula = rent ~ area + location, data = ds, contrasts = list(location = "contr.treatment"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -790.98 -100.89   -4.87   94.47 1004.98 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       128.0867     8.6947  14.732  < 2e-16 ***
## area                4.7056     0.1202  39.142  < 2e-16 ***
## locationgood       28.0040     5.8662   4.774 1.89e-06 ***
## locationexcellent 131.1075    18.2614   7.180 8.73e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 157.1 on 3078 degrees of freedom
## Multiple R-squared:  0.3555, Adjusted R-squared:  0.3549 
## F-statistic:   566 on 3 and 3078 DF,  p-value: < 2.2e-16
## 
## 
## Call:
## lm(formula = rent ~ area + locationRELEVEL, data = ds, contrasts = list(locationRELEVEL = "contr.treatment"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -790.98 -100.89   -4.87   94.47 1004.98 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              156.0907     9.4950  16.439  < 2e-16 ***
## area                       4.7056     0.1202  39.142  < 2e-16 ***
## locationRELEVELaverage   -28.0040     5.8662  -4.774 1.89e-06 ***
## locationRELEVELexcellent 103.1034    18.4021   5.603 2.30e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 157.1 on 3078 degrees of freedom
## Multiple R-squared:  0.3555, Adjusted R-squared:  0.3549 
## F-statistic:   566 on 3 and 3078 DF,  p-value: < 2.2e-16

Q: Comment on the print-out. How do we interpret the intercept estimate?

Effect coding aka sum-zero-contrast:

This is an equally useful and popular coding - and this is the coding that is preferred when working with analysis of variance in general. The effect coding assumes that the sum of the effects for the levels of the factor sums to zero, and this is done with the following coding scheme (Model 3 with the original location and 4 with the releveled version.)

X3=model.matrix(~area+location,data=ds,contrasts=list(location="contr.sum"))
X3[c(1,3,69),]
X4=model.matrix(~area+locationRELEVEL,data=ds,contrasts=list(locationRELEVEL="contr.sum"))
X4[c(1,3,69),]

##    (Intercept) area location1 location2
## 1            1   26         0         1
## 3            1   30         1         0
## 69           1   55        -1        -1
##    (Intercept) area locationRELEVEL1 locationRELEVEL2
## 1            1   26                1                0
## 3            1   30                0                1
## 69           1   55               -1               -1

Observe the coding scheme. This means that when we find “the missing location level estimate” as the negative of the sum of the parameter estimates for the other estimated levels.

So, what does this mean in practice?

fit3=lm(rent~area+location,data=ds,contrasts = list(location="contr.sum"))
summary(fit3)
fit4=lm(rent~area+locationRELEVEL,data=ds,contrasts = list(locationRELEVEL="contr.sum"))
summary(fit4)

## 
## Call:
## lm(formula = rent ~ area + location, data = ds, contrasts = list(location = "contr.sum"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -790.98 -100.89   -4.87   94.47 1004.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 181.1238    10.6383  17.026  < 2e-16 ***
## area          4.7056     0.1202  39.142  < 2e-16 ***
## location1   -53.0372     6.6428  -7.984 1.98e-15 ***
## location2   -25.0331     6.7710  -3.697 0.000222 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 157.1 on 3078 degrees of freedom
## Multiple R-squared:  0.3555, Adjusted R-squared:  0.3549 
## F-statistic:   566 on 3 and 3078 DF,  p-value: < 2.2e-16
## 
## 
## Call:
## lm(formula = rent ~ area + locationRELEVEL, data = ds, contrasts = list(locationRELEVEL = "contr.sum"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -790.98 -100.89   -4.87   94.47 1004.98 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      181.1238    10.6383  17.026  < 2e-16 ***
## area               4.7056     0.1202  39.142  < 2e-16 ***
## locationRELEVEL1 -25.0331     6.7710  -3.697 0.000222 ***
## locationRELEVEL2 -53.0372     6.6428  -7.984 1.98e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 157.1 on 3078 degrees of freedom
## Multiple R-squared:  0.3555, Adjusted R-squared:  0.3549 
## F-statistic:   566 on 3 and 3078 DF,  p-value: < 2.2e-16

Q: Comment on the print-out. How do we now interpret the intercept estimate?

Interactions

(read for yourself)

To illustrate how interactions between covariates can be included we use the ozone data set from the ElemStatLearn library. This data set is measurements from 1973 in New York and contains 111 observations of the following variables:

ozone : ozone concentration (ppm)
radiation : solar radiation (langleys)
temperature : daily maximum temperature (F)
wind : wind speed (mph)

We start by fitting a multiple linear regression model to the data, with ozone as our response variable and temperature and wind as covariates.

ozone	radiation	temperature	wind
41	190	67	7.4
36	118	72	8.0
12	149	74	12.6
18	313	62	11.5
23	299	65	8.6
19	99	59	13.8

## 
## Call:
## lm(formula = ozone ~ temperature + wind, data = ozone)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.160 -13.209  -3.089  10.588  98.470 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -67.2008    23.6083  -2.846  0.00529 ** 
## temperature   1.8265     0.2504   7.293 5.32e-11 ***
## wind         -3.2993     0.6706  -4.920 3.12e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.72 on 108 degrees of freedom
## Multiple R-squared:  0.5817, Adjusted R-squared:  0.574 
## F-statistic:  75.1 on 2 and 108 DF,  p-value: < 2.2e-16

The model can be written as: \[Y = \beta_0 + \beta_1 x_t + \beta_2 x_w + \varepsilon\] In this model we have assumed that increasing the value of one covariate is independent of the other covariates. For example: by increasing the temperature by one-unit always increases the response value by $\beta_2 \approx 1.651$, regardless of the value of wind.

However, one might think that the covariate wind (wind speed) might act differently upon ozone for different values of temperature and vice verse. \[\begin{aligned} Y &= \beta_0 + \beta_1 x_t + \beta_2 x_w + \beta_3\cdot(x_t \cdot x_w) +\varepsilon \\ &= \beta_0 + (\beta_1 + \beta_3 x_w) \cdot x_t + \beta_2 x_w + \varepsilon \\ &= \beta_0 + \beta_1 x_t + (\beta_2 + \beta_3 x_t) \cdot x_w + \varepsilon \end{aligned}.\] We fit this model in R. An interaction term can be included in the model using the * symbol.

Q: Look at the summary below. Is this a better model than without the interaction term? It the term significant?

ozone.int = lm(ozone~temperature+wind+ temperature*wind, data=ozone)
summary(ozone.int)

## 
## Call:
## lm(formula = ozone ~ temperature + wind + temperature * wind, 
##     data = ozone)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.929 -11.190  -3.037   8.209  97.440 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -239.94146   48.59004  -4.938 2.92e-06 ***
## temperature         4.00151    0.59311   6.747 8.02e-10 ***
## wind               13.60882    4.28070   3.179  0.00193 ** 
## temperature:wind   -0.21747    0.05446  -3.993  0.00012 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.36 on 107 degrees of freedom
## Multiple R-squared:  0.636,  Adjusted R-squared:  0.6258 
## F-statistic: 62.31 on 3 and 107 DF,  p-value: < 2.2e-16

Below we see that the interaction term is highly significant. The $p$-value is very small, so that there is strong evidence that $\beta_3 \neq 0$. Furthermore, $R^2_{\text{adj}}$ has increased, indicating that more of the variability in the data has been explained by the model (than without the interaction).

Interpretation of the interaction term:

If we now increase the temperature by $10^{\circ}$ F, the increase in wind speed will be \[(\hat \beta_1+\hat \beta_3 \cdot x_w) \cdot 10 = (4.0 -0.22 \cdot x_w) \cdot 10 = 40-2.2 x_w \text{ units}.\]
If we increase the wind speed by 10 mph, the increase in temperature will be \[(\hat \beta_2 + \hat \beta_3 \cdot x_t) \cdot 10 = (14 -0.22 \cdot x_t) \cdot 10 = 140-2.2 x_t \text{ units}.\]

The hierarchical principle

It is possible that the interaction term is higly significant, but the main effects are not.

In our ozone.int model above: the main effects are temperature and wind. The hierarchical principle states that if we include an interaction term in our model, the main effects are also to be included, even if they are not significant. This means that if the coefficients $\hat \beta_1$ or $\hat \beta_2$ would be insignificant, while the coefficient $\hat \beta_3$ is significant, $\hat \beta_1$ and $\hat \beta_2$ should still be included in the model.

There reasons for this is that a model with interaction terms, but without the main effects is hard to interpret.

Interactions between qualitative (discrete) and quantitative (continuous) covariates

We create a new variable temp.cat which is a temperature as a qualitative covariate with two levels and fit the model: \[\begin{aligned}y&=\beta_0 + \beta_1 x_w + \begin{cases} \beta_2 + \beta_3 x_w &\text{ if temperature="low"}\\ 0 &\text{ if temperature = "high"}\end{cases} \\\\ &= \begin{cases} (\beta_0 + \beta_2) + (\beta_1 + \beta_3) \cdot x_w &\text{ if temperature="low"}\\ \beta_0 + \beta_1 x_w &\text{ if temperature="high""} \end{cases} \end{aligned}\]

temp.cat = ifelse(ozone$temperature < mean(ozone$temperature), "low", "high")
ozone2 = cbind(ozone, temp.cat)
kable(head(ozone2))
ozone.int2 = lm(ozone~wind + temp.cat+ temp.cat*wind, data=ozone2)
summary(ozone.int2)

ozone	radiation	temperature	wind	temp.cat
41	190	67	7.4	low
36	118	72	8.0	low
12	149	74	12.6	low
18	313	62	11.5	low
23	299	65	8.6	low
19	99	59	13.8	low

## 
## Call:
## lm(formula = ozone ~ wind + temp.cat + temp.cat * wind, data = ozone2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.291  -9.091  -1.307  11.227  71.815 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      119.0450     7.5004  15.872  < 2e-16 ***
## wind              -6.7235     0.8195  -8.204 5.61e-13 ***
## temp.catlow      -92.6316    12.9466  -7.155 1.09e-10 ***
## wind:temp.catlow   6.0544     1.1999   5.046 1.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.26 on 107 degrees of freedom
## Multiple R-squared:  0.6393, Adjusted R-squared:  0.6291 
## F-statistic:  63.2 on 3 and 107 DF,  p-value: < 2.2e-16

interceptlow = coef(ozone.int2)[1]+coef(ozone.int2)[3]
slopelow = coef(ozone.int2)[2]+coef(ozone.int2)[4]
intercepthigh = coef(ozone.int2)[1]
slopehigh = coef(ozone.int2)[2]
ggplot(ozone)+geom_line(aes(y=interceptlow+slopelow*wind, x = wind), col="blue")+geom_line(aes(y=intercepthigh+slopehigh*wind, x=wind), col="red")+ylab("ozone")

Interactive lectures- problem set first week

Theoretical questions

Problem 1

Write down the GLM way for the multiple linear regression model. Explain.
Write down the likelihood and loglikelihood. Then define the score vector.

What is the set of equations we solve to find parameter estimates? What if we could not find a closed form solution to our set of equations - what could we do then?

Define the observed and the expected Fisher information matrix. What dimension does these matrices have? What can these matrices tell us?
A core finding is $\hat \beta$. \[ \hat{\beta}=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf Y}\] with $\hat{\beta}\sim N_{p}(\beta,\sigma^2({\bf X}^T{\bf X})^{-1})$.

Show that $\hat{\beta}$ has this distribution with the given mean and covariance matrix. What does this imply for the distribution of the $j$th element of $\hat{\beta}$? In particular, how can we calculate the variance of $\hat{\beta}_j$?

Explain the difference between error and residual. What are the properties of the raw residuals? Why don’t we want to use the raw residuals for model check? What is our solution to this?
That is the theoretical intercept and slope of a QQ–plot based on a normal sample?

Interpretation and understanding

Problem 2: Munich Rent Index data

Fit the regression model with first rent and then rentsqm as reponse and following covariates: area, location (dummy variable coding using location2 and location3), bath, kitchen and cheating (central heating).

#CODE

Look at diagnostic plots for the two fits. Which response do you prefer?

Consentrate on the response-model you choose for the rest of the tasks.

Explain what the parameter estimates mean in practice. In particular, what is the interpretation of the intercept?

Go through the summary printout and explain the parts you know now, and also observe the parts you don’t know yet (on the agenda for next week?).

Next week: more on inference on this data set.

Problem 3: Simple vs. multiple regression

In a design matrix orthogonal columns gives diagonal ${\bf X}^T {\bf X}$. What does that mean? How can we get orthogonal columns?
If we have orthogonal columns, will then simple (only one covariate) and multiple estimated regression coefficients be different? Explain.
What is multicollinearity? Is that a problem? Why (not)?

Problem 4: Dummy vs. effect coding in MLR

Background material for this task: [Categorical covariates - dummy and effect coding)(#categorical)

We will study a dataset where we want to model income as response and two unordered categorical covariates genderand place (location).

income <- c(300, 350, 370, 360, 400, 370, 420, 390,
            400,430,420, 410, 300, 320, 310, 305,
            350, 370, 340, 355,370, 380, 360, 365)
gender <- c(rep("Male", 12),rep("Female",12))
place <- rep(c(rep("A",4),rep("B",4),rep("C",4)),2)
data <- data.frame(income,gender=factor(gender,levels=c("Female","Male")),
                   place=factor(place,levels=c("A","B","C")))

First, describe the data set.

#CODE ggpairs

Check out the interaction.plot(data$gender,data$place,data$income). What does it show? Do we need an interaction term if we want to model a MLR with income as response? Is there a ggplot version of the interaction plot?
Check our plot.design(income~place+gender, data = data). What does it show?
First, use treatment contrast (dummy variable coding) and fit a MLR with income as response and gender and place as covariates. Explain what your model estimates mean. In particular, what is the interpretation of the intercept estimate?
Now, turn to sum-zero contrast (effect coding). Explain what your model estimates mean. Now, what is the intercept estimate? Calculate the estimate for place=C.

Next week we connect this to linear hypotheses and ANOVA.

Problem 5: Interactions

This part of the module was marked “self-study”. Go through this together in the group, and make sure that you understand.

Problem 6: Simulations in R (optional)

(this problem was also given in TMA4268 Statistical learning)

For simple linear regression, simulate at data set with homoscedastic errore and with heteroscedastic errors. Here is a suggestion of one solution - not using ggplot. You use ggplot. Why this? To see how things looks when the model is correct and wrong.

#Homoscedastic errore
n=1000
x=seq(-3,3,length=n)
beta0=-1
beta1=2
xbeta=beta0+beta1*x
sigma=1
e1=rnorm(n,mean=0,sd=sigma)
y1=xbeta+e1
ehat1=residuals(lm(y1~x))
plot(x,y1,pch=20)
abline(beta0,beta1,col=1)
plot(x,e1,pch=20)
abline(h=0,col=2)
#Heteroscedastic errors
sigma=(0.1+0.3*(x+3))^2
e2=rnorm(n,0,sd=sigma)
y2=xbeta+e2
ehat2=residuals(lm(y2~x))
plot(x,y2,pch=20)
abline(beta0,beta1,col=2)
plot(x,e2,pch=20)
abline(h=0,col=2)

All this fuss about raw, standardized and studentized residuals- does really matter in practice? Below is one example where the raw residuals are rather different from the standardized, but the standardized is identical to the studentized. Can you come up with a simuation model where the standardized and studentized are very different? Hint: what about at smaller sample size?

n=1000
beta=matrix(c(0,1,1/2,1/3),ncol=1)
set.seed(123)
x1=rnorm(n,0,1); x2=rnorm(n,0,2); x3=rnorm(n,0,3)
X=cbind(rep(1,n),x1,x2,x3)
y=X%*%beta+rnorm(n,0,2)
fit=lm(y~x1+x2+x3)
yhat=predict(fit)
summary(fit)
ehat=residuals(fit); estand=rstandard(fit); estud=rstudent(fit)
plot(yhat,ehat,pch=20)
points(yhat,estand,pch=20,col=2)

SECOND WEEK

UNDER CONSTRUCTION :-)

R packages

install.packages(c("gamlss.data","tidyverse", "GGally", "Matrix","nortest"))

References and further reading

Slightly different presentation (more focus on multivariate normal theory): Slides and written material from TMA4267 Linear Statistical Models in 2017, Part 2: Regression (by Mette Langaas).
And, same source, but now [Slides and written material from TMA4267 Linear Statistical Models in 2017, Part 3: Hypothesis testing and ANOVA] (http://www.math.ntnu.no/emner/TMA4267/2017v/TMA4267V2017Part3.pdf)

Module 2: MULTIPLE LINEAR REGRESSION

TMA4315 Generalized linear models H2018

Mette Langaas, Department of Mathematical Sciences, NTNU – with contributions from Øyvind Bakke and Ingeborg Hem

30.08 and 06.09 [PL], 31.08 and 07.09 [IL]