install.packages("gamlss.data")
install.packages("tidyverse")
install.packages("GGally")
install.packages("Matrix")
For Compulsory 1: jump to sequential anova formula
Aim of this module
First week
- Aim of multiple linear regression.
- Define and understand the multiple linear regression model (matrix notation)
- parameter estimation with maximum likelihood (and least squares), geometric interpretation
- likelihood, score vector and Hessian (observed Fisher information matrix)
- properties of parameter estimates (distribution)
- categorical covariates: dummy or effect coding
- simple vs multiple regression: why (and when) do estimated regression effects change from simple to multiple regression?
- QQ-plots
- assessing model fit (diagnostic), residuals
Jump to interactive for first week
Second week
- statistical inference for parameter estimates
- confidence intervals,
- prediction intervals,
- hypothesis test,
- linear hypotheses
- analysis of variance decompositions and \(R^2\), sequential ANOVA table
- model selection with AIC and variants
Jump to interactive for second week
Textbook: Chapter 2.2, 3 and B.4 (Chapter 3 was on the reading list for TMA4267 Linear statistical models in 2016 and 2017, but for TMA4255 Applied statistics sum and not matrix notation was used and a different textbook.)
Aim of multiple linear regression
Construct a model to help understand the relationship between a response and one or several explanatory variables. [Correlation, or cause and effect?]
Construct a model to predict the reponse from a set of (one or several) explanatory variables. [More or less “black box”]
\(\oplus\) Notation
\({\bf Y}: (n \times 1)\) vector of responses [e.g. one of the following: rent, rent pr sqm, weight of baby, ph of lake, volume of tree]
\({\bf X}: (n \times p)\) design matrix [e.g. location of flat, gestation age of baby, chemical measurement of the lake, height of tree]
\({\bf \beta}: (p \times 1)\) vector of regression parameters (intercept included, so \(p=k+1\))
\({\bf \varepsilon}: (n\times 1)\) vector of random errors.
We assume that pairs \(({\bf x}_i^T,y_i)\) (\(i=1,...,n\)) are measured from sampling units. That is, the observation pair \(({\bf x}_1^T,y_1)\) is independent from \(({\bf x}_2^T,y_2)\).
\(\odot\) Munich rent index: response and covariates
Study the print-out and discuss the following questions:
- What can be response, and what covariates? (using what you know about rents) A: rent or rentsqm for response and all other as covariates.
- What type of response(s) do we have? (continuous, categorical, nominal, ordinal, discrete, factors, …). A: rent, rentsqm (continuous).
- What types of covariates? (continuous, categorical, nominal, ordinal, discrete, factors, …) A: area (continuous), yearc (ordinal or continuous), location (categorical - ordinal), bath, kichen, cheating (categorical, nominal?), district (categorical - nominal).
- Explain what the elements of
model.matrix
are. (Hint: coding oflocation
) A: our design matrix, contains numerical values for the continuous covariates, and dummy variable coding for categorical covariates (more about this later).
# install.packages('gamlss.data') the package needs to be installed before
# this Rmd is run
library("gamlss.data")
ds = rent99
colnames(ds)
## [1] "rent" "rentsqm" "area" "yearc" "location" "bath"
## [7] "kitchen" "cheating" "district"
summary(ds)
## rent rentsqm area yearc
## Min. : 40.51 Min. : 0.4158 Min. : 20.00 Min. :1918
## 1st Qu.: 322.03 1st Qu.: 5.2610 1st Qu.: 51.00 1st Qu.:1939
## Median : 426.97 Median : 6.9802 Median : 65.00 Median :1959
## Mean : 459.44 Mean : 7.1113 Mean : 67.37 Mean :1956
## 3rd Qu.: 559.36 3rd Qu.: 8.8408 3rd Qu.: 81.00 3rd Qu.:1972
## Max. :1843.38 Max. :17.7216 Max. :160.00 Max. :1997
## location bath kitchen cheating district
## 1:1794 0:2891 0:2951 0: 321 Min. : 113
## 2:1210 1: 191 1: 131 1:2761 1st Qu.: 561
## 3: 78 Median :1025
## Mean :1170
## 3rd Qu.:1714
## Max. :2529
dim(ds)
## [1] 3082 9
head(ds)
## rent rentsqm area yearc location bath kitchen cheating district
## 1 109.9487 4.228797 26 1918 2 0 0 0 916
## 2 243.2820 8.688646 28 1918 2 0 0 1 813
## 3 261.6410 8.721369 30 1918 1 0 0 1 611
## 4 106.4103 3.547009 30 1918 2 0 0 0 2025
## 5 133.3846 4.446154 30 1918 2 0 0 1 561
## 6 339.0256 11.300851 30 1918 2 0 0 1 541
str(ds$location)
## Factor w/ 3 levels "1","2","3": 2 2 1 2 2 2 1 1 1 2 ...
contrasts(ds$location)
## 2 3
## 1 0 0
## 2 1 0
## 3 0 1
X = model.matrix(rentsqm ~ area + yearc + location + bath + kitchen + cheating +
district, data = ds)
head(X)
## (Intercept) area yearc location2 location3 bath1 kitchen1 cheating1
## 1 1 26 1918 1 0 0 0 0
## 2 1 28 1918 1 0 0 0 1
## 3 1 30 1918 0 0 0 0 1
## 4 1 30 1918 1 0 0 0 0
## 5 1 30 1918 1 0 0 0 1
## 6 1 30 1918 1 0 0 0 1
## district
## 1 916
## 2 813
## 3 611
## 4 2025
## 5 561
## 6 541
Model
\[{\bf Y=X \beta}+{\bf\varepsilon}\] is called a classical linear model if the following is true:
\(\text{E}(\bf{\varepsilon})=\bf{0}\).
\(\text{Cov}(\bf{\varepsilon})=\text{E}(\bf{\varepsilon}\bf{\varepsilon}^T)=\sigma^2\bf{I}\).
The design matrix has full rank, \(\text{rank}({\bf X})=k+1=p\).
The classical normal linear regression model is obtained if additionally
- \(\varepsilon\sim N_n(\bf{0},\sigma^2\bf{I})\) holds.
For random covariates these assumptions are to be understood conditionally on \(\bf{X}\).
\(\oplus\) Questions
- What is “full rank”? Why is this needed? Example of rank less than \(p\)?
- For the normal linear model - what is the distribution for each \(Y_i\)? Are the \(Y_i\)s independent?
- Would the normal linear model be a possible model for the Munich rent data? Why?
Parameter estimation
In multiple linear regression there are two popular methods for estimating the regression parameters in \(\beta\): maximum likelihood and least squares. These two methods give the same estimator when we assume the normal linear regression model. We will in this module focus on maximum likelihood estimation, since that can be used also when we have non-normal responses (modules 3-6: binomial, Poisson, gamma, multinomial).
\(\oplus\) Likelihood theory (from B.4)
We assume that we have collected independent pairs of data from \(n\) units (individuals, sample points)- here we move on to define the likelihood, loglikelihood, score vector, observed and expected Fisher information. Then we solve “score vector=0” to find parameter estimates.
Least squares and maximum likelihood estimator for \({\bf \beta}\):
\[ \hat\beta=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf Y}\]
\(\odot\) Munich rent index: parameter estimates
Explain what the values under Estimate
mean in practice.
fit = lm(rentsqm ~ area + yearc + location + bath + kitchen + cheating, data = ds)
summary(fit)
##
## Call:
## lm(formula = rentsqm ~ area + yearc + location + bath + kitchen +
## cheating, data = ds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4303 -1.4131 -0.1073 1.3244 8.6452
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -45.475484 3.603775 -12.619 < 2e-16 ***
## area -0.032330 0.001648 -19.618 < 2e-16 ***
## yearc 0.026959 0.001846 14.606 < 2e-16 ***
## location2 0.777133 0.076870 10.110 < 2e-16 ***
## location3 1.725068 0.236062 7.308 3.45e-13 ***
## bath1 0.762808 0.157559 4.841 1.35e-06 ***
## kitchen1 1.136908 0.183088 6.210 6.02e-10 ***
## cheating1 1.765261 0.129068 13.677 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.031 on 3074 degrees of freedom
## Multiple R-squared: 0.3065, Adjusted R-squared: 0.3049
## F-statistic: 194.1 on 7 and 3074 DF, p-value: < 2.2e-16
Reproduce the values under Estimate
by calculating without the use of lm
.
X = model.matrix(rentsqm ~ area + yearc + location + bath + kitchen + cheating,
data = ds)
Y = ds$rentsqm
betahat = solve(t(X) %*% X) %*% t(X) %*% Y
# betahat-fit$coefficients
print(betahat)
## [,1]
## (Intercept) -45.47548356
## area -0.03233033
## yearc 0.02695857
## location2 0.77713297
## location3 1.72506792
## bath1 0.76280784
## kitchen1 1.13690814
## cheating1 1.76526110
Projection matrices: idempotent, symmetric/orthogonal:
First, we define predictions as \(\hat{\bf Y}={\bf X}\hat{\beta}\), and inserted the ML (and LS) estimate we get \(\hat{\bf Y}={\bf X}({\bf X}^T{\bf X})^{-1}{\bf X}^T{\bf Y}\).
We define the projection matrix \[ {\bf H}={\bf X}({\bf X}^T{\bf X})^{-1} {\bf X}^T\] called the hat matrix. This simplifies the notation for the predictions, \[\hat{\bf Y}={\bf H}{\bf Y}\] so the hat matrix is putting the hat on the response \({\bf Y}\).
In addition we define residuals as \[\begin{align*} \hat{\varepsilon}&={\bf Y}-\hat{\bf Y} \\ \hat{\varepsilon}&={\bf Y}-{\bf HY}=({\bf I-H}){\bf Y}\\ \end{align*}\]so we have a second projection matrix \[ {\bf I-H}={\bf I}-{\bf X}({\bf X}^T{\bf X})^{-1} {\bf X}^T \]
Geometry of Least Squares (involving our two projection matrices)
- Mean response vector: \(\text{E}({\bf Y})={\bf X}{\bf \beta}\)
- As \(\beta\) varies, \({\bf X}\beta\) spans the model plane of all linear combinations. I.e. the space spanned by the columns of \({\bf X}\): the column-space of \({\bf X}\).
- Due to random error (and unobserved covariates), \({\bf Y}\) is not exactly a linear combination of the columns of \({\bf X}\).
- LS-estimation chooses \(\hat{\beta}\) such that \({\bf X}\hat{\beta}\) is the point in the column-space of \({\bf X}\) that is closes to \({\bf Y}\).
- The residual vector \(\hat{\varepsilon}={\bf Y}-\hat{{\bf Y}}=({\bf I-H}){\bf Y}\) is perpendicular to the column-space of \({\bf X}\).
- Multiplication by \({\bf H}={\bf X}({\bf X}^T{\bf X})^{-1}{\bf X}^T\) projects a vector onto the column-space of \({\bf X}\).
- Multiplication by \({\bf I-H}={\bf I}-{\bf X}({\bf X}^T{\bf X})^{-1}{\bf X}^T\) projects a vector onto the space perpendicular to the column-space of \({\bf X}\).
Graphical presentations: Putanen, Styan and Isotalo: Matrix Tricks for Linear Statistical Models: Our Personal Top Twenty, Figure 8.3.
Restricted maximum likelihood estimator for \({\bf \sigma}^2\):
\[ \hat{\sigma}^2=\frac{1}{n-p}({\bf Y}-{\bf X}{\hat{\beta}})^T({\bf Y}-{\bf X}{\bf \hat{\beta}})=\frac{\text{SSE}}{n-p}\]
In the generalized linear models setting (remember exponential family from Module 1) we will look at the parameter \(\sigma^2\) as a nuisance parameter = parameter that is not of interest to us. Our our focus will be on the parameters of interest - which will be related to the mean of the response, which is modelled using our covariate - so the regression parameters \(\beta\) are therefore our prime focus.
However, to perform inference we need an estimator for \(\sigma^2\).
The maximum likelihood estimator for \(\sigma^2\) is \(\frac{\text{SSE}}{n}\), which is found from maximizing the likelihood inserted our estimate of \(\hat{\beta}\) \[L(\hat{\beta},{\sigma^2})=(\frac{1}{2\pi})^{n/2}(\frac{1}{\sigma^2})^{n/2}\exp(-\frac{1}{2\sigma^2} ({\bf y}-{\bf X}\hat{\beta})^T({\bf y}-{\bf X}\hat{\beta}))\] \[\begin{align*} l(\hat{\beta},{\sigma^2})&=\text{ln}(L(\hat{\beta},{\sigma^2}))\\ &=-\frac{n}{2}\text{ln}(2\pi)-\frac{n}{2}\text{ln}\sigma^2-\frac{1}{2\sigma^2} ({\bf y}-{\bf X}\hat{\beta})^T({\bf y}-{\bf X}\hat{\beta})\\ \end{align*}\]The score vector with respect to \(\sigma^2\) is \[\frac{\partial l}{\partial \sigma^2}=0-\frac{n}{2\sigma^2}+\frac{1}{2\sigma^4}({\bf y}-{\bf X}\hat{\beta})^T({\bf y}-{\bf X}\hat{\beta}) \] Solving \(\frac{\partial l}{\partial \sigma^2}=0\) gives us the estimator \[ \frac{1}{n}({\bf y}-{\bf X}\hat{\beta})^T({\bf y}-{\bf X}\hat{\beta})=\frac{\text{SSE}}{n}\] But, this estimator is biased, which can be seen using the trace-formula (from TMA4267).
But, it is asympotically unbiased (unbiased when the sample size \(n\) increases to infinity).
However, a unbiased version is preferred, and is found based on restricted maximum likelihood (REML). We will look into REML-estimation in Module 7. In our case the (unbiased) REML estimate is \[ \hat{\sigma}^2=\frac{1}{n-p}({\bf y}-{\bf X}\hat{\beta})^T({\bf y}-{\bf X}\hat{\beta})=\frac{\text{SSE}}{n-p}\]
The restricted maximum likelihood estimate is used in lm
.
Q: What does it mean that the REML estimate is unbiased? Where is the estimate \(\hat{\sigma}\) in the regression output? (See output from lm
for the rent index example.) A: Residual standard error: 2.031
is \(\hat{\sigma}\).
Properties for the normal linear model
To be able to do inference (=make confidence intervals, prediction intervals, test hypotheses) we need to know about the properties of our parameter estimators in the (normal) linear model.
Least squares and maximum likelihood estimator for \({\bf \beta}\): \[ \hat{\beta}=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf Y}\] with \(\hat{\beta}\sim N_{p}(\beta,\sigma^2({\bf X}^T{\bf X})^{-1})\).
Restricted maximum likelihood estimator for \({\bf \sigma}^2\): \[ \hat{\sigma}^2=\frac{1}{n-p}({\bf Y}-{\bf X}\hat{\beta})^T({\bf Y}-{\bf X}\hat{\beta})=\frac{\text{SSE}}{n-p}\] with \(\frac{(n-p)\hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p}\).
Statistic for inference about \(\beta_j\), \(c_{jj}\) is diagonal element \(j\) of \(({\bf X}^T{\bf X})^{-1}\). \[ T_j=\frac{\hat{\beta}_j-\beta_j}{\sqrt{c_{jj}}\hat{\sigma}}\sim t_{n-p}\] This requires that \(\hat{\beta}_j\) and \(\hat{\sigma}\) are independent (see below).
However, when we work with large samples then \(n-p\) becomes large and the \(t\) distribution goes to a normal distribution, so we may use the standard normal in place of the \(t_{n-p}\).
Are \(\hat{{\bf \beta}}\) and SSE are independent? (optional)
Can be proven using knowledge from TMA4267 on properties of the multivariate normal distribution.
Independence: Let \({\bf X}_{(p \times 1)}\) be a random vector from \(N_p({\bf \mu},{\bf \Sigma})\). Then \({\bf A}{\bf X}\) and \({\bf B}{\bf X}\) are independent iff \({\bf A}{\bf \Sigma}{\bf B}^T={\bf 0}\).
We have:
\({\bf Y}\sim N_n({\bf X}{\bf \beta},\sigma^2{\bf I})\)
\({\bf AY}=\hat{{\bf \beta}}=({\bf X}^T{\bf X})^{-1}{\bf X}^T {\bf Y}\), and
\({\bf BY}=({\bf I}-{\bf H}){\bf Y}\).
Now \({\bf A}\sigma^2{\bf I}{\bf B}^T=\sigma^2 {\bf A}{\bf B}^T=\sigma^2 ({\bf X}^T{\bf X})^{-1}{\bf X}^T ({\bf I}-{\bf H})={\bf 0}\)
since \({\bf X}({\bf I}-{\bf H})={\bf X}-{\bf HX}={\bf X}-{\bf X}={\bf 0}\).
We conclude that \(\hat{{\bf \beta}}\) is independent of \(({\bf I}-{\bf H}){\bf Y}\),
and, since SSE=function of \(({\bf I}-{\bf H}){\bf Y}\): SSE=\({\bf Y}^T({\bf I}-{\bf H}){\bf Y}\),
then \(\hat{{\bf \beta}}\) and SSE are independent, and the result with \(T_j\) being t-distributed with \(n-p\) degrees of freedom is correct.
Categorical covariates - dummy and effect coding
Example: consider our rent
dataset with rent
as reponse, and continuous covariate area
and categorical covariate location
. Let the location
be a factor with levels average, good, excellent
.
require(gamlss.data)
require(tidyverse)
require(GGally)
ds = rent99 %>% select(location, area, rent)
levels(ds$location)
## [1] "1" "2" "3"
# change to meaningful names
levels(ds$location) = c("average", "good", "excellent")
ggpairs(ds)
Q: comment on what you see in the ggpairs
plot.
Categorical covariates may either be ordered or unordered. We will only consider unordered categories here. In general, we could like to estimate regression coefficients for all levels for the categorical covariates. However, if we want to include an intercept in our model we can only include codings for one less variable than the number of levels we have - or else our design matrix will not have full rank.
Q: Assume you have a categorical variable with three levels. Check for yourself that making a design matrix with one intercept and three columns with dummy (0-1) variable coding will result in a matrix that is singular.
# make 'wrong' dummy variable coding with 3 columns
n = length(ds$location)
X = cbind(rep(1, n), ds$area, rep(0, n), rep(0, n), rep(0, n))
X[ds$location == "average", 3] = 1
X[ds$location == "good", 4] = 1
X[ds$location == "excellent", 5] = 1
X[c(1, 3, 69), ]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 26 0 1 0
## [2,] 1 30 1 0 0
## [3,] 1 55 0 0 1
require(Matrix)
dim(X)
## [1] 3082 5
rankMatrix(X)
## [1] 4
## attr(,"method")
## [1] "tolNorm2"
## attr(,"useGrad")
## [1] FALSE
## attr(,"tol")
## [1] 6.843415e-13
This is why we need to instead work with different ways of coding categorical variables. One solution is to not include an intercept in the model, but that is often not what we want. We will look at two other solutions - one where we decide on a reference category (that we not include in the coding, and therefore is kind of included in the intercept - this is called “treatment coding”) and one where we require that the the sum of the coeffisients are zero (called “effect coding). This mainly effects how we interpret parameter estimates and communicate our findings to the world.
If we fit a regression model with lm
to the data with rent
as response and area
and location
as covariates, a model matrix is made - and how to handle the categorical variable is either specified the call to lm
in contrasts=list(location="contr.treatment")
(or to model.matrix) or globally for all categorical variables with options(contrasts=c("contr.treatment","contr.poly"))
- where first element give choice for unordered factor (then treatment contrast is default) and second for ordered (and then this polynomial contrast is default). We will only work with unordered factors now.
Dummy variable coding aka treatment contrast
This is the default coding. The reference level is automatically chosen as the “lowest” level (sorted alphabetically). For our example this means that the reference category for location is “average”. If we instead wanted “good” to be reference category we could relevel the factor.
X1 = model.matrix(~area + location, data = ds)
X1[c(1, 3, 69), ]
## (Intercept) area locationgood locationexcellent
## 1 1 26 1 0
## 3 1 30 0 0
## 69 1 55 0 1
ds$locationRELEVEL = relevel(ds$location, ref = "good")
X2 = model.matrix(~area + locationRELEVEL, data = ds)
X2[c(1, 3, 69), ]
## (Intercept) area locationRELEVELaverage locationRELEVELexcellent
## 1 1 26 0 0
## 3 1 30 1 0
## 69 1 55 0 1
So, what does this mean in practice? Model 1 has average
as reference category and model 2 good
.
fit1 = lm(rent ~ area + location, data = ds, contrasts = list(location = "contr.treatment"))
summary(fit1)
##
## Call:
## lm(formula = rent ~ area + location, data = ds, contrasts = list(location = "contr.treatment"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -790.98 -100.89 -4.87 94.47 1004.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 128.0867 8.6947 14.732 < 2e-16 ***
## area 4.7056 0.1202 39.142 < 2e-16 ***
## locationgood 28.0040 5.8662 4.774 1.89e-06 ***
## locationexcellent 131.1075 18.2614 7.180 8.73e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 157.1 on 3078 degrees of freedom
## Multiple R-squared: 0.3555, Adjusted R-squared: 0.3549
## F-statistic: 566 on 3 and 3078 DF, p-value: < 2.2e-16
fit2 = lm(rent ~ area + locationRELEVEL, data = ds, contrasts = list(locationRELEVEL = "contr.treatment"))
summary(fit2)
##
## Call:
## lm(formula = rent ~ area + locationRELEVEL, data = ds, contrasts = list(locationRELEVEL = "contr.treatment"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -790.98 -100.89 -4.87 94.47 1004.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.0907 9.4950 16.439 < 2e-16 ***
## area 4.7056 0.1202 39.142 < 2e-16 ***
## locationRELEVELaverage -28.0040 5.8662 -4.774 1.89e-06 ***
## locationRELEVELexcellent 103.1034 18.4021 5.603 2.30e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 157.1 on 3078 degrees of freedom
## Multiple R-squared: 0.3555, Adjusted R-squared: 0.3549
## F-statistic: 566 on 3 and 3078 DF, p-value: < 2.2e-16
Q: Comment on the print-out. How do we interpret the intercept estimate?
Effect coding aka sum-zero-contrast:
This is an equally useful and popular coding - and this is the coding that is preferred when working with analysis of variance in general. The effect coding assumes that the sum of the effects for the levels of the factor sums to zero, and this is done with the following coding scheme (Model 3 with the original location and 4 with the releveled version.)
X3 = model.matrix(~area + location, data = ds, contrasts = list(location = "contr.sum"))
X3[c(1, 3, 69), ]
## (Intercept) area location1 location2
## 1 1 26 0 1
## 3 1 30 1 0
## 69 1 55 -1 -1
X4 = model.matrix(~area + locationRELEVEL, data = ds, contrasts = list(locationRELEVEL = "contr.sum"))
X4[c(1, 3, 69), ]
## (Intercept) area locationRELEVEL1 locationRELEVEL2
## 1 1 26 1 0
## 3 1 30 0 1
## 69 1 55 -1 -1
Observe the coding scheme. This means that when we find “the missing location level estimate” as the negative of the sum of the parameter estimates for the other estimated levels.
So, what does this mean in practice?
fit3 = lm(rent ~ area + location, data = ds, contrasts = list(location = "contr.sum"))
summary(fit3)
##
## Call:
## lm(formula = rent ~ area + location, data = ds, contrasts = list(location = "contr.sum"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -790.98 -100.89 -4.87 94.47 1004.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 181.1238 10.6383 17.026 < 2e-16 ***
## area 4.7056 0.1202 39.142 < 2e-16 ***
## location1 -53.0372 6.6428 -7.984 1.98e-15 ***
## location2 -25.0331 6.7710 -3.697 0.000222 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 157.1 on 3078 degrees of freedom
## Multiple R-squared: 0.3555, Adjusted R-squared: 0.3549
## F-statistic: 566 on 3 and 3078 DF, p-value: < 2.2e-16
fit4 = lm(rent ~ area + locationRELEVEL, data = ds, contrasts = list(locationRELEVEL = "contr.sum"))
summary(fit4)
##
## Call:
## lm(formula = rent ~ area + locationRELEVEL, data = ds, contrasts = list(locationRELEVEL = "contr.sum"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -790.98 -100.89 -4.87 94.47 1004.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 181.1238 10.6383 17.026 < 2e-16 ***
## area 4.7056 0.1202 39.142 < 2e-16 ***
## locationRELEVEL1 -25.0331 6.7710 -3.697 0.000222 ***
## locationRELEVEL2 -53.0372 6.6428 -7.984 1.98e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 157.1 on 3078 degrees of freedom
## Multiple R-squared: 0.3555, Adjusted R-squared: 0.3549
## F-statistic: 566 on 3 and 3078 DF, p-value: < 2.2e-16
Q: Comment on the print-out. How do we now interpret the intercept estimate?
Simple vs. multiple regression: why (and when) do estimated regression effects change from simple to multiple regression?
The Aquarium
(story originally told by Kathrine Frey Frøslie, look up her blog at http://statistrikk.no)
Imagine an aquarium - maybe 50 cm wide (x-axis) and 1 meter long (our y axis) and 1 meter high (our z-axis). Inside the aquarium there are many many small fish swimming together as a shoal of fish. Suddenly - we are able to freeze time and all the fish are fixated in the water.
We move to the short side of the aquarium, bend down and see the fish in two dimensions (x and z) - we are able to use a ruler to draw a line through the fish - they look as they lie on a straight line (we do a simple linear regression). Then we move over to the long side of the aquarium, bend down and see the fish in two dimensions (now y and z) - we are able to use a ruler to draw a line through the fish - and again they look as they lie on a straight line (and we can do simple linear regression).
Then I hand you a fictive glass plate and ask you to try to fit it to the shoal of fish in three dimentions - you do a multiple linear regression.
Q:
Design matrix - orthogonal columns gives diagonal \(X^T X\). What does that mean? How can we get orthogonal columns?
If we have orthogonal columns, will then simple and multiple estimated regression coefficients be different?
What is multicollinearity?
Checking model assumptions
In the normal linear model we have made the following assumptions.
Linearity of covariates: \({\bf Y}={\bf X}\beta+\varepsilon\)
Homoscedastic error variance: \(\text{Cov}(\varepsilon)=\sigma^2 {\bf I}\).
Uncorrelated errors: \(\text{Cov}(\varepsilon_i,\varepsilon_j)=0\).
Additivity of errors: \({\bf Y}={\bf X}\beta+\varepsilon\)
Assumption of normality: \(\varepsilon \sim N_n({\bf 0},\sigma^2{\bf I})\)
General theory on QQ-plots
Go to separate R Markdown or html document: QQ–plot as Rmd or QQ–plot as html
Residuals
If we assume the normal linear model then we know that the residuals (\(n\times 1\) vector) \[\hat{\varepsilon}={\bf Y}-{\bf \hat{Y}}=({\bf I}-{\bf H}){\bf Y}\] has a normal (singular) distribution with mean \(\text{E}(\hat{\varepsilon})={\bf 0}\) and covariance matrix \(\text{Cov}(\hat{ \varepsilon})=\sigma^2({\bf I}-{\bf H})\) where \({\bf H}={\bf X}({\bf X}^T{\bf X})^{-1}{\bf X}^T\).
This means that the residuals (possibly) have different variance, and may also be correlated.
Our best guess for the error \(\varepsilon\) is the residual vector \(\hat{\varepsilon}\), and we may think of the residuals as predictions of the errors. Be aware: don’t mix errors (the unobserved) with the residuals (“observed”).
But, we see that the residuals are not independent and may have different variance, therefore we will soon define variants of the residuals that we may use to assess model assumptions after a data set is fitted.
Q: How can we say that the residuals can have different variance and may be correlated? Why is that a problem?
We would like to check the model assumptions - we see that they are all connected to the error terms. But, but we have not observed the error terms \(\varepsilon\) so they can not be used for this. However, we have made “predictions” of the errors - our residuals. And, we want to use our residuals to check the model assumptions.
That is, we want to check that our errors are independent, homoscedastic (same variance for each observation), and not dependent on our covariates - and we want to use the residuals (observed) in place of the errors (unobserved). Then it would have been great if the residuals have these properties when the underlying errors have. To amend our problem we need to try to fix the residual so that they at least have equal variances. We do that by working with standardized or studentized residuals.
Standardized residuals:
\[r_i=\frac{\hat{\varepsilon}_i}{\hat{\sigma}\sqrt{1-h_{ii}}}\] where \(h_ii\) is the \(i\)th diagonal element of the hat matrix \({\bf H}\).
In R you can get the standardized residuals from an lm
-object (named fit
) by rstandard(fit)
.
Studentized residuals:
\[r^*_i=\frac{\hat{\varepsilon}_i}{\hat{\sigma}_{(i)}\sqrt{1-h_{ii}}}\] where \(\hat{\sigma}_{(i)}\) is the estimated error variance in a model with observation number \(i\) omitted. This seems like a lot of work, but it can be shown that it is possible to calculated the studentized residuals directly from the standardized residuals: \[r^*_i=r_i (\frac{n-p-1}{n-p-r_i^2})^{1/2}\]
In R you can get the studentized residuals from an lm
-object (named fit
) by rstudent(fit)
.
Plotting residuals - and what to do when assumptions are violated?
Some important plots
- Plot the residuals, \(r^*_i\) against the predicted values, \(\hat{y}_i\).
Dependence of the residuals on the predicted value: wrong regression model?
Nonconstant variance: transformation or weighted least squares is needed?
Plot the residuals, \(r^*_i\), against predictor variable or functions of predictor variables. Trend suggest that transformation of the predictors or more terms are needed in the regression.
Assessing normality of errors: QQ-plots and histograms of residuals. As an additional aid a test for normality can be used, but must be interpreted with caution since for small sample sizes the test is not very powerful and for large sample sizes even very small deviances from normality will be labelled as significant.
Plot the residuals, \(r^*_i\), versus time or collection order (if possible). Look for dependence or autocorrelation.
Residuals can be used to check model assumptions, and also to discover outliers.
Diagnostic plots in R
More information on the plots here: http://data.library.virginia.edu/diagnostic-plots/ and http://ggplot2.tidyverse.org/reference/fortify.lm.html
You can use the function fortify.lm
in ggplot2
to create a dataframe from an lm
-object, which ggplot
uses automatically when given a lm
-object. This can be used to plot diagnostic plots.
We will use a datasett called women
from the car
-library as illustration. The information fortify
provides is
data(women) # Load a built-in dataframe called women
fit <- lm(weight ~ height, data = women) # Run a regression analysis
format(head(fortify(fit)), digits = 4L)
## weight height .hat .sigma .cooksd .fitted .resid .stdresid
## 1 115 58 0.24167 1.370 0.5276638 112.6 2.41667 1.8198
## 2 117 59 0.19524 1.556 0.0605634 116.0 0.96667 0.7066
## 3 120 60 0.15595 1.579 0.0125634 119.5 0.51667 0.3688
## 4 123 61 0.12381 1.587 0.0001541 122.9 0.06667 0.0467
## 5 126 62 0.09881 1.583 0.0038437 126.4 -0.38333 -0.2648
## 6 129 63 0.08095 1.567 0.0143093 129.8 -0.83333 -0.5700
# to show what fortify implicitly does in ggplot
Residuals vs fitted values
A plot with the fitted values of the model on the x-axis and the residuals on the y-axis shows if the residuals have non-linear patterns. The plot can be used to test the assumption of a linear relationship between the response and the covariates. If the residuals are spread around a horizontal line with no distinct patterns, it is a good indication on no non-linear relationships, and a good model.
Does this look like a good plot for this data set?
ggplot(fit, aes(.fitted, .resid)) + geom_point(pch = 21) + geom_hline(yintercept = 0,
linetype = "dashed") + geom_smooth(se = FALSE, col = "red", size = 0.5,
method = "loess") + labs(x = "Fitted values", y = "Residuals", title = "Residuals vs Fitted values",
subtitle = deparse(fit$call))
Normal Q-Q
This plot shows if the residuals are Gaussian (normally) distributed. If they follow a straigt line it is an indication that they are, and else they are probably not.
ggplot(fit, aes(sample = .stdresid)) + stat_qq(pch = 19) + geom_abline(intercept = 0,
slope = 1, linetype = "dotted") + labs(x = "Theoretical quantiles", y = "Standardized residuals",
title = "Normal Q-Q", subtitle = deparse(fit$call))
Scale-location
This is also called spread-location plot. It shows if the residuals are spread equally along the ranges of predictors. Can be used to check the assumption of equal variance (homoscedasticity). A good plot is one with a horizontal line with randomly spread points.
Is this plot good for your data?
ggplot(fit, aes(.fitted, sqrt(abs(.stdresid)))) + geom_point() + geom_smooth(se = FALSE,
col = "red", size = 0.5, method = "loess") + labs(x = "Fitted values", y = expression(sqrt("Standardized residuals")),
title = "Scale-location", subtitle = deparse(fit$call))
Residual vs Leverage
This plot can reveal influential outliers. Not all outliers are influential in linear regression; even though data have extreme values, they might not be influential to determine the regression line (the results don’t differ much if they are removed from the data set). These influential outliers can be seen as observations that does not get along with the trend in the majority of the observations. In plot.lm
, dashed lines are used to indicate the Cook’s distance, instead of using the size of the dots as is done here.
Cook’s distance is the Euclidean distance between the \(\mathbf{\hat{y}}\) (the fitted values) and \(\mathbf{\hat{y}}_{(i)}\) (the fitted values calculated when the \(i\)-th observation is omitted from the regression). This is then a measure on how much the model is influences by observation \(i\). The distance is scaled, and a rule of thumb is to examine observations with Cook’s distance larger than 1, and give some attention to those with Cook’s distance above 0.5.
Leverage is defined as the diagonal elements of the hat matrix, i.e., the leverage of the \(i\)-th data point is \(h_{ii}\) on the diagonal of \(\mathbf{H = X(X^TX)^{-1}X^T}\). A large leverage indicated that the observation (\(i\)) has a large influence on the estimation results, and that the covariate values (\(\mathbf{x}_i\)) are unusual.
ggplot(fit, aes(.hat, .stdresid)) + geom_smooth(se = FALSE, col = "red", size = 0.5,
method = "loess") + geom_point(aes(size = .cooksd)) + scale_size_continuous("Cook's dist.") +
labs(x = "Leverage", y = "Standardized residuals", title = "Residuals vs Leverage",
subtitle = deparse(fit$call))
Interactive tasks for the first week (of this module)
Theoretical questions
Check what has been covered so far in this module - and stop and answer the questions (marked with Q) in the text. Discuss among yourself if there are common theoretical or practical questions.
Using the normal linear regression model write down the likelihood. Then define the score vector and observed Fisher information matrix. What is the set of equations we solve to find parameter estimates? What if we could not find a closed form solution to our set of equations? What if we have so many observations that they don’t even fit in the computer memory?
A core finding is \(\hat \beta\). \[ \hat{\beta}=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf Y}\] with \(\hat{\beta}\sim N_{p}(\beta,\sigma^2({\bf X}^T{\bf X})^{-1})\).
Show that \(\hat{\beta}\) has this distribution with the given mean and covariance matrix. What does this imply for the distribution of the \(j\)th element of \(\hat{\beta}\)? In particular, how can we calculate the variance of \(\hat{\beta}_j\)?
Explain the difference between error and residual. What are the properties of the raw residuals? Why don’t we want to use the raw residuals for model check? What is our solution to this?
That is the theoretical intercept and slope of a QQ–plot based on a normal sample?
Munich Rent Index data
Fit the regression model with first rent
and then rentsqm
as reponse and following covariates: area
, location
(dummy variable coding using location2 and location3), bath
, kitchen
and cheating
(central heating).
- Look at diagnostic plots for the two fits. Which response do you prefer?
Consentrate on the response-model you choose for the rest of the tasks.
Explain what the parameter estimates mean in practice. In particular, what is the interpretation of the intercept?
What if you wanted to recode the location variable (average-good-excellent) so that the good location is the reference level. How can you do that, and what would that mean for your fitted model? Hint: first define a location factor, and then use
relevel
.Go through the summary printout and explain the parts you know now, and also observe the parts you don’t know yet (on the agenda for next week?).
If you wanted to tell the common man about your fitted regression model - perferably using graphics from in
ggplot
. How could you do that?
Next week: more on inference on this data set.
Dummy vs. effect coding in MLR
We will study a dataset where we want to model income
as response and two unordered categorical covariates gender
and place
(location).
income <- c(300, 350, 370, 360, 400, 370, 420, 390, 400, 430, 420, 410, 300,
320, 310, 305, 350, 370, 340, 355, 370, 380, 360, 365)
gender <- c(rep("Male", 12), rep("Female", 12))
place <- rep(c(rep("A", 4), rep("B", 4), rep("C", 4)), 2)
data <- data.frame(income, gender = factor(gender, levels = c("Female", "Male")),
place = factor(place, levels = c("A", "B", "C")))
First, describe the data set. How can you present the data?
ggpairs
?Check out the
interaction.plot(data$gender,data$place,data$income)
. What does it show? Do we need an interaction term if we want to model a MLR withincome
as response? Is there aggplot
version of the interaction plot?Check our
plot.design(income~place+gender, data = data)
. What does it show?First, use treatment contrast (dummy variable coding) and fit a MLR with
income
as response andgender
andplace
as covariates. Explain what your model estimates mean. In particular, what is the interpretation of the intercept estimate?Now, turn to sum-zero contrast (effect coding). Explain what your model estimates mean. Now, what is the intercept estimate? Calculate the estimate for
place=C
.
Next week we connect this to linear hypotheses and ANOVA.
Simulations in R
- For simple linear regression, simulate at data set with homoscedastic errore and with heteroscedastic errors. Here is a suggestion of one solution - not using
ggplot
. You useggplot
. Why this? To see how things looks when the model is correct and wrong.
# Homoscedastic errore
n = 1000
x = seq(-3, 3, length = n)
beta0 = -1
beta1 = 2
xbeta = beta0 + beta1 * x
sigma = 1
e1 = rnorm(n, mean = 0, sd = sigma)
y1 = xbeta + e1
ehat1 = residuals(lm(y1 ~ x))
plot(x, y1, pch = 20)
abline(beta0, beta1, col = 1)
plot(x, e1, pch = 20)
abline(h = 0, col = 2)
# Heteroscedastic errors
sigma = (0.1 + 0.3 * (x + 3))^2
e2 = rnorm(n, 0, sd = sigma)
y2 = xbeta + e2
ehat2 = residuals(lm(y2 ~ x))
plot(x, y2, pch = 20)
abline(beta0, beta1, col = 2)
plot(x, e2, pch = 20)
abline(h = 0, col = 2)
- All this fuss about raw, standardized and studentized residuals- does really matter in practice? Below is one example where the raw residuals are rather different from the standardized, but the standardized is identical to the studentized. Can you come up with a simuation model where the standardized and studentized are very different? Hint: what about at smaller sample size?
n = 1000
beta = matrix(c(0, 1, 1/2, 1/3), ncol = 1)
set.seed(123)
x1 = rnorm(n, 0, 1)
x2 = rnorm(n, 0, 2)
x3 = rnorm(n, 0, 3)
X = cbind(rep(1, n), x1, x2, x3)
y = X %*% beta + rnorm(n, 0, 2)
fit = lm(y ~ x1 + x2 + x3)
yhat = predict(fit)
summary(fit)
ehat = residuals(fit)
estand = rstandard(fit)
estud = rstudent(fit)
plot(yhat, ehat, pch = 20)
points(yhat, estand, pch = 20, col = 2)
Demonstrations
- Using github (Jarl) - fork, update, pull request, merge. You need to have made your own account on github to benefit the most from this demo.
You may then send me a pull request to add you Rmd and html version of your work (for the interactive session) to the module folders at https://github.com/mettelang/TMA4315H2017
Kristin showed that there are many group tools for your compulsory exercises - blogg and file upload are two.
SECOND WEEK
\(\oplus\) What to remember?
Model: \[{\bf Y=X \beta}+{\bf\varepsilon}\] with full rank design matrix. And classical normal linear regression model when \[\varepsilon\sim N_n(\bf{0},\sigma^2\bf{I}).\]
Parameter of interest is \(\beta\) and \(\sigma^2\) is a nuisance. Maximum likelihood estimator \[ \hat\beta=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf Y}\] has distribution: \(\hat{\beta}\sim N_{p}(\beta,\sigma^2({\bf X}^T{\bf X})^{-1})\).
Restricted maximum likelihood estimator for \({\bf \sigma}^2\): \[ \hat{\sigma}^2=\frac{1}{n-p}({\bf Y}-{\bf X}\hat{\beta})^T({\bf Y}-{\bf X}\hat{\beta})=\frac{\text{SSE}}{n-p}\] with \(\frac{(n-p)\hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p}\).
Statistic for inference about \(\beta_j\), \(c_{jj}\) is diagonal element \(j\) of \(({\bf X}^T{\bf X})^{-1}\). \[ T_j=\frac{\hat{\beta}_j-\beta_j}{\sqrt{c_{jj}}\hat{\sigma}}\sim t_{n-p}\] This requires that \(\hat{\beta}_j\) and \(\hat{\sigma}\) are independent.
Inference
We will consider confidence intervals and prediction intervals, and then test single and linear hypotheses.
Confidence intervals (CI)
In addition to providing a parameter estimate for each element of our parameter vector \(\beta\) we should also report a \((1-\alpha)100\)% confidence interval (CI) for each element. (We will not consider simultanous confidence regions in this course.)
We focus on element \(j\) of \(\beta\), called \(\beta_j\). It is known that \(T_j =\frac{\hat{\beta}_j-\beta_j}{\sqrt{c_{jj}}\hat{\sigma}}\) follows a \(t\)-distribution with \(n-p\) degrees of freedom. Let \(t_{\alpha/2,n-p}\) be such that \(P(T_j>t_{\alpha/2,n-p})=\alpha/2\). REMARK: our textbook would here look at area to the left instead of to the right - but we stick with this notation. Since the \(t\)-distribution is symmetric around 0, then \(P(T_j< -t_{\alpha/2,n-p})=\alpha/2\). We may then write \[ P(-t_{\alpha/2,n-p}\le T_j \le t_{\alpha/2,n-p})=1-\alpha\]
require(ggplot2)
df=10
ggplot(data.frame(x=c(-4,4)), aes(x))+
theme(axis.title.x=element_blank(),axis.title.y=element_blank(),axis.text.x=element_blank(),axis.text.y=element_blank(),axis.ticks = element_blank())+
stat_function(fun=function(x) dt(x,df), geom="line", colour="red")+
geom_vline(xintercept=qt(0.975,df),color="blue")+
geom_vline(xintercept=-qt(0.975,df),color="blue")+
geom_hline(yintercept=0)+
annotate("text",x=qt(0.975,df)+0.6,y=0.03,label="alpha/2",parse=TRUE,size=5)+
annotate("text",x=-qt(0.975,df)-0.6,y=0.03,label="alpha/2",parse=TRUE,size=5)+
annotate("text",x=0,y=0.03,label="1-alpha",parse=TRUE,size=5)
(Blue lines at \(\pm t_{\alpha/2,n-p}\).)
Inserting \(T_j =\frac{\hat{\beta}_j-\beta_j}{\sqrt{c_{jj}}\hat{\sigma}}\) and solving so \(\beta_j\) is in the middle gives: \[ P(\hat{\beta}_j-t_{\alpha/2,n-p}\sqrt{c_{jj}}\hat{\sigma} \le \beta_j \le \hat{\beta}_j+t_{\alpha/2,n-p}\sqrt{c_{jj}}\hat{\sigma})=1-\alpha\]
A \((1-\alpha)\)% CI for \(\beta_j\) is when we insert numerical values for the upper and lower limits: \([\hat{\beta}_j-t_{\alpha/2,n-p}\sqrt{c_{jj}}\hat{\sigma},\hat{\beta}_j+t_{\alpha/2,n-p}\sqrt{c_{jj}}\hat{\sigma}]\).
CIs can be found in R using confint
on an lm
object. (Here dummy variable coding is used for location
, with average as reference location.)
require(gamlss.data)
fit = lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
confint(fit)
## 2.5 % 97.5 %
## (Intercept) -44.825534 0.8788739
## area 4.354674 4.8029443
## location2 28.579849 49.9405909
## location3 92.970636 159.1443278
## bath1 52.076412 96.0311030
## kitchen1 94.907671 145.9621578
## cheating1 144.427555 178.4000215
Q (and A):
What is the interpretation of a 95% confidence interval? A: see Classnotes 04.09.2017
Does the CI for \(\hat{\beta}_{\text{area}}\) change if we change the regression model (e.g. not include
cheating
)? A: Yes!How can we in practice find a CI for
location1
(average location) - when that is not printed above? (Yes, may use formula, but in R without maths?) A: relevel to give good as reference.
Prediction intervals
Remember, one aim for regression was to “construct a model to predict the reponse from a set of (one or several) explanatory variables- more or less black box”.
Assume we want to make a prediction (of the response - often called \(Y_0\)) given specific values for the covariates - often called \({\bf x}_0\). An intuitive point estimate is \(\hat{Y}_0={\bf x}_0^T \hat{\beta}\) - but to give a hint of the uncertainty in this prediction we also want to present a prediction interval for the \(Y_0\).
To arrive at such an estimate we start with the difference between the unobserved response \(Y_0\) (for a given covariate vector \({\bf x}_0\)) and the point prediction \(\hat{Y}_0\), \(Y_0-\hat{Y}_0\). First, we assume that the unobserved response at covariate \({\bf x}_0\) is independent of our previous observations and follows the same distibution, that is \(Y_0 \sim N({\bf x}_0^T \beta,\sigma^2)\). Further,
\[\hat{Y}_0={\bf x}_0^T \hat{\beta} \sim N({\bf x}_0^T \beta,\sigma^2 {\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0).\]
Then, for \(Y_0-{\bf x}_0^T \hat{\beta}\) we have \[\text{E}(Y_0-{\bf x}_0^T \hat{\beta})=0 \text{ and } \text{Var}(Y_0-{\bf x}_0^T \hat{\beta})=\text{Var}(Y_0)+\text{Var}({\bf x}_0^T \hat{\beta})=\sigma^2+\sigma^2{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0\] so that \[Y_0-{\bf x}_0^T \hat{\beta}\sim N(0,\sigma^2 (1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0)) \] Inserting our REML-estimate for \(\sigma^2\) gives
\[T=\frac{Y_0-{\bf x}_0^T \hat{\beta}}{\hat{\sigma}\sqrt{1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0}}\sim t_{n-p}.\]
Then, we start with \[ P(-t_{\alpha/2,n-p}\le \frac{Y_0-{\bf x}_0^T \hat{\beta}}{\hat{\sigma}\sqrt{1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0}} \le t_{\alpha/2,n-p})=1-\alpha\] and solve so that \(Y_0\) is in the middle, which gives \[P({\bf x}_0^T \hat{\beta}-t_{\alpha/2,n-p}\hat{\sigma}\sqrt{1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0} \le Y_0 \le {\bf x}_0^T \hat{\beta}+t_{\alpha/2,n-p}\hat{\sigma}\sqrt{1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0})=1-\alpha\]
A \((1-\alpha)\)% PI for \(Y_0\) is when we insert numerical values for the upper and lower limits: \([{\bf x}_0^T \hat{\beta}-t_{\alpha/2,n-p}\hat{\sigma}\sqrt{1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0}, {\bf x}_0^T \hat{\beta}+t_{\alpha/2,n-p}\hat{\sigma}\sqrt{1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0}]\).
PIs can be found in R using predict
on an lm
object, but make sure that newdata
is a data.frame
with the same names as the original data. We want to predict the rent - with PI - for an appartment with area 50, location 2 (“good”), nice bath and kitchen and with central heating.
require(gamlss.data)
fit = lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
newobs = rent99[1, ]
newobs[1, ] = c(NA, NA, 50, NA, 2, 1, 1, 1, NA)
predict(fit, newdata = newobs, interval = "prediction", type = "response")
## fit lwr upr
## 1 602.1298 315.5353 888.7243
Q (and A):
When is a prediction interval of interest? A: Always? Gives useful information in addition to a point prediction.
- Explain the result from
predict
above. A: Fit it point prediction, lwr is lower and upr is upper limit of the 95% PI (default with 95, argument if other.) What is the interpretation of a 95% prediction interval? A: see Classnotes 04.09.2017 —-
Single hypothesis testing set-up
In single hypothesis testing we are interesting in testing one null hypothesis against an alternative hypothesis. In MLR the hypothesis is often about a regression parameter \(\beta_j\): \[ H_0: \beta_j=0 \text{ vs. } H_1: \beta_j\neq 0\] Remark: we implicitly say that our test is done given that the other variables are present in the model, that is, the other \(\beta_i\)s (\(j\neq i\)) are not zero.
Two types of errors:
“Reject \(H_0\) when \(H_0\) is true”=“false positives” = “type I error” =“miscarriage of justice”. These are our fake news, which we is very important for us to avoid.
“Fail to reject \(H_0\) when \(H_1\) is true (and \(H_0\) is false)”=“false negatives” = “type II error”= “guilty criminal go free”.
We choose to reject \(H_0\) at some significance level \(\alpha\) if the \(p\)-value of the test (see below) is smaller than the chosen significance level. We say that : Type I error is “controlled” at significance level \(\alpha\), which means that the probability of miscarriage of justice (Type I error) does not exceed \(\alpha\).
Q (and A):
Draw a 2 by 2 table showing the connection between “truth” (\(H_0\) true or \(H_0\) false) and “action” (reject “\(H_0\)” and accept “\(H_0\)”) - and place the two types of errors in the correct position within the table. What else should be written within the last two cells?
Hypothesis test on \(\beta_j\)
In our MLR our test statistic for testing \(H_0: \beta_j=0\) is \[ T_0=\frac{\hat{\beta}_j-0}{\sqrt{c_{jj}}\hat{\sigma}}\sim t_{n-p}\] Inserted observed values (and estimates) we have \(t_0\).
We would in a two-sided setting reject \(H_0\) for large values of \(\text{abs}(t_0)\). We may rely on calculating a \(p\)-value.
The \(p\)-value
A \(p\)-value \(p(X)\) is a test statistic satisfying \(0 \leq p({\bf Y}) \leq 1\) for every vector of observations \({\bf Y}\).
Small values give evidence that \(H_1\) is true.
In single hypothesis testing, if the \(p\)-value is less than the chosen significance level (chosen upper limit for the probability of committing a type I error), then we reject the null hypothesis, \(H_0\). The chosen significance level is often referred to as \(\alpha\).
A \(p\)-value is valid if \[ P(p({\bf Y}) \leq \alpha) \leq \alpha\] for all \(\alpha\), \(0 \leq \alpha \leq 1\), whenever \(H_0\) is true, that is, if the \(p\)-value is valid, rejection on the basis of the \(p\)-value ensures that the probability of type I error does not exceed \(\alpha\).
If \(P(p({\bf Y}) \leq \alpha) = \alpha\) for all \(\alpha\), \(0 \leq \alpha \leq 1\), the \(p\)-value is called an exact \(p\)-value.
In our MLR we use the \(t\)-distibution to calculate \(p\)-values for our two-sided test situation \(H_0: \beta_j=0\) vs. \(H_1: \beta_j \neq 0\). Assume we have observed that our test statistic \(T_0\) takes the numerical value \(t_0\). Since the \(t\)-distribution is symmetric around \(0\) we have
\[p\text{-value}=P(T_0>\text{abs}(t_0))+P(T_0<-\text{abs}(t_0))=2\cdot P(T_0>\text{abs}(t_0)).\]
We reject \(H_0\) if our calculated \(p\)-value is below our chosen signficance level. We often choose as significance level \(\alpha=0.05\).
Munich rent index hypothesis test
We look at print-out using summary
from fitting lm
.
require(gamlss.data)
colnames(rent99)
## [1] "rent" "rentsqm" "area" "yearc" "location" "bath"
## [7] "kitchen" "cheating" "district"
fit = lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
summary(fit)
##
## Call:
## lm(formula = rent ~ area + location + bath + kitchen + cheating,
## data = rent99)
##
## Residuals:
## Min 1Q Median 3Q Max
## -633.41 -89.17 -6.26 82.96 1000.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -21.9733 11.6549 -1.885 0.0595 .
## area 4.5788 0.1143 40.055 < 2e-16 ***
## location2 39.2602 5.4471 7.208 7.14e-13 ***
## location3 126.0575 16.8747 7.470 1.04e-13 ***
## bath1 74.0538 11.2087 6.607 4.61e-11 ***
## kitchen1 120.4349 13.0192 9.251 < 2e-16 ***
## cheating1 161.4138 8.6632 18.632 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 145.2 on 3075 degrees of freedom
## Multiple R-squared: 0.4504, Adjusted R-squared: 0.4494
## F-statistic: 420 on 6 and 3075 DF, p-value: < 2.2e-16
Q (and A):
Where is hypothesis testing performed here, and which are the hypotheses rejected at level \(0.01\)? A: Column named
t value
gives numerical value for the t-statistic and column*Pr(>|t|)
gives \(p\)-value for the two-sided test.Will the test statistics and \(p\)-values change if we change the regression model? A: Yes. Unless our design matrix has orthogonal columns.
What is the relationship between performing an hypothesis test and constructing a CI interval? Remember:
require(gamlss.data)
fit = lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
confint(fit)
## 2.5 % 97.5 %
## (Intercept) -44.825534 0.8788739
## area 4.354674 4.8029443
## location2 28.579849 49.9405909
## location3 92.970636 159.1443278
## bath1 52.076412 96.0311030
## kitchen1 94.907671 145.9621578
## cheating1 144.427555 178.4000215
A: If a \((1-\alpha)\) 100% CI covers the hypothsized value (here 0) then this is a sensible value for the parameter it can be shown that then the \(p\)-value will be larger than \(\alpha\). And, if a \((1-\alpha)\) 100% CI does not cover the hypothsized value (here 0) then this is not a sensible value for the parameter it can be shown that then the \(p\)-value will be smaller than \(\alpha\).
Testing linear hypotheses in regression
We study a normal linear regression model with \(p=k+1\) covariates, and refer to this as model A (the larger model). We then want to investigate the null and alternative hypotheses of the following type(s): \[\begin{eqnarray*} H_0: \beta_{j}&=&0 \text{ vs. } H_1:\beta_j\neq 0\\ H_0: \beta_{1}&=&\beta_{2}=\beta_{3}=0 \text{ vs. } H_1:\text{ at least one of these }\neq 0\\ H_0: \beta_{1}&=&\beta_{2}=\cdots=\beta_{k}=0 \text{ vs. } H_1:\text{ at least one of these }\neq 0\\ \end{eqnarray*}\]We call the restricted model (when the null hypotesis is true) model B, or the smaller model.
These null hypotheses and alternative hypotheses can all be rewritten as a linear hypothesis \[H_0: {\bf C}{\bf \beta}={\bf d} \text{ vs. } {\bf C}{\bf \beta}\neq {\bf d} \] by specifying \({\bf C}\) to be a \(r \times p\) matrix and \({\bf d}\) to be a column vector of length \(p\).
The test statistic for performing the test is called \(F_{obs}\) and can be formulated in two ways: \[\begin{eqnarray} F_{obs}&=&\frac{\frac{1}{r} (SSE_{H_0}-SSE)}{\frac{SSE}{n-p}} \label{Fobsnested}\\ F_{obs}&=&\frac{1}{r}({\bf C}\hat{{\bf \beta}}-{\bf d})^{\text T}[\hat{\sigma}^2 {\bf C}({\bf X}^{\text T}{\bf X})^{-1}{\bf C}^{\text T}]^{-1}({\bf C}\hat{{\bf \beta}}-{\bf d}) \label{FobsCbeta} \end{eqnarray}\]where \(SSE\) is from the larger model A, \(SSE_{H_0}\) from the smaller model B, and \(\hat{{\bf \beta}}\) and \(\hat{\sigma}^2\) are estimators from the larger model A.
Testing a set of parameters - what is \({\bf C}\) and \({\bf d}\)?
We consider a regression model with intercept and five covariates, \(x_1, \ldots, x_5\). Assume that we want to know if the covariates \(x_3\), \(x_4\), and \(x_5\) can be dropped (due to the fact that none of the corresponding \(\beta_j\)s are different from zero). This means that we want to test:
\[H_0: \beta_{3}=\beta_{4}=\beta_{5}=0 \text{ vs. } H_1:\text{ at least one of these }\neq 0\] This means that our \({\bf C}\) is a \(6\times 3\) matrix and \({\bf d}\) a \(3 \times 1\) column vector \[ {\bf C}=\begin{pmatrix} 0 &0 &0 &1 &0&0 \\ 0&0&0&0&1&0\\ 0&0&0&0&0& 1\\ \end{pmatrix} \text{ and } {\bf d} \begin{pmatrix} 0\\0\\0\\ \end{pmatrix}\]
Testing one regression parameter
If we set \({\bf C}=(0,1, 0, \cdots, 0)^T\), a row vector with 1 in position 2 and 0 elsewhere, and \({\bf d}=(0,0,\ldots,0)\), a column vector with 0s, then we test \[ H_0: \beta_1=0 \text{ vs. } H_1: \beta_1\neq 0.\] Now \({\bf C}\hat{\beta}=\beta_1\) and \({\bf C}({\bf X}^{\text T}{\bf X})^{-1}{\bf C}^{\text T}=c_{11}\), so that \(F_{obs}\) then is equal to the square of the \(t\)-statistics for testing a single regression parameter. \[F_{obs}=(\hat{\beta}_1-0)^T[\hat{\sigma}^2 c_{jj}]^{-1}(\hat{\beta}_1-0)=T_1^2\]
Repeat the argument with \(\beta_j\) instead of \(\beta_1\).
Remark: Remember that \(T_{\nu}^2=F_{1,\nu}\).
Testing “significance of the regression”
If we set \({\bf C}=(0,1,1, \cdots ,1)^T\), a row vector with 0 in position 1 and 0 elsewhere, and \({\bf d}=(0,0,\ldots,0)\), a column vector with 0s, then we test \[ H_0: \beta_1=\beta_2=\cdots= \beta_k =0 \text{ vs. } H_1: \text{at least one different from zero}.\] This means we test if at least one of the regression parameters (in addition to the intercept) is different from 0. The small model is then the model with only the intercept, and for this model the SSE\(_{H_0}\) is equal to SST (sums of squares total, see below). Let SSE be the sums-of-squares of errors for the full model. If we have \(k\) regression parameters (in addition to the intercept) then the F-statistic becomes \[ F_{obs}=\frac{\frac{1}{k}(\text{SST}-\text{SSE})}{\frac{\text{SSE}}{n-p}}\] with \(k\) and \(n-p\) degrees of freedom under \(H_0\).
require(gamlss.data)
fit = lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
summary(fit)
##
## Call:
## lm(formula = rent ~ area + location + bath + kitchen + cheating,
## data = rent99)
##
## Residuals:
## Min 1Q Median 3Q Max
## -633.41 -89.17 -6.26 82.96 1000.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -21.9733 11.6549 -1.885 0.0595 .
## area 4.5788 0.1143 40.055 < 2e-16 ***
## location2 39.2602 5.4471 7.208 7.14e-13 ***
## location3 126.0575 16.8747 7.470 1.04e-13 ***
## bath1 74.0538 11.2087 6.607 4.61e-11 ***
## kitchen1 120.4349 13.0192 9.251 < 2e-16 ***
## cheating1 161.4138 8.6632 18.632 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 145.2 on 3075 degrees of freedom
## Multiple R-squared: 0.4504, Adjusted R-squared: 0.4494
## F-statistic: 420 on 6 and 3075 DF, p-value: < 2.2e-16
Q (and A): Is the regression significant? A: Yes, the F-test is given in the last row of the summary
print-out and gives a \(p\)-value clearly below any sensible signficance level.
Relation to Wald test
Since \(\text{Cov}(\hat{\beta})=\sigma^2 ({\bf X}^T{\bf X})^{-1}\), then \(\text{Cov}({\bf C}\hat{\beta})={\bf C}\sigma^2({\bf X}^T{\bf X})^{-1}{\bf C}^T\), so that \({\bf C}\hat{\sigma}^2({\bf X}^T{\bf X})^{-1}{\bf C}^T\) can be seen as an estimate of \(\text{Cov}({\bf C}\hat{\beta})\). Therefore, \(F_{obs}\) can be written
\[F_{obs}=\frac{1}{r}({\bf C}\hat{{\bf \beta}}-{\bf d})^{\text T}[\widehat{\text{Cov}}({\bf C}\hat{\beta})]^{-1}({\bf C}\hat{{\bf \beta}}-{\bf d})=\frac{1}{r}W\] where \(W\) is a socalled Wald test. It is known that \(W\sim \chi^2_r\) asymptotically as \(n\) becomes large. We will study the Wald test in more detail later in this course.
Asympotic result
It can in general be shown that \[r F_{r,n-p}\stackrel{n\rightarrow \infty}{\longrightarrow} \chi^2_r.\] That is, if we have a random variable \(F\) that is distributed as Fisher with \(r\) (numerator) and \(n-p\) (denominator) degrees of freedom, then when \(n\) goes to infinity (\(p\) kept fixed), then \(rF\) is approximately \(\chi^2\)-distributed with \(r\) degrees of freedom.
Also, if our error terms are not normally distributed then we can assume that when the number of observation becomes very large then \(rF_{r,n-p}\) is approximately \(\chi^2_r\).
Analysis of variance decomposition and coefficient of determination, \(R^2\)
It is possible to decompose the total variability in the data, called SST (sums-of-squares total), into a part that is explained by the regression SSR (sums-of-squares regression), and a part that is not explained by the regression SSE (sums-of-squares error, or really residual).
Let \(\bar{Y}=\frac{1}{n}\sum_{i=1}^n Y_i\), and \(\hat{Y}_i={\bf x}_i^T\hat{\beta}\). Then, \[\begin{align*} \text{SST}&=\text{SSR}+\text{SSE}\\ \text{SST}&=\sum_{i=1}^n (Y_i-\bar{Y})^2={\bf Y}^T({\bf I}-\frac{1}{n}{\bf 1 1}^T){\bf Y}\\ \text{SSR}&=\sum_{i=1}^n (\hat{Y}_i-\bar{Y})^2={\bf Y}^T({\bf H}-\frac{1}{n}{\bf 1 1}^T){\bf Y}\\ \text{SSE}&=\sum_{i=1}^n (Y_i-\hat{Y}_i)^2=\sum_{i=1}^n \hat{\varepsilon}_i^2={\bf Y}^T({\bf I}-{\bf H}){\bf Y}.\\ \end{align*}\]The proof can be found in Section 3.5 in our text book Regression.
Based on this decomposition we may define the coefficient of determination (\(R^2\)) as the ratio between SSR and SST, that is \[R^2=\text{SSR}/\text{SST}=1-\text{SSE}/\text{SST}\] 1. The interpretation of this coefficient is that the closer it is to 1 the better the fit to the data. If \(R^2=1\) then all residuals are zero - that is, perfect fit to the data.
In a simple linear regression the \(R^2\) equals the squared correlation coefficient between the reponse and the predictor. In multiple linear regression \(R^2\) is the squared correlation coefficient between the observed and predicted response.
If we have two models M1 and M2, where model M2 is a submodel of model M1, then \[ R^2_{M_1}\ge R^2_{M_2}.\] This can be explained from the fact that \(\text{SSE}_{M_1}\le \text{SSE}_{M_2}\). (More in the Theoretical questions.)
Analysis of variance tables - with emphasis on sequential Type I ANOVA
It is possible to call the function anova
on an lm
-object. What does that function do?
require(gamlss.data)
fit = lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
anova(fit)
## Analysis of Variance Table
##
## Response: rent
## Df Sum Sq Mean Sq F value Pr(>F)
## area 1 40299098 40299098 1911.765 < 2.2e-16 ***
## location 2 1635047 817524 38.783 < 2.2e-16 ***
## bath 1 1676825 1676825 79.547 < 2.2e-16 ***
## kitchen 1 2196952 2196952 104.222 < 2.2e-16 ***
## cheating 1 7317894 7317894 347.156 < 2.2e-16 ***
## Residuals 3075 64819547 21080
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What is produced is a sequential table of the reductions in residual sum of squares (SSE) as each term in the regression formula is added in turn. This type of ANOVA is often referred to as “Type 1” (not to be confused with type I errors).
We can produce the same table by fitting larger and larger regression models.
require(gamlss.data)
fit = lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
fit0 <- lm(rent ~ 1, data = rent99)
fit1 <- update(fit0, . ~ . + area)
fit2 <- update(fit1, . ~ . + location)
fit3 <- update(fit2, . ~ . + bath)
fit4 <- update(fit3, . ~ . + kitchen)
fit5 <- update(fit4, . ~ . + cheating)
anova(fit0, fit1, fit2, fit3, fit4, fit5, test = "F")
## Analysis of Variance Table
##
## Model 1: rent ~ 1
## Model 2: rent ~ area
## Model 3: rent ~ area + location
## Model 4: rent ~ area + location + bath
## Model 5: rent ~ area + location + bath + kitchen
## Model 6: rent ~ area + location + bath + kitchen + cheating
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3081 117945363
## 2 3080 77646265 1 40299098 1911.765 < 2.2e-16 ***
## 3 3078 76011217 2 1635047 38.783 < 2.2e-16 ***
## 4 3077 74334393 1 1676825 79.547 < 2.2e-16 ***
## 5 3076 72137441 1 2196952 104.222 < 2.2e-16 ***
## 6 3075 64819547 1 7317894 347.156 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# anova(fit0,fit1) # compare model 0 and 1 - NOT sequential anova(fit0,fit5)
# # compare model 0 and 5 - NOT sequential
If we had changed the order of adding the covariates to the model, then our anova table would also change. You might check that if you want.
See the last page of the classnotes 04.09.2017 for mathematical notation on the sequential test in anova
, and details on the print-out comes next - NEW: now with formulas!
Details on the test anova(fit)
When running anova
on one fitted regression the \(F\)-test in anova
is calculated as for “testing linear hypotheses” - but with a slight twist. Our large model is still the full regression model (from the fitted object), but the smaller model is replaced by the the change from one model to the next.
Let SSE be the sums-of-squares-error (residual sums of squares) from the full (large, called A) model - this will be our denominator (as always). For our rent example the denominator will be SSE/(n-p)=64819547/3075 (see print-out above).
However, for the numerator we are not comparing one small model with the full (large) one, we are instead looking at the change in SSE between two (smaller) models (calles model B1 and B2). So, now we have in the numerator the difference in SSE between models B1 and B2, scaled with the difference in number of parameters estimated in model B1 and B2 =“number in B2 minus in B1” (which is the same as the difference in degrees of freedom for the two models).
So, B1 could be the model with only intercept, and B2 could be the model with intercept and area. Then we calculate the SSE for model B1 and for model B2, and keep the difference (here 40299098). Then we count the number of parameters in model B1 and model B2 and compute “number in B2 minus in B1” (here 1). We could istead have calculated the number of degrees of freedom in the smaller model minus the number of degrees of freedom for the larger model (which here is 1). Our numerator is then 40299098/1.
This means that the test statistics we use are:
\[ F_0=\frac{\frac{\text{SSE}_{B1}-\text{SSE}_{B2}}{\text{df}_{B1}-\text{df}_{B2}}}{\frac{\text{SSE}_A}{\text{df}_A}}\] Remark: notice that the denominator is just the \(\hat{\sigma^2}\) from the larger model A.
This makes our \(F\)-test statistic: \(f_0=\frac{40299098/1}{64819547/3075}=1911.765\) (remember that we swap from capital to small letters when we insert numerical values).
To produce a \(p\)-value to the test that \[H_0: \text{"Model B1 and B2 are equally good" vs }H_1:\text{"Model B2 is better than B1}\] and then the \(F\sim {\text{df}_{B1}-\text{df}_{B2},\text{df}_A}\). In our example we compare to an F-distribution with 1 and 3075 degrees of freedom. The \(p\)-value is the “probability of observing a test statistic at least as extreme as we have” so we calculate the \(p\)-value as \(P(F>f_0)\). This gives a \(p\)-value that is practically 0.
If you then want to use the asymptotic version (relating to a chi-square instead of the F), then multiply your F-statistic with \(\text{df}_{B1}-\text{df}_{B2}\) and relate to a \(\chi^2\) distribution with \(\text{df}_{B1}-\text{df}_{B2}\) degrees of freedom, where \(\text{df}_{B1}-\text{df}_{B2}\) is the difference in number of parameters in models B1 and B2. In our example \(\text{df}_{B1}-\text{df}_{B2}=1\).
For the anova table we do this sequentially for all models from starting with only intercept to the full model A. This means you need to calculate SSE and df for models of all sizes to calculate lots of these \(F_0\)s. Assume that we have 4 covariates that are added to the model, and call the 5 possible models (given the order of adding the covariates)
model 1: model with only intercept
model 2: model with intercept and covariate 1
model 3: model with intercept and covariate 1 and covariate 2
model 4: model with intercept and covariate 1 and covariate 2 and covariate 3
model 5: model with intercept and covariate 1 and covariate 2 and covariate 3 and covariate 4
Fit a linear model (lm) for each model 1-5, and store SSE and degrees of freedom=df (number of observations minus number of covariates estimated) for each of the models. Call these SSE\(_1\) to SSE\(_5\) and df\(_1\) to \(df\)_5$.
The anova output has columns: Df Sum Sq Mean Sq F value Pr(>F) and one row for each covariate added to the model.
model 2 vs model 1: Df=df\(_1\)-df\(_2\), Sum Sq=SSE\(_1\)-SSE\(_2\), Mean Sq=Sum Sq/Df, F value=(Mean Sq)/(SSE\(_5\)/df\(_5\))=\(f_0\), Pr(>F)=pvalue=\(P(F>f_0)\).
model 3 vs model 2: Df=df\(_2\)-df\(_3\), Sum Sq=SSE\(_2\)-SSE\(_3\), Mean Sq=Sum Sq/Df, F value=(Mean Sq)/(SSE\(_5\)/df\(_5\))=\(f_0\), Pr(>F)=pvalue=\(P(F>f_0)\).
model 4 vs model 3: Df=df\(_3\)-df\(_4\), Sum Sq=SSE\(_3\)-SSE\(_4\), Mean Sq=Sum Sq/Df, F value=(Mean Sq)/(SSE\(_5\)/df\(_5\))=\(f_0\), Pr(>F)=pvalue=\(P(F>f_0)\).
model 5 vs model 4: Df=df\(_4\)-df\(_5\), Sum Sq=SSE\(_4\)-SSE\(_5\), Mean Sq=Sum Sq/Df, F value=(Mean Sq)/(SSE\(_5\)/df\(_5\))=\(f_0\), Pr(>F)=pvalue=\(P(F>f_0)\).
In R the p-value is calculated as 1-pf(f0,Df)
or as 1-pchisq(Df*f0,Df)
if the asymptotic chisquare distribution is used.
So, this is what is presented - a sequential record of the effect of adding a new covariate.
**Q*: what if you change the order of the covariates into the model? Yes, then everything changes. That is the drawback of Type I (sequential) thinking.
A competing way of thinking is called type 3 ANOVA and instead of looking sequentially at adding terms, we (like in summary
) calculated the contribution to a covariate (or factor) given that all other covariates are present in the regression model. Type 3 ANOVA is available from library car
as function Anova
(possible to give type of anova as input).
Check : Take a look at the print-out from summary
and anova
and observe that for our rent data the \(p\)-values for each covariate are different due to the different nature of the \(H_0\)s tested (sequential vs. “all other present”).
If we had orthogonal columns for our different covariates the type 1 and type 3 ANOVA tables would have been equal.
Optional (beyond the scope of this course)
There is also something called a type 2 ANOVA table, but that is mainly important if we have interactions in our model, so we do not consider that here. If you want to read more this blogplot http://goanna.cs.rmit.edu.au/~fscholer/anova.php is a good read. And, in combination with different variants of dummy and effct coding - read this: http://rstudio-pubs-static.s3.amazonaws.com/65059_586f394d8eb84f84b1baaf56ffb6b47f.html
Model choice
Quality measures
To assess the quality of the regression we can report the \(R^2\) coefficient of determination. However, since adding covariates to the linear regression can not make the SSE larger, this means that adding covariates can not make the \(R^2\) smaller. This means that SSE and \(R^2\) are only useful measures for comparing models with the same number of regression parameters estimated.
If we consider two models with the same model complexity then SSE can be used to choose between (or compare) these models.
But, if we want to compare models with different model complexity we need to look at other measures of quality for the regression.
\(R^2\) adjusted (corrected)
\[R^2_{\text{adj}}=1-\frac{\frac{SSE}{n-p}}{\frac{SST}{n-1}}=1-\frac{n-1}{n-p}(1-R^2)\] Choose the model with the largest \(R^2_{\text{adj}}\).
AIC Akaike information criterion
AIC is one of the most widely used criteria, and is designed for likelihood-based inference. Let \(l(\hat{\beta}_M,\tilde{\sigma}^2)\) be the maximum of the log-likelihood of the data inserted the maximum likelihood estimates for the regression and nuisance parameter. Further, let \(\lvert M \rvert\) be the number of estimated regression parameters in our model. \[\text{AIC} =-2 \cdot l(\hat{\beta}_M,\tilde{\sigma}^2)+2(\lvert M\rvert +1)\]
For a normal regression model: \[\text{AIC} =n\ln(\tilde{\sigma}^2)+2(\lvert M\rvert +1)+C\] where C is a function of \(n\) (will be the same for two models for the same data set). Remark that \(\tilde{\sigma}^2=SSE/n\) - our ML estimator (not our unbiased REML), so that the first term in the AIC is just a function of the SSE. For MLR the AIC and the Mallows Cp gives the same result when comparing models.
Choose the model with the minimum AIC.
BIC Bayesian information criterion.
The BIC is also based on the likelihood (see notation above). \[\text{BIC} =-2 \cdot l(\hat{\beta}_M,\tilde{\sigma}^2)+\ln(n)\cdot (\lvert M\rvert +1)\]
For a normal regression model: \[ \text{BIC}= n\ln(\tilde{\sigma}^2)+\ln(n)(\lvert M\rvert +1)\] Choose the model with the minimum BIC.
AIC and BIC are motivated in very different ways, but the final result for the normal regression model is very similar. BIC has a larger penalty than AIC (\(\log(n)\) vs. \(2\)), and will often give a smaller model (=more parsimonious models) than AIC. In general we would not like a model that is too complex. —-
Model selection strategies
All subset selection: use smart “leaps and bounds” algorithm, works fine for number of covariates in the order of 40.
Forward selection: choose starting model (only intercept), then add one new variable at each step - selected to make the best improvement in the model selection criteria. End when no improvement is made.
Backward elimination: choose starting model (full model), then remove one new variable at each step - selected to make the best improvement in the model selection criteria. End when no improvement is made.
Stepwise selection: combine forward and backward.
Interactive tasks for the second week (of this module)
Theoretical questions
What is the interpretation of a 95% confidence interval? Hint: repeat experiment (on \(Y\)), on average how many CIs cover the true \(\beta_j\)?
What is the interpretation of a 95% prediction interval? Hint: repeat experiment (on \(Y\)) for a given \({\bf x}_0\).
Construct a 95% CI for \({\bf x}_0^T \beta\). Explain what is the connections between a CI for \(\beta_j\), a CI for \({\bf x}_0^T \beta\) and a PI for \(Y\) at \({\bf x}_0\).
Explain why using
summary
andanova
on a fittedlm
gives differen \(p\)-values listed for each covariate. Maybe test outAnova
in librarycar
with type 3 ANOVA to compare?Consider a MLR model \(A\) and a submodel \(B\) (all parameters in \(B\) are in \(A\) also). We say that \(B\) is nested within \(A\). Assume that regression parameters are estimated using maximum likelihood. Why is then the following true: SSE for model \(A\) will always be smaller or equal to SSE for model \(B\). And thus, \(R^2\) for model \(A\) can never be worse than \(R^2\) for model B.
Take a look at the details getting from the general AIC criterion to the version used for MLR. We will partly be using AIC for this course, but with likelihood from the binomial, Poission, etc.
Munich Rent index
Last week all groups decided on using rent
or rentsqm
as response, and in short - there was not really a big difference. So, now use rent
as the response.
- We now want to use model selection to arrive at a good model. Start by defining which covariates you want to include and how to code them (
location
as dummy or effect coding). What about year of construction - is that a linear covariate? Maybe you want to make intervals in time instead? Linear or categorical for the time? What about thedistrict
? We leave that since we have not talked about how to use spatial covariates.
Hint: if you want to test out interval versions of year of construction the function mutate
(from dplyr
) is useful:
rent99 <- rent99 %>% mutate(yearc.cat = cut(yearc, breaks = c(-Inf, seq(1920,
2000, 10)), labels = 10 * 1:9))
More on dplyr
: Tutorial: http://genomicsclass.github.io/book/pages/dplyr_tutorial.html and Cheat sheet (data wrangling): https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf and dplyr in particular: https://github.com/rstudio/cheatsheets/raw/master/source/pdfs/data-transformation-cheatsheet.pdf
There are many ways to perform model selection in MLR. One possibility is best subsets, which can be done using the
regsubsets
function from libraryleaps
. You may definex
frommodel.matrix(fit)[,-1]
(not including the intercept term), and then runbest=regsubsets(x=model.matrix(fit)[,-1],y=rent99$rent)
and look atsummary(best)
. Explain the print-out (with all the stars). Using the Mallows Cp (namedcp
in the list fromsummary(best)
) will give the same result at using AIC (which is not available in this function). What is your preferred model? Hint: look at the R-code in Problem 2 (Figure 3) from the TMA4267V2017 exam: pdf, and maybe the solutions for the interprtation pdfCheck what is done if you use
stepAIC
. Do you get the same model choice as with best subset selection based on AIC? Why, or why not?
Dummy vs. effect coding in MLR (continued)
We have studied the data set with income, place and gender - with focus on dummy variable coding (with different reference levels) and effect coding and the interpretation of parameter estimates. Now we continue with the same data set, but with focus on hypothesis testing (linear hypotheses) and analysis of variance decomposition.
- Previously, we have read in the data and fitted linear models - look back to see what we found.
income <- c(300, 350, 370, 360, 400, 370, 420, 390, 400, 430, 420, 410, 300,
320, 310, 305, 350, 370, 340, 355, 370, 380, 360, 365)
gender <- c(rep("Male", 12), rep("Female", 12))
place <- rep(c(rep("A", 4), rep("B", 4), rep("C", 4)), 2)
data <- data.frame(income, gender = factor(gender, levels = c("Female", "Male")),
place = factor(place, levels = c("A", "B", "C")))
- Fit the following model
model = lm(income~place-1,data=data,x=TRUE)
. Herex=TRUE
tells the function to calculate the design matrix X, which is stored asmodel$x
.
Examine the results with summary
and anova
. What parametrization is used? What is the interpretation of the parameters? Which null hypothesis is tested in the anova-call? What is the result of the hypothesis test?
- Fit the models:
model1 = lm(income ~ place, data = data, x = TRUE, contrasts = list(place = "contr.treatment"))
head(model1$x)
summary(model1)
model2 = lm(income ~ place, data = data, x = TRUE, contrasts = list(place = "contr.sum"))
head(model2$x)
summary(model2)
We have talked about dummy- and effect encoding of categorical covariates. What are the parametrizations used here? What is the interpretation of the parameters and how do the parameter interpretations differ between model1
and model2
?
We want to test the (one-way ANOVA) null hypothesis that there is no effect of place. Use the \(F_{obs}\) to do this both using the dummy-variable and the effect coding of the place-factor. Compare the results from the two coding strategies. Hint: in R this can be done using
anova(model0,model1)
wheremodel0
is the model with only intercept.Suppose now that there are two factors
place
andgender
.
model3 = lm(income ~ place + gender, data = data, x = TRUE, contrasts = list(place = "contr.treatment",
gender = "contr.treatment"))
summary(model3)
##
## Call:
## lm(formula = income ~ place + gender, data = data, x = TRUE,
## contrasts = list(place = "contr.treatment", gender = "contr.treatment"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.500 -6.250 0.000 9.687 25.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 306.250 6.896 44.411 < 2e-16 ***
## placeB 47.500 8.446 5.624 1.67e-05 ***
## placeC 65.000 8.446 7.696 2.11e-07 ***
## genderMale 41.250 6.896 5.982 7.54e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.89 on 20 degrees of freedom
## Multiple R-squared: 0.8322, Adjusted R-squared: 0.8071
## F-statistic: 33.07 on 3 and 20 DF, p-value: 6.012e-08
anova(model3)
## Analysis of Variance Table
##
## Response: income
## Df Sum Sq Mean Sq F value Pr(>F)
## place 2 18100.0 9050.0 31.720 6.260e-07 ***
## gender 1 10209.4 10209.4 35.783 7.537e-06 ***
## Residuals 20 5706.2 285.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4 = lm(income ~ place + gender, data = data, x = TRUE, contrasts = list(place = "contr.sum",
gender = "contr.sum"))
summary(model4)
##
## Call:
## lm(formula = income ~ place + gender, data = data, x = TRUE,
## contrasts = list(place = "contr.sum", gender = "contr.sum"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.500 -6.250 0.000 9.687 25.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 364.375 3.448 105.680 < 2e-16 ***
## place1 -37.500 4.876 -7.691 2.13e-07 ***
## place2 10.000 4.876 2.051 0.0536 .
## gender1 -20.625 3.448 -5.982 7.54e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.89 on 20 degrees of freedom
## Multiple R-squared: 0.8322, Adjusted R-squared: 0.8071
## F-statistic: 33.07 on 3 and 20 DF, p-value: 6.012e-08
anova(model4)
## Analysis of Variance Table
##
## Response: income
## Df Sum Sq Mean Sq F value Pr(>F)
## place 2 18100.0 9050.0 31.720 6.260e-07 ***
## gender 1 10209.4 10209.4 35.783 7.537e-06 ***
## Residuals 20 5706.2 285.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What are the parameterizations? What is the interpretation of the parameters? Does the ANOVA table look different for the two parametrizations? Hint: orthogonality of design matrix for this balanced design?
- Finally, fit a model with interactions (model formula is place*gender for both the contrasts and check if the interaction effect is significant.
Simulations in R
Make R code that shows the interpretation of a 95% CI for \(\beta_j\). Hint: Theoretical question 1.
Make R code that shows the interpretation of a 95% PI for a new response at \({\bf x}_0\). Hint: Theoretical question 2.
Quiz with Kahoot!
Around 13.45 in the IL on Wednesday:
One person on each group go to https:\\kahoot.it on a mobile device or a laptop. (The lecturer will hijack the screen for showing questions so you it is difficult to use the PC.)
Give the pin (shown soon) and then give the team nick name “Group1”-“Group8” or make your own personalized group name. Then - if you want - add nicks for all group members. Work together and only provide one answer to each question for each group. In team mode there is a short “team talk” period before you can provide the answer - so you have some time. 1000 points if you answer correctly immediately, 500 if you answer when the time is up, 0 for wrong answers.
There is a price to the winning team!
\(\odot\) Further reading
- Slightly different presentation (more focus on multivariate normal theory): Slides and written material from TMA4267 Linear Statistical Models in 2017, Part 2: Regression (by Mette Langaas).
- And, same source, but now [Slides and written material from TMA4267 Linear Statistical Models in 2017, Part 3: Hypothesis testing and ANOVA] (http://www.math.ntnu.no/emner/TMA4267/2017v/TMA4267V2017Part3.pdf)