Last changes: (21.01: classnotes, where to start Part B. 18.01: typos corrected)
We need more statistical theory than is presented in the textbook, which you find in this module page.
Part A: Simple linear regression and introduction to multiple linear regression
Construct a model to help understand the relationship between one response and one or several explanatory variables. [Correlation, or cause and effect?]
Construct a model to predict the response from a set of (one or several) explanatory variables. [More or less “black box”]
Is linear regression dull? Maybe, but very useful and widely used. Important to understand because many learning methods can be seen as generalization of linear regression.
Linear regression is a supervised and parametric method.
Munich, 1999: 3082 observations on 9 variables.
rent
: the net rent per month (in Euro).rentsqm
: the net rent per month per square meter (in Euro).area
: Living area in square meters.yearc
: year of construction.location
: quality of location: a factor indicating whether the location is average location, 1, good location, 2, and top location, 3.bath
: quality of bathroom: a a factor indicating whether the bath facilities are standard, 0, or premium, 1.kitchen
: Quality of kitchen: 0 standard 1 premium.cheating
: central heating: a factor 0 without central heating, 1 with central heating.district
: District in Munich.More information in Fahrmeir et. al., (2013) page 5 (textbook used in TMA4267 Linear statistical models).
Interesting questions
rent
and area
?rent
?area
the same on rent
for appartments at average, good and top location
? (interaction)\[Y = f(x)+ \varepsilon= \beta_0 + \beta_1 x + \varepsilon\]
Here the model parameters \(\beta_0\), \(\beta_1\) and \(\sigma^2\) are unknown and have to be estimated from observed data.
We use the \(\hat{}\) symbol for estimated or predicted values.
Assume a linear relationship between the monthly net rent (rent
) and the living area (area
) in square meters.
In R: fit simple linear regression model using the lm
function to the data, and plot the data together with the fitted model.
## (Intercept) area
## 134.592194 4.821464
We see that the model fits the data quite well. It captures the essence. It looks that a linear relationship between the response rent
and covariate area
is a good approximation.
Q:
A:
area
vs rent
we need to plot the relationship between all units in the population - and if we assume the relationship is linear that line might deviate from our blue line.area
.Assume that we have a data set consisting of \(n\) observation pairs \((x_i, Y_i)\) for \(i = 1,2,..., n\). We usually assume the pairs are independent.
Further, assume we fit a linear model to our data. For a single pair of observations \((x_i, Y_i)\) it can be written as \(\hat{Y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i\).
The \(i\)-th residual of the model is the difference between the \(i\)-th observed response value and the \(i\)-th predicted value, and is written as: \[e_i = Y_i - \hat{Y}_i.\]
We may regard the residuals as predictions (not estimates) of the error terms \(\varepsilon_i\).
Remark: the error terms are random variables and can not be estimated - they can be predicted. (It is only for parameters we find parameter estimates.)
The residual sum of squares (RSS) is the squared sum of all residuals \[\begin{aligned} \text{RSS} &= e_1^2+e_2^2+...+e_n^2 \\ &= (Y_1 - \hat{\beta}_0 - \hat{\beta}_1 x_1)^2 + (Y_2 -\hat{\beta}_0 - \hat{\beta}_1 x_2)^2+ ... + (Y_n -\hat{\beta}_0 - \hat{\beta}_1 x_n)^2 \end{aligned}\]
We find the parameter estimates for \(\beta_0\) and \(\beta_1\) by minimizing the RSS.
Least squares estimators:
\[\hat{\beta}_0 = \bar{Y}-\hat{\beta}_1 \bar{x}\] and \[\hat{\beta}_1 = \frac{\sum_{i=1}^n(x_i-\bar{x})(Y_i-\bar{Y})}{\sum_{i=1}^n(x_i-\bar{x})^2},\] where \(\bar{Y} = \frac{1}{n} \sum_{i=1}^n Y_i\) and \(\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i\) are the sample means.
The coefficient estimates from our simple linear model fit to the Munich rent index data can be printed using the coef
function in R:
coef(munich1.lm)
## (Intercept) area
## 134.592194 4.821464
Q: Explain what these two values mean.
(We will derive a general version of these formulas for multiple linear regression, because without matrix notation this is very cumbersome - e.g. “Egenskaper til estimatorene” for the TMA4240/TMA4245 Thematic pages.)
The standard errors of the estimates are given by the following formulas: \[\text{Var}(\hat{\beta}_0)=\text{SE}(\hat{\beta}_0)^2 = \sigma^2 \Big [ \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i -\bar{x})^2} \Big]\] and \[\text{Var}(\hat{\beta}_1)=\text{SE}(\hat{\beta}_1)^2 = \frac{\sigma^2}{\sum_{i=1}^n (x_i-\bar{x})^2}.\] In addition \(\text{Cov}(\hat{\beta_0},\hat{\beta_1})\) is in general different from zero.
We see how the estimated coefficient will vary when our experiment is repeated.
Remark: SE=standard error vs SD=standard deviation - since we write SE of a random variable this means the standard deviation of this variable (square root of variance) - so for us SD and SE will be the same.
In general much confusion about SD and SE in articles: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1255808/
Q: The best design
Assume that we know the value of \(\sigma^2\). Now, observe that
\[\text{SE}(\hat{\beta}_1)^2 = \frac{\sigma^2}{\sum_{i=1}^n (x_i-\bar{x})^2}\] is only dependent on the design of the \(x_i\)’s.
x1 = seq(1:10)
x2 = c(rep(1, 5), rep(10, 5))
x3 = runif(10, 1, 10)
sd1 = sqrt(1/sum((x1 - mean(x1))^2))
sd2 = sqrt(1/sum((x2 - mean(x2))^2))
sd3 = sqrt(1/sum((x3 - mean(x3))^2))
print(c(sd1, sd2, sd3))
## [1] 0.11009638 0.07027284 0.11512750
A: the second design - all observations at extremes.
Remember - residual sum of squares: \(\text{RSS}=\sum_{i=1}^n (Y_i-\hat{\beta}_0-\hat{\beta_1}x_{i})^2\).
The residual standard error, RSE, is given by \[\text{RSE} =\sqrt{\frac{1}{n-2} \text{RSS}} = \sqrt{\frac{1}{n-2}\sum_{i=1}^n (Y_i -\hat{Y}_i)^2}\] and is an estimate of \(\sigma\), that is, the standard deviation of the error term \(\varepsilon\). This is the socalled restricted maximum likelihood estimator, and is unbiased.
This is related to the amount the response variables deviate from the estimated regression line. Recall that we will always have observations with noise.
RSE shows the lack of fit of the model to the data. It is measured in units of \(Y\), hence is value may be hard to interpret.
If we assume that the simple linear regression is a good model, observation pairs \((x_i,Y_i)\) are independent for \(i=1,\ldots,n\) and that \(\varepsilon_i\sim N(0,\sigma^2)\) then it can be shown that
\[\frac{\text{RSE}^2(n-2)}{\sigma^2}= \frac{\sum_{i=1}^n (Y_i -\hat{Y}_i)^2}{\sigma^2}\sim \chi^2_{n-2}\]
In R we can get a summary of our fitted linear model, by calling the summary
function. In this outprint we see the estimated coefficient values in the first column, and the estimated standard errors in the second column. Thus \(\hat{\text{SE}}(\hat{\beta}_0) = 8.6135\) and \(\hat{\text{SE}}(\hat{\beta}_1) = 0.1206\).
summary(munich1.lm)
##
## Call:
## lm(formula = rent ~ area, data = rent99)
##
## Residuals:
## Min 1Q Median 3Q Max
## -786.63 -104.88 -5.69 95.93 1009.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 134.5922 8.6135 15.63 <2e-16 ***
## area 4.8215 0.1206 39.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 158.8 on 3080 degrees of freedom
## Multiple R-squared: 0.3417, Adjusted R-squared: 0.3415
## F-statistic: 1599 on 1 and 3080 DF, p-value: < 2.2e-16
To illustrate this point further, we fit four models to our Munich rent index data set. Each of the models has been fit using a random sample of a fraction of our observations (1/4). We see how our fitted line changes given a “new” set of observations.
(We will derive a general version for multiple linear regression.)
For \(j=0,1\):
Then \(\hat{\beta}_j \sim N(\beta_j,\text{Var}(\hat{\beta}_j))\) and \[T_j =\frac{\hat{\beta}_j-\beta_j}{\sqrt{\widehat{\text{Var}}(\hat{\beta}_j)}} \sim t_{n-2}\] The \(t\)-distribution with \(n-2\) degrees of freedom.
See Rintermediate-sol.html for more on the \(t\)-distribution in R.
For the slope in the simple linear regression this is shown, together with inference for \(\beta_1\), in this video from TMA4240/TMA4245 Statistics (in Norwegian - but formulas should still be understood in English).
The figure shows the \(t_{10}\) distribution, where the points were the area under the curve to the right and left are with \(\alpha/2=0.025\).
The \(t\)-distribution can be used to create confidence intervals for the regression parameters. The lower and upper limits of a 95% confidence interval for \(\beta_j\) are \[\hat{\beta}_j \pm t_{\alpha/2,n-2} \cdot\text{SE} (\hat{\beta}_j) \quad j=0, 1.\] The interpretation of this confidence is that: (before we have contructed the interval) there is a 95% chance that the interval will contain the true value of \(\beta_j\).
If \(n\) is large, the normal approximation to the \(t\)-distribution can be used (and is used in the textbook).
Q: Calculate the confidence intervals for the slope parameter in the munich1.lm
model by finding the numbers you need from the summary
output. Here \(t_{0.025,n-2}\)=1.961.
We can find confidence intervals in R
using the confint
function on lm
object:
confint(munich1.lm)
## 2.5 % 97.5 %
## (Intercept) 117.703417 151.480972
## area 4.585017 5.057912
Remark: the argument level=0.99
will give a 99% CI.
Based on the joint distribution of the intercept and slope it is possible to find the distribution for the linear predictor \(\hat{\beta}_0+\hat{\beta}_1 x\), and then confidence intervals for \(\beta_0+\beta_1 x\).
The figures show the fitted line and the confidence interval for the (true) regression line based on all (left) and the first 100 observations (right) in the Munich rent index data. Observe the effect of the sample size on the with of the CIs.
In single hypothesis testing we are interesting in testing one null hypothesis against an alternative hypothesis. In linear regression the hypothesis is often about a regression parameter \(\beta_j\): \[H_0: \beta_j=0 \text{ vs. } H_1: \beta_j\neq 0\]
“Reject \(H_0\) when \(H_0\) is true”=“false positives” = “type I error” =“miscarriage of justice”. These are our fake news, which are very important for us to avoid.
“Fail to reject \(H_0\) when \(H_1\) is true (and \(H_0\) is false)”=“false negatives” = “type II error”= “guilty criminal go free”.
We choose to reject \(H_0\) at some significance level \(\alpha\) if the \(p\)-value of the test (see below) is smaller than the chosen significance level. We say that : Type I error is “controlled” at significance level \(\alpha\), which means that the probability of miscarriage of justice (Type I error) does not exceed \(\alpha\).
Q: Draw a 2 by 2 table showing the connection between
and place the two types of errors in the correct position within the table.
What else should be written in the last two cells?
In linear regression models our test statistic for testing \(H_0: \beta_j=0\) is \[T_0=\frac{\hat{\beta}_j-0}{\sqrt{c_{jj}}\hat{\sigma}}\sim t_{n-2}\] where \(c_{jj}\hat{\sigma}^2=\widehat{\text{Var}}(\hat{\beta}_j)\).
Inserted observed values (and estimates) we have \(t_0\).
We would in a two-sided setting reject \(H_0\) for large values of \(\text{abs}(t_0)\). We may rely on calculating a \(p\)-value.
A p-value is a test statistic satisfying \(0 \leq p({\bf Y}) \leq 1\) for every vector of observations \({\bf Y}\).
Small values give evidence that \(H_1\) is true.
In single hypothesis testing, if the p-value is less than the chosen significance level (chosen upper limit for the probability of committing a type I error), then we reject the null hypothesis, \(H_0\). The chosen significance level is often referred to as \(\alpha\).
A p-value is valid if \[ P(p({\bf Y}) \leq \alpha) \leq \alpha\] for all \(\alpha\), \(0 \leq \alpha \leq 1\), whenever \(H_0\) is true, that is, if the \(p\)-value is valid, rejection on the basis of the \(p\)-value ensures that the probability of type I error does not exceed \(\alpha\).
If \(P(p({\bf Y}) \leq \alpha) = \alpha\) for all \(\alpha\), \(0 \leq \alpha \leq 1\), the \(p\)-value is called an exact p-value.
In our linear regression we use the \(t\)-distibution to calculate p-values for our two-sided test situation \(H_0: \beta_j=0\) vs. \(H_1: \beta_j \neq 0\). Assume we have observed that our test statistic \(T_0\) takes the numerical value \(t_0\). Since the \(t\)-distribution is symmetric around \(0\) we have
\[p\text{-value}=P(T_0>\text{abs}(t_0))+P(T_0<-\text{abs}(t_0))=2\cdot P(T_0>\text{abs}(t_0)).\]
We reject \(H_0\) if our calculated \(p\)-value is below our chosen signficance level. We often choose as significance level \(\alpha=0.05\).
Q: Comment on the \(p\)-values listed in the summary output from fitting the simple linear regression of area
and rent
. Then, pinpoint \(\hat{\sigma}=\text{RSE}\). Where are the entries in the output that you do not (yet) know what is?
##
## Call:
## lm(formula = rent ~ area, data = rent99)
##
## Residuals:
## Min 1Q Median 3Q Max
## -786.63 -104.88 -5.69 95.93 1009.68
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 134.5922 8.6135 15.63 <2e-16 ***
## area 4.8215 0.1206 39.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 158.8 on 3080 degrees of freedom
## Multiple R-squared: 0.3417, Adjusted R-squared: 0.3415
## F-statistic: 1599 on 1 and 3080 DF, p-value: < 2.2e-16
The total sum of squares is defined as \[\text{TSS} = \sum_{i=1}^n (y_i - \bar{y})^2,\] and is proportional to the estimated variance of the response variable.
The \(R^2\) statisitic is the fraction of variance explained by the model and is given by \[R^2 = \frac{\text{TSS}-\text{RSS}}{TSS}= 1-\frac{\text{RSS}}{\text{TSS}}=1-\frac{\sum_{i=1}^n(y_i-\hat{y}_i)^2}{\sum_{i=1}^n(y_i-\bar{y}_i)^2}.\] The value is between 0 and 1, and we want an as high \(R^2\) statistic as possible. This statistic is independent of the scale of \(Y\).
For a simple linear regression model the squared correlation \(r^2\) is equal to \(R^2\).
Q: Look back at the summary
outprint for the munich1.lm
model. What is the value of the \(R^2\) statistic? Based on this value, would you conclude that this model gives a good fit to the data?
Back to the Munich rent index data, but now we want to include more than one covariate. Suggestions?
rent
: the net rent per month (in Euro).rentsqm
: the net rent per month per square meter (in Euro).area
: Living area in square meters.yearc
: year of construction.location
: quality of location: a factor indicating whether the location is average location, 1, good location, 2, and top location, 3.bath
: quality of bathroom: a a factor indicating whether the bath facilities are standard, 0, or premium, 1.kitchen
: Quality of kitchen: 0 standard 1 premium.cheating
: central heating: a factor 0 without central heating, 1 with central heating.district
: District in Munich.We assume, for observation \(i\): \[Y_i= \beta_0 + \beta_{1} x_{i1} + \beta_2 x_{i2} + ... + \beta_p x_{ip} + \varepsilon_i,\] where \(i=1,2,...,n\).
Here: what is \(x_{ij}\)?
The model can be written in matrix form: \[{\bf Y}={\bf X} \boldsymbol{\beta}+{\boldsymbol{\varepsilon}}.\]
Q: write out in detail the dimension!
\({\bf Y}: (n \times 1)\) vector of responses [e.g. one of the following: rent, weight of baby, ph of lake, volume of tree]
\({\bf X}: (n \times (p+1))\) design matrix [e.g. location of flat, gestation age of baby, chemical measurement of the lake, height of tree], and \({\bf x}_i^T\) is a \((p+1)\)-dimensional row row vector for observation \(i\).
\({\boldsymbol \beta}: ((p+1) \times 1)\) vector of regression parameters (intercept included)
\({\boldsymbol \varepsilon}: (n\times 1)\) vector of random errors.
We assume that pairs \(({\bf x}_i^T,y_i)\) \((i=1,...,n)\) are measured from sampling units. That is, the observation pair \(({\bf x}_1^T,y_1)\) is independent from \(({\bf x}_2^T,y_2)\).
Remark: other books, including the book in TMA4267 and TMA4315 define \(p\) to include the intercept. This may lead to some confusion about \(p\) or \(p+1\) in formulas…
Assume that we have rent
as response and area
and bath
as covariates.
Remember: \(n=3802\) and bath=0
gives standard quality and bath=1
premium and the model is \[{\bf Y=X \boldsymbol\beta}+{\boldsymbol \varepsilon}.\]
fit = lm(rent ~ area + bath, data = rent99)
head(model.matrix(fit))
## (Intercept) area bath1
## 1 1 26 0
## 2 1 28 0
## 3 1 30 0
## 4 1 30 0
## 5 1 30 0
## 6 1 30 0
head(rent99$rent)
## [1] 109.9487 243.2820 261.6410 106.4103 133.3846 339.0256
Assume that \({\bf Y=X \boldsymbol\beta}+{\boldsymbol \varepsilon}\) and \(\boldsymbol\varepsilon\sim N_n({\bf 0},\sigma^2 {\bf I})\).
Q:
A: \[ {\bf Y} \sim N_{n}({\bf X} {\boldsymbol\beta},\sigma^2 {\bf I})\]
Munich, 1999: 3082 observations on 9 variables.
rent
: the net rent per month (in Euro).rentsqm
: the net rent per month per square meter (in Euro).area
: Living area in square meters.yearc
: year of construction.location
: quality of location: a factor indicating whether the location is average location, 1, good location, 2, and top location, 3.bath
: quality of bathroom: a a factor indicating whether the bath facilities are standard, 0, or premium, 1.kitchen
: Quality of kitchen: 0 standard 1 premium.cheating
: central heating: a factor 0 without central heating, 1 with central heating.district
: District in Munich.New York, 1973: 111 observations of
ozone
: ozone concentration (ppm)radiation
: solar radiation (langleys)temperature
: daily maximum temperature (F)wind
: wind speed (mph)ozone
as our response variable and temperature
, wind
and `radiation as covariates.
First 6 observations in data set printed.
## ozone radiation temperature wind
## 1 41 190 67 7.4
## 2 36 118 72 8.0
## 3 12 149 74 12.6
## 4 18 313 62 11.5
## 5 23 299 65 8.6
## 6 19 99 59 13.8
\({\bf Y=X \boldsymbol\beta}+{\boldsymbol \varepsilon}\)
Assumptions:
The classical normal linear regression model is obtained if additionally
For random covariates these assumptions are to be understood conditionally on \({\bf X}\).
The interpretation of the coefficients \(\beta_j\) is now as following: holding all other covariates fixed, what is the average effect on \(Y\) of a one-unit increase in the \(j\)th covariate.
In multiple linear regression parameters in \(\beta\) are estimated with maximum likelihood and least squares. These two methods give the same estimator when we assume the normal linear regression model.
Least squares and maximum likelihood estimator for \({\bf \beta}\): \[ \hat{\boldsymbol\beta}=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf Y}\]
The estimator is found by minimizing the RSS for a multiple linear regression model: \[\begin{aligned} \text{RSS} &=\sum_{i=1}^n (y_i - \hat y_i)^2 = \sum_{i=1}^n (y_i - \hat \beta_0 - \hat \beta_1 x_{i1} - \hat \beta_2 x_{i2} -...-\hat \beta_p x_{ip} )^2 \\ &= \sum_{i=1}^n (y_i-{\bf x}_i^T \boldsymbol \beta)^2=({\bf Y}-{\bf X}\hat{\boldsymbol{\beta}})^T({\bf Y}-{\bf X}\hat{\boldsymbol{\beta}})\end{aligned}\] The estimator is found by solving the system of (p+1) equation :\(\frac{\partial \text{RSS}}{\partial \boldsymbol \beta}={\bf 0}\), see LeastSquaresMLR.pdf for a derivation (from TMA4267V2017).
Write down the model and explain what the values under Estimate
mean in practice.
munich3.lm = lm(rentsqm ~ area + yearc + location + bath + kitchen +
cheating, data = rent99)
summary(munich3.lm)
##
## Call:
## lm(formula = rentsqm ~ area + yearc + location + bath + kitchen +
## cheating, data = rent99)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4303 -1.4131 -0.1073 1.3244 8.6452
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -45.475484 3.603775 -12.619 < 2e-16 ***
## area -0.032330 0.001648 -19.618 < 2e-16 ***
## yearc 0.026959 0.001846 14.606 < 2e-16 ***
## location2 0.777133 0.076870 10.110 < 2e-16 ***
## location3 1.725068 0.236062 7.308 3.45e-13 ***
## bath1 0.762808 0.157559 4.841 1.35e-06 ***
## kitchen1 1.136908 0.183088 6.210 6.02e-10 ***
## cheating1 1.765261 0.129068 13.677 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.031 on 3074 degrees of freedom
## Multiple R-squared: 0.3065, Adjusted R-squared: 0.3049
## F-statistic: 194.1 on 7 and 3074 DF, p-value: < 2.2e-16
Reproduce the values under Estimate
by calculating without the use of lm
.
X = model.matrix(rentsqm ~ area + yearc + location + bath + kitchen +
cheating, data = rent99)
Y = rent99$rentsqm
betahat = solve(t(X) %*% X) %*% t(X) %*% Y
print(betahat)
## [,1]
## (Intercept) -45.47548356
## area -0.03233033
## yearc 0.02695857
## location2 0.77713297
## location3 1.72506792
## bath1 0.76280784
## kitchen1 1.13690814
## cheating1 1.76526110
Q:
Model: \[{\bf Y}=\bf X \boldsymbol{\beta}+{\boldsymbol{\varepsilon}}\] with full rank design matrix \({\bf X}\).
head(model.matrix(ozone.lm))
## (Intercept) temperature wind radiation
## 1 1 67 7.4 190
## 2 1 72 8.0 118
## 3 1 74 12.6 149
## 4 1 62 11.5 313
## 5 1 65 8.6 299
## 6 1 59 13.8 99
head(ozone$ozone)
## [1] 41 36 12 18 23 19
Classical normal linear regression model when \[\boldsymbol{\varepsilon}\sim N_n({\bf 0},\sigma^2{\bf I}).\]
In Part A we found that : \[ {\bf Y} \sim N_{n}({\bf X} {\boldsymbol\beta},\sigma^2 {\bf I})\]
Parameter of interest is \(\boldsymbol{\beta}\) and \(\sigma^2\) is a nuisance (=parameter not of interest).
Using the least squares (and maximum likelihood) method the estimator for \(\boldsymbol\beta\) is \[ \hat\beta=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf Y}\]
How does this compare to simple linear regression? Not so easy to see a connection!
\[\hat{\beta}_0 = \bar{Y}-\hat{\beta}_1 \bar{x} \text{ and } \hat{\beta}_1 = \frac{\sum_{i=1}^n(x_i-\bar{x})(Y_i-\bar{Y})}{\sum_{i=1}^n(x_i-\bar{x})^2},\]
\[ \hat\beta=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf Y}\]
Often we use centered data (and also scaled) to ease interpretation. In design of experiments often orthogonal columns of the design matrix is chosen to get \({\bf X}^T{\bf X}\) to be diagonal, which leads to easier interpretation and identifiability.
\[ \hat{\boldsymbol\beta}=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf Y}\]
This can be written as \(\hat{\boldsymbol\beta}={\bf C}{\bf Y}\) where
Therefore
So: \(\hat{\beta}\sim N_{p+1}(\boldsymbol{\beta},\sigma^2({\bf X}^T{\bf X})^{-1})\).
PropertiesBetahatMLR.pdf for a derivation.
\(\hat{\beta}\sim N_{p+1}(\boldsymbol{\beta},\sigma^2({\bf X}^T{\bf X})^{-1})\).
coefficients(ozone.lm)
## (Intercept) temperature wind radiation
## -64.23208116 1.65120780 -3.33759763 0.05979717
vcov(ozone.lm)
## (Intercept) temperature wind radiation
## (Intercept) 530.93558002 -5.503192281 -1.043562e+01 0.0266688733
## temperature -5.50319228 0.064218138 8.034556e-02 -0.0015749279
## wind -10.43562350 0.080345561 4.275126e-01 -0.0003442514
## radiation 0.02666887 -0.001574928 -3.442514e-04 0.0005371733
Q: Explain what all these numbers are!
\[\begin{aligned} \text{RSS} &=\sum_{i=1}^n (y_i - \hat y_i)^2 = \sum_{i=1}^n (y_i - \hat \beta_0 - \hat \beta_1 x_{i1} - \hat \beta_2 x_{i2} -...-\hat \beta_p x_{ip} )^2 \\ &= \sum_{i=1}^n (y_i-{\bf x}_i^T \hat{\boldsymbol \beta})^2=({\bf Y}-{\bf X}\hat{\boldsymbol{\beta}})^T({\bf Y}-{\bf X}\hat{\boldsymbol{\beta}})\end{aligned}\]
Restricted maximum likelihood estimator for \({\bf \sigma}^2\): \[ \hat{\sigma}^2=\frac{1}{n-p-1}({\bf Y}-{\bf X}\hat{\boldsymbol\beta})^T({\bf Y}-{\bf X}\hat{\boldsymbol\beta})=\frac{\text{RSS}}{n-p-1}\] with \(\frac{(n-p-1)\hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p-1}\).
\(\hat{\beta}\sim N_{p+1}(\boldsymbol{\beta},\sigma^2({\bf X}^T{\bf X})^{-1})\).
Multicollinearity: columns in design matrix (that is, the covariates) are correlated, which may lead to difficulty in “identifying” the effect of each covariate on the response, and thus large variances (and covariances) for the elements of \(\hat{\boldsymbol\beta}\).
The variance inflation factor (VIF) is the ratio of the variance of \(\hat{\beta}_j\) when fitting a model with the chosen covariates divided by the variance of \(\hat{\beta}_j\) in a simple linear regression.
oz1 = as.data.frame(apply(ozone, 2, scale, scale = FALSE))
fitoz = lm(ozone ~ temperature + wind + radiation, data = oz1)
vif(fitoz)
## temperature wind radiation
## 1.431201 1.328979 1.095241
corrplot(cov2cor(vcov(fitoz)), cex.lab = 0.7)
\[\hat{\beta}\sim N_{p+1}(\boldsymbol{\beta},\sigma^2({\bf X}^T{\bf X})^{-1})\]
Statistic for inference about \(\beta_j\), \(c_{jj}\) is diagonal element corresponding to \(\hat{\beta}_j\) of \(({\bf X}^T{\bf X})^{-1}\). \[ T_j=\frac{\hat{\beta}_j-\beta_j}{\sqrt{c_{jj}}\hat{\sigma}}\sim t_{n-p-1}\] \[ P(\hat{\beta}_j-t_{\alpha/2,n-p-1}\sqrt{c_{jj}}\hat{\sigma} \le \beta_j \le \hat{\beta}_j+t_{\alpha/2,n-p-1}\sqrt{c_{jj}}\hat{\sigma})=1-\alpha\]
A \((1-\alpha)\)% CI for \(\beta_j\) is when we insert numerical values for the upper and lower limits: \([\hat{\beta}_j-t_{\alpha/2,n-p-1}\sqrt{c_{jj}}\hat{\sigma},\hat{\beta}_j+t_{\alpha/2,n-p-1}\sqrt{c_{jj}}\hat{\sigma}]\).
When we work with large samples then \(n-p-1\) becomes large and the \(t\) distribution goes to a normal distribution, so we may use the standard normal in place of the \(t_{n-p-1}\). This is in done in our textbook.
fitoz = lm(ozone ~ temperature + wind + radiation, data = ozone)
confint(fitoz)
## 2.5 % 97.5 %
## (Intercept) -109.91023689 -18.5539254
## temperature 1.14884613 2.1535695
## wind -4.63376808 -2.0414272
## radiation 0.01385147 0.1057429
\[H_0: \beta_j=0 \text{ vs } H_1: \beta_j\neq 0\]
Nothing new: using \(t\)-tests based on \[ T_j=\frac{\hat{\beta}_j-\beta_j}{\sqrt{c_{jj}}\hat{\sigma}}\sim t_{n-p-1}\]
Again, \(c_{jj}\) is diagonal element corresponding to \(\hat{\beta}_j\) of \(({\bf X}^T{\bf X})^{-1}\).
fitoz = lm(ozone ~ temperature + wind + radiation, data = ozone)
summary(fitoz)
##
## Call:
## lm(formula = ozone ~ temperature + wind + radiation, data = ozone)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.485 -14.210 -3.556 10.124 95.600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.23208 23.04204 -2.788 0.00628 **
## temperature 1.65121 0.25341 6.516 2.43e-09 ***
## wind -3.33760 0.65384 -5.105 1.45e-06 ***
## radiation 0.05980 0.02318 2.580 0.01124 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.17 on 107 degrees of freedom
## Multiple R-squared: 0.6062, Adjusted R-squared: 0.5952
## F-statistic: 54.91 on 3 and 107 DF, p-value: < 2.2e-16
Q:
A:
t value
gives numerical value for the t-statistic and column *Pr(>|t|)
gives \(p\)-value for the two-sided test.Consider the hypotheses: \[ H_0: \beta_1=\beta_2=\cdots= \beta_k =0 \text{ vs. } H_1: \text{at least one different from zero}.\]
This means we test a set of regression parameters is different from 0.
This is used to
To do this \(F\)-tests are used.
\[ H_0: \beta_1=\beta_2=\cdots= \beta_p =0 \text{ vs. } H_1: \text{at least one different from zero}.\]
F-statistic
\[F=\frac{(\text{TSS-RSS})/p}{\text{RSS}/(n-p-1)} \sim F_{p,n-p-1}\]
When \(H_0\) is false we expect that the numerator is larger than the denominator and thus \(F\) is greater than 1. A \(p\) value is calculated from the upper tail of the \(F\)-distribution
We find \(\text{p-value}=P(F_{p,n-p-1}>f_0)\), where \(f_0\) is the numerical value of \(F\) inserted RSS, TSS from the data.
fitoz = lm(ozone ~ temperature + wind + radiation, data = ozone)
summary(fitoz)
##
## Call:
## lm(formula = ozone ~ temperature + wind + radiation, data = ozone)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.485 -14.210 -3.556 10.124 95.600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -64.23208 23.04204 -2.788 0.00628 **
## temperature 1.65121 0.25341 6.516 2.43e-09 ***
## wind -3.33760 0.65384 -5.105 1.45e-06 ***
## radiation 0.05980 0.02318 2.580 0.01124 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.17 on 107 degrees of freedom
## Multiple R-squared: 0.6062, Adjusted R-squared: 0.5952
## F-statistic: 54.91 on 3 and 107 DF, p-value: < 2.2e-16
\(H_0:\) the additional \(p-q\) coefficients in the large model are all zero vs. \(H_1\): at least one different from zero.
\[F=\frac{(\text{RSS$_0$-RSS})/(p-q)}{\text{RSS}/(n-p-1)} \sim F_{p-q,n-p-1}\]
When \(H_0\) is false we expect that the numerator is larger than the denominator and thus \(F\) is greater than 1. A \(p\) value is calculated from the upper tail of the \(F\)-distribution
We find \(\text{p-value}=P(F_{p-q,n-p-1}>f_0)\), where \(f_0\) is the numerical value of \(F\) inserted RSS, TSS from the data.
In R we perform the test by fitting the two models fit.large
and fit.small
and use `anova(fit.small,fit.large).
fit.large = lm(ozone ~ temperature + wind + radiation, data = ozone)
fit.small = lm(ozone ~ temperature, data = ozone)
anova(fit.small, fit.large)
## Analysis of Variance Table
##
## Model 1: ozone ~ temperature
## Model 2: ozone ~ temperature + wind + radiation
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 109 62367
## 2 107 47964 2 14403 16.066 7.921e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Once we have estimated the coeffients \(\hat\beta_0\), \(\hat\beta_1\),.., \(\hat\beta_p\), we can make a prediction for a response value \(Y_0\) for a new observation \(\mathbf x_0 = (x_1, x_2, ..., x_p)\) as before:
\[\hat Y_0 = \hat \beta_0 + \hat \beta_1 x_1 + \hat \beta_2 x_2 + ... + \hat \beta_p x_p ={\bf x}_0^T \hat{\boldsymbol\beta}.\] This is an intuitive point estimate.
Remember, one aim for regression was to “construct a model to predict the response from a set of (one or several) explanatory variables- more or less black box”.
To assess the uncertainty in this prediction we present a prediction interval for the \(Y_0\).
After some work (see “details on the derivation” below):
\(P({\bf x}_0^T \hat{\beta}-t_{\alpha/2,n-p-1}\hat{\sigma}\sqrt{1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0} \le Y_0 \le {\bf x}_0^T \hat{\beta}+t_{\alpha/2,n-p-1}\hat{\sigma}\sqrt{1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0})=1-\alpha\)
A \((1-\alpha)\)% PI for \(Y_0\) is when we insert numerical values for the upper and lower limits: \([{\bf x}_0^T \hat{\beta}-t_{\alpha/2,n-p-1}\hat{\sigma}\sqrt{1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0}, {\bf x}_0^T \hat{\beta}+t_{\alpha/2,n-p-1}\hat{\sigma}\sqrt{1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0}]\).
PIs can be found in R using predict
on an lm
object, but make sure that newdata
is a data.frame
with the same names as the original data.
Example: Using the Munich rent index data
We want to predict the rent - with PI - for an appartment with area 50, location 2 (“good”), nice bath and kitchen and with central heating.
library(gamlss.data)
fit = lm(rent ~ area + location + bath + kitchen + cheating, data = gamlss.data::rent99)
newobs = gamlss.data::rent99[1, ]
newobs[1, ] = c(NA, NA, 50, NA, 2, 1, 1, 1, NA)
predict(fit, newdata = newobs, interval = "prediction", type = "response")
## fit lwr upr
## 1 602.1298 315.5353 888.7243
Q
A: Always? Gives useful information in addition to a point prediction.
predict
above.A: Fit is point prediction, lwr is lower and upr is upper limit of the 95% PI (default with 95, and level=0.99
gives 99).
We start to look at the difference between the unobserved response \(Y_0\) (for a given covariate vector \({\bf x}_0\)) and the point prediction \(\hat{Y}_0\), \(Y_0-\hat{Y}_0\).
First, we assume that the unobserved response at covariate \({\bf x}_0\) is independent of our previous observations and follows the same distibution, that is \(Y_0 \sim N({\bf x}_0^T \beta,\sigma^2)\). Further,
\[\hat{Y}_0={\bf x}_0^T \hat{\beta} \sim N({\bf x}_0^T \beta,\sigma^2 {\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0).\]
Then, for \(Y_0-{\bf x}_0^T \hat{\beta}\) we have
\(\text{E}(Y_0-{\bf x}_0^T \hat{\beta})=0 \text{ and } \text{Var}(Y_0-{\bf x}_0^T \hat{\beta})=\text{Var}(Y_0)+\text{Var}({\bf x}_0^T \hat{\beta})=\sigma^2+\sigma^2{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0\)
so that \[Y_0-{\bf x}_0^T \hat{\beta}\sim N(0,\sigma^2 (1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0)) \]
Inserting our REML-estimate for \(\sigma^2\) gives
\[T=\frac{Y_0-{\bf x}_0^T \hat{\beta}}{\hat{\sigma}\sqrt{1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0}}\sim t_{n-p-1}.\]
Then, we start with \[ P(-t_{\alpha/2,n-p-1}\le \frac{Y_0-{\bf x}_0^T \hat{\beta}}{\hat{\sigma}\sqrt{1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0}} \le t_{\alpha/2,n-p-1})=1-\alpha\] and solve so that \(Y_0\) is in the middle, which gives
\(P({\bf x}_0^T \hat{\beta}-t_{\alpha/2,n-p-1}\hat{\sigma}\sqrt{1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0} \le Y_0 \le {\bf x}_0^T \hat{\beta}+t_{\alpha/2,n-p-1}\hat{\sigma}\sqrt{1+{\bf x}_0^T({\bf X}^T{\bf X})^{-1}{\bf x}_0})=1-\alpha\)
no change from simple linear regression.
\[R^2 = \frac{\text{TSS}-\text{RSS}}{\text{TSS}}= 1-\frac{\text{RSS}}{\text{TSS}}=1-\frac{\sum_{i=1}^n(y_i-\hat{y}_i)^2}{\sum_{i=1}^n(y_i-\bar{y}_i)^2}.\]
The interpretation of this coefficient is that the closer it is to 1 the better the fit to the data. If \(R^2=1\) then all residuals are zero - that is, perfect fit to the data.
In a simple linear regression the \(R^2\) equals the squared correlation coefficient between the response and the predictor. In multiple linear regression \(R^2\) is the squared correlation coefficient between the observed and predicted response.
To assess the quality of the regression we can report the \(R^2\) coefficient of determination. However, since adding covariates to the linear regression can not make the RSS larger, this means that adding covariates can not make the \(R^2\) smaller. This means that RSS and \(R^2\) are only useful measures for comparing models with the same number of regression parameters estimated.
If we consider two models with the same model complexity then RSS can be used to choose between (or compare) these models.
But, if we want to compare models with different model complexity we need to look at other measures of quality for the regression.
\[R^2_{\text{adj}}=1-\frac{\frac{RSS}{n-p-1}}{\frac{TSS}{n-1}}=1-\frac{n-1}{n-p-1}(1-R^2)\] Choose the model with the largest \(R^2_{\text{adj}}\).
AIC is one of the most widely used criteria, and is designed for likelihood-based inference. Let \(l(\hat{\beta}_M,\tilde{\sigma}^2)\) be the maximum of the log-likelihood of the data inserted the maximum likelihood estimates for the regression parameters. Further, let \(\lvert M \rvert\) be the number of estimated regression parameters in our model (and then we add 1 for the estimated variance). \[\text{AIC} =-2 \cdot l(\hat{\beta}_M,\tilde{\sigma}^2)+2(\lvert M\rvert +1)\]
Choose the model with the minimum AIC.
For a normal regression model: \[\text{AIC} \propto n\ln(\tilde{\sigma}^2)+2(\lvert M\rvert +1)+C\] where C is a function of \(n\) (will be the same for two models for the same data set).
The likelihood - for the normal linear regression model: \[l(\hat{\boldsymbol\beta},{\tilde{\sigma}^2})=\text{ln}(L(\hat{\boldsymbol\beta},{\sigma^2}))=-\frac{n}{2}\text{ln}(2\pi)-\frac{n}{2}\text{ln}\tilde{\sigma}^2-\frac{1}{2\tilde{\sigma}^2} ({\bf y}-{\bf X}\hat{\boldsymbol\beta})^T({\bf y}- {\bf X}\hat{\boldsymbol\beta})\]
Remark that \(\tilde{\sigma}^2=\text{RSS}/n\) - our ML estimator, so that the first term in the AIC is just a function of the RSS.
Remark: for us here we have \(2(\lvert M\rvert +1)=2(p+2)\)
The BIC is also based on the likelihood (see notation above). \[\text{BIC} =-2 \cdot l(\hat{\beta}_M,\tilde{\sigma}^2)+\ln(n)\cdot (\lvert M\rvert +1)\]
For a normal regression model: \[ \text{BIC} \propto n\ln(\tilde{\sigma}^2)+\ln(n)(\lvert M\rvert +1)\] Choose the model with the minimum BIC.
AIC and BIC are motivated in very different ways, but the final result for the normal regression model is very similar. BIC has a larger penalty than AIC (\(\log(n)\) vs. \(2\)), and will often give a smaller model (=more parsimonious models) than AIC. In general we would not like a model that is too complex.
Remark: for us here we have \(\ln(n)\cdot (\lvert M\rvert +1)=\ln(n)\cdot (p+2)\)
Given a full multiple linear regression model the case is often that not all of the covariates are of equal importance for predicting the response. Some of the covariates could be removed from the model without affecting the fit. At the same time one would gain a more interpretable model with fewer parameters. We will here shortly discuss three subset selction procedures. These will be discussed in greater detail in Module 6.
Forward selection The forward selection procedure stars with the null model (no covariates, only \(\beta_0\)). In a step-wise procedure, additional covariates are added one at a time. The inclusion of the covariate is based on a quality of fit measure (f.ex. \(R^2_{\text{adj}}\) or AIC) and the covariate giving the greatest improve of the fit is added to the model at each step.
Backward selection The backward selection has the opposite procedure: Here the starting point is the full multiple linear regression model (with all covariates included). At each step of the procedure, one covariate is removed from the model. The removal of this covariate is based on a quality of fit measure, where the covariate corresponding to the smallest decrease in the fit is removed at each step.
Mixed selection Combine forward and backward.
All subset selection All subset selection is a procedure where all possible combination of covariates are tested. This approach is efficient but computationally expensive.
Model selection is a major point in Module 6.
Residuals can be used to check model assumptions, and also to discover outliers.
If can be shown that the vector of residuals, \({\bf e}=(e_1,e_2,\ldots,e_n)\) have a normal (singular) distribution with mean \(\text{E}({\bf e})={\bf 0}\) and covariance matrix \(\text{Cov}({\bf e})=\sigma^2({\bf I}-{\bf H})\) where \({\bf H}={\bf X}({\bf X}^T{\bf X})^{-1}{\bf X}^T\).
This means that the residuals (possibly) have different variance, and may also be correlated.
Q: How can we say that the residuals can have different variance and may be correlated? Why is that a problem?
A:
We would like to check the model assumptions - we see that they are all connected to the error terms. But, but we have not observed the error terms \(\varepsilon\) so they can not be used for this. However, we have made “predictions” of the errors - our residuals. And, we want to use our residuals to check the model assumptions.
That is, we want to check that our errors are independent, homoscedastic (same variance for each observation), and not dependent on our covariates - and we want to use the residuals (observed) in place of the errors (unobserved). Then it would have been great if the residuals have these properties when the underlying errors have. To amend our problem we need to try to fix the residual so that they at least have equal variances. We do that by working with standardized or studentized residuals.
Standardized residuals:
\[r_i=\frac{e_i}{\hat{\sigma}\sqrt{1-h_{ii}}}\] where \(h_{ii}\) is the \(i\)th diagonal element of the hat matrix \({\bf H}\).
In R you can get the standardized residuals from an lm
-object (named fit
) by rstandard(fit)
.
Studentized residuals:
\[r^*_i=\frac{e_i}{\hat{\sigma}_{(i)}\sqrt{1-h_{ii}}}\] where \(\hat{\sigma}_{(i)}\) is the estimated error variance in a model with observation number \(i\) omitted. This seems like a lot of work, but it can be shown that it is possible to calculated the studentized residuals directly from the standardized residuals.
In R you can get the studentized residuals from an lm
-object (named fit
) by rstudent(fit)
.
R
More information on the plots here: http://data.library.virginia.edu/diagnostic-plots/ and http://ggplot2.tidyverse.org/reference/fortify.lm.html
You can use the function fortify.lm
in ggplot2
to create a dataframe from an lm
-object, which ggplot
uses automatically when given a lm
-object. This can be used to plot diagnostic plots.
For simplicity we use the Munch rent index with rent
as response and only area
as the only covariate. (You may change the model to a more complex one, and rerun the code chunks.)
fit <- lm(rent ~ area, data = rent99) # Run a regression analysis
format(head(fortify(fit)), digits = 4L)
## rent area .hat .sigma .cooksd .fitted .resid .stdresid
## 1 109.9 26 0.001312 158.8 5.870e-04 260.0 -150.00 -0.9454
## 2 243.3 28 0.001219 158.8 1.678e-05 269.6 -26.31 -0.1658
## 3 261.6 30 0.001130 158.8 6.956e-06 279.2 -17.60 -0.1109
## 4 106.4 30 0.001130 158.8 6.711e-04 279.2 -172.83 -1.0891
## 5 133.4 30 0.001130 158.8 4.779e-04 279.2 -145.85 -0.9191
## 6 339.0 30 0.001130 158.8 8.032e-05 279.2 59.79 0.3768
# to show what fortify implicitly does in ggplot for more information
# ggplot2::fortify.lm
A plot with the fitted values of the model on the x-axis and the residuals on the y-axis shows if the residuals have non-linear patterns. The plot can be used to test the assumption of a linear relationship between the response and the covariates. If the residuals are spread around a horizontal line with no distinct patterns, it is a good indication on no non-linear relationships, and a good model.
Does this look like a good plot for this data set?
ggplot(fit, aes(.fitted, .stdresid)) + geom_point(pch = 21) + geom_hline(yintercept = 0,
linetype = "dashed") + geom_smooth(se = FALSE, col = "red", size = 0.5,
method = "loess") + labs(x = "Fitted values", y = "Standardized residuals",
title = "Fitted values vs standardized residuals", subtitle = deparse(fit$call)) +
theme_minimal()
A: Ok linear assumption, but not constant spread.
This plot shows if the residuals are Gaussian (normally) distributed. If they follow a straigt line it is an indication that they are, and else they are probably not.
ggplot(fit, aes(sample = .stdresid)) + stat_qq(pch = 19) + geom_abline(intercept = 0,
slope = 1, linetype = "dotted") + labs(x = "Theoretical quantiles",
y = "Standardized residuals", title = "Normal Q-Q", subtitle = deparse(fit$call)) +
theme_minimal()
library(nortest)
ad.test(rstudent(fit))
##
## Anderson-Darling normality test
##
## data: rstudent(fit)
## A = 6.4123, p-value = 9.809e-16
A: Not normal.
This is also called spread-location plot. It shows if the residuals are spread equally along the ranges of predictors. Can be used to check the assumption of equal variance (homoscedasticity). A good plot is one with a horizontal line with randomly spread points.
Is this plot good for your data?
ggplot(fit, aes(.fitted, sqrt(abs(.stdresid)))) + geom_point() + geom_smooth(se = FALSE,
col = "red", size = 0.5, method = "loess") + labs(x = "Fitted values",
y = expression(sqrt("Standardized residuals")), title = "Scale-location",
subtitle = deparse(fit$call)) + theme_minimal()
A: Confirms our observation of not constant variance.
This plot can reveal influential outliers. Not all outliers are influential in linear regression; even though data have extreme values, they might not be influential to determine the regression line (the results don’t differ much if they are removed from the data set). These influential outliers can be seen as observations that does not get along with the trend in the majority of the observations. In plot.lm
, dashed lines are used to indicate the Cook’s distance, instead of using the size of the dots as is done here.
Cook’s distance is the Euclidean distance between the \(\mathbf{\hat{y}}\) (the fitted values) and \(\mathbf{\hat{y}}_{(i)}\) (the fitted values calculated when the \(i\)-th observation is omitted from the regression). This is then a measure on how much the model is influences by observation \(i\). The distance is scaled, and a rule of thumb is to examine observations with Cook’s distance larger than 1, and give some attention to those with Cook’s distance above 0.5.
Leverage is defined as the diagonal elements of the hat matrix, i.e., the leverage of the \(i\)-th data point is \(h_{ii}\) on the diagonal of \(\mathbf{H = X(X^TX)^{-1}X^T}\). A large leverage indicated that the observation (\(i\)) has a large influence on the estimation results, and that the covariate values (\(\mathbf{x}_i\)) are unusual.
ggplot(fit, aes(.hat, .stdresid)) + geom_smooth(se = FALSE, col = "red",
size = 0.5, method = "loess") + geom_point(aes(size = .cooksd)) +
scale_size_continuous("Cook's dist.") + labs(x = "Leverage", y = "Standardized residuals",
title = "Residuals vs Leverage", subtitle = deparse(fit$call)) +
theme_minimal()
A:Some observations does not fit our model, but if we fit a more complex model this may change.
The section is a self study section, where the dummy variable part is the most important and will be used in this course.
See Rob Tibshirani explains - from ca 9 minutes
Qualitative predictors can be included in a linear regression model by introducing dummy variables
Example: consider our rent
dataset with rent
as response, and continuous covariate area
and categorical covariate location
. Let the location
be a factor with levels average, good, top
.
library(gamlss.data)
library(dplyr)
library(GGally)
ds = dplyr::select(rent99, location, area, rent)
levels(ds$location)
# change to meaningful names
levels(ds$location) = c("average", "good", "top")
ggpairs(ds)
Q: comment on what you see in the ggpairs
plot.
Categorical covariates may either be ordered or unordered. We will only consider unordered categories here. In general, we could like to estimate regression coefficients for all levels for the categorical covariates. However, if we want to include an intercept in our model we can only include codings for one less variable than the number of levels we have - or else our design matrix will not have full rank.
Q: Assume you have a categorical variable with three levels. Check for yourself that making a design matrix with one intercept and three columns with dummy (0-1) variable coding will result in a matrix that is singular.
# make 'wrong' dummy variable coding with 3 columns
n = length(ds$location)
X = cbind(rep(1, n), ds$area, rep(0, n), rep(0, n), rep(0, n))
X[ds$location == "average", 3] = 1
X[ds$location == "good", 4] = 1
X[ds$location == "top", 5] = 1
X[c(1, 3, 69), ]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 26 0 1 0
## [2,] 1 30 1 0 0
## [3,] 1 55 0 0 1
require(Matrix)
dim(X)
## [1] 3082 5
rankMatrix(X)
## [1] 4
## attr(,"method")
## [1] "tolNorm2"
## attr(,"useGrad")
## [1] FALSE
## attr(,"tol")
## [1] 6.843415e-13
This is why we need to instead work with different ways of coding categorical variables. One solution is to not include an intercept in the model, but that is often not what we want. We will look at two other solutions - one where we decide on a reference category (that we not include in the coding, and therefore is kind of included in the intercept - this is called “treatment coding”) and one where we require that the the sum of the coeffisients are zero (called “effect coding”). This mainly effects how we interpret parameter estimates and communicate our findings to the world. We will here restrict our discussion to “treatment coding”.
If we fit a regression model with lm
to the data with rent
as response and area
and location
as covariates, a model matrix is made - and how to handle the categorical variable is either specified the call to lm
in contrasts=list(location="contr.treatment")
(or to model.matrix) or globally for all categorical variables with options(contrasts=c("contr.treatment","contr.poly"))
- where first element give choice for unordered factor (then treatment contrast is default) and second for ordered (and then this polynomial contrast is default). We will only work with unordered factors now.
This is the default coding. The reference level is automatically chosen as the “lowest” level (sorted alphabetically). For our example this means that the reference category for location is “average”.
\[x_{i \text{locationgood}} = \begin{cases} 1 \text{ if } i \text{ -th location}=\text{"good"} \\ 0 \text{ if } i \text{ -th location }\neq\text{ "good"} \end{cases}\] \[x_{i \text{locationtop}} = \begin{cases} 1 \text{ if } i \text{ -th location}=\text{"top"} \\ 0 \text{ if } i \text{ -th location }\neq\text{ "top"} \end{cases}\] \[\begin{aligned} y_i &= \beta_0 + \beta_1 x_{i \text{area}} + \beta_2 x_{i \text{locationgood}}+\beta_3 x_{i \text{locationtop}} + \varepsilon_i\\\\ &= \begin{cases} \beta_0 + \beta_1 x_{i \text{area}}+ \beta_2 + \varepsilon_i &\text{ if } i \text{ -th location}=\text{"good"} \\ \beta_0 + \beta_1 x_{i \text{area}}+ \beta_3 + \varepsilon_i &\text{ if } i \text{ -th location}=\text{"top"} \\ \beta_0 + \varepsilon_i &\text{ if } i \text{ -th location}=\text{"average"}\end{cases}\end{aligned}\]
If we instead wanted “good” to be reference category we could relevel the factor.
X1 = model.matrix(~area + location, data = ds)
X1[c(1, 3, 69), ]
## (Intercept) area locationgood locationtop
## 1 1 26 1 0
## 3 1 30 0 0
## 69 1 55 0 1
ds$locationRELEVEL = relevel(ds$location, ref = "good")
X2 = model.matrix(~area + locationRELEVEL, data = ds)
X2[c(1, 3, 69), ]
## (Intercept) area locationRELEVELaverage locationRELEVELtop
## 1 1 26 0 0
## 3 1 30 1 0
## 69 1 55 0 1
So, what does this mean in practice? Model 1 has average
as reference category and model 2 good
.
fit1 = lm(rent ~ area + location, data = ds, contrasts = list(location = "contr.treatment"))
summary(fit1)
##
## Call:
## lm(formula = rent ~ area + location, data = ds, contrasts = list(location = "contr.treatment"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -790.98 -100.89 -4.87 94.47 1004.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 128.0867 8.6947 14.732 < 2e-16 ***
## area 4.7056 0.1202 39.142 < 2e-16 ***
## locationgood 28.0040 5.8662 4.774 1.89e-06 ***
## locationtop 131.1075 18.2614 7.180 8.73e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 157.1 on 3078 degrees of freedom
## Multiple R-squared: 0.3555, Adjusted R-squared: 0.3549
## F-statistic: 566 on 3 and 3078 DF, p-value: < 2.2e-16
fit2 = lm(rent ~ area + locationRELEVEL, data = ds, contrasts = list(locationRELEVEL = "contr.treatment"))
summary(fit2)
##
## Call:
## lm(formula = rent ~ area + locationRELEVEL, data = ds, contrasts = list(locationRELEVEL = "contr.treatment"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -790.98 -100.89 -4.87 94.47 1004.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.0907 9.4950 16.439 < 2e-16 ***
## area 4.7056 0.1202 39.142 < 2e-16 ***
## locationRELEVELaverage -28.0040 5.8662 -4.774 1.89e-06 ***
## locationRELEVELtop 103.1034 18.4021 5.603 2.30e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 157.1 on 3078 degrees of freedom
## Multiple R-squared: 0.3555, Adjusted R-squared: 0.3549
## F-statistic: 566 on 3 and 3078 DF, p-value: < 2.2e-16
Q: Comment on the print-out. How do we interpret the intercept estimate?
To illustrate how interactions between covariates can be included we use the ozone
data set from the ElemStatLearn
library. This data set is measurements from 1973 in New York and contains 111 observations of the following variables:
ozone
: ozone concentration (ppm)radiation
: solar radiation (langleys)temperature
: daily maximum temperature (F)wind
: wind speed (mph)We start by fitting a multiple linear regression model to the data, with ozone
as our response variable and temperature
and wind
as covariates.
## ozone radiation temperature wind
## 1 41 190 67 7.4
## 2 36 118 72 8.0
## 3 12 149 74 12.6
## 4 18 313 62 11.5
## 5 23 299 65 8.6
## 6 19 99 59 13.8
##
## Call:
## lm(formula = ozone ~ temperature + wind, data = ozone)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.160 -13.209 -3.089 10.588 98.470
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -67.2008 23.6083 -2.846 0.00529 **
## temperature 1.8265 0.2504 7.293 5.32e-11 ***
## wind -3.2993 0.6706 -4.920 3.12e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.72 on 108 degrees of freedom
## Multiple R-squared: 0.5817, Adjusted R-squared: 0.574
## F-statistic: 75.1 on 2 and 108 DF, p-value: < 2.2e-16
The model can be written as: \[Y = \beta_0 + \beta_1 x_t + \beta_2 x_w + \varepsilon\] In this model we have assumed that increasing the value of one covariate is independent of the other covariates. For example: by increasing the temperature
by one-unit always increases the response value by \(\beta_2 \approx 1.651\), regardless of the value of wind
.
However, one might think that the covariate wind
(wind speed) might act differently upon ozone
for different values of temperature
and vice verse. \[\begin{aligned} Y &= \beta_0 + \beta_1 x_t + \beta_2 x_w + \beta_3\cdot(x_t \cdot x_w) +\varepsilon \\ &= \beta_0 + (\beta_1 + \beta_3 x_w) \cdot x_t + \beta_2 x_w + \varepsilon \\ &= \beta_0 + \beta_1 x_t + (\beta_2 + \beta_3 x_t) \cdot x_w + \varepsilon \end{aligned}.\] We fit this model in R
. An interaction term can be included in the model using the *
symbol.
Q: Look at the summary
below. Is this a better model than without the interaction term? It the term significant?
ozone.int = lm(ozone ~ temperature + wind + temperature * wind, data = ozone)
summary(ozone.int)
##
## Call:
## lm(formula = ozone ~ temperature + wind + temperature * wind,
## data = ozone)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.929 -11.190 -3.037 8.209 97.440
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -239.94146 48.59004 -4.938 2.92e-06 ***
## temperature 4.00151 0.59311 6.747 8.02e-10 ***
## wind 13.60882 4.28070 3.179 0.00193 **
## temperature:wind -0.21747 0.05446 -3.993 0.00012 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.36 on 107 degrees of freedom
## Multiple R-squared: 0.636, Adjusted R-squared: 0.6258
## F-statistic: 62.31 on 3 and 107 DF, p-value: < 2.2e-16
Below we see that the interaction term is highly significant. The \(p\)-value is very small, so that there is strong evidence that \(\beta_3 \neq 0\). Furthermore, \(R^2_{\text{adj}}\) has increased, indicating that more of the variability in the data has been explained by the model (than without the interaction).
Interpretation of the interaction term:
If we now increase the temperature
by \(10^{\circ}\) F, the increase in wind
speed will be \[(\hat \beta_1+\hat \beta_3 \cdot x_w) \cdot 10 = (4.0 -0.22 \cdot x_w) \cdot 10 = 40-2.2 x_w \text{ units}.\]
If we increase the wind
speed by 10 mph, the increase in temperature
will be \[(\hat \beta_2 + \hat \beta_3 \cdot x_t) \cdot 10 = (14 -0.22 \cdot x_t) \cdot 10 = 140-2.2 x_t \text{ units}.\]
The hierarchical principle
It is possible that the interaction term is higly significant, but the main effects are not.
In our ozone.int
model above: the main effects are temperature
and wind
. The hierarchical principle states that if we include an interaction term in our model, the main effects are also to be included, even if they are not significant. This means that if the coefficients \(\hat \beta_1\) or \(\hat \beta_2\) would be insignificant, while the coefficient \(\hat \beta_3\) is significant, \(\hat \beta_1\) and \(\hat \beta_2\) should still be included in the model.
There reasons for this is that a model with interaction terms, but without the main effects is hard to interpret.
We create a new variable temp.cat
which is a temperature
as a qualitative covariate with two levels and fit the model: \[\begin{aligned}y&=\beta_0 + \beta_1 x_w + \begin{cases} \beta_2 + \beta_3 x_w &\text{ if temperature="low"}\\ 0 &\text{ if temperature = "high"}\end{cases} \\\\ &= \begin{cases} (\beta_0 + \beta_2) + (\beta_1 + \beta_3) \cdot x_w &\text{ if temperature="low"}\\ \beta_0 + \beta_1 x_w &\text{ if temperature="high""} \end{cases} \end{aligned}\]
temp.cat = ifelse(ozone$temperature < mean(ozone$temperature), "low",
"high")
ozone2 = cbind(ozone, temp.cat)
print(head(ozone2))
## ozone radiation temperature wind temp.cat
## 1 41 190 67 7.4 low
## 2 36 118 72 8.0 low
## 3 12 149 74 12.6 low
## 4 18 313 62 11.5 low
## 5 23 299 65 8.6 low
## 6 19 99 59 13.8 low
ozone.int2 = lm(ozone ~ wind + temp.cat + temp.cat * wind, data = ozone2)
summary(ozone.int2)
##
## Call:
## lm(formula = ozone ~ wind + temp.cat + temp.cat * wind, data = ozone2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.291 -9.091 -1.307 11.227 71.815
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 119.0450 7.5004 15.872 < 2e-16 ***
## wind -6.7235 0.8195 -8.204 5.61e-13 ***
## temp.catlow -92.6316 12.9466 -7.155 1.09e-10 ***
## wind:temp.catlow 6.0544 1.1999 5.046 1.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.26 on 107 degrees of freedom
## Multiple R-squared: 0.6393, Adjusted R-squared: 0.6291
## F-statistic: 63.2 on 3 and 107 DF, p-value: < 2.2e-16
interceptlow = coef(ozone.int2)[1] + coef(ozone.int2)[3]
slopelow = coef(ozone.int2)[2] + coef(ozone.int2)[4]
intercepthigh = coef(ozone.int2)[1]
slopehigh = coef(ozone.int2)[2]
ggplot(ozone) + geom_line(aes(y = interceptlow + slopelow * wind, x = wind),
col = "blue") + geom_line(aes(y = intercepthigh + slopehigh * wind,
x = wind), col = "red") + ylab("ozone") + theme_minimal()
We may extend the linear model to handle non-linear relationships by using polynomial regression - as we did in Module 2 in our bias-variance trade-off example.
More on non-linearity in Module 7.
First, we define predictions as \(\hat{\bf Y}={\bf X}\hat{\beta}\), and inserted the ML (and LS) estimate we get \(\hat{\bf Y}={\bf X}({\bf X}^T{\bf X})^{-1}{\bf X}^T{\bf Y}\).
We define the projection matrix \[ {\bf H}={\bf X}({\bf X}^T{\bf X})^{-1} {\bf X}^T\] called the hat matrix. This simplifies the notation for the predictions, \[\hat{\bf Y}={\bf H}{\bf Y}\] so the hat matrix is putting the hat on the response \({\bf Y}\).
In addition we define residuals as \[\begin{align*} {\bf e}&={\bf Y}-\hat{\bf Y} \\ {\bf e}&={\bf Y}-{\bf HY}=({\bf I-H}){\bf Y}\\ \end{align*}\]so we have a second projection matrix \[ {\bf I-H}={\bf I}-{\bf X}({\bf X}^T{\bf X})^{-1} {\bf X}^T \]
Transformations of the response variable or of a covariate can be useful if the relationship is not linear. A linear model can then be fit to the transformed variables.
Multiple linear regression might require subset selection if the number of covariates is high.
There will be a very similar regression problem in the compulsory exercise 1 in 2019!
The Framingham Heart Study is a study of the etiology (i.e. underlying causes) of cardiovascular disease, with participants from the community of Framingham in Massachusetts, USA. For more more information about the Framingham Heart Study visit https://www.framinghamheartstudy.org/. The dataset used in here is subset of a teaching version of the Framingham data, used with permission from the Framingham Heart Study.
We will focus on modelling systolic blood pressure using data from n = 2600 persons. For each person in the data set we have measurements of the seven variables
SYSBP
systolic blood pressure,SEX
1=male, 2=female,AGE
age in years at examination,CURSMOKE
current cigarette smoking at examination: 0=not current smoker, 1= current smoker,BMI
body mass index,TOTCHOL
serum total cholesterol, andBPMEDS
use of anti-hypertensive medication at examination: 0=not currently using, 1=currently using.A multiple normal linear regression model was fitted to the data set with -1/sqrt(SYSBP)
as response and all the other variables as covariates.
library(ggplot2)
data = read.table("https://www.math.ntnu.no/emner/TMA4268/2018v/data/SYSBPreg3uid.txt")
dim(data)
colnames(data)
modelA = lm(-1/sqrt(SYSBP) ~ ., data = data)
summary(modelA)
We name the model fitted above modelA
.
modelA
.summary
-output means.Estimate
- in particular interpretation of Intercept
Std.Error
t value
Pr(>|t|)
Residual standard error
F-statistic
modelA
? Comment.modelB
, with SYSBP
as response, and the same covariates as for modelA
. Would you prefer to use modelA
or modelB
when the aim is to make inference about the systolic blood pressure?# residuls vs fitted
ggplot(modelA, aes(.fitted, .resid)) + geom_point(pch = 21) + geom_hline(yintercept = 0,
linetype = "dashed") + geom_smooth(se = FALSE, col = "red", size = 0.5,
method = "loess") + labs(x = "Fitted values", y = "Residuals", title = "Fitted values vs. residuals",
subtitle = deparse(modelA$call))
# qq-plot of residuals
ggplot(modelA, aes(sample = .stdresid)) + stat_qq(pch = 19) + geom_abline(intercept = 0,
slope = 1, linetype = "dotted") + labs(x = "Theoretical quantiles",
y = "Standardized residuals", title = "Normal Q-Q", subtitle = deparse(modelA$call))
# normality test
library(nortest)
ad.test(rstudent(modelA))
We use modelA
and focus on addressing the association between BMI and the response.
Consider a 56 year old man who is smoking. He is 1.75 meters tall and his weight is 89 kilograms. His serum total cholesterol is 200 mg/dl and he is not using anti-hypertensive medication.
names(data)
new = data.frame(SEX = 1, AGE = 56, CURSMOKE = 1, BMI = 89/1.75^2, TOTCHOL = 200,
BPMEDS = 0)
-1/sqrt(SYSBP)
? To get a best guess for his SYSBP
you may take the inverse function of -1/sqrt
.(Comment: Is that allowed - to only do the inverse? Yes, that could be the result of a first order Taylor expansion approximation. If you think this is interesting you can read more from page 36 here: https://www.math.ntnu.no/emner/TMA4267/2017v/L12.pdf - but that is not on the reading list of this course. You have probably used this result in your physics courses.)
SYSBP
. Comment. Hint: first contruct values on the scale of the response -1/sqrt(SYSBP)
and then transform the upper and lower limits of the prediction interval.A core finding is \(\hat{\boldsymbol\beta}\). \[ \hat{\boldsymbol\beta}=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf Y}\] with \(\hat{\boldsymbol\beta}\sim N_{p}(\boldsymbol\beta,\sigma^2({\bf X}^T{\bf X})^{-1})\).
What is the interpretation of a 95% confidence interval? Hint: repeat experiment (on \(Y\)), on average how many CIs cover the true \(\beta_j\)?
What is the interpretation of a 95% prediction interval? Hint: repeat experiment (on \(Y\)) for a given \({\bf x}_0\).
Construct a 95% CI for \({\bf x}_0^T \beta\). Explain what is the connections between a CI for \(\beta_j\), a CI for \({\bf x}_0^T \beta\) and a PI for \(Y\) at \({\bf x}_0\).
Explain the difference between error and residual. What are the properties of the raw residuals? Why don’t we want to use the raw residuals for model check? What is our solution to this?
Consider a multiple linear regression model \(A\) and a submodel \(B\) (all parameters in \(B\) are in \(A\) also). We say that \(B\) is nested within \(A\). Assume that regression parameters are estimated using least squares. Why is then the following true: RSS for model \(A\) will always be smaller or equal to RSS for model \(B\). And thus, \(R^2\) for model \(A\) can never be worse than \(R^2\) for model B. (See also Problem 3d below.)
Fit the regression model with first rent
and then rentsqm
as response and following covariates: area
, location
(dummy variable coding using location2 and location3, just write as.factor(location)
), bath
, kitchen
and cheating
(central heating).
Look at diagnostic plots for the two fits. Which response do you prefer?
Consentrate on the response-model you choose for the rest of the tasks.
Explain what the parameter estimates mean in practice. In particular, what is the interpretation of the intercept?
Go through the summary printout and explain all parts.
Now we add random noise as a covariance, but simulating the IQ of the landlord of each appartment. Observe that \(R^2\) increases (or stays unchanged) and RSS decreases (or stays the same) if we add IQ as covariate, but \(R^2_{\text{adj}}\) decreases. What does this tell you about model selection and overfitting?
For the code - what is the connection between sigma
and RSS?
orgfit = lm(rent ~ area + as.factor(location) + bath + kitchen + cheating,
data = rent99)
summary(orgfit)
set.seed(1) #to be able to reproduce results
n = dim(rent99)[1]
IQ = rnorm(n, 100, 16)
fitIQ = lm(rent ~ area + as.factor(location) + bath + kitchen + cheating +
IQ, data = rent99)
summary(fitIQ)
summary(orgfit)$sigma
summary(fitIQ)$sigma
summary(orgfit)$r.squared
summary(fitIQ)$r.squared
summary(orgfit)$adj.r.squared
summary(fitIQ)$adj.r.squared
##
## Call:
## lm(formula = rent ~ area + as.factor(location) + bath + kitchen +
## cheating, data = rent99)
##
## Residuals:
## Min 1Q Median 3Q Max
## -633.41 -89.17 -6.26 82.96 1000.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -21.9733 11.6549 -1.885 0.0595 .
## area 4.5788 0.1143 40.055 < 2e-16 ***
## as.factor(location)2 39.2602 5.4471 7.208 7.14e-13 ***
## as.factor(location)3 126.0575 16.8747 7.470 1.04e-13 ***
## bath1 74.0538 11.2087 6.607 4.61e-11 ***
## kitchen1 120.4349 13.0192 9.251 < 2e-16 ***
## cheating1 161.4138 8.6632 18.632 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 145.2 on 3075 degrees of freedom
## Multiple R-squared: 0.4504, Adjusted R-squared: 0.4494
## F-statistic: 420 on 6 and 3075 DF, p-value: < 2.2e-16
##
##
## Call:
## lm(formula = rent ~ area + as.factor(location) + bath + kitchen +
## cheating + IQ, data = rent99)
##
## Residuals:
## Min 1Q Median 3Q Max
## -630.95 -89.50 -6.12 82.62 995.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -41.3879 19.5957 -2.112 0.0348 *
## area 4.5785 0.1143 40.056 < 2e-16 ***
## as.factor(location)2 39.2830 5.4467 7.212 6.90e-13 ***
## as.factor(location)3 126.3356 16.8748 7.487 9.18e-14 ***
## bath1 74.1979 11.2084 6.620 4.23e-11 ***
## kitchen1 120.0756 13.0214 9.221 < 2e-16 ***
## cheating1 161.4450 8.6625 18.637 < 2e-16 ***
## IQ 0.1940 0.1574 1.232 0.2179
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 145.2 on 3074 degrees of freedom
## Multiple R-squared: 0.4507, Adjusted R-squared: 0.4494
## F-statistic: 360.3 on 7 and 3074 DF, p-value: < 2.2e-16
##
## [1] 145.1879
## [1] 145.1757
## [1] 0.4504273
## [1] 0.4506987
## [1] 0.449355
## [1] 0.4494479
We now want to use model selection to arrive at a good model. Start by defining which covariates you want to include and how to code them (use dummy variable coding of location
). What about year of construction - is that a linear covariate? Maybe you want to make intervals in time instead? Linear or categorical for the time? What about the district
? We leave that since we have not talked about how to use spatial covariates.
Hint: if you want to test out interval versions of year of construction the function mutate
(from dplyr
) is useful:
rent99 <- rent99 %>% mutate(yearc.cat = cut(yearc, breaks = c(-Inf, seq(1920,
2000, 10)), labels = 10 * 1:9))
More on dplyr
: Tutorial: http://genomicsclass.github.io/book/pages/dplyr_tutorial.html and Cheat sheet (data wrangling): https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf and dplyr in particular: https://github.com/rstudio/cheatsheets/raw/master/source/pdfs/data-transformation-cheatsheet.pdf
(More on this in Module 6.)
There are many ways to perform model selection for multiple linear regression. One possibility is best subsets, which can be done using the regsubsets
function from library leaps
. Assume your “full” model fitted by lm
is called fit
. You may define x
from model.matrix(fit)[,-1]
(not including the intercept term), and then run best=regsubsets(x=model.matrix(fit)[,-1],y=rent99$rent)
and look at summary(best)
. Explain the print-out (with all the stars). Using the Mallows Cp (named cp
in the list from summary(best)
) will give the same result at using AIC (which is not available in this function). What is your preferred model? Hint: look at the R-code in Problem 2 (Figure 3) from the TMA4267V2017 exam: pdf, and maybe the solutions for the interpretation pdf
Make R code that shows the interpretation of a 95% CI for \(\beta_j\). Hint: Theoretical question a.
Make R code that shows the interpretation of a 95% PI for a new response at \({\bf x}_0\). Hint: Theoretical question b.
For simple linear regression, simulate at data set with homoscedastic errors and with heteroscedastic errors. Here is a suggestion of one solution - not using ggplot
. You use ggplot
. Why this? To see how things looks when the model is correct and wrong.
# Homoscedastic errore
n = 1000
x = seq(-3, 3, length = n)
beta0 = -1
beta1 = 2
xbeta = beta0 + beta1 * x
sigma = 1
e1 = rnorm(n, mean = 0, sd = sigma)
y1 = xbeta + e1
ehat1 = residuals(lm(y1 ~ x))
plot(x, y1, pch = 20)
abline(beta0, beta1, col = 1)
plot(x, e1, pch = 20)
abline(h = 0, col = 2)
# Heteroscedastic errors
sigma = (0.1 + 0.3 * (x + 3))^2
e2 = rnorm(n, 0, sd = sigma)
y2 = xbeta + e2
ehat2 = residuals(lm(y2 ~ x))
plot(x, y2, pch = 20)
abline(beta0, beta1, col = 2)
plot(x, e2, pch = 20)
abline(h = 0, col = 2)
All this fuss about raw, standardized and studentized residuals- does really matter in practice? Below is one example where the raw residuals are rather different from the standardized, but the standardized is identical to the studentized. Can you come up with a simulation model where the standardized and studentized are very different? Hint: what about at smaller sample size?
n = 1000
beta = matrix(c(0, 1, 1/2, 1/3), ncol = 1)
set.seed(123)
x1 = rnorm(n, 0, 1)
x2 = rnorm(n, 0, 2)
x3 = rnorm(n, 0, 3)
X = cbind(rep(1, n), x1, x2, x3)
y = X %*% beta + rnorm(n, 0, 2)
fit = lm(y ~ x1 + x2 + x3)
yhat = predict(fit)
summary(fit)
ehat = residuals(fit)
estand = rstandard(fit)
estud = rstudent(fit)
plot(yhat, ehat, pch = 20)
points(yhat, estand, pch = 20, col = 2)
If you want to look at the .Rmd file and knit
it, you need to first install the following packages (only once).
# packages to install before knitting this R Markdown file to knit
# the Rmd
install.packages("knitr")
install.packages("rmarkdown")
# nice tables in Rmd
install.packages("kableExtra")
# cool layout for the Rmd
install.packages("prettydoc") # alternative to github
# plotting
install.packages("ggplot2") # cool plotting
install.packages("ggpubr") # for many ggplots
install.packages("GGally") # for ggpairs
# datasets
install.packages("ElemStatLearn") # for ozone data set
install.packages("gamlss.data") #rent index data set here
# methods
install.packages("nortest") #test for normality - e.g. Anderson-Darling
install.packages("car") # vif
library(Matrix)
install.packages("reshape")
install.packages("corrplot")
install.packages("tidyverse")
Thanks to Julia Debik for contributing to this module page.