Interactive lectures- problem set first week

Theoretical questions

Problem 1

Write down the GLM way for the multiple linear regression model. Explain how it is different.

Solution

Independent pairs $(Y_i, {\bf x}_i)$ for $i=1,\ldots,n$.

Random component: $Y_i \sim N$ with $\text{E}(Y_i)=\mu_i$ and $\text{Var}(Y_i)=\sigma^2$.
Systematic component: $\eta_i={\bf x}_i^T \boldsymbol{\beta}$.
Link function: linking the random and systematic component (linear predictor): Identity link and response function. $\mu_i=\eta_i$.

The model is the same, but it is expressed differently, and htere are several ways of looing at this:

the focus is on $y_i$ having a distribution with a mean that is modeled, rather than $\varepsilon_i$ having a distribution.
the random and systematic components are explicitly separated
the link function is thrown is as overkill

This helps us get to the generalisation to GLMs.

Write down the likelihood and loglikelihood. Then define the score vector.

Solution

Likelihood:

\[ L({\boldsymbol \beta}, \sigma^2 ; X, {\bf y}) = \frac{1}{2\pi \sigma^2} e^{-\frac{1}{2 \sigma^2}({\bf y} - X {\boldsymbol \beta})^T({\bf y} - X {\boldsymbol \beta})} \]

log-likelihood:

\[ l({\boldsymbol \beta}, \sigma^2 ; X, {\bf y}) = -\log{2\pi \sigma^2} -\frac{1}{2 \sigma^2}({\bf y} - X {\boldsymbol \beta})^T({\bf y} - X {\boldsymbol \beta}) \] The score vector is

\[ s(\beta)=\frac{\partial l(\beta)}{\partial \beta}= \sum_{i=1}^n \frac{\partial l_i(\beta)}{\partial \beta}= \sum_{i=1}^n s_i(\beta) \]

\[ \begin{align} s_i(\beta) &= \frac{\partial l()}{\partial \psi} \frac{\partial \psi}{\partial \beta} \\ &= -\frac{1}{2} 2 \sigma^{-2} \psi x_i^T \\ &= - \sigma^{-2} (y_i - x_i^T \mathbf{\beta}) x_i^T \\ &= \sigma^{-2} (x_i^T x_i \mathbf{\beta} - x_i^T y_i) \end{align} \]

What is the set of equations we solve to find parameter estimates? What if we could not find a closed form solution to our set of equations - what could we do then?

Solution

Start with $s(\beta)=0$…

\[ \begin{align} s(\beta) &= \Sigma^{-1} (X^TX \mathbf{\beta} - X^T \mathbf{y}) = 0 \\ &= X^TX \mathbf{\beta} - X^T \mathbf{y}\\ (X^TX)^{-1}X^TX \mathbf{\beta} &= (X^TX)^{-1} X^T \mathbf{y}\\ \mathbf{\hat{\beta}} &= (X^TX)^{-1} X^T \mathbf{y} \end{align} \]

If there is no closed solution, we have to resort to numerical methods, of which there are a lot

Define the observed and the expected Fisher information matrix. What dimension does these matrices have?

Solution

Observed Fisher information matrix:

\[ H(\beta) = -\frac{\partial^2l(\beta)}{\partial\beta\partial\beta^T} = -\frac{\partial s(\beta)}{\partial\beta^T} \]

Expected Fisher information matrix

\[ F(\beta) = E\left(-\frac{\partial^2l(\beta)}{\partial\beta\partial\beta^T} \right) \]

They are both square matrices of dimension $p \times p$, where $p$ is the length of $\beta$.

(NOTE: these next two questions have been added at the last minute, so may not be complete)

Calculate the Fisher information for multiple regression

Hint

Use the rules from Hardle and Simes (2015), page 65 (search the module notes for Simes if you can’t fine them!).

Solution

You can go via the likelihood or the score.

The log likelihood is

\[ l(\beta) = -\frac{1}{2} (\mathbf{y} - X \mathbf{\beta})^T \Sigma^{-1} (\mathbf{y} - X \mathbf{\beta}) \]

… (to be continued. Basically, use Rule 3, and substitute $\beta = (\mathbf{y} - X \mathbf{\beta})$, with apologies for using $\beta$ to mean different things.

Or we can go via the score \[ s(\beta) = \Sigma^{-1} (X^TX \mathbf{\beta} - X^T \mathbf{y}) \] Where $\Sigma^{-1} = \sigma^{-2} I$, and use Rule 1.

How does it relate to the covariance of $\hat{\beta}$?

Solution

It’s the inverse!

(we will see later that this is a special case of a more general result, which is why we care about it now)

A core finding is $\hat \beta$.

\[ \hat{\beta}=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf y} \]

with $\hat{\beta}\sim N_{p}(\beta,\sigma^2({\bf X}^T{\bf X})^{-1})$.

Show that $\hat{\beta}$ has this distribution with the given mean and covariance matrix.

Hint

The only element that is random is ${\bf y}$, and we know its distribution. And we know the distribution of linear functions of ${\bf y}$.

Solution

Go here. But come back when you’ve finished.

What does this imply for the distribution of the $j$th element of $\hat{\beta}$? In particular, how can we calculate the variance of $\hat{\beta}_j$?

Solution

The $j$th element of $\hat{\beta}$ is normally distributed, with mean $\beta$, and variance that is a bit complicated.

Explain the difference between error and residual. What are the properties of the raw residuals? Why don’t we want to use the raw residuals for model check? What is our solution to this?

Solution

The difference between an estimand and an estimate.

The error is the difference from the true (or “true”) line: the residual is from the fitted line.

The raw residuals (i.e. $y_i - x_i^T \mathbf{\hat{\beta}}$) \[\hat{\varepsilon}={\bf Y}-{\bf \hat{Y}}=({\bf I}-{\bf H}){\bf Y}\] where ${\bf H} = X (X^T X)^{-1} X^T$. Thus these can be heteroscedastic, and correlated (unless ${\bf H}$ behaves)

THe solution is to studentise them.

What is the theoretical intercept and slope of a QQ–plot based on a normal sample?

Hint: QQ–plot as html

Solution

Because we’re assuming a 1:1 relationship, the intercept should be 0, and slope 1.

Interpretation and understanding

Problem 2: Munich Rent Index data

Fit the regression model with first rent and then rentsqm as reponse and following covariates: area, location (dummy variable coding using location2 and location3), bath, kitchen and cheating (central heating).

library(gamlss.data)
library(ggfortify)
`?`(rent99)

mod1 <- lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
mod2 <- lm(rentsqm ~ area + location + bath + kitchen + cheating, data = rent99)
autoplot(mod1, label.size = 2)

autoplot(mod2, label.size = 2)

Look at diagnostic plots for the two fits. Which response do you prefer?

Concentrate on the response-model you choose for the rest of the tasks.

Explain what the parameter estimates mean in practice. In particular, what is the interpretation of the intercept?

summary(mod1)

## 
## Call:
## lm(formula = rent ~ area + location + bath + kitchen + cheating, 
##     data = rent99)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -633.41  -89.17   -6.26   82.96 1000.76 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -21.9733    11.6549  -1.885   0.0595 .  
## area          4.5788     0.1143  40.055  < 2e-16 ***
## location2    39.2602     5.4471   7.208 7.14e-13 ***
## location3   126.0575    16.8747   7.470 1.04e-13 ***
## bath1        74.0538    11.2087   6.607 4.61e-11 ***
## kitchen1    120.4349    13.0192   9.251  < 2e-16 ***
## cheating1   161.4138     8.6632  18.632  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 145.2 on 3075 degrees of freedom
## Multiple R-squared:  0.4504, Adjusted R-squared:  0.4494 
## F-statistic:   420 on 6 and 3075 DF,  p-value: < 2.2e-16

summary(mod2)

## 
## Call:
## lm(formula = rentsqm ~ area + location + bath + kitchen + cheating, 
##     data = rent99)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4959 -1.4084 -0.0733  1.3847  9.4400 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.108319   0.168567  42.169  < 2e-16 ***
## area        -0.038154   0.001653 -23.077  < 2e-16 ***
## location2    0.628698   0.078782   7.980 2.04e-15 ***
## location3    1.686099   0.244061   6.909 5.93e-12 ***
## bath1        0.989898   0.162113   6.106 1.15e-09 ***
## kitchen1     1.412113   0.188299   7.499 8.34e-14 ***
## cheating1    2.414101   0.125297  19.267  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.1 on 3075 degrees of freedom
## Multiple R-squared:  0.2584, Adjusted R-squared:  0.2569 
## F-statistic: 178.6 on 6 and 3075 DF,  p-value: < 2.2e-16

Go through the summary printout and explain the parts you know now, and also observe the parts you don’t know yet (on the agenda for next week?).

Next week: more on inference on this data set.

Problem 3: Simple vs. multiple regression

We look at a regression problem where both the response and the covariates are centered - that is, the mean of the response and the mean of each covariate is zero. We do this to avoid the intercept term, which makes things a bit more complicated.

In a design matrix (without an intercept column) orthogonal columns gives diagonal ${\bf X}^T {\bf X}$. What does that mean? How can we get orthogonal columns?
If we have orthogonal columns, will then simple (only one covariate) and multiple estimated regression coefficients be different? Explain.
What is multicollinearity? Is that a problem? Why (not)?

Problem 4: Dummy vs. effect coding in MLR

Background material for this task: [Categorical covariates - dummy and effect coding)(#categorical)

We will study a dataset where we want to model income as response and two unordered categorical covariates genderand place (location).

income <- c(300, 350, 370, 360, 400, 370, 420, 390, 400, 430, 420, 410,
    300, 320, 310, 305, 350, 370, 340, 355, 370, 380, 360, 365)
gender <- c(rep("Male", 12), rep("Female", 12))
place <- rep(c(rep("A", 4), rep("B", 4), rep("C", 4)), 2)
data <- data.frame(income, gender = factor(gender, levels = c("Female",
    "Male")), place = factor(place, levels = c("A", "B", "C")))

First, describe the data set.

library(GGally)
GGally::ggpairs(data)

Check out the interaction.plot(data$gender,data$place,data$income). What does it show? Do we need an interaction term if we want to model a MLR with income as response?

interaction.plot(x.factor = data$gender, trace.factor = data$place, response = data$income,
    type = "l")

Check our plot.design(income~place+gender, data = data). What does it show?

plot.design(income ~ place + gender, data = data)

First, use treatment contrast (dummy variable coding) and fit a MLR with income as response and gender and place as covariates.

Explain what your model estimates mean. In particular, what is the interpretation of the intercept estimate?

mod3 <- lm(income ~ place + gender, data = data)
mod3

## 
## Call:
## lm(formula = income ~ place + gender, data = data)
## 
## Coefficients:
## (Intercept)       placeB       placeC   genderMale  
##      306.25        47.50        65.00        41.25

Now, turn to sum-zero contrast (effect coding).

Explain what your model estimates mean. Now, what is the intercept estimate?

Calculate the estimate for place=C.

mod4 <- lm(income ~ place + gender, data = data, contrasts = list(place = "contr.sum",
    gender = "contr.sum"))
mod4

## 
## Call:
## lm(formula = income ~ place + gender, data = data, contrasts = list(place = "contr.sum", 
##     gender = "contr.sum"))
## 
## Coefficients:
## (Intercept)       place1       place2      gender1  
##      364.38       -37.50        10.00       -20.63

model.matrix(mod4)

##    (Intercept) place1 place2 gender1
## 1            1      1      0      -1
## 2            1      1      0      -1
## 3            1      1      0      -1
## 4            1      1      0      -1
## 5            1      0      1      -1
## 6            1      0      1      -1
## 7            1      0      1      -1
## 8            1      0      1      -1
## 9            1     -1     -1      -1
## 10           1     -1     -1      -1
## 11           1     -1     -1      -1
## 12           1     -1     -1      -1
## 13           1      1      0       1
## 14           1      1      0       1
## 15           1      1      0       1
## 16           1      1      0       1
## 17           1      0      1       1
## 18           1      0      1       1
## 19           1      0      1       1
## 20           1      0      1       1
## 21           1     -1     -1       1
## 22           1     -1     -1       1
## 23           1     -1     -1       1
## 24           1     -1     -1       1
## attr(,"assign")
## [1] 0 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$place
## [1] "contr.sum"
## 
## attr(,"contrasts")$gender
## [1] "contr.sum"

mean(income)

## [1] 364.375

Next week we connect this to linear hypotheses and ANOVA.

Problem 5: Interactions

This part of the module was marked “self-study”. Go through this together in the group, and make sure that you understand.

Problem 6: Simulations in R (optional)

(a version this problem was also given as recommended exercise in TMA4268 Statistical learning)

For simple linear regression, simulate at data set with homoscedastic errore and with heteroscedastic errors. Here is a suggestion of one solution. Why this? To see how things looks when the model is correct and wrong. Look at the code and discuss what is done, and relate this to the plots of errors (which are usually unobserved) and plots of residuals.

# Homoscedastic errors
n = 1000
x = seq(-3, 3, length = n)
beta0 = -1
beta1 = 2
xbeta = beta0 + beta1 * x
sigma = 1
e1 = rnorm(n, mean = 0, sd = sigma)
y1 = xbeta + e1
ehat1 = residuals(lm(y1 ~ x))
plot(x, y1, pch = 20)
abline(beta0, beta1, col = 1)
plot(x, e1, pch = 20)
abline(h = 0, col = 2)
plot(x, ehat1, pch = 20)
abline(h = 0, col = 2)

# Heteroscedastic errors
sigma = (0.1 + 0.3 * (x + 3))^2
e2 = rnorm(n, 0, sd = sigma)
y2 = xbeta + e2
ehat2 = residuals(lm(y2 ~ x))
plot(x, y2, pch = 20)
abline(beta0, beta1, col = 2)
plot(x, e2, pch = 20)
abline(h = 0, col = 2)
plot(x, ehat2, pch = 20)
abline(h = 0, col = 2)

All this fuss about raw, standardized and studentized residuals- does really matter in practice? Below is one example where the raw residuals are rather different from the standardized, but the standardized is identical to the studentized. Can you come up with a simuation model where the standardized and studentized are very different? Hint: what about at smaller sample size?

n = 1000
beta = matrix(c(0, 1, 1/2, 1/3), ncol = 1)
set.seed(123)
x1 = rnorm(n, 0, 1)
x2 = rnorm(n, 0, 2)
x3 = rnorm(n, 0, 3)
X = cbind(rep(1, n), x1, x2, x3)
y = X %*% beta + rnorm(n, 0, 2)
fit = lm(y ~ x1 + x2 + x3)
yhat = predict(fit)
summary(fit)
ehat = residuals(fit)
estand = rstandard(fit)
estud = rstudent(fit)
plot(yhat, ehat, pch = 20)
points(yhat, estand, pch = 20, col = 2)
# points(yhat,estud,pch=19,col=3)

R packages

install.packages(c("formatR", "gamlss.data", "tidyverse", "ggplot2",
    "GGally", "Matrix", "nortest", "lmtest", "wordcloud2", "tm"))

References and further reading

Slightly different presentation (more focus on multivariate normal theory): Slides and written material from TMA4267 Linear Statistical Models in 2017, Part 2: Regression (by Mette Langaas).
And, same source, but now [Slides and written material from TMA4267 Linear Statistical Models in 2017, Part 3: Hypothesis testing and ANOVA] (http://www.math.ntnu.no/emner/TMA4267/2017v/TMA4267V2017Part3.pdf)

TMA4315 Generalized linear models H2018

Module 2: MULTIPLE LINEAR REGRESSION, interactive sessions

Mette Langaas, Department of Mathematical Sciences, NTNU – with contributions from Oyvind Bakke and Ingeborg Hem

30.08 and 06.09 [PL], 31.08 and 07.09 [IL]

Interactive lectures- problem set first week

Theoretical questions

Problem 1

Interpretation and understanding

Problem 2: Munich Rent Index data

Problem 3: Simple vs. multiple regression

Problem 4: Dummy vs. effect coding in MLR

Problem 5: Interactions

Problem 6: Simulations in R (optional)

R packages

References and further reading