Interactive lectures- problem set first week

Theoretical questions

Problem 1

  1. Write down the GLM way for the multiple linear regression model. Explain how it is different.
Solution

Independent pairs \((Y_i, {\bf x}_i)\) for \(i=1,\ldots,n\).

  1. Random component: \(Y_i \sim N\) with \(\text{E}(Y_i)=\mu_i\) and \(\text{Var}(Y_i)=\sigma^2\).
  2. Systematic component: \(\eta_i={\bf x}_i^T \boldsymbol{\beta}\).
  3. Link function: linking the random and systematic component (linear predictor): Identity link and response function. \(\mu_i=\eta_i\).

The model is the same, but it is expressed differently, and htere are several ways of looing at this:

  • the focus is on \(y_i\) having a distribution with a mean that is modeled, rather than \(\varepsilon_i\) having a distribution.
  • the random and systematic components are explicitly separated
  • the link function is thrown is as overkill

This helps us get to the generalisation to GLMs.

  1. Write down the likelihood and loglikelihood. Then define the score vector.
Solution

Likelihood:

\[ L({\boldsymbol \beta}, \sigma^2 ; X, {\bf y}) = \frac{1}{2\pi \sigma^2} e^{-\frac{1}{2 \sigma^2}({\bf y} - X {\boldsymbol \beta})^T({\bf y} - X {\boldsymbol \beta})} \]

log-likelihood:

\[ l({\boldsymbol \beta}, \sigma^2 ; X, {\bf y}) = -\log{2\pi \sigma^2} -\frac{1}{2 \sigma^2}({\bf y} - X {\boldsymbol \beta})^T({\bf y} - X {\boldsymbol \beta}) \] The score vector is

\[ s(\beta)=\frac{\partial l(\beta)}{\partial \beta}= \sum_{i=1}^n \frac{\partial l_i(\beta)}{\partial \beta}= \sum_{i=1}^n s_i(\beta) \]

which will be filled in when I get the time…

  1. What is the set of equations we solve to find parameter estimates? What if we could not find a closed form solution to our set of equations - what could we do then?
Solution

Start with \(s(\beta)=0\)

If there is no closed solution, we have to resort to numerical methods, of which there are a lot

  1. Define the observed and the expected Fisher information matrix. What dimension does these matrices have? What can these matrices tell us?
Solution

The matrices can tell us the location of the Holy Grail?


  1. A core finding is \(\hat \beta\). \[ \hat{\beta}=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf y} \]

with \(\hat{\beta}\sim N_{p}(\beta,\sigma^2({\bf X}^T{\bf X})^{-1})\).

Show that \(\hat{\beta}\) has this distribution with the given mean and covariance matrix.

Hint The only element that is random is \({\bf y}\), and we know its distribution. And we know the distribution of linear functions of \({\bf y}\).
Solution Go here. But come back when you’ve finished.

What does this imply for the distribution of the \(j\)th element of \(\hat{\beta}\)? In particular, how can we calculate the variance of \(\hat{\beta}_j\)?

Solution Hmm.
  1. Explain the difference between error and residual. What are the properties of the raw residuals? Why don’t we want to use the raw residuals for model check? What is our solution to this?
Solution

The difference between an estimand and an estimate.

To be explained later…

  1. What is the theoretical intercept and slope of a QQ–plot based on a normal sample?

Hint: QQ–plot as html

Solution

Whatever you want: you can re-scale the predicted quantiles.

(not the best answer, of course)

Interpretation and understanding

Problem 2: Munich Rent Index data

Fit the regression model with first rent and then rentsqm as reponse and following covariates: area, location (dummy variable coding using location2 and location3), bath, kitchen and cheating (central heating).

library(gamlss.data)
library(ggfortify)
`?`(rent99)

mod1 <- lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
mod2 <- lm(rentsqm ~ area + location + bath + kitchen + cheating, data = rent99)
autoplot(mod1, label.size = 2)

autoplot(mod2, label.size = 2)

  1. Look at diagnostic plots for the two fits. Which response do you prefer?

Concentrate on the response-model you choose for the rest of the tasks.

  1. Explain what the parameter estimates mean in practice. In particular, what is the interpretation of the intercept?
summary(mod1)
## 
## Call:
## lm(formula = rent ~ area + location + bath + kitchen + cheating, 
##     data = rent99)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -633.41  -89.17   -6.26   82.96 1000.76 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -21.9733    11.6549  -1.885   0.0595 .  
## area          4.5788     0.1143  40.055  < 2e-16 ***
## location2    39.2602     5.4471   7.208 7.14e-13 ***
## location3   126.0575    16.8747   7.470 1.04e-13 ***
## bath1        74.0538    11.2087   6.607 4.61e-11 ***
## kitchen1    120.4349    13.0192   9.251  < 2e-16 ***
## cheating1   161.4138     8.6632  18.632  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 145.2 on 3075 degrees of freedom
## Multiple R-squared:  0.4504, Adjusted R-squared:  0.4494 
## F-statistic:   420 on 6 and 3075 DF,  p-value: < 2.2e-16
summary(mod2)
## 
## Call:
## lm(formula = rentsqm ~ area + location + bath + kitchen + cheating, 
##     data = rent99)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4959 -1.4084 -0.0733  1.3847  9.4400 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.108319   0.168567  42.169  < 2e-16 ***
## area        -0.038154   0.001653 -23.077  < 2e-16 ***
## location2    0.628698   0.078782   7.980 2.04e-15 ***
## location3    1.686099   0.244061   6.909 5.93e-12 ***
## bath1        0.989898   0.162113   6.106 1.15e-09 ***
## kitchen1     1.412113   0.188299   7.499 8.34e-14 ***
## cheating1    2.414101   0.125297  19.267  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.1 on 3075 degrees of freedom
## Multiple R-squared:  0.2584, Adjusted R-squared:  0.2569 
## F-statistic: 178.6 on 6 and 3075 DF,  p-value: < 2.2e-16
  1. Go through the summary printout and explain the parts you know now, and also observe the parts you don’t know yet (on the agenda for next week?).

Next week: more on inference on this data set.


Problem 3: Simple vs. multiple regression

We look at a regression problem where both the response and the covariates are centered - that is, the mean of the response and the mean of each covariate is zero. We do this to avoid the intercept term, which makes things a bit more complicated.

  1. In a design matrix (without an intercept column) orthogonal columns gives diagonal \({\bf X}^T {\bf X}\). What does that mean? How can we get orthogonal columns?

  2. If we have orthogonal columns, will then simple (only one covariate) and multiple estimated regression coefficients be different? Explain.

  3. What is multicollinearity? Is that a problem? Why (not)?


Problem 4: Dummy vs. effect coding in MLR

Background material for this task: [Categorical covariates - dummy and effect coding)(#categorical)

We will study a dataset where we want to model income as response and two unordered categorical covariates genderand place (location).

income <- c(300, 350, 370, 360, 400, 370, 420, 390, 400, 430, 420, 410,
    300, 320, 310, 305, 350, 370, 340, 355, 370, 380, 360, 365)
gender <- c(rep("Male", 12), rep("Female", 12))
place <- rep(c(rep("A", 4), rep("B", 4), rep("C", 4)), 2)
data <- data.frame(income, gender = factor(gender, levels = c("Female",
    "Male")), place = factor(place, levels = c("A", "B", "C")))
  1. First, describe the data set.
library(GGally)
GGally::ggpairs(data)

  1. Check out the interaction.plot(data$gender,data$place,data$income). What does it show? Do we need an interaction term if we want to model a MLR with income as response?
interaction.plot(x.factor = data$gender, trace.factor = data$place, response = data$income,
    type = "l")

  1. Check our plot.design(income~place+gender, data = data). What does it show?
plot.design(income ~ place + gender, data = data)

  1. First, use treatment contrast (dummy variable coding) and fit a MLR with income as response and gender and place as covariates.

Explain what your model estimates mean. In particular, what is the interpretation of the intercept estimate?

mod3 <- lm(income ~ place + gender, data = data)
mod3
## 
## Call:
## lm(formula = income ~ place + gender, data = data)
## 
## Coefficients:
## (Intercept)       placeB       placeC   genderMale  
##      306.25        47.50        65.00        41.25
  1. Now, turn to sum-zero contrast (effect coding).

Explain what your model estimates mean. Now, what is the intercept estimate?

Calculate the estimate for place=C.

mod4 <- lm(income ~ place + gender, data = data, contrasts = list(place = "contr.sum",
    gender = "contr.sum"))
mod4
## 
## Call:
## lm(formula = income ~ place + gender, data = data, contrasts = list(place = "contr.sum", 
##     gender = "contr.sum"))
## 
## Coefficients:
## (Intercept)       place1       place2      gender1  
##      364.38       -37.50        10.00       -20.62
model.matrix(mod4)
##    (Intercept) place1 place2 gender1
## 1            1      1      0      -1
## 2            1      1      0      -1
## 3            1      1      0      -1
## 4            1      1      0      -1
## 5            1      0      1      -1
## 6            1      0      1      -1
## 7            1      0      1      -1
## 8            1      0      1      -1
## 9            1     -1     -1      -1
## 10           1     -1     -1      -1
## 11           1     -1     -1      -1
## 12           1     -1     -1      -1
## 13           1      1      0       1
## 14           1      1      0       1
## 15           1      1      0       1
## 16           1      1      0       1
## 17           1      0      1       1
## 18           1      0      1       1
## 19           1      0      1       1
## 20           1      0      1       1
## 21           1     -1     -1       1
## 22           1     -1     -1       1
## 23           1     -1     -1       1
## 24           1     -1     -1       1
## attr(,"assign")
## [1] 0 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$place
## [1] "contr.sum"
## 
## attr(,"contrasts")$gender
## [1] "contr.sum"
mean(income)
## [1] 364.375

Next week we connect this to linear hypotheses and ANOVA.


Problem 5: Interactions

This part of the module was marked “self-study”. Go through this together in the group, and make sure that you understand.

Problem 6: Simulations in R (optional)

(a version this problem was also given as recommended exercise in TMA4268 Statistical learning)

  1. For simple linear regression, simulate at data set with homoscedastic errore and with heteroscedastic errors. Here is a suggestion of one solution. Why this? To see how things looks when the model is correct and wrong. Look at the code and discuss what is done, and relate this to the plots of errors (which are usually unobserved) and plots of residuals.
# Homoscedastic errors
n = 1000
x = seq(-3, 3, length = n)
beta0 = -1
beta1 = 2
xbeta = beta0 + beta1 * x
sigma = 1
e1 = rnorm(n, mean = 0, sd = sigma)
y1 = xbeta + e1
ehat1 = residuals(lm(y1 ~ x))
plot(x, y1, pch = 20)
abline(beta0, beta1, col = 1)
plot(x, e1, pch = 20)
abline(h = 0, col = 2)
plot(x, ehat1, pch = 20)
abline(h = 0, col = 2)

# Heteroscedastic errors
sigma = (0.1 + 0.3 * (x + 3))^2
e2 = rnorm(n, 0, sd = sigma)
y2 = xbeta + e2
ehat2 = residuals(lm(y2 ~ x))
plot(x, y2, pch = 20)
abline(beta0, beta1, col = 2)
plot(x, e2, pch = 20)
abline(h = 0, col = 2)
plot(x, ehat2, pch = 20)
abline(h = 0, col = 2)
  1. All this fuss about raw, standardized and studentized residuals- does really matter in practice? Below is one example where the raw residuals are rather different from the standardized, but the standardized is identical to the studentized. Can you come up with a simuation model where the standardized and studentized are very different? Hint: what about at smaller sample size?
n = 1000
beta = matrix(c(0, 1, 1/2, 1/3), ncol = 1)
set.seed(123)
x1 = rnorm(n, 0, 1)
x2 = rnorm(n, 0, 2)
x3 = rnorm(n, 0, 3)
X = cbind(rep(1, n), x1, x2, x3)
y = X %*% beta + rnorm(n, 0, 2)
fit = lm(y ~ x1 + x2 + x3)
yhat = predict(fit)
summary(fit)
ehat = residuals(fit)
estand = rstandard(fit)
estud = rstudent(fit)
plot(yhat, ehat, pch = 20)
points(yhat, estand, pch = 20, col = 2)
# points(yhat,estud,pch=19,col=3)

R packages

install.packages(c("formatR", "gamlss.data", "tidyverse", "ggplot2",
    "GGally", "Matrix", "nortest", "lmtest", "wordcloud2", "tm"))

References and further reading