R packages

install.packages(c("formatR", "gamlss.data", "ggfortify", "GGally"))

Theoretical questions

Problem 1

1.

From the module page: Independent pairs \((Y_i, {\bf x}_i)\) for \(i=1,\ldots,n\).

  1. Random component: \(Y_i \sim N\) with \(\text{E}(Y_i)=\mu_i\) and \(\text{Var}(Y_i)=\sigma^2\).
  2. Systematic component: \(\eta_i={\bf x}_i^T \boldsymbol{\beta}\).
  3. Link function: linking the random and systematic component (linear predictor): Identity link and response function. \(\mu_i=\eta_i\).

2.

Likelihood: \(L(\beta) = \prod_{i=1}^n f(y_i; \beta) = \left(\frac{1}{2\pi}\right)^{n/2} \left(\frac{1}{\sigma^2}\right)^{n/2} \exp\left(- \frac{1}{2\sigma^2} (\bf{y} - \bf{X}\beta)^T(\bf{y} - \bf{X}\beta)\right)\)

Log-likelihood: \(l(\beta) = \ln L(\beta) = -\frac{n}{2}\ln(2\pi) -\frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2} (\bf{y} - \bf{X}\beta)^T(\bf{y} - \bf{X}\beta)\)

Score vector: \(s(\beta) = \frac{\partial l}{\partial \beta} = \frac{1}{\sigma^2} (\bf{X}^T\bf{Y} - \bf{X}^T\bf{X} \beta) = \frac{1}{\sigma^2} \sum_{i=1}^n (x_i y_i - x_i x_i^T \beta)\)

The set of equations we use to find the parameter estimates is the score vector set equal to zero. If we have no closed form solution we can use numerical optimization to find the solution.

3.

Use the rules on the module pages to calculate this

Observed Fisher: \(H(\beta) = -\frac{\partial^2 l(\beta)}{\partial \beta \partial \beta^T} = \dots = \frac{1}{\sigma^2} \bf{X}^T \bf{X}\)

Expected Fisher (note that there are no random variables in the observed Fisher information matrix!): \(F(\beta) = E\left(-\frac{\partial^2 l(\beta)}{\partial \beta \partial \beta^T}\right) = \frac{1}{\sigma^2} \bf{X}^T \bf{X}\)

So for MLR they are equal! But this is not the case for all GLMs.

Both are square matrices of dimension \(k + 1 = p\) (number of covariates + 1 for intercept), i.e., \(p \times p\). These tell something about the amount of information a random variable \(Y\) has about an unknown parameter (here \(\beta\)). In MLR, we use the Fisher information (which is the negative Hessian) to calculate the variance of the parameter estimates! Later, this does not hold exact, but asymptotically.

4.

\(\hat{\beta}\) is a linear combination of normal distributed RVs \(\implies\) \(\hat{\beta}\) is normal istelf. We know that \(\bf{Y} \sim \text{N}(\bf{X}\beta, \sigma^2 \bf{I})\).

\(\text{E}(\hat{\beta}) = \text{E}((\bf{X}^T \bf{X})^{-1} \bf{X}^T \bf{Y}) = (\bf{X}^T \bf{X})^{-1} \bf{X}^T \text{E}(\bf{Y}) = (\bf{X}^T \bf{X})^{-1} \bf{X}^T \bf{X}\beta = \beta\)

\(\text{Var}(\hat{\beta}) = \text{Var}((\bf{X}^T \bf{X})^{-1} \bf{X}^T \bf{Y}) = (\bf{X}^T \bf{X})^{-1} \bf{X}^T \text{Var}(\bf{Y}) = ((\bf{X}^T \bf{X})^{-1} \bf{X}^T)^T = \sigma^2 (\bf{X}^T \bf{X})^{-1} \bf{X}^T \bf{X} ((\bf{X}^T \bf{X})^{-1})^T = \sigma^2 ((\bf{X}^T \bf{X})^T)^{-1} = \sigma^2(\bf{X}^T \bf{X})^{-1}\)

This means that the \(j\)th element of \(\hat{\beta}\), \(\hat{\beta}_j\), has a normal distribution with mean \(\beta_j\) and variance given by the inverse Fisher information!

5.

The error of an observation is the difference between the observed value and the true (unobservable) value (true value of \(\varepsilon\) in MLR). The residual of an observation if the difference between the observed value and the predicted value (estimated \(\hat{\varepsilon}\) in MLR).

Raw residuals: \(\hat{\varepsilon} = Y - \hat{Y}\). These have different variance, and may be correlated, as the variance-covariance matrix here is not \(\sigma^2\bf{I}\), but \(\sigma^2(I-H)\) which is not a diagonal matrix. We solve this by standardizing the residuals so they have the same variance (but they are still correlated), by using the estimated variance \(\hat{\sigma^2}\) together with the square root of the diagonal elements of \((I-H)\). We can also use studentized residuals where we omit observation \(i\) when calculating the estimated variance \(\hat{\sigma^2}\) used to scale residual \(i\).

6.

The intercept is thus the mean, and the slope is the standard deviation.

Interpretation and understanding

Problem 2: Munich Rent Index data

1.

library(gamlss.data)
library(ggfortify)
`?`(rent99)

mod1 <- lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
mod2 <- lm(rentsqm ~ area + location + bath + kitchen + cheating, data = rent99)
autoplot(mod1, label.size = 2)

autoplot(mod2, label.size = 2)

2.

Discuss the meaning of each covariate estimate.

The intercept means that if we have an apartment of 0 square meter (which does not make sense!), at location 1, without bath, kitchen and central heating, the rent is -21 Euro per month, or 7 euro per square meter per month.

3.

summary(mod1)
summary(mod2)
## 
## Call:
## lm(formula = rent ~ area + location + bath + kitchen + cheating, 
##     data = rent99)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -633.41  -89.17   -6.26   82.96 1000.76 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -21.9733    11.6549  -1.885   0.0595 .  
## area          4.5788     0.1143  40.055  < 2e-16 ***
## location2    39.2602     5.4471   7.208 7.14e-13 ***
## location3   126.0575    16.8747   7.470 1.04e-13 ***
## bath1        74.0538    11.2087   6.607 4.61e-11 ***
## kitchen1    120.4349    13.0192   9.251  < 2e-16 ***
## cheating1   161.4138     8.6632  18.632  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 145.2 on 3075 degrees of freedom
## Multiple R-squared:  0.4504, Adjusted R-squared:  0.4494 
## F-statistic:   420 on 6 and 3075 DF,  p-value: < 2.2e-16
## 
## 
## Call:
## lm(formula = rentsqm ~ area + location + bath + kitchen + cheating, 
##     data = rent99)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4959 -1.4084 -0.0733  1.3847  9.4400 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.108319   0.168567  42.169  < 2e-16 ***
## area        -0.038154   0.001653 -23.077  < 2e-16 ***
## location2    0.628698   0.078782   7.980 2.04e-15 ***
## location3    1.686099   0.244061   6.909 5.93e-12 ***
## bath1        0.989898   0.162113   6.106 1.15e-09 ***
## kitchen1     1.412113   0.188299   7.499 8.34e-14 ***
## cheating1    2.414101   0.125297  19.267  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.1 on 3075 degrees of freedom
## Multiple R-squared:  0.2584, Adjusted R-squared:  0.2569 
## F-statistic: 178.6 on 6 and 3075 DF,  p-value: < 2.2e-16

Problem 3: Simple vs. multiple regression

1.

Then the matrix \(\bf{X}^T\bf{X}\) becomes a diagonal matrix, and the \(\hat{\beta}\)s are independent when the data are assumed Gaussian. We can choose covariates so we get orthogonal columns.

2.

Without an intercept (centered data). The estimate of \(\beta_1\) in the simple regression model and in the MLR (where we also have \(\beta_2\) etc.) will not change, as a new covariate is measured ortogonal (uncorrelated) to the old and thus will not affect the estimate of \(\beta_1\).

3.

Multicollinearity leads to a matrix \(\bf{X}^T\bf{X}\) that is far from diagonal, which means that the \(\hat{\beta}\) estimates are highly dependent. This makes the interpretation of the model difficult.

Problem 4: Dummy vs. effect coding in MLR

income <- c(300, 350, 370, 360, 400, 370, 420, 390, 400, 430, 420, 410, 
    300, 320, 310, 305, 350, 370, 340, 355, 370, 380, 360, 365)
gender <- c(rep("Male", 12), rep("Female", 12))
place <- rep(c(rep("A", 4), rep("B", 4), rep("C", 4)), 2)
data <- data.frame(income, gender = factor(gender, levels = c("Female", 
    "Male")), place = factor(place, levels = c("A", "B", "C")))

1.

GGally::ggpairs(data)

2.

interaction.plot(x.factor = data$gender, trace.factor = data$place, response = data$income, 
    type = "l")

Show that men have higher income than women, and that there is a difference in location. Does not seem necessary with an interaction.

3.

plot.design(income ~ place + gender, data = data)

The average income as a function of place and gender, and the average income overall.

4.

mod3 <- lm(income ~ place + gender, data = data)
mod3
## 
## Call:
## lm(formula = income ~ place + gender, data = data)
## 
## Coefficients:
## (Intercept)       placeB       placeC   genderMale  
##      306.25        47.50        65.00        41.25

See that this matches the plot.interaction plot above, where Female and place A is the intercept alone.

5.

mod4 <- lm(income ~ place + gender, data = data, contrasts = list(place = "contr.sum", 
    gender = "contr.sum"))
mod4
model.matrix(mod4)
mean(income)
## 
## Call:
## lm(formula = income ~ place + gender, data = data, contrasts = list(place = "contr.sum", 
##     gender = "contr.sum"))
## 
## Coefficients:
## (Intercept)       place1       place2      gender1  
##      364.38       -37.50        10.00       -20.62  
## 
##    (Intercept) place1 place2 gender1
## 1            1      1      0      -1
## 2            1      1      0      -1
## 3            1      1      0      -1
## 4            1      1      0      -1
## 5            1      0      1      -1
## 6            1      0      1      -1
## 7            1      0      1      -1
## 8            1      0      1      -1
## 9            1     -1     -1      -1
## 10           1     -1     -1      -1
## 11           1     -1     -1      -1
## 12           1     -1     -1      -1
## 13           1      1      0       1
## 14           1      1      0       1
## 15           1      1      0       1
## 16           1      1      0       1
## 17           1      0      1       1
## 18           1      0      1       1
## 19           1      0      1       1
## 20           1      0      1       1
## 21           1     -1     -1       1
## 22           1     -1     -1       1
## 23           1     -1     -1       1
## 24           1     -1     -1       1
## attr(,"assign")
## [1] 0 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$place
## [1] "contr.sum"
## 
## attr(,"contrasts")$gender
## [1] "contr.sum"
## 
## [1] 364.375

Now the intercept is the average of the income. Look at the plot.design plot above and see that the estimates follow the lines there.

Problem 5. Interactions

No solution.

Problem 6: Simulations in R

No solution.