Independent pairs \((Y_i, {\bf x}_i)\) for \(i=1,\ldots,n\).
The model is the same, but it is expressed differently, and htere are several ways of looing at this:
This helps us get to the generalisation to GLMs.
Likelihood:
\[ L({\boldsymbol \beta}, \sigma^2 ; X, {\bf y}) = \frac{1}{2\pi \sigma^2} e^{-\frac{1}{2 \sigma^2}({\bf y} - X {\boldsymbol \beta})^T({\bf y} - X {\boldsymbol \beta})} \]
log-likelihood:
\[ l({\boldsymbol \beta}, \sigma^2 ; X, {\bf y}) = -\log{2\pi \sigma^2} -\frac{1}{2 \sigma^2}({\bf y} - X {\boldsymbol \beta})^T({\bf y} - X {\boldsymbol \beta}) \] The score vector is
\[ s(\beta)=\frac{\partial l(\beta)}{\partial \beta}= \sum_{i=1}^n \frac{\partial l_i(\beta)}{\partial \beta}= \sum_{i=1}^n s_i(\beta) \]
\[ \begin{align} s_i(\beta) &= \frac{\partial l()}{\partial \psi} \frac{\partial \psi}{\partial \beta} \\ &= -\frac{1}{2} 2 \sigma^{-2} \psi x_i^T \\ &= - \sigma^{-2} (y_i - x_i^T \mathbf{\beta}) x_i^T \\ &= \sigma^{-2} (x_i^T x_i \mathbf{\beta} - x_i^T y_i) \end{align} \]
Start with \(s(\beta)=0\)…
\[ \begin{align} s(\beta) &= \Sigma^{-1} (X^TX \mathbf{\beta} - X^T \mathbf{y}) = 0 \\ &= X^TX \mathbf{\beta} - X^T \mathbf{y}\\ (X^TX)^{-1}X^TX \mathbf{\beta} &= (X^TX)^{-1} X^T \mathbf{y}\\ \mathbf{\hat{\beta}} &= (X^TX)^{-1} X^T \mathbf{y} \end{align} \]
If there is no closed solution, we have to resort to numerical methods, of which there are a lot
Observed Fisher information matrix:
\[ H(\beta) = -\frac{\partial^2l(\beta)}{\partial\beta\partial\beta^T} = -\frac{\partial s(\beta)}{\partial\beta^T} \]
Expected Fisher information matrix
\[ F(\beta) = E\left(-\frac{\partial^2l(\beta)}{\partial\beta\partial\beta^T} \right) \]
They are both square matrices of dimension \(p \times p\), where \(p\) is the length of \(\beta\).
(NOTE: these next two questions have been added at the last minute, so may not be complete)
Calculate the Fisher information for multiple regression
You can go via the likelihood or the score.
The log likelihood is
\[ l(\beta) = -\frac{1}{2} (\mathbf{y} - X \mathbf{\beta})^T \Sigma^{-1} (\mathbf{y} - X \mathbf{\beta}) \]
… (to be continued. Basically, use Rule 3, and substitute \(\beta = (\mathbf{y} - X \mathbf{\beta})\), with apologies for using \(\beta\) to mean different things.
Or we can go via the score \[ s(\beta) = \Sigma^{-1} (X^TX \mathbf{\beta} - X^T \mathbf{y}) \] Where \(\Sigma^{-1} = \sigma^{-2} I\), and use Rule 1.
How does it relate to the covariance of \(\hat{\beta}\)?
It’s the inverse!
(we will see later that this is a special case of a more general result, which is why we care about it now)\[ \hat{\beta}=({\bf X}^T{\bf X})^{-1} {\bf X}^T {\bf y} \]
with \(\hat{\beta}\sim N_{p}(\beta,\sigma^2({\bf X}^T{\bf X})^{-1})\).
Show that \(\hat{\beta}\) has this distribution with the given mean and covariance matrix.
What does this imply for the distribution of the \(j\)th element of \(\hat{\beta}\)? In particular, how can we calculate the variance of \(\hat{\beta}_j\)?
The \(j\)th element of \(\hat{\beta}\) is normally distributed, with mean \(\beta\), and variance that is a bit complicated.
The difference between an estimand and an estimate.
The error is the difference from the true (or “true”) line: the residual is from the fitted line.
The raw residuals (i.e. \(y_i - x_i^T \mathbf{\hat{\beta}}\)) \[\hat{\varepsilon}={\bf Y}-{\bf \hat{Y}}=({\bf I}-{\bf H}){\bf Y}\] where \({\bf H} = X (X^T X)^{-1} X^T\). Thus these can be heteroscedastic, and correlated (unless \({\bf H}\) behaves)
THe solution is to studentise them.
Hint: QQ–plot as html
Because we’re assuming a 1:1 relationship, the intercept should be 0, and slope 1.
Fit the regression model with first rent and then
rentsqm as reponse and following covariates:
area, location (dummy variable coding using
location2 and location3), bath, kitchen and
cheating (central heating).
library(gamlss.data)
library(ggfortify)
`?`(rent99)
mod1 <- lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
mod2 <- lm(rentsqm ~ area + location + bath + kitchen + cheating, data = rent99)
autoplot(mod1, label.size = 2)
autoplot(mod2, label.size = 2)
Concentrate on the response-model you choose for the rest of the tasks.
summary(mod1)
##
## Call:
## lm(formula = rent ~ area + location + bath + kitchen + cheating,
## data = rent99)
##
## Residuals:
## Min 1Q Median 3Q Max
## -633.41 -89.17 -6.26 82.96 1000.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -21.9733 11.6549 -1.885 0.0595 .
## area 4.5788 0.1143 40.055 < 2e-16 ***
## location2 39.2602 5.4471 7.208 7.14e-13 ***
## location3 126.0575 16.8747 7.470 1.04e-13 ***
## bath1 74.0538 11.2087 6.607 4.61e-11 ***
## kitchen1 120.4349 13.0192 9.251 < 2e-16 ***
## cheating1 161.4138 8.6632 18.632 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 145.2 on 3075 degrees of freedom
## Multiple R-squared: 0.4504, Adjusted R-squared: 0.4494
## F-statistic: 420 on 6 and 3075 DF, p-value: < 2.2e-16
summary(mod2)
##
## Call:
## lm(formula = rentsqm ~ area + location + bath + kitchen + cheating,
## data = rent99)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4959 -1.4084 -0.0733 1.3847 9.4400
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.108319 0.168567 42.169 < 2e-16 ***
## area -0.038154 0.001653 -23.077 < 2e-16 ***
## location2 0.628698 0.078782 7.980 2.04e-15 ***
## location3 1.686099 0.244061 6.909 5.93e-12 ***
## bath1 0.989898 0.162113 6.106 1.15e-09 ***
## kitchen1 1.412113 0.188299 7.499 8.34e-14 ***
## cheating1 2.414101 0.125297 19.267 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.1 on 3075 degrees of freedom
## Multiple R-squared: 0.2584, Adjusted R-squared: 0.2569
## F-statistic: 178.6 on 6 and 3075 DF, p-value: < 2.2e-16
Next week: more on inference on this data set.
We look at a regression problem where both the response and the covariates are centered - that is, the mean of the response and the mean of each covariate is zero. We do this to avoid the intercept term, which makes things a bit more complicated.
In a design matrix (without an intercept column) orthogonal columns gives diagonal \({\bf X}^T {\bf X}\). What does that mean? How can we get orthogonal columns?
If we have orthogonal columns, will then simple (only one covariate) and multiple estimated regression coefficients be different? Explain.
What is multicollinearity? Is that a problem? Why (not)?
Background material for this task: [Categorical covariates - dummy and effect coding)(#categorical)
We will study a dataset where we want to model income as
response and two unordered categorical covariates genderand
place (location).
income <- c(300, 350, 370, 360, 400, 370, 420, 390, 400, 430, 420, 410,
300, 320, 310, 305, 350, 370, 340, 355, 370, 380, 360, 365)
gender <- c(rep("Male", 12), rep("Female", 12))
place <- rep(c(rep("A", 4), rep("B", 4), rep("C", 4)), 2)
data <- data.frame(income, gender = factor(gender, levels = c("Female",
"Male")), place = factor(place, levels = c("A", "B", "C")))
library(GGally)
GGally::ggpairs(data)
interaction.plot(data$gender,data$place,data$income). What
does it show? Do we need an interaction term if we want to model a MLR
with income as response?interaction.plot(x.factor = data$gender, trace.factor = data$place, response = data$income,
type = "l")
plot.design(income~place+gender, data = data). What does it
show?plot.design(income ~ place + gender, data = data)
income as response and gender and
place as covariates.Explain what your model estimates mean. In particular, what is the interpretation of the intercept estimate?
mod3 <- lm(income ~ place + gender, data = data)
mod3
##
## Call:
## lm(formula = income ~ place + gender, data = data)
##
## Coefficients:
## (Intercept) placeB placeC genderMale
## 306.25 47.50 65.00 41.25
Explain what your model estimates mean. Now, what is the intercept estimate?
Calculate the estimate for
place=C.
mod4 <- lm(income ~ place + gender, data = data, contrasts = list(place = "contr.sum",
gender = "contr.sum"))
mod4
##
## Call:
## lm(formula = income ~ place + gender, data = data, contrasts = list(place = "contr.sum",
## gender = "contr.sum"))
##
## Coefficients:
## (Intercept) place1 place2 gender1
## 364.38 -37.50 10.00 -20.63
model.matrix(mod4)
## (Intercept) place1 place2 gender1
## 1 1 1 0 -1
## 2 1 1 0 -1
## 3 1 1 0 -1
## 4 1 1 0 -1
## 5 1 0 1 -1
## 6 1 0 1 -1
## 7 1 0 1 -1
## 8 1 0 1 -1
## 9 1 -1 -1 -1
## 10 1 -1 -1 -1
## 11 1 -1 -1 -1
## 12 1 -1 -1 -1
## 13 1 1 0 1
## 14 1 1 0 1
## 15 1 1 0 1
## 16 1 1 0 1
## 17 1 0 1 1
## 18 1 0 1 1
## 19 1 0 1 1
## 20 1 0 1 1
## 21 1 -1 -1 1
## 22 1 -1 -1 1
## 23 1 -1 -1 1
## 24 1 -1 -1 1
## attr(,"assign")
## [1] 0 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$place
## [1] "contr.sum"
##
## attr(,"contrasts")$gender
## [1] "contr.sum"
mean(income)
## [1] 364.375
Next week we connect this to linear hypotheses and ANOVA.
This part of the module was marked “self-study”. Go through this together in the group, and make sure that you understand.
(a version this problem was also given as recommended exercise in TMA4268 Statistical learning)
# Homoscedastic errors
n = 1000
x = seq(-3, 3, length = n)
beta0 = -1
beta1 = 2
xbeta = beta0 + beta1 * x
sigma = 1
e1 = rnorm(n, mean = 0, sd = sigma)
y1 = xbeta + e1
ehat1 = residuals(lm(y1 ~ x))
plot(x, y1, pch = 20)
abline(beta0, beta1, col = 1)
plot(x, e1, pch = 20)
abline(h = 0, col = 2)
plot(x, ehat1, pch = 20)
abline(h = 0, col = 2)
# Heteroscedastic errors
sigma = (0.1 + 0.3 * (x + 3))^2
e2 = rnorm(n, 0, sd = sigma)
y2 = xbeta + e2
ehat2 = residuals(lm(y2 ~ x))
plot(x, y2, pch = 20)
abline(beta0, beta1, col = 2)
plot(x, e2, pch = 20)
abline(h = 0, col = 2)
plot(x, ehat2, pch = 20)
abline(h = 0, col = 2)
n = 1000
beta = matrix(c(0, 1, 1/2, 1/3), ncol = 1)
set.seed(123)
x1 = rnorm(n, 0, 1)
x2 = rnorm(n, 0, 2)
x3 = rnorm(n, 0, 3)
X = cbind(rep(1, n), x1, x2, x3)
y = X %*% beta + rnorm(n, 0, 2)
fit = lm(y ~ x1 + x2 + x3)
yhat = predict(fit)
summary(fit)
ehat = residuals(fit)
estand = rstandard(fit)
estud = rstudent(fit)
plot(yhat, ehat, pch = 20)
points(yhat, estand, pch = 20, col = 2)
# points(yhat,estud,pch=19,col=3)
install.packages(c("formatR", "gamlss.data", "tidyverse", "ggplot2",
"GGally", "Matrix", "nortest", "lmtest", "wordcloud2", "tm"))