Independent pairs \((Y_i, {\bf x}_i)\) for \(i=1,\ldots,n\).
The model is the same, but it is expressed differently, and htere are several ways of looing at this:
This helps us get to the generalisation to GLMs.
Likelihood:
\[ L({\boldsymbol \beta}, \sigma^2 ; X, {\bf y}) = \frac{1}{2\pi \sigma^2} e^{-\frac{1}{2 \sigma^2}({\bf y} - X {\boldsymbol \beta})^T({\bf y} - X {\boldsymbol \beta})} \]
log-likelihood:
\[ l({\boldsymbol \beta}, \sigma^2 ; X, {\bf y}) = -\log{2\pi \sigma^2} -\frac{1}{2 \sigma^2}({\bf y} - X {\boldsymbol \beta})^T({\bf y} - X {\boldsymbol \beta}) \] The score vector is
\[ s(\beta)=\frac{\partial l(\beta)}{\partial \beta}= \sum_{i=1}^n \frac{\partial l_i(\beta)}{\partial \beta}= \sum_{i=1}^n s_i(\beta) \]
which will be filled in when I get the time…
Start with \(s(\beta)=0\)…
If there is no closed solution, we have to resort to numerical methods, of which there are a lot
The matrices can tell us the location of the Holy Grail?
with \(\hat{\beta}\sim N_{p}(\beta,\sigma^2({\bf X}^T{\bf X})^{-1})\).
Show that \(\hat{\beta}\) has this distribution with the given mean and covariance matrix.
What does this imply for the distribution of the \(j\)th element of \(\hat{\beta}\)? In particular, how can we calculate the variance of \(\hat{\beta}_j\)?
The difference between an estimand and an estimate.
To be explained later…
Hint: QQ–plot as html
Whatever you want: you can re-scale the predicted quantiles.
(not the best answer, of course)Fit the regression model with first rent
and then
rentsqm
as reponse and following covariates:
area
, location
(dummy variable coding using
location2 and location3), bath
, kitchen
and
cheating
(central heating).
library(gamlss.data)
library(ggfortify)
`?`(rent99)
mod1 <- lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
mod2 <- lm(rentsqm ~ area + location + bath + kitchen + cheating, data = rent99)
autoplot(mod1, label.size = 2)
autoplot(mod2, label.size = 2)
Concentrate on the response-model you choose for the rest of the tasks.
summary(mod1)
##
## Call:
## lm(formula = rent ~ area + location + bath + kitchen + cheating,
## data = rent99)
##
## Residuals:
## Min 1Q Median 3Q Max
## -633.41 -89.17 -6.26 82.96 1000.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -21.9733 11.6549 -1.885 0.0595 .
## area 4.5788 0.1143 40.055 < 2e-16 ***
## location2 39.2602 5.4471 7.208 7.14e-13 ***
## location3 126.0575 16.8747 7.470 1.04e-13 ***
## bath1 74.0538 11.2087 6.607 4.61e-11 ***
## kitchen1 120.4349 13.0192 9.251 < 2e-16 ***
## cheating1 161.4138 8.6632 18.632 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 145.2 on 3075 degrees of freedom
## Multiple R-squared: 0.4504, Adjusted R-squared: 0.4494
## F-statistic: 420 on 6 and 3075 DF, p-value: < 2.2e-16
summary(mod2)
##
## Call:
## lm(formula = rentsqm ~ area + location + bath + kitchen + cheating,
## data = rent99)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4959 -1.4084 -0.0733 1.3847 9.4400
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.108319 0.168567 42.169 < 2e-16 ***
## area -0.038154 0.001653 -23.077 < 2e-16 ***
## location2 0.628698 0.078782 7.980 2.04e-15 ***
## location3 1.686099 0.244061 6.909 5.93e-12 ***
## bath1 0.989898 0.162113 6.106 1.15e-09 ***
## kitchen1 1.412113 0.188299 7.499 8.34e-14 ***
## cheating1 2.414101 0.125297 19.267 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.1 on 3075 degrees of freedom
## Multiple R-squared: 0.2584, Adjusted R-squared: 0.2569
## F-statistic: 178.6 on 6 and 3075 DF, p-value: < 2.2e-16
Next week: more on inference on this data set.
We look at a regression problem where both the response and the covariates are centered - that is, the mean of the response and the mean of each covariate is zero. We do this to avoid the intercept term, which makes things a bit more complicated.
In a design matrix (without an intercept column) orthogonal columns gives diagonal \({\bf X}^T {\bf X}\). What does that mean? How can we get orthogonal columns?
If we have orthogonal columns, will then simple (only one covariate) and multiple estimated regression coefficients be different? Explain.
What is multicollinearity? Is that a problem? Why (not)?
Background material for this task: [Categorical covariates - dummy and effect coding)(#categorical)
We will study a dataset where we want to model income
as
response and two unordered categorical covariates gender
and
place
(location).
income <- c(300, 350, 370, 360, 400, 370, 420, 390, 400, 430, 420, 410,
300, 320, 310, 305, 350, 370, 340, 355, 370, 380, 360, 365)
gender <- c(rep("Male", 12), rep("Female", 12))
place <- rep(c(rep("A", 4), rep("B", 4), rep("C", 4)), 2)
data <- data.frame(income, gender = factor(gender, levels = c("Female",
"Male")), place = factor(place, levels = c("A", "B", "C")))
library(GGally)
GGally::ggpairs(data)
interaction.plot(data$gender,data$place,data$income)
. What
does it show? Do we need an interaction term if we want to model a MLR
with income
as response?interaction.plot(x.factor = data$gender, trace.factor = data$place, response = data$income,
type = "l")
plot.design(income~place+gender, data = data)
. What does it
show?plot.design(income ~ place + gender, data = data)
income
as response and gender
and
place
as covariates.Explain what your model estimates mean. In particular, what is the interpretation of the intercept estimate?
mod3 <- lm(income ~ place + gender, data = data)
mod3
##
## Call:
## lm(formula = income ~ place + gender, data = data)
##
## Coefficients:
## (Intercept) placeB placeC genderMale
## 306.25 47.50 65.00 41.25
Explain what your model estimates mean. Now, what is the intercept estimate?
Calculate the estimate for
place=C
.
mod4 <- lm(income ~ place + gender, data = data, contrasts = list(place = "contr.sum",
gender = "contr.sum"))
mod4
##
## Call:
## lm(formula = income ~ place + gender, data = data, contrasts = list(place = "contr.sum",
## gender = "contr.sum"))
##
## Coefficients:
## (Intercept) place1 place2 gender1
## 364.38 -37.50 10.00 -20.62
model.matrix(mod4)
## (Intercept) place1 place2 gender1
## 1 1 1 0 -1
## 2 1 1 0 -1
## 3 1 1 0 -1
## 4 1 1 0 -1
## 5 1 0 1 -1
## 6 1 0 1 -1
## 7 1 0 1 -1
## 8 1 0 1 -1
## 9 1 -1 -1 -1
## 10 1 -1 -1 -1
## 11 1 -1 -1 -1
## 12 1 -1 -1 -1
## 13 1 1 0 1
## 14 1 1 0 1
## 15 1 1 0 1
## 16 1 1 0 1
## 17 1 0 1 1
## 18 1 0 1 1
## 19 1 0 1 1
## 20 1 0 1 1
## 21 1 -1 -1 1
## 22 1 -1 -1 1
## 23 1 -1 -1 1
## 24 1 -1 -1 1
## attr(,"assign")
## [1] 0 1 1 2
## attr(,"contrasts")
## attr(,"contrasts")$place
## [1] "contr.sum"
##
## attr(,"contrasts")$gender
## [1] "contr.sum"
mean(income)
## [1] 364.375
Next week we connect this to linear hypotheses and ANOVA.
This part of the module was marked “self-study”. Go through this together in the group, and make sure that you understand.
(a version this problem was also given as recommended exercise in TMA4268 Statistical learning)
# Homoscedastic errors
n = 1000
x = seq(-3, 3, length = n)
beta0 = -1
beta1 = 2
xbeta = beta0 + beta1 * x
sigma = 1
e1 = rnorm(n, mean = 0, sd = sigma)
y1 = xbeta + e1
ehat1 = residuals(lm(y1 ~ x))
plot(x, y1, pch = 20)
abline(beta0, beta1, col = 1)
plot(x, e1, pch = 20)
abline(h = 0, col = 2)
plot(x, ehat1, pch = 20)
abline(h = 0, col = 2)
# Heteroscedastic errors
sigma = (0.1 + 0.3 * (x + 3))^2
e2 = rnorm(n, 0, sd = sigma)
y2 = xbeta + e2
ehat2 = residuals(lm(y2 ~ x))
plot(x, y2, pch = 20)
abline(beta0, beta1, col = 2)
plot(x, e2, pch = 20)
abline(h = 0, col = 2)
plot(x, ehat2, pch = 20)
abline(h = 0, col = 2)
n = 1000
beta = matrix(c(0, 1, 1/2, 1/3), ncol = 1)
set.seed(123)
x1 = rnorm(n, 0, 1)
x2 = rnorm(n, 0, 2)
x3 = rnorm(n, 0, 3)
X = cbind(rep(1, n), x1, x2, x3)
y = X %*% beta + rnorm(n, 0, 2)
fit = lm(y ~ x1 + x2 + x3)
yhat = predict(fit)
summary(fit)
ehat = residuals(fit)
estand = rstandard(fit)
estud = rstudent(fit)
plot(yhat, ehat, pch = 20)
points(yhat, estand, pch = 20, col = 2)
# points(yhat,estud,pch=19,col=3)
install.packages(c("formatR", "gamlss.data", "tidyverse", "ggplot2",
"GGally", "Matrix", "nortest", "lmtest", "wordcloud2", "tm"))