Munich, 1999: 3082 observations on 9 variables.
rent
: the net rent per month (in Euro).rentsqm
: the net rent per month per square meter (in
Euro).area
: living area in square meters.yearc
: year of construction.location
: quality of location: a factor indicating
whether the location is average location, 1, good location, 2, and top
location, 3.bath
: quality of bathroom: a a factor indicating
whether the bath facilities are standard, 0, or premium, 1.kitchen
: Quality of kitchen: 0 standard 1 premium.cheating
: central heating: a factor 0 without central
heating, 1 with central heating.district
: District in Munich.More information in Fahrmeir et. al., (2013) page 5.
library(gamlss.data)
library(ggfortify)
`?`(rent99)
mod1 <- lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
mod2 <- lm(rentsqm ~ area + location + bath + kitchen + cheating, data = rent99)
autoplot(mod1, label.size = 2)
autoplot(mod2, label.size = 2)
Independent pairs \((Y_i, {\bf x}_i)\) for \(i=1,\ldots,n\).
Q: How is this done in
lm
?
Hint: repeat experiment (on \(Y\)), on average how many CIs cover the true \(\beta_j\)?
summary
from lm
.fit = lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
summary(fit)
##
## Call:
## lm(formula = rent ~ area + location + bath + kitchen + cheating,
## data = rent99)
##
## Residuals:
## Min 1Q Median 3Q Max
## -633.41 -89.17 -6.26 82.96 1000.76
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -21.9733 11.6549 -1.885 0.0595 .
## area 4.5788 0.1143 40.055 < 2e-16 ***
## location2 39.2602 5.4471 7.208 7.14e-13 ***
## location3 126.0575 16.8747 7.470 1.04e-13 ***
## bath1 74.0538 11.2087 6.607 4.61e-11 ***
## kitchen1 120.4349 13.0192 9.251 < 2e-16 ***
## cheating1 161.4138 8.6632 18.632 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 145.2 on 3075 degrees of freedom
## Multiple R-squared: 0.4504, Adjusted R-squared: 0.4494
## F-statistic: 420 on 6 and 3075 DF, p-value: < 2.2e-16
anova
from lm
.anova(fit)
## Analysis of Variance Table
##
## Response: rent
## Df Sum Sq Mean Sq F value Pr(>F)
## area 1 40299098 40299098 1911.765 < 2.2e-16 ***
## location 2 1635047 817524 38.783 < 2.2e-16 ***
## bath 1 1676825 1676825 79.547 < 2.2e-16 ***
## kitchen 1 2196952 2196952 104.222 < 2.2e-16 ***
## cheating 1 7317894 7317894 347.156 < 2.2e-16 ***
## Residuals 3075 64819547 21080
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary
and anova
on a fitted lm
give different test statistics and different \(p\)-values listed for each covariate. And,
why is summary
listing location2
and
location3
while anova
is listing
location
?Optional: Maybe test out Anova
in library
car
with type 3 ANOVA to compare?
Consider a MLR model \(A\) and a submodel \(B\) (all parameters in \(B\) are in \(A\) also). We say that \(B\) is nested within \(A\). Assume that regression parameters are estimated using maximum likelihood.
Why is the following true: the likelihood for model \(A\) will always be larger or equal to the likelihood for model \(B\).
How do we define the deviance of model \(A\)? What is a saturated model in our MLR setting? What does our finding in 5. imply for the deviance (can the deviance both be positive and negative)?
We have studied the data set with income, place and gender - with focus on dummy variable coding (with different reference levels) and effect coding and the interpretation of parameter estimates. Now we continue with the same data set, but with focus on hypothesis testing (linear hypotheses) and analysis of variance decomposition.
income <- c(300, 350, 370, 360, 400, 370, 420, 390, 400, 430, 420, 410,
300, 320, 310, 305, 350, 370, 340, 355, 370, 380, 360, 365)
gender <- c(rep("Male", 12), rep("Female", 12))
place <- rep(c(rep("A", 4), rep("B", 4), rep("C", 4)), 2)
data <- data.frame(income, gender = factor(gender, levels = c("Female",
"Male")), place = factor(place, levels = c("A", "B", "C")))
model = lm(income~place-1,data=data,x=TRUE)
. Here
x=TRUE
tells the function to calculate the design matrix X,
which is stored as model$x
.model = lm(income ~ place - 1, data = data, x = TRUE)
model$x
## placeA placeB placeC
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 0 1 0
## 6 0 1 0
## 7 0 1 0
## 8 0 1 0
## 9 0 0 1
## 10 0 0 1
## 11 0 0 1
## 12 0 0 1
## 13 1 0 0
## 14 1 0 0
## 15 1 0 0
## 16 1 0 0
## 17 0 1 0
## 18 0 1 0
## 19 0 1 0
## 20 0 1 0
## 21 0 0 1
## 22 0 0 1
## 23 0 0 1
## 24 0 0 1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$place
## [1] "contr.treatment"
Examine the results with summary
and
anova
.
What parametrization is used?
What is the interpretation of the parameters?/span>
Which null hypothesis is tested in the anova-call?
What is the result of the hypothesis test?
summary(model)
##
## Call:
## lm(formula = income ~ place - 1, data = data, x = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.375 -22.500 -5.625 23.750 45.625
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## placeA 326.875 9.733 33.58 <2e-16 ***
## placeB 374.375 9.733 38.46 <2e-16 ***
## placeC 391.875 9.733 40.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.53 on 21 degrees of freedom
## Multiple R-squared: 0.9951, Adjusted R-squared: 0.9944
## F-statistic: 1409 on 3 and 21 DF, p-value: < 2.2e-16
anova(model)
## Analysis of Variance Table
##
## Response: income
## Df Sum Sq Mean Sq F value Pr(>F)
## place 3 3204559 1068186 1409.4 < 2.2e-16 ***
## Residuals 21 15916 758
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model1 = lm(income ~ place, data = data, x = TRUE, contrasts = list(place = "contr.treatment"))
head(model1$x)
## (Intercept) placeB placeC
## 1 1 0 0
## 2 1 0 0
## 3 1 0 0
## 4 1 0 0
## 5 1 1 0
## 6 1 1 0
summary(model1)
##
## Call:
## lm(formula = income ~ place, data = data, x = TRUE, contrasts = list(place = "contr.treatment"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.375 -22.500 -5.625 23.750 45.625
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 326.875 9.733 33.583 < 2e-16 ***
## placeB 47.500 13.765 3.451 0.002394 **
## placeC 65.000 13.765 4.722 0.000116 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.53 on 21 degrees of freedom
## Multiple R-squared: 0.5321, Adjusted R-squared: 0.4875
## F-statistic: 11.94 on 2 and 21 DF, p-value: 0.000344
model2 = lm(income ~ place, data = data, x = TRUE, contrasts = list(place = "contr.sum"))
head(model2$x)
## (Intercept) place1 place2
## 1 1 1 0
## 2 1 1 0
## 3 1 1 0
## 4 1 1 0
## 5 1 0 1
## 6 1 0 1
summary(model2)
##
## Call:
## lm(formula = income ~ place, data = data, x = TRUE, contrasts = list(place = "contr.sum"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.375 -22.500 -5.625 23.750 45.625
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 364.375 5.619 64.841 < 2e-16 ***
## place1 -37.500 7.947 -4.719 0.000117 ***
## place2 10.000 7.947 1.258 0.222090
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.53 on 21 degrees of freedom
## Multiple R-squared: 0.5321, Adjusted R-squared: 0.4875
## F-statistic: 11.94 on 2 and 21 DF, p-value: 0.000344
We have talked about dummy- and effect encoding of categorical covariates.
What are the parametrizations used here? What
is the interpretation of the parameters and how do the parameter
interpretations differ between model1
and
model2
?
Compare the results from the two coding strategies.
model0 = lm(income ~ 1, data = data)
anova(model0, model1)
## Analysis of Variance Table
##
## Model 1: income ~ 1
## Model 2: income ~ place
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 23 34016
## 2 21 15916 2 18100 11.941 0.000344 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model0, model2)
## Analysis of Variance Table
##
## Model 1: income ~ 1
## Model 2: income ~ place
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 23 34016
## 2 21 15916 2 18100 11.941 0.000344 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
place
and
gender
.model3 = lm(income ~ place + gender, data = data, x = TRUE, contrasts = list(place = "contr.treatment",
gender = "contr.treatment"))
summary(model3)
##
## Call:
## lm(formula = income ~ place + gender, data = data, x = TRUE,
## contrasts = list(place = "contr.treatment", gender = "contr.treatment"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.500 -6.250 0.000 9.687 25.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 306.250 6.896 44.411 < 2e-16 ***
## placeB 47.500 8.446 5.624 1.67e-05 ***
## placeC 65.000 8.446 7.696 2.11e-07 ***
## genderMale 41.250 6.896 5.982 7.54e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.89 on 20 degrees of freedom
## Multiple R-squared: 0.8322, Adjusted R-squared: 0.8071
## F-statistic: 33.07 on 3 and 20 DF, p-value: 6.012e-08
anova(model3)
## Analysis of Variance Table
##
## Response: income
## Df Sum Sq Mean Sq F value Pr(>F)
## place 2 18100.0 9050.0 31.720 6.260e-07 ***
## gender 1 10209.4 10209.4 35.783 7.537e-06 ***
## Residuals 20 5706.2 285.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model4 = lm(income ~ place + gender, data = data, x = TRUE, contrasts = list(place = "contr.sum",
gender = "contr.sum"))
summary(model4)
##
## Call:
## lm(formula = income ~ place + gender, data = data, x = TRUE,
## contrasts = list(place = "contr.sum", gender = "contr.sum"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.500 -6.250 0.000 9.687 25.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 364.375 3.448 105.680 < 2e-16 ***
## place1 -37.500 4.876 -7.691 2.13e-07 ***
## place2 10.000 4.876 2.051 0.0536 .
## gender1 -20.625 3.448 -5.982 7.54e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.89 on 20 degrees of freedom
## Multiple R-squared: 0.8322, Adjusted R-squared: 0.8071
## F-statistic: 33.07 on 3 and 20 DF, p-value: 6.012e-08
anova(model4)
## Analysis of Variance Table
##
## Response: income
## Df Sum Sq Mean Sq F value Pr(>F)
## place 2 18100.0 9050.0 31.720 6.260e-07 ***
## gender 1 10209.4 10209.4 35.783 7.537e-06 ***
## Residuals 20 5706.2 285.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
What are the parameterizations? What is the interpretation of the parameters? Does the ANOVA table look different for the two parametrizations?
Hint: orthogonality of design matrix for this balanced design?
model5 = lm(income ~ place * gender, data = data, x = TRUE, contrasts = list(place = "contr.treatment",
gender = "contr.treatment"))
summary(model5)
##
## Call:
## lm(formula = income ~ place * gender, data = data, x = TRUE,
## contrasts = list(place = "contr.treatment", gender = "contr.treatment"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.000 -5.938 1.250 11.250 25.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 308.750 8.824 34.989 < 2e-16 ***
## placeB 45.000 12.479 3.606 0.002020 **
## placeC 60.000 12.479 4.808 0.000141 ***
## genderMale 36.250 12.479 2.905 0.009446 **
## placeB:genderMale 5.000 17.648 0.283 0.780168
## placeC:genderMale 10.000 17.648 0.567 0.577963
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.65 on 18 degrees of freedom
## Multiple R-squared: 0.8352, Adjusted R-squared: 0.7894
## F-statistic: 18.24 on 5 and 18 DF, p-value: 1.74e-06
anova(model5)
## Analysis of Variance Table
##
## Response: income
## Df Sum Sq Mean Sq F value Pr(>F)
## place 2 18100.0 9050.0 29.0569 2.314e-06 ***
## gender 1 10209.4 10209.4 32.7793 1.988e-05 ***
## place:gender 2 100.0 50.0 0.1605 0.8529
## Residuals 18 5606.2 311.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model6 = lm(income ~ place * gender, data = data, x = TRUE, contrasts = list(place = "contr.sum",
gender = "contr.sum"))
summary(model6)
##
## Call:
## lm(formula = income ~ place * gender, data = data, x = TRUE,
## contrasts = list(place = "contr.sum", gender = "contr.sum"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.000 -5.938 1.250 11.250 25.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.644e+02 3.602e+00 101.147 < 2e-16 ***
## place1 -3.750e+01 5.095e+00 -7.361 7.86e-07 ***
## place2 1.000e+01 5.095e+00 1.963 0.0653 .
## gender1 -2.062e+01 3.602e+00 -5.725 1.99e-05 ***
## place1:gender1 2.500e+00 5.095e+00 0.491 0.6296
## place2:gender1 1.743e-14 5.095e+00 0.000 1.0000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.65 on 18 degrees of freedom
## Multiple R-squared: 0.8352, Adjusted R-squared: 0.7894
## F-statistic: 18.24 on 5 and 18 DF, p-value: 1.74e-06
anova(model6)
## Analysis of Variance Table
##
## Response: income
## Df Sum Sq Mean Sq F value Pr(>F)
## place 2 18100.0 9050.0 29.0569 2.314e-06 ***
## gender 1 10209.4 10209.4 32.7793 1.988e-05 ***
## place:gender 2 100.0 50.0 0.1605 0.8529
## Residuals 18 5606.2 311.5
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Introduction to the first compulsory exercise by TA Ingeborg - and an
introduction to packages and classes in R
.
The exercise: https://www.math.ntnu.no/emner/TMA4315/2018h/project1_h18.html
Packages: ggplot2
, gamlss.data
, and so on.
Some are already loaded when R
starts (like
stats
), others must be loaded (like MASS
).
You are going to make your own package, called mylm
,
which performs multiple linear regression and is a smaller version of
lm
.
Show how to create package in R Studio
Classes in R: Something we do not have to think much about, but we use all the time. We are now going to make a new class in R, that we call “test”.
# takes a word, and returns the index in the alphabet of each
# letter in an object with class 'test'
test <- function(word) {
x <- 1:nchar(word)
y <- match(c(strsplit(tolower(word), "")[[1]]), letters[1:26])
res <- list(x = x, y = y, word = word) # if you are not familiar with lists, you should read up on this
class(res) <- "test"
return(res)
}
Now we make an object of this class and try to look at it.
myname <- test("Ingeborg")
# 'print(myname)' and 'myname' returns the same thing in a script,
# so to simplify we just write 'myname' here
myname # prints everything
## $x
## [1] 1 2 3 4 5 6 7 8
##
## $y
## [1] 9 14 7 5 2 15 18 7
##
## $word
## [1] "Ingeborg"
##
## attr(,"class")
## [1] "test"
# lets make a print function that only prints the word
print.test <- function(obj) cat(obj$word)
myname # and now we get only the name
## Ingeborg
Now we want a function that plots objects of this class in a particular way.
# important that it is called plot.test with .test at the end!!!
plot.test <- function(obj) plot(obj$x, obj$y, xlab = "Letter", ylab = "Index",
main = obj$word, col = rainbow(length(obj$x)), pch = 19, cex = 2)
plot(myname) # we do not have to specify that this is plot.test, because 'myname' is already of class 'test'!
# And a summary function
summary.test <- function(obj) {
cat("Word: ", obj$word, "\n")
cat("Length of word: ", length(obj$x), " letters\n")
cat("Occurrence of each letter:")
print(table(strsplit(tolower(obj$word), "")))
}
summary(myname)
## Word: Ingeborg
## Length of word: 8 letters
## Occurrence of each letter:
## b e g i n o r
## 1 1 2 1 1 1 1
Now we have made a class with a plot, print and summary function, and this is what you do in the exercise! But a bit more advanced…
Let us look at what happens when we use the plotting function on
objects with different classes: The function called plot
.
First we make two new objects that can be plotted:
data <- data.frame(x = rnorm(10), y = rnorm(10))
mod <- lm(y ~ x, data = data)
And then we plot them:
plot(data)
plot(mod)
plot(myname)
What is happening? R
reads the class of the objects and
uses the plot-function made for that specific class. The user does not
have to specify the class as this is already stored in the object!
The different objects we have declared earlier have the following classes:
class(data)
class(mod)
class(myname)
## [1] "data.frame"
## [1] "lm"
## [1] "test"
You will make a new class in R
called mylm
,
and R
will also then understand which plot-function to use
based on the class.
Note that an object can have more than one class.
You write the report using R Markdown, use this template: https://www.math.ntnu.no/emner/TMA4315/2018h/template_glm.Rmd
Discuss with the group to get a feeling on what to do in the exercise.
mylm
function. Which formulas are used
to
print.mylm
, plot.mylm
and
summary.mylm
. What should these
functions contain?model.frame
,
model.matrix
and model.response
.Last week all groups decided on using rent
or
rentsqm
as response, and in short - there was not really a
big difference. So, now use rent
as the response.
location
as dummy or effect coding). What about year of
construction - is that a linear covariate? Maybe you want to make
intervals in time instead? Linear or categorical for the time? What
about the district
? We leave that since we have not talked
about how to use spatial covariates.Hint: if you want to test out interval versions of year of
construction the function mutate
(from dplyr
)
is useful:
rent99 <- rent99 %>%
mutate(yearc.cat = cut(yearc, breaks = c(-Inf, seq(1920, 2000, 10)),
labels = 10 * 1:9))
More on dplyr
: Tutorial: http://genomicsclass.github.io/book/pages/dplyr_tutorial.html
and Cheat sheet (data wrangling): https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
and dplyr in particular: https://github.com/rstudio/cheatsheets/raw/master/source/pdfs/data-transformation-cheatsheet.pdf
regsubsets
function from library leaps
. You
may define x
from model.matrix(fit)[,-1]
(not
including the intercept term), and then run
best=regsubsets(x=model.matrix(fit)[,-1],y=rent99$rent)
and
look at summary(best)
. Explain the print-out (with all the
stars). Using the Mallows Cp (named cp
in the list from
summary(best)
) will give the same result at using AIC
(which is not available in this function).What is your preferred model? Are there other models worth considering?
Hint: look at the R-code in Problem 2 (Figure 3) from the TMA4267V2017 exam: pdf, and maybe the solutions for the interprtation pdf
stepAIC
. Do you get the same model choice as with best
subset selection based on AIC? Why, or why not?Run the following code to make the wordcloud. The code can not be run
by knit
for a pdf because of how the graphics are made - in
that case you need to run it and then you need to save the resulting
figure as a file (I choose png: the code to do this has been commented
out). Maybe you want to run the code on another document? Please mail us
if you do cool stuff for others to see!
library(wordcloud2)
library(tm)
all = scan("https://www.math.ntnu.no/emner/TMA4315/2018h/2MLR.Rmd", what = "s")
corpus = Corpus(VectorSource(all))
corpus[[1]][1]
## $content
## [1] "---"
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, removeWords, c("---", "bf", "boldsymbol", "will",
"include", "use", "can", "follow", "provide", "using"))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, stripWhitespace)
# corpus=tm_map(corpus,stemDocument)
tdm = TermDocumentMatrix(corpus)
m = as.matrix(tdm)
v = sort(rowSums(m), decreasing = TRUE)
d = data.frame(word = names(v), freq = v)
dim(d)
## [1] 1969 2
d[1:10, ]
## word freq
## model model 206
## beta beta 81
## regression regression 75
## linear linear 57
## data data 54
## likelihood likelihood 52
## test test 52
## residuals residuals 46
## covariates covariates 45
## matrix matrix 43
# png('M2wordcloud.png') # send graphics output to a png file
wordcloud2(d, shape = "cardioid", maxRotation = pi/10, minRotation = -pi/10)
# dev.off() # stop sending graphics output to a png file
install.packages(c("formatR", "gamlss.data", "tidyverse", "ggplot2",
"GGally", "Matrix", "nortest", "lmtest", "wordcloud2", "tm"))