Interactive lectures- problem set second week

What to remember from the first week?

Munich rent index

Munich, 1999: 3082 observations on 9 variables.

rent: the net rent per month (in Euro).
rentsqm: the net rent per month per square meter (in Euro).
area: living area in square meters.
yearc: year of construction.
location: quality of location: a factor indicating whether the location is average location, 1, good location, 2, and top location, 3.
bath: quality of bathroom: a a factor indicating whether the bath facilities are standard, 0, or premium, 1.
kitchen: Quality of kitchen: 0 standard 1 premium.
cheating: central heating: a factor 0 without central heating, 1 with central heating.
district: District in Munich.

More information in Fahrmeir et. al., (2013) page 5.

library(gamlss.data)
library(ggfortify)
`?`(rent99)

mod1 <- lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
mod2 <- lm(rentsqm ~ area + location + bath + kitchen + cheating, data = rent99)
autoplot(mod1, label.size = 2)

autoplot(mod2, label.size = 2)

The GLM way

Independent pairs $(Y_i, {\bf x}_i)$ for $i=1,\ldots,n$.

Random component: $Y_i \sim N$ with $\text{E}(Y_i)=\mu_i$ and $\text{Var}(Y_i)=\sigma^2$.
Systematic component: $\eta_i={\bf x}_i^T \boldsymbol{\beta}$.
Link function: linking the random and systematic component (linear predictor): Identity link and response function. $\mu_i=\eta_i$.

Likelihood, loglikelihood, score function, observed and expected Fisher information matrix

Likelihood $L(\beta)=\prod_{i=1}^n f(y_i; \beta)$.
Loglikelihood $l(\beta)=\ln L(\beta)$.
Score function $s(\beta)=\frac{\partial l(\beta)}{\partial \beta}$. Find ML estimates by solving $s(\hat{\boldsymbol \beta})={\bf 0}$.
Observed $H(\boldsymbol{\beta}) = -\frac{\partial^2l(\beta)}{\partial\beta\partial\beta^T}$ and expected Fisher information $F(\beta) =\text{E}(H(\boldsymbol{\beta}))$

Q: How is this done in lm?

Interactive tasks for the second week

Problem 1: Theory

What is the interpretation of a 95% confidence interval?

Hint: repeat experiment (on $Y$), on average how many CIs cover the true $\beta_j$?

Explain in words and with formulas the $p$-values printed in a summary from lm.

fit = lm(rent ~ area + location + bath + kitchen + cheating, data = rent99)
summary(fit)

## 
## Call:
## lm(formula = rent ~ area + location + bath + kitchen + cheating, 
##     data = rent99)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -633.41  -89.17   -6.26   82.96 1000.76 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -21.9733    11.6549  -1.885   0.0595 .  
## area          4.5788     0.1143  40.055  < 2e-16 ***
## location2    39.2602     5.4471   7.208 7.14e-13 ***
## location3   126.0575    16.8747   7.470 1.04e-13 ***
## bath1        74.0538    11.2087   6.607 4.61e-11 ***
## kitchen1    120.4349    13.0192   9.251  < 2e-16 ***
## cheating1   161.4138     8.6632  18.632  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 145.2 on 3075 degrees of freedom
## Multiple R-squared:  0.4504, Adjusted R-squared:  0.4494 
## F-statistic:   420 on 6 and 3075 DF,  p-value: < 2.2e-16

Explain in words and with formulas the full output (with $p$-values) printed in an anova from lm.

anova(fit)

## Analysis of Variance Table
## 
## Response: rent
##             Df   Sum Sq  Mean Sq  F value    Pr(>F)    
## area         1 40299098 40299098 1911.765 < 2.2e-16 ***
## location     2  1635047   817524   38.783 < 2.2e-16 ***
## bath         1  1676825  1676825   79.547 < 2.2e-16 ***
## kitchen      1  2196952  2196952  104.222 < 2.2e-16 ***
## cheating     1  7317894  7317894  347.156 < 2.2e-16 ***
## Residuals 3075 64819547    21080                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In particular: why does using summary and anova on a fitted lm give different test statistics and different $p$-values listed for each covariate. And, why is summary listing location2 and location3 while anova is listing location?

Optional: Maybe test out Anova in library car with type 3 ANOVA to compare?

Consider a MLR model $A$ and a submodel $B$ (all parameters in $B$ are in $A$ also). We say that $B$ is nested within $A$. Assume that regression parameters are estimated using maximum likelihood.

Why is the following true: the likelihood for model $A$ will always be larger or equal to the likelihood for model $B$.
How do we define the deviance of model $A$? What is a saturated model in our MLR setting? What does our finding in 5. imply for the deviance (can the deviance both be positive and negative)?

Problem 2: Dummy vs. effect coding in MLR (continued)

We have studied the data set with income, place and gender - with focus on dummy variable coding (with different reference levels) and effect coding and the interpretation of parameter estimates. Now we continue with the same data set, but with focus on hypothesis testing (linear hypotheses) and analysis of variance decomposition.

Previously, we have read in the data and fitted linear models - look back to see what we found.

income <- c(300, 350, 370, 360, 400, 370, 420, 390, 400, 430, 420, 410,
    300, 320, 310, 305, 350, 370, 340, 355, 370, 380, 360, 365)
gender <- c(rep("Male", 12), rep("Female", 12))
place <- rep(c(rep("A", 4), rep("B", 4), rep("C", 4)), 2)
data <- data.frame(income, gender = factor(gender, levels = c("Female",
    "Male")), place = factor(place, levels = c("A", "B", "C")))

Fit the following model model = lm(income~place-1,data=data,x=TRUE). Here x=TRUE tells the function to calculate the design matrix X, which is stored as model$x.

model = lm(income ~ place - 1, data = data, x = TRUE)
model$x

##    placeA placeB placeC
## 1       1      0      0
## 2       1      0      0
## 3       1      0      0
## 4       1      0      0
## 5       0      1      0
## 6       0      1      0
## 7       0      1      0
## 8       0      1      0
## 9       0      0      1
## 10      0      0      1
## 11      0      0      1
## 12      0      0      1
## 13      1      0      0
## 14      1      0      0
## 15      1      0      0
## 16      1      0      0
## 17      0      1      0
## 18      0      1      0
## 19      0      1      0
## 20      0      1      0
## 21      0      0      1
## 22      0      0      1
## 23      0      0      1
## 24      0      0      1
## attr(,"assign")
## [1] 1 1 1
## attr(,"contrasts")
## attr(,"contrasts")$place
## [1] "contr.treatment"

Examine the results with summary and anova.

What parametrization is used?

What is the interpretation of the parameters?/span>

Which null hypothesis is tested in the anova-call?

What is the result of the hypothesis test?

summary(model)

## 
## Call:
## lm(formula = income ~ place - 1, data = data, x = TRUE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.375 -22.500  -5.625  23.750  45.625 
## 
## Coefficients:
##        Estimate Std. Error t value Pr(>|t|)    
## placeA  326.875      9.733   33.58   <2e-16 ***
## placeB  374.375      9.733   38.46   <2e-16 ***
## placeC  391.875      9.733   40.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.53 on 21 degrees of freedom
## Multiple R-squared:  0.9951, Adjusted R-squared:  0.9944 
## F-statistic:  1409 on 3 and 21 DF,  p-value: < 2.2e-16

anova(model)

## Analysis of Variance Table
## 
## Response: income
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## place      3 3204559 1068186  1409.4 < 2.2e-16 ***
## Residuals 21   15916     758                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Fit the models:

model1 = lm(income ~ place, data = data, x = TRUE, contrasts = list(place = "contr.treatment"))
head(model1$x)

##   (Intercept) placeB placeC
## 1           1      0      0
## 2           1      0      0
## 3           1      0      0
## 4           1      0      0
## 5           1      1      0
## 6           1      1      0

summary(model1)

## 
## Call:
## lm(formula = income ~ place, data = data, x = TRUE, contrasts = list(place = "contr.treatment"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.375 -22.500  -5.625  23.750  45.625 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  326.875      9.733  33.583  < 2e-16 ***
## placeB        47.500     13.765   3.451 0.002394 ** 
## placeC        65.000     13.765   4.722 0.000116 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.53 on 21 degrees of freedom
## Multiple R-squared:  0.5321, Adjusted R-squared:  0.4875 
## F-statistic: 11.94 on 2 and 21 DF,  p-value: 0.000344

model2 = lm(income ~ place, data = data, x = TRUE, contrasts = list(place = "contr.sum"))
head(model2$x)

##   (Intercept) place1 place2
## 1           1      1      0
## 2           1      1      0
## 3           1      1      0
## 4           1      1      0
## 5           1      0      1
## 6           1      0      1

summary(model2)

## 
## Call:
## lm(formula = income ~ place, data = data, x = TRUE, contrasts = list(place = "contr.sum"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.375 -22.500  -5.625  23.750  45.625 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  364.375      5.619  64.841  < 2e-16 ***
## place1       -37.500      7.947  -4.719 0.000117 ***
## place2        10.000      7.947   1.258 0.222090    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27.53 on 21 degrees of freedom
## Multiple R-squared:  0.5321, Adjusted R-squared:  0.4875 
## F-statistic: 11.94 on 2 and 21 DF,  p-value: 0.000344

We have talked about dummy- and effect encoding of categorical covariates.

What are the parametrizations used here? What is the interpretation of the parameters and how do the parameter interpretations differ between model1 and model2?

We want to test the (one-way ANOVA) null hypothesis that there is no effect of place. Use the $F_{obs}$ to do this both using the dummy-variable and the effect coding of the place-factor.

Compare the results from the two coding strategies.

model0 = lm(income ~ 1, data = data)
anova(model0, model1)

## Analysis of Variance Table
## 
## Model 1: income ~ 1
## Model 2: income ~ place
##   Res.Df   RSS Df Sum of Sq      F   Pr(>F)    
## 1     23 34016                                 
## 2     21 15916  2     18100 11.941 0.000344 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(model0, model2)

## Analysis of Variance Table
## 
## Model 1: income ~ 1
## Model 2: income ~ place
##   Res.Df   RSS Df Sum of Sq      F   Pr(>F)    
## 1     23 34016                                 
## 2     21 15916  2     18100 11.941 0.000344 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Suppose now that there are two factors placeand gender.

model3 = lm(income ~ place + gender, data = data, x = TRUE, contrasts = list(place = "contr.treatment",
    gender = "contr.treatment"))
summary(model3)

## 
## Call:
## lm(formula = income ~ place + gender, data = data, x = TRUE, 
##     contrasts = list(place = "contr.treatment", gender = "contr.treatment"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.500  -6.250   0.000   9.687  25.000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  306.250      6.896  44.411  < 2e-16 ***
## placeB        47.500      8.446   5.624 1.67e-05 ***
## placeC        65.000      8.446   7.696 2.11e-07 ***
## genderMale    41.250      6.896   5.982 7.54e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.89 on 20 degrees of freedom
## Multiple R-squared:  0.8322, Adjusted R-squared:  0.8071 
## F-statistic: 33.07 on 3 and 20 DF,  p-value: 6.012e-08

anova(model3)

## Analysis of Variance Table
## 
## Response: income
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## place      2 18100.0  9050.0  31.720 6.260e-07 ***
## gender     1 10209.4 10209.4  35.783 7.537e-06 ***
## Residuals 20  5706.2   285.3                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model4 = lm(income ~ place + gender, data = data, x = TRUE, contrasts = list(place = "contr.sum",
    gender = "contr.sum"))
summary(model4)

## 
## Call:
## lm(formula = income ~ place + gender, data = data, x = TRUE, 
##     contrasts = list(place = "contr.sum", gender = "contr.sum"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.500  -6.250   0.000   9.687  25.000 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  364.375      3.448 105.680  < 2e-16 ***
## place1       -37.500      4.876  -7.691 2.13e-07 ***
## place2        10.000      4.876   2.051   0.0536 .  
## gender1      -20.625      3.448  -5.982 7.54e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.89 on 20 degrees of freedom
## Multiple R-squared:  0.8322, Adjusted R-squared:  0.8071 
## F-statistic: 33.07 on 3 and 20 DF,  p-value: 6.012e-08

anova(model4)

## Analysis of Variance Table
## 
## Response: income
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## place      2 18100.0  9050.0  31.720 6.260e-07 ***
## gender     1 10209.4 10209.4  35.783 7.537e-06 ***
## Residuals 20  5706.2   285.3                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

What are the parameterizations? What is the interpretation of the parameters? Does the ANOVA table look different for the two parametrizations?

Hint: orthogonality of design matrix for this balanced design?

Finally, fit a model with interactions (model formula is place*gender for both the contrasts and check if the interaction effect is significant.

model5 = lm(income ~ place * gender, data = data, x = TRUE, contrasts = list(place = "contr.treatment",
    gender = "contr.treatment"))
summary(model5)

## 
## Call:
## lm(formula = income ~ place * gender, data = data, x = TRUE, 
##     contrasts = list(place = "contr.treatment", gender = "contr.treatment"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.000  -5.938   1.250  11.250  25.000 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        308.750      8.824  34.989  < 2e-16 ***
## placeB              45.000     12.479   3.606 0.002020 ** 
## placeC              60.000     12.479   4.808 0.000141 ***
## genderMale          36.250     12.479   2.905 0.009446 ** 
## placeB:genderMale    5.000     17.648   0.283 0.780168    
## placeC:genderMale   10.000     17.648   0.567 0.577963    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.65 on 18 degrees of freedom
## Multiple R-squared:  0.8352, Adjusted R-squared:  0.7894 
## F-statistic: 18.24 on 5 and 18 DF,  p-value: 1.74e-06

anova(model5)

## Analysis of Variance Table
## 
## Response: income
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## place         2 18100.0  9050.0 29.0569 2.314e-06 ***
## gender        1 10209.4 10209.4 32.7793 1.988e-05 ***
## place:gender  2   100.0    50.0  0.1605    0.8529    
## Residuals    18  5606.2   311.5                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

model6 = lm(income ~ place * gender, data = data, x = TRUE, contrasts = list(place = "contr.sum",
    gender = "contr.sum"))
summary(model6)

## 
## Call:
## lm(formula = income ~ place * gender, data = data, x = TRUE, 
##     contrasts = list(place = "contr.sum", gender = "contr.sum"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.000  -5.938   1.250  11.250  25.000 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.644e+02  3.602e+00 101.147  < 2e-16 ***
## place1         -3.750e+01  5.095e+00  -7.361 7.86e-07 ***
## place2          1.000e+01  5.095e+00   1.963   0.0653 .  
## gender1        -2.062e+01  3.602e+00  -5.725 1.99e-05 ***
## place1:gender1  2.500e+00  5.095e+00   0.491   0.6296    
## place2:gender1  1.743e-14  5.095e+00   0.000   1.0000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.65 on 18 degrees of freedom
## Multiple R-squared:  0.8352, Adjusted R-squared:  0.7894 
## F-statistic: 18.24 on 5 and 18 DF,  p-value: 1.74e-06

anova(model6)

## Analysis of Variance Table
## 
## Response: income
##              Df  Sum Sq Mean Sq F value    Pr(>F)    
## place         2 18100.0  9050.0 29.0569 2.314e-06 ***
## gender        1 10209.4 10209.4 32.7793 1.988e-05 ***
## place:gender  2   100.0    50.0  0.1605    0.8529    
## Residuals    18  5606.2   311.5                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Problem 3: Compulsory exercise 1

Introduction to the first compulsory exercise by TA Ingeborg - and an introduction to packages and classes in R.

The exercise: https://www.math.ntnu.no/emner/TMA4315/2018h/project1_h18.html

Packages: ggplot2, gamlss.data, and so on. Some are already loaded when R starts (like stats), others must be loaded (like MASS).

You are going to make your own package, called mylm, which performs multiple linear regression and is a smaller version of lm.

Show how to create package in R Studio

Classes in R: Something we do not have to think much about, but we use all the time. We are now going to make a new class in R, that we call “test”.

# takes a word, and returns the index in the alphabet of each
# letter in an object with class 'test'
test <- function(word) {

    x <- 1:nchar(word)
    y <- match(c(strsplit(tolower(word), "")[[1]]), letters[1:26])

    res <- list(x = x, y = y, word = word)  # if you are not familiar with lists, you should read up on this
    class(res) <- "test"

    return(res)

}

Now we make an object of this class and try to look at it.

myname <- test("Ingeborg")
# 'print(myname)' and 'myname' returns the same thing in a script,
# so to simplify we just write 'myname' here
myname  # prints everything

## $x
## [1] 1 2 3 4 5 6 7 8
## 
## $y
## [1]  9 14  7  5  2 15 18  7
## 
## $word
## [1] "Ingeborg"
## 
## attr(,"class")
## [1] "test"

# lets make a print function that only prints the word
print.test <- function(obj) cat(obj$word)

myname  # and now we get only the name

## Ingeborg

Now we want a function that plots objects of this class in a particular way.

# important that it is called plot.test with .test at the end!!!
plot.test <- function(obj) plot(obj$x, obj$y, xlab = "Letter", ylab = "Index",
    main = obj$word, col = rainbow(length(obj$x)), pch = 19, cex = 2)

plot(myname)  # we do not have to specify that this is plot.test, because 'myname' is already of class 'test'!

# And a summary function
summary.test <- function(obj) {

    cat("Word: ", obj$word, "\n")
    cat("Length of word: ", length(obj$x), " letters\n")
    cat("Occurrence of each letter:")
    print(table(strsplit(tolower(obj$word), "")))

}

summary(myname)

## Word:  Ingeborg 
## Length of word:  8  letters
## Occurrence of each letter:
## b e g i n o r 
## 1 1 2 1 1 1 1

Now we have made a class with a plot, print and summary function, and this is what you do in the exercise! But a bit more advanced…

Let us look at what happens when we use the plotting function on objects with different classes: The function called plot. First we make two new objects that can be plotted:

data <- data.frame(x = rnorm(10), y = rnorm(10))
mod <- lm(y ~ x, data = data)

And then we plot them:

plot(data)

plot(mod)

plot(myname)

What is happening? R reads the class of the objects and uses the plot-function made for that specific class. The user does not have to specify the class as this is already stored in the object!

The different objects we have declared earlier have the following classes:

class(data)
class(mod)
class(myname)

## [1] "data.frame"
## [1] "lm"
## [1] "test"

You will make a new class in R called mylm, and R will also then understand which plot-function to use based on the class.

Note that an object can have more than one class.

You write the report using R Markdown, use this template: https://www.math.ntnu.no/emner/TMA4315/2018h/template_glm.Rmd

Exercises:

Discuss with the group to get a feeling on what to do in the exercise.

Go through how to make an R package together in the group, and make the mylm-package.
The core is the mylm function. Which formulas are used to
- calculate parameters estimates?
- calculate covariance matrix of the estimated regression coefficients?
- perform type 3 hypothesis tests (remember you need to do the asymptotic normal - so no t-distributions)?
You will make print.mylm, plot.mylm and summary.mylm. What should these functions contain?
Look at the mylm-template (https://www.math.ntnu.no/emner/TMA4315/2018h/mylm.R) and see if you understand it, or if you have questions about some of the parts. In particular, explore the functions model.frame, model.matrix and model.response.

Problem 4: Munich Rent index (optional)

Last week all groups decided on using rent or rentsqm as response, and in short - there was not really a big difference. So, now use rent as the response.

We now want to use model selection to arrive at a good model. Start by defining which covariates you want to include and how to code them (location as dummy or effect coding). What about year of construction - is that a linear covariate? Maybe you want to make intervals in time instead? Linear or categorical for the time? What about the district? We leave that since we have not talked about how to use spatial covariates.

Hint: if you want to test out interval versions of year of construction the function mutate (from dplyr) is useful:

rent99 <- rent99 %>%
    mutate(yearc.cat = cut(yearc, breaks = c(-Inf, seq(1920, 2000, 10)),
        labels = 10 * 1:9))

More on dplyr: Tutorial: http://genomicsclass.github.io/book/pages/dplyr_tutorial.html and Cheat sheet (data wrangling): https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf and dplyr in particular: https://github.com/rstudio/cheatsheets/raw/master/source/pdfs/data-transformation-cheatsheet.pdf

There are many ways to perform model selection in MLR. One possibility is best subsets, which can be done using the regsubsets function from library leaps. You may define x from model.matrix(fit)[,-1] (not including the intercept term), and then run best=regsubsets(x=model.matrix(fit)[,-1],y=rent99$rent) and look at summary(best). Explain the print-out (with all the stars). Using the Mallows Cp (named cp in the list from summary(best)) will give the same result at using AIC (which is not available in this function).

What is your preferred model? Are there other models worth considering?

Hint: look at the R-code in Problem 2 (Figure 3) from the TMA4267V2017 exam: pdf, and maybe the solutions for the interprtation pdf

Check what is done if you use stepAIC. Do you get the same model choice as with best subset selection based on AIC? Why, or why not?

Wordclouds are cool?

Run the following code to make the wordcloud. The code can not be run by knit for a pdf because of how the graphics are made - in that case you need to run it and then you need to save the resulting figure as a file (I choose png: the code to do this has been commented out). Maybe you want to run the code on another document? Please mail us if you do cool stuff for others to see!

library(wordcloud2)
library(tm)
all = scan("https://www.math.ntnu.no/emner/TMA4315/2018h/2MLR.Rmd", what = "s")

corpus = Corpus(VectorSource(all))
corpus[[1]][1]

## $content
## [1] "---"

corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, removeWords, c("---", "bf", "boldsymbol", "will",
    "include", "use", "can", "follow", "provide", "using"))
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, stripWhitespace)
# corpus=tm_map(corpus,stemDocument)

tdm = TermDocumentMatrix(corpus)
m = as.matrix(tdm)
v = sort(rowSums(m), decreasing = TRUE)
d = data.frame(word = names(v), freq = v)
dim(d)

## [1] 1969    2

d[1:10, ]

##                  word freq
## model           model  206
## beta             beta   81
## regression regression   75
## linear         linear   57
## data             data   54
## likelihood likelihood   52
## test             test   52
## residuals   residuals   46
## covariates covariates   45
## matrix         matrix   43

# png('M2wordcloud.png') # send graphics output to a png file
wordcloud2(d, shape = "cardioid", maxRotation = pi/10, minRotation = -pi/10)

# dev.off() # stop sending graphics output to a png file

R packages

install.packages(c("formatR", "gamlss.data", "tidyverse", "ggplot2",
    "GGally", "Matrix", "nortest", "lmtest", "wordcloud2", "tm"))

References and further reading

Slightly different presentation (more focus on multivariate normal theory): Slides and written material from TMA4267 Linear Statistical Models in 2017, Part 2: Regression (by Mette Langaas).
And, same source, but now [Slides and written material from TMA4267 Linear Statistical Models in 2017, Part 3: Hypothesis testing and ANOVA] (http://www.math.ntnu.no/emner/TMA4267/2017v/TMA4267V2017Part3.pdf)

TMA4315 Generalized linear models H2018

Module 2: MULTIPLE LINEAR REGRESSION, interactive sessions, Week 2

Mette Langaas, Department of Mathematical Sciences, NTNU – with contributions from Oyvind Bakke and Ingeborg Hem

30.08 and 06.09 [PL], 31.08 and 07.09 [IL]