TMA4315 Generalized Linear Models

Version 22.11.2017:

1. added \({\bf 1}\) to the equation for \({\bf Y}_i\) to get the correct dimension for adding the random term \(\gamma_{0i}\), and added comment to explain why we ask you to calculate \(p\)-value
1. print-out from anova in d) and from the random slope and intercept model i d) were both missing. Comment on the print-out from anova also added (yes, we have talked about likelihood ratio test). End of d) explanation of plots from random slopes and intercepts added. This explanation should also apply to the catepillar plot in c.

In this exercise, we will focus on (generalized) linear mixed models. Each problem (there are 5 problems) has a maximum score of 2 points. Please provide short answers (preferably one sentence) to all questions. We have included bullet points in each problem to make it easier to see what we ask about, but the points are not necessarily weighted equally.

As for the previous two compulsory exercises this exercise should be handed in on Blackboard and please write the names of the group members and the group number you have on Blackboard on top of your document!

Students may work on the assignment in groups of two or three, and are also encouraged to collaborate across groups. Each group needs to hand in an R Markdown document with answers to the questions as well as all source code. Hand in the file as an R Markdown file (.rmd) and a pdf file (.pdf). Include all code you have written for the new functions (or library) in the end of your file (hint: use eval = FALSE in the chunk option).

Guidance by the teaching assistant (Ingeborg) will be given Tuesdays after 9 in her office (Central Building 2, 12th floor, office 1224). You can either just show up, or send an email to Ingeborg at ingeborg.hem@ntnu.no to make sure she is available. And remember that we have a discussion board.

We will use a dataset with clustered data. The dataset consists of 9 variables:

school: 50 schools, with code 1-50
(class: a factor with levels 1, 2, 3, 4)
gender: a factor with levels boy, girl
social: social class of the father, 1 to 9
raven: test score
(id: student id, 1-1402)
(english: english score)
math: math score
(year: year of school)

We will look at a subset of this dataset, using only the first year of school (year = 0). We will use math as response, and group the data by school (and not by class). That means that the variables in parenthesis will not be used in this exercise.

After installing the package faraway, run the following commands to load and modify the dataset into R.

# first install package faraway
# install.packages("faraway")
library(faraway)
data(jsp, package = "faraway")
sjsp <- jsp[jsp$year == 0,]
sjsp <- sjsp[, -c(6,7,9)] # removing class, id, english and year as they are not used

Note that the number of schools in the subset of the dataset is 49, as shcool number 43 has no measurements for year 0.

a)

First we want to get to know the dataset. Use the ggpairs function from GGally on the data, but you should group the data using gender (mapping = aes(col = gender)), and you should only include the covariates social, raven and math (column = vector of columns you want). The legend can be added using legend = 1.

Comment briefly on the plot you have created.

First fit a linear model with math as response, and raven and gender as covariates. Model for the \(k\)th student:

\[ Y_k={\bf X}_k \beta + \varepsilon_k\] where we assume that the \(\varepsilon_k\)s are independent (between students), and have mean 0 and variance \(\sigma^2\) for all students.

Explain what the different parts of this model are called.
Comment briefly on the parameter estimates you have found.
What are we investigating with this model?

b)

We think that the math grades might be different between schools and want to add school as a factor to our model. However, with 49 schools we don’t want to add school as a fixed effect. We will instead fit a random intercept model with school as random intercept. For school \(i\) we study the measurement model:

\[{\bf Y}_{i} = {\bf X}_i\beta + {\bf 1} \gamma_{0i} + \varepsilon_{i}\] where \({\bf 1}\), \({\bf Y}_i\) and \(\varepsilon_i\) have dimension \(n_i\times 1\).

Explain what the different parts of this model are called and what dimensions the model components have.
Write down distributional assumptions for \(\gamma_{0i}\) and \(\varepsilon_{i}\).
What do we assume about the dependency between the responses at school \(i\), \({\bf Y}_i\), and school \(k\), \({\bf Y}_k\)?

Fit the model using lmer with the following code:

# install.packages("lme4") #install the package
library(lme4)
fitRIa <- lmer(math ~ raven + gender + (1 | school), data = sjsp)
summary(fitRIa)

Compare the parameter estimates for raven and gender with the estimates from the linear model in a), and discuss.
How do gender and raven score affect the math scores?
In the print-out from summary(fitRIa) there are no \(p\)-values. Why is this?
Test the null-hypothesis \(H_0: \beta_{\text{raven}}=0\) against \(H_0: \beta_{\text{raven}}\neq 0\) and provide a \(p\)-value for the test. (Yes, we have many observations and believe that we can calculate a \(p\)-value even though the lmer package not by default want to report such a number.)
Also provide a 95% confidence interval for the effect of the female gender on the math score.

c)

We now continue with a random intercept model (school) with only raven as fixed effect (remove gender from our model).

fitRI <- lmer(math ~ raven + (1 | school), data = sjsp)

Write down the mathematical formula for the covariance and correlation between response \(Y_{ij}\) and \(Y_{il}\) from school \(i\).
What is this correlation for our fitted model fitRI? Comment.

We want to study the predicted value for the random intercept for each school.

Write down the mathematical formula for \(\hat{\gamma}_{0i}\) for your random intercept model and explain what the different elements in the formula means.
Explain what each of the six plots produced and displayed below can be used for (that is, why are we asking you to make these plots).
Comment on your findings.

# install.packages("sjPlot") # first install package sjPlot
library(ggplot2)
library(sjPlot)
gg1 <- plot_model(fitRI, type = "re", sort.est = "(Intercept)", y.offset = 0.4, dot.size = 1.5) + theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())
gg2 <- sjp.lmer(fitRI, type = "ri.slope", vars = "raven", prnt.plot = FALSE)
gg3 <- sjp.lmer(fitRI, type = "re.qq", prnt.plot = FALSE)
gg4 <- ggplot() + geom_density(aes(x = ranef(fitRI)$school[[1]])) + labs(x = "x", y = "y", title = "Density")
df <- data.frame(fitted = fitted(fitRI), resid = residuals(fitRI, scaled = TRUE))
gg5 <- ggplot(df, aes(fitted,resid)) + geom_point(pch = 21) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  geom_smooth(se = FALSE, col = "red", size = 0.5, method = "loess") +
  labs(x = "Fitted values", y = "Residuals", title = "Residuals vs Fitted values")
gg6 <- ggplot(df, aes(sample=resid)) + stat_qq(pch = 19) +
  geom_abline(intercept = 0, slope = 1, linetype = "dotted") +
  labs(x = "Theoretical quantiles", y = "Standardized residuals", title = "Normal Q-Q")

library(ggpubr)
ggarrange(gg1, gg3$plot, gg4, gg5, gg6)

gg2$plot[[1]]

d)

Now include the social status of the father in the model, as a fixed effect.

Compare the model with and without the social status of the father using hypothesis test from the anova below (which is a likelihood ratio test - no, you need not look at the column called deviance since we have not talked about that). Which of the two models do you prefer?
Also comment on the AIC and BIC of the two models (automatically added in the print-out from anova).
Why does the print-out say “refitting model(s) with ML (instead of REML)” (i.e. why do we not want REML when comparing models with the same random terms but with different fixed terms)?

fitRIb <- lmer(math ~ raven + social + (1 | school), data = sjsp)
anova(fitRI, fitRIb)

## Data: sjsp
## Models:
## fitRI: math ~ raven + (1 | school)
## fitRIb: math ~ raven + social + (1 | school)
##        Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)   
## fitRI   4 7326.3 7346.5 -3659.1   7318.3                            
## fitRIb 12 7320.9 7381.5 -3648.5   7296.9 21.372      8   0.006221 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The last model we want to consider is a model with a random intercept and a random slope for the raven score at each school. It can be fit as follows:

fitRIS <- lmer(math ~ raven + (1 + raven | school), data = sjsp)
summary(fitRIS)

## Linear mixed model fit by REML ['lmerMod']
## Formula: math ~ raven + (1 + raven | school)
##    Data: sjsp
## 
## REML criterion at convergence: 7311.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9967 -0.6428  0.0806  0.6619  2.6632 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  school   (Intercept) 25.5896  5.0586        
##           raven        0.0349  0.1868   -0.95
##  Residual             30.8284  5.5523        
## Number of obs: 1154, groups:  school, 49
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  7.17043    1.07733   6.656
## raven        0.70909    0.04089  17.342
## 
## Correlation of Fixed Effects:
##       (Intr)
## raven -0.965

gg1 <- sjp.lmer(fitRIS, type = "rs.ri", prnt.plot = FALSE)
gg2 <- sjp.lmer(fitRIS, sort.est = "(Intercept)", prnt.plot = FALSE, facet.grid = FALSE, title = names(ranef(fitRIS)$school), geom.size = 1.5)
ggarrange(gg1$plot[[1]], gg2$plot.list[[1]], gg2$plot.list[[2]] + labs(x = ""), ncol = 3, widths = c(2,1,1))

Plots:

Left panel (gg1) gives the fitted model- one line for each school (x-axis: score for “raven test” and yaxis: \(\hat{\beta}_0+\hat{\beta}_1{\text raven}+\hat{\gamma}_{i0}+\hat{\gamma}_{1i}{\text raven}\) where \(i\) denotes the school, \(i=1,\ldots,49\).
Right panels (gg2) show the predicted values for the random intercepts and slopes for the 49 schools. X-axis: predicted value with interval from 2.5% to 97.5% quantile of the distribution for the predicted value. Y-axis is only school, sorted by the random intercept, red lines for predicted values below 0 and blue above.

Finally:

Write the mathematical formula for the random intercept and slope model and comment on what you see from fitting the model.

e)

Now imagine that we want to model the probability for a student to fail maths instead of the the individual grades given in maths.

Why is it not suitable to use a linear mixed effects model?
What type of model would be more suitable? (hint: IL module 7)
How would we add a random school intercept into this model (in which part of the model)?
What is the main challenge with this type of models? (hint: marginal model)

TMA4315 Generalized Linear Models

Compulsory exercise 3: (Generalized) Linear Mixed Models

Deadline: Friday, November 24, 2017 at 16.00

a)

b)

c)

d)

e)