Exercise Module 10: Generalised linear models

Instructions:

This module accompanies the lectures from week 10.

This document contains information, questions, R code, and plots.

Hints and reminders are in bold

Questions appear in blue.

# R code looks like this.
# Comments are in grey and explain the code

2+2 # add two and two

## [1] 4

# outputs from R appear like the above ^^^^

Contents of module

Part A = Is a linear model appropriate?
Part B = Fit your first glm
Part C = Which distribution?
Part D = Pheonix data
Part E = Model checking

New R skills

Using glm() with link and family arguments

Part A: Is a linear model appropriate?

Below you have three examples of datasets where a linear model has been fitted. Each is shown by three plots.

Plot 1 = the raw data with a straight line fitted.
Plot 2 = the residuals vs fitted plot for the linear model.
Plot 3 = the normal QQ plot for the linear model.

For each, answer:

1. If the linear model was appropriate.

2. Explain why/why not.

3. How could you improve it?

Example 1: Sparrow survival

Question: How does body weight influence survival probability in sparrows?

Data: Response = whether the bird survived (1), or not (0). Explanatory = body weight in grams

plot of chunk unnamed-chunk-2

Example 2: Length and weight in sparrows

Question: How does body weight influence total length of the sparrows?

Data: Response = total length in mm. Explanatory = body weight in grams

plot of chunk unnamed-chunk-3

Example 3: Fledge success in blue tits

Question: How does lay date influence the number of chicks that leave the nest?

Data: Response = number of chicks that fledge (leave nest alive). Explanatory = lay date (day since 1st April)

plot of chunk unnamed-chunk-4

Part B: Fit your first glm

Take on the 100m Olympic running times data you used in the regression week, found here: https://www.math.ntnu.no/emner/ST2304/2019v/Week5/Times.csv

Use the code below to fit a glm() and an lm() for WomenTimes. Stick with gaussian family and identity link.

Compare the results from the two different models.

Use coef() and confint.lm() or summary(). Note, we have added something to the confint function!

# code for the glm - update to match your data
glm(Y ~ X, data,  family = gaussian(link=identity))  

# lm() is the same as always

Part C: Which distribution?

Look back at the examples in Part A.

Which distribution should you use for each?

Part D: Pheonix data

Take the data for the phoenix clutch size from https://www.math.ntnu.no/emner/ST2304/2019v/Week11/Phoenix.csv. It is a csv with a header.

Fit a GLM with a Poisson family and log link to look at whether location of nest influences number of eggs.

Basic code is below, you will need to edit it to match your data AND the distribution you need.

Look at results using coef(). How easy is it to interpret them?

glm(Y ~ X, data,  family = gaussian(link=identity))

The output of the glm() looks very similar to an lm(). But slightly more has been going on here. In particular, we have used a link function to link the systematic part of our model to the random part. This needs to be considered when interpreting. We need to think about what scale our parameters are working on!

Remember the link function! The parameters (coefficients) are for the linear predictor, which sits inside the link. Our link here was log(), the inverse is exp().

We want to interpret on the original scale of our data. This works differently for the different parameters we estimate.

\(\alpha\), the intercept, this represents the value of y when x = 0. Therefore, to be meaningful to us, we want this on the scale of y. To interpret it, and for plotting, we want to take the inverse of the coefficient estimate given by the glm(). The unit of the intercept = y.
\(\beta\), slopes, these represent the slope of a continuous relationship or the move from x=0 to x=1. Generally, it represents the change in y for a change of 1 in x. It represents this change on the link scale. We do NOT want to take the inverse of the \(\beta\) coefficient estimate to interpret it. What we might want to do, for prediction or plotting, is to take the inverse of the actual prediction. We can do this by predicting first using a linear equation, then take the inverse of the link. Or, we can take the inverse of the linear equation itself e.g. exp(alpha + betaX) or y = alpha + betaX, exp(y). You can interpret the direct and type of effect without transforming the coefficient at all. This works the same as for an lm().

Have a go at interpretting the results. Remember that Location is categorical.

Part E: Model checking

Now you have a model, it is important to check it too. You can see from the lecture that this is similar to the checking for an lm() but we need to use different residuals. This is because our residuals are be definition, not normal for most glm()s.

Check the fit of the model using Pearson and Deviance residuals. Check linearity, normality, and outliers.

Code to help is here:

resid(model, type="pearson")
resid(model, type="deviance")

fitted(model)

plot(fitted, residuals)
qqnorm(residuals)
qqline(residuals)
plot(model, which=4) # cook’s distance

What do you think? How easy is it do?

Are the two types of residuals different?