Exercise Module 10: Generalised linear models

Instructions:

This module accompanies the lectures from week 10.

This document contains information, questions, R code, and plots.

Hints and reminders are in bold

Questions appear in blue.

# R code looks like this.
# Comments are in grey and explain the code

2+2 # add two and two
## [1] 4
# outputs from R appear like the above ^^^^

Contents of module


New R skills


Part A: Is a linear model appropriate?

Below you have three examples of datasets where a linear model has been fitted. Each is shown by three plots.

For each, answer:

1. If the linear model was appropriate.

2. Explain why/why not.

3. How could you improve it?

Example 1: Sparrow survival

Question: How does body weight influence survival probability in sparrows?

Data: Response = whether the bird survived (1), or not (0). Explanatory = body weight in grams

plot of chunk unnamed-chunk-2

Example 2: Length and weight in sparrows

Question: How does body weight influence total length of the sparrows?

Data: Response = total length in mm. Explanatory = body weight in grams

plot of chunk unnamed-chunk-3

Example 3: Fledge success in blue tits

Question: How does lay date influence the number of chicks that leave the nest?

Data: Response = number of chicks that fledge (leave nest alive). Explanatory = lay date (day since 1st April)

plot of chunk unnamed-chunk-4


Part B: Fit your first glm

Take on the 100m Olympic running times data you used in the regression week, found here: https://www.math.ntnu.no/emner/ST2304/2019v/Week5/Times.csv

Use the code below to fit a glm() and an lm() for WomenTimes. Stick with gaussian family and identity link.

Compare the results from the two different models.

Use coef() and confint.lm() or summary(). Note, we have added something to the confint function!

# code for the glm - update to match your data
glm(Y ~ X, data,  family = gaussian(link=identity))  

# lm() is the same as always

Part C: Which distribution?

Look back at the examples in Part A.

Which distribution should you use for each?


Part D: Pheonix data

Take the data for the phoenix clutch size from https://www.math.ntnu.no/emner/ST2304/2019v/Week11/Phoenix.csv. It is a csv with a header.

Fit a GLM with a Poisson family and log link to look at whether location of nest influences number of eggs.

Basic code is below, you will need to edit it to match your data AND the distribution you need.

Look at results using coef(). How easy is it to interpret them?

glm(Y ~ X, data,  family = gaussian(link=identity))  

The output of the glm() looks very similar to an lm(). But slightly more has been going on here. In particular, we have used a link function to link the systematic part of our model to the random part. This needs to be considered when interpreting. We need to think about what scale our parameters are working on!

Remember the link function! The parameters (coefficients) are for the linear predictor, which sits inside the link. Our link here was log(), the inverse is exp().

We want to interpret on the original scale of our data. This works differently for the different parameters we estimate.

Have a go at interpretting the results. Remember that Location is categorical.


Part E: Model checking

Now you have a model, it is important to check it too. You can see from the lecture that this is similar to the checking for an lm() but we need to use different residuals. This is because our residuals are be definition, not normal for most glm()s.

Check the fit of the model using Pearson and Deviance residuals. Check linearity, normality, and outliers.

Code to help is here:

resid(model, type="pearson")
resid(model, type="deviance")

fitted(model)

plot(fitted, residuals)
qqnorm(residuals)
qqline(residuals)
plot(model, which=4) # cook’s distance

What do you think? How easy is it do?

Are the two types of residuals different?