This module accompanies the lectures from week 10.
This document contains information, questions, R code, and plots.
Hints and reminders are in bold
Questions appear in blue.
# R code looks like this.
# Comments are in grey and explain the code
2+2 # add two and two
## [1] 4
# outputs from R appear like the above ^^^^
glm()
with link
and family
argumentsBelow you have three examples of datasets where a linear model has been fitted. Each is shown by three plots.
For each, answer:
1. If the linear model was appropriate.
2. Explain why/why not.
3. How could you improve it?
Example 1: Sparrow survival
Question: How does body weight influence survival probability in sparrows?
Data: Response = whether the bird survived (1), or not (0). Explanatory = body weight in grams
Example 2: Length and weight in sparrows
Question: How does body weight influence total length of the sparrows?
Data: Response = total length in mm. Explanatory = body weight in grams
Example 3: Fledge success in blue tits
Question: How does lay date influence the number of chicks that leave the nest?
Data: Response = number of chicks that fledge (leave nest alive). Explanatory = lay date (day since 1st April)
Take on the 100m Olympic running times data you used in the regression week, found here: https://www.math.ntnu.no/emner/ST2304/2019v/Week5/Times.csv
Use the code below to fit a glm()
and an lm()
for WomenTimes.
Stick with gaussian family and identity link.
Compare the results from the two different models.
Use coef()
and confint.lm()
or summary()
. Note, we have added something to
the confint function!
# code for the glm - update to match your data
glm(Y ~ X, data, family = gaussian(link=identity))
# lm() is the same as always
Look back at the examples in Part A.
Which distribution should you use for each?
Take the data for the phoenix clutch size from https://www.math.ntnu.no/emner/ST2304/2019v/Week11/Phoenix.csv. It is a csv with a header.
Fit a GLM with a Poisson family and log link to look at whether location of nest influences number of eggs.
Basic code is below, you will need to edit it to match your data AND the distribution you need.
Look at results using coef()
. How easy is it to interpret them?
glm(Y ~ X, data, family = gaussian(link=identity))
The output of the glm()
looks very similar to an lm()
. But slightly more
has been going on here. In particular, we have used a link function to link the
systematic part of our model to the random part. This needs to be considered when
interpreting. We need to think about what scale our parameters are working on!
Remember the link function! The parameters (coefficients) are for the linear predictor,
which sits inside the link. Our link here was log()
, the inverse is exp()
.
We want to interpret on the original scale of our data. This works differently for the different parameters we estimate.
\(\alpha\), the intercept, this represents the value of y when x = 0. Therefore, to
be meaningful to us, we want this on the scale of y. To interpret it, and
for plotting, we want to take the inverse of the coefficient estimate given by the glm()
.
The unit of the intercept = y.
\(\beta\), slopes, these represent the slope of a continuous relationship or the
move from x=0 to x=1. Generally, it represents the change in y for a change of 1 in x. It
represents this change on the link scale. We do NOT want to take the inverse of the
\(\beta\) coefficient estimate to interpret it. What we might want to do, for prediction or
plotting, is to take the inverse of the actual prediction. We can do this by predicting first
using a linear equation, then take the inverse of the link. Or, we can take the inverse of
the linear equation itself e.g. exp(alpha + betaX) or y = alpha + betaX, exp(y). You can interpret
the direct and type of effect without transforming the coefficient at all. This
works the same as for an lm()
.
Have a go at interpretting the results. Remember that Location is categorical.
Now you have a model, it is important to check it too. You can see from the lecture
that this is similar to the checking for an lm()
but we need to use different residuals.
This is because our residuals are be definition, not normal for most glm()
s.
Check the fit of the model using Pearson and Deviance residuals. Check linearity, normality, and outliers.
Code to help is here:
resid(model, type="pearson")
resid(model, type="deviance")
fitted(model)
plot(fitted, residuals)
qqnorm(residuals)
qqline(residuals)
plot(model, which=4) # cook’s distance
What do you think? How easy is it do?
Are the two types of residuals different?