Part 1: GLMs in Theory

0. Why do we use GLMs?

This is the log-likelihood for a binomial distribution (from Week 2):

\[ log(Pr(n = r | N, p)) = \log \left( \frac{N !}{r! (N-r)!} \right) + r \log{p} + (N-r) \log(1-p) \]

We want to model \(p\), but is it a GLM? In other words, does the likelihood have this form?

\[ l(\theta | y) = \frac{y \theta -b(\theta)}{a(\phi)} + c(y, \phi) \]

It’s not obvious, is it? But we can re-arrange things:

\[ \begin{aligned} l(p | r, N) &= \log \left( \begin{array}{c} N \\ r \end{array} \right) + r \left( \log{p} - \log{(1-p)} \right) + N \log{(1-p)} \\ &= \log \left( \begin{array}{c} N \\ r \end{array} \right) + r \log \left(\frac{p}{1-p} \right) + N \log{(1-p)} \\ &= \log \left( \begin{array}{c} N \\ r \end{array} \right) + \frac{ \frac{r}{N} \log \left(\frac{p}{1-p} \right) + \log{(1-p)}}{1/N} \end{aligned} \] This is makes things a bit easier to pull apart.

1a. Have a go at identifying \(y\), \(\theta\), \(b(\theta)\), \(a(\phi)\), \(c(y, \phi)\).

Remember, \(y\) is the data, and \(\theta\) is the linear predictor. Also, \(\phi\) is the dispersion parameter, but here it is a constant.

Answer below

\[ \begin{aligned} y &= \frac{r}{N} \\ \theta &= \log \left(\frac{p}{1-p} \right) \\ b(\theta) &= -\log (1-p) \\ a(\phi) &= {1/N} \\ c(y, \phi) &= \log \left( \begin{array}{c} N \\ r \end{array} \right) \end{aligned} \]

1b. What is the link function in this model? Look at the Poisson example in the lecture to see where you find it

Because we have \(\theta = \log \frac{p}{1-p}\), the link function must be \(\log \frac{p}{1-p}\). This is called the logit link.

Part 2: Bumpus’ Sparrows

Sparrow * Photo credit Wikipedia

In 1898 Bumpus (or, probably, one of his technicians) collected some sparrows that had been blown out of their trees during a snow storm. He wanted his team (you) to look at which sparrows survived. It is your job to take the data and try to do this!

The data can be found at: https://www.math.ntnu.no/emner/ST2304/2019v/Week11/BumpusSparrows.csv

There are many variables in this dataset. Today, we want to focus on just a few - Status (whether they are alive or dead) and Total length (total length of sparrow in mm).

and a scan of the paper is here:

https://biodiversitylibrary.org/page/5071528

Below your team wants to take some summaries of the different variables. You use the function tapply() which creates a summary (defined in the third argument) of the first argument, split by the second argument. e.g. to get the mean height per year you could do tapply(height, year, mean).

Bumpus <- read.csv("https://www.math.ntnu.no/emner/ST2304/2019v/Week11/BumpusSparrows.csv")
Bumpus$Survived <- as.numeric(Bumpus$Status=="Alive")

# Calculate the mean survival for each size
MeanSurv <- tapply(Bumpus$Survived, list(Bumpus$Total.Length), mean)
# Count the number who survived & died for each size
TotSurv <- tapply(Bumpus$Survived, list(Bumpus$Total.Length), sum)
TotDied <- tapply(1-Bumpus$Survived, list(Bumpus$Total.Length), sum)

The calculations for the mean survival and numbers who survived/died for each size are going to be used to plot the data.

plot(jitter(Bumpus$Total.Length), jitter(Bumpus$Survived, factor=0.4), col="grey70",
     xlab="Total Length", ylab="Proportion Survived")
# add the empirical proportion who survived
lines(as.numeric(names(MeanSurv)), MeanSurv, col="darkblue")
# add the counts of the numbers who survived and died
text(as.numeric(names(TotSurv)), 1, TotSurv)
text(as.numeric(names(TotDied)), 0, TotDied)

2. Use the plot to describe the main features of the relationship between total length and survival. i.e. look at the plot and describe the patterns you can see, relate these patterns back to total length and survival

This answer only needs information from the plot above! From the plot you can see that while survival probability seems to increase initially from the smallest lengths to aroun 153mm, it then decreases steading as individuals get bigger than 155mm. The relationship appears to be curved, in an ‘n’ shape, this suggests a quadratic (X to the power of two) relationship. You could also notice there is more data in the middle i.e. from 155mm to 165mm.

Now your team decides to fit a model (lm) with Total Length and Total Length squared as covariates, and survival as a response (i.e. lm(Survived ~Total.Length + I(Total.Length^2), data=Bumpus)). Previous work has shown that the quadratic term improves the model fit, which is why we do it here.

3. How well does the model fit the data? How much of the variance it does it explain, and how do the residuals look?

The model does not fit the data very well. The R squared is 0.10 so only 10% of the variation in Y is explained by our X values. The residuals also look quite bad. They are just in two lines! The variance might be equal across the fitted values but this is hard to see. The model also gives an intercept that is impossible. It is below 0 (no link in the LM, so already on original scale) and a survival probability of less than 0 cannot exist.

4. Explain why the plot of the residuals against the fitted values (or against total length) looks like it does?

We have the two lines of residuals because our data are only 0s and 1s, so they form lines as they only take two values. As the fitted value (probability of survival for different lengths) increases, the residuals become lower. This is because we are substracting 0s and 1s from 0.5+ giving residuals of -0.5 and +0.5 towards the more extreme fitted values.

mod.lm <- lm(Survived ~Total.Length + I(Total.Length^2), data=Bumpus)
# look at variance explained from LM
summary(mod.lm)$r.squared

## [1] 0.1002291

residuals <- resid(mod.lm)
fitted <- fitted(mod.lm)
plot(fitted,residuals)

Based on the results of the linear model, your team now decides to fit a GLM assuming a binomial distribution and a logit link:

mod.glm <- glm(Survived ~ Total.Length + I(Total.Length^2), data=Bumpus, 
               family = binomial(link="logit"))

We can’t plot the fitted line with abline(), so we need to predict the proportion for each size. Here is some code: you should plot it yourself.

# create data to predict for: this goes a bit beyond the data, to make things clearer
# create new vector of total length from 5 lower than min to 5 more than max
Total.Length.pred <- data.frame(Total.Length=(min(Bumpus$Total.Length)-5):(max(Bumpus$Total.Length)+5))
# predict survival probability based on your glm model
Total.Length.pred$PredSurv <- predict(mod.glm, newdata = Total.Length.pred, type="response")
# also predict based on the lm
Total.Length.pred$PredSurv.lm <- predict(mod.lm, newdata = Total.Length.pred, type="response")

# plot the data
plot(jitter(Bumpus$Total.Length), jitter(Bumpus$Survived, factor=0.4), col="grey70",
     xlab="Total Length", ylab="Proportion Survived")
# add the empirical proportion who survived
lines(as.numeric(names(MeanSurv)), MeanSurv, col="lightblue")
# now add a line for the glm predictions
lines(Total.Length.pred$Total.Length, Total.Length.pred$PredSurv, lty=2, col="darkred")
# and one for the lm predictions
lines(Total.Length.pred$Total.Length, Total.Length.pred$PredSurv.lm, lty=3, col="darkblue")

5. How good do the fitted models look? i.e. how do the glm (red) and lm (blue) predicted lines look?

Both fitted models look much better than the up and down line we had before. The lm and glm are actually quite similar. In the middle, especially between 155 and 160mm, the two fitted lines are the almost the same. But at the ends they are different. The lm goes below 0, which is not good. But the glm does not. The glm fits slightly better and both capture the general shape of the data. I think the glm is ok!

6. Are the predictions from each model reasonable for large birds? What would happen if we extrapolated just a bit beyond the largest bird? Think about each model separately and compare

For large birds the lm predictions are not reasonable, they go below 0 already and would go considerably negative if we extrapolated. This is impossible, so very bad. The glm however seems to level off so might predict around 0 or very low survival probability. For both models we have less data here so harder to know if the predictions will be good but at least the glm predicts something that is possible.

7. What conclusions can you draw about length of sparrows and survival probability in snow storms? What could this mean for the future of the population?

The conclusions we can draw at that low and high lengths have lower survival probability than mid lengths e.g. 155 to 160 mm. This makes biological sense as smaller birds maybe have fewer reserves so are less able to survive disturbance and bigger birds have higher energy requirement so also might have problems. Mid sized birds could also be prime age vs younger or older birds that are small or big. For the future of the population more snow storms could mean a shift in body length distribution to closer to middle values, the mean could remain unchanged but variance would shift.

Exercises Week 11: GLMs

Instructions:

Part 1: GLMs in Theory

Part 2: Bumpus’ Sparrows