More House Sparrows in North America

The North American Breeding Bird Survey is conducted across North America to estimate the abundances of birds. Bird watchers go to sites across North America and count the number of birds they observe for a set time. This data can be used to ask a wide range of questions. Here we can look at the effects of climate on abundance of the house sparrow.

The data set consists of 1714 observations at different sites across North America. For each site the presence of the house sparrow was recorded. Mean temperature (temp.mean) and total precipitation (perc.mean) were extracted from a standard database (CRU). The mean temperaure and precipitation are both scaled, so that a temperature value of 0.3 would mean that the temperature is 0.3 standard deviations above the mean in North-America.

Data <- read.csv("https://www.math.ntnu.no/emner/ST2304/2019v/Week12/HouseSparrowPresence.csv")

plot(Data$Longitude, Data$Latitude, col=1+Data$Present, pch=18)
legend(-125, 31, c("Absent", "Present"), col=1:2, pch=18)

First, fit a generalised linear model for Present assuming a binomial response, with mean temperature and total precipitation as explanatory variables. Decide which link function and model to use.

form1 <- formula(Present ~ prec.mean.sc + temp.mean.sc)
model1.logit <- glm(form1, data=Data, family = binomial("logit"))
model1.cloglog <- glm(form1, data=Data, family = binomial("cloglog"))

# These are the models with the quadratic effects in
form2 <- formula(Present ~ prec.mean.sc + temp.mean.sc + I(prec.mean.sc^2) + I(temp.mean.sc^2))
model2.logit <- glm(form2, data=Data, family = binomial("logit"))
model2.cloglog <- glm(form2, data=Data, family = binomial("cloglog"))

1. Which link function do you want to use? Justify your choice.

There is no single correct answer here. So, a choice is ok, but the justification is important. Here are the three reasonable answers:

cloglog: This is porbably the best link, because the presences and absences come from counts, and this is then a model for whether the counts are zero or larger. Therefore, this could be considered most appropriate to represent the process that generated the data.

logit: This is the number two choice. It can be justified by saying that the counting processes is not the most important focus of the analysis, that you want to focus more on the presence-absence. Sometimes this makes sense ecologically: processes controlling abundance (i.e. counts) can sometimes be very different from those controlling presence/absence, in which case a logit link might make more sense.

probit: This choice isn’t wrong, but we can’t think of a good reason for it. It is a threshold type model. If anyone can justify it, then please tell us how you do it!

2. Write down the assumptions of this model and an equation for the expected number of sparrows at a site.

Assumptions should focus around: Lack of outliers, Correct distribution used, Correct link function is used, Correct variance function is used, Dispersion parameter is constant, Independence of y. I would also especially expect a mention of variance assumptions. I.e. that variance is controlled by mean in binomial GLM and therefore not overdispersed or underdispersed.

This part is quite a difficult question. Read the whole answer slowly and carefully! I’ve tried to go through it in steps. The simpliest and most familiar way to do this would be to use the Logit link, this is valid as discussed above. The equation for expected number of sparrows at site i refers to the equation you would need to use to predict, from the GLM coefficients, the value of a specific Y (response) given a specific X (explanatory).

In this case the X values (explanatory variables) are the precipitation (rainfall) and temperature values for site i. If this was a simple linear model we could predict the Y value (response variable) for site i using just a linear equation and the coefficients of intercept (alpha) and slopes (betas) and the X values of rainfall and temperature at site i (see below). Here T = temperature and R = precipitation.

\[ \begin{aligned} Y_i &= \alpha + \beta_TT_i + \beta_RR_i \end{aligned} \] All questions asking for an equation for the expected value of the response (Y) for a specific X, or asking for prediction, will have answers based around this general format. The key thing is to include the i for the Y and X values (here R and T) to indicate that you are predicting a specific Y value from specific X values.

But as this is a GLM, we need to remember the link funtion. To predict back on the original scale, we need to place this linear part into the inverse of the link. An example of this for the Logit link is shown below. Plugging your coefficient values for alpha and betas and chosen values of Xs into this equation would allow you predict Y values manually without the predict() function.

Here is the equation for the Logit link, it gives the expected presence not expected number. To get number rather than presence we need the cloglog link. If you used Logit being aware that cloglog would give number and Logit does not, should be ok!. Here T = temperature and R = precipitation.

For logistic link (probability of presence in Site i):

\[ \begin{aligned} Y_i &= \frac{1}{1+e^{-(\alpha + \beta_TT_i + \beta_RR_i)}} \end{aligned} \]

For cloglog link things are slightly different. On the link scale, we predict the log(Yi), if we take the exp() of that, we will get the number of sparrows at site i. This is because the log(Yi) is the same link as we have for a Poisson GLM. Again, this is towards the most advanced end of this course. So, to predict the number of sparrows at site i from a cloglog model we can predict on the link scale. This basically means we can predict using the linear equation we would for a linear model. The inverse link we then use is actually exp(), we assume that our data were generated from a Poisson process. The cloglog link made our binary data comparable to count data so when we take inverse, we can just use the one we would have for Poisson.

For cloglog link (number of sparrows at Site i):

\[ \begin{aligned} Y_i &= e^{(\alpha + \beta_TT_i + \beta_RR_i)} \end{aligned} \]

3. What is the estimated coefficient for the effect of mean temperature (prec.temp.sc)? And what is the 95% confidence interval for this coefficient? Does this suggest an effect of temperature?

Estimated coefficient = 0.085 (logit), CI = -0.04 to 0.21. This means that we cannot be certain about the direction of the effect of temperature. When uncertainty is included, we cannot say a direction for this effect. Interpretation is the same for cloglog model. (coef = 0.035, CI = -0.02 to 0.094)

# Logit model
coef(model1.logit)

##  (Intercept) prec.mean.sc temp.mean.sc 
##   1.41556014   0.11396537   0.08526596

confint(model1.logit)

## Waiting for profiling to be done...

##                    2.5 %    97.5 %
## (Intercept)   1.29706668 1.5370466
## prec.mean.sc -0.01273306 0.2402357
## temp.mean.sc -0.04038918 0.2127383

# cloglog
coef(model1.cloglog)

##  (Intercept) prec.mean.sc temp.mean.sc 
##   0.48807229   0.05306602   0.03518324

confint(model1.cloglog)

## Waiting for profiling to be done...

##                    2.5 %     97.5 %
## (Intercept)   0.42880454 0.54667667
## prec.mean.sc -0.00778442 0.11417569
## temp.mean.sc -0.02381297 0.09409312

4. What is the predicted probability of sparrows being found in the following sites:

1. at a site near Seattle where the mean temperature is 0.3 standard deviations below the mean, and total precipitation is 0.3 standard deviations above the mean?

Predicted probability from cloglog = 0.81. From logit = 0.81.

2. at a site just west of Seattle, in a temperate rainforest, where the mean temperature is 0.4 standard deviations below the average, but the total precipitation is 5.8 standard deviations above the average.

Predicted probability from cloglog = 0.89. From logit = 0.89. Both of these are called responses, basically they are back on probability scale. R has used the inverse link function.

predData <- data.frame(temp.mean.sc=c(-0.3, -0.4), prec.mean.sc=c(0.3, 5.8))
predict(model1.logit, newdata=predData, se.fit=TRUE, type="response")

## $fit
##         1         2 
## 0.8059913 0.8851828 
## 
## $se.fit
##          1          2 
## 0.01078668 0.03976396 
## 
## $residual.scale
## [1] 1

predict(model1.cloglog, newdata=predData, se.fit=TRUE, type="response")

## $fit
##         1         2 
## 0.8056192 0.8875631 
## 
## $se.fit
##          1          2 
## 0.01076428 0.04695680 
## 
## $residual.scale
## [1] 1

As well as using the predict function, you can also predict manually using the equations in question 2. But because this question asks for ‘probability’ we would need to use the cloglog inverse not log() for that model. Examples for each model below

# First LOGIT model
# begin with looking at the coefficient values
coef(model1.logit) # alpha(intercept) = 1.42, Rain beta = 0.11, Temp beta = 0.09

##  (Intercept) prec.mean.sc temp.mean.sc 
##   1.41556014   0.11396537   0.08526596

# then you plug them and the X values for site 1 (Rain = 0.3, Temp = -0.3) into the equation from Q2.
prediction <- 1/(1+exp(-(1.42+(0.11*0.3)+(0.09*-0.3)))) 
round(prediction,2)

## [1] 0.81

# should also round to 0.81! YAY

# Now for CLOGLOG
# again, begin with coefficient values
coef(model1.cloglog) # alpha(intercept) = 0.49, Rain beta = 0.05, Temp beta = 0.04

##  (Intercept) prec.mean.sc temp.mean.sc 
##   0.48807229   0.05306602   0.03518324

# then you plug into the inverse cloglog link equation
# that is 1-exp(-exp(alpha + beta*x))
prediction2 <- 1-exp(-exp(0.49+(0.05*0.3)+(0.03*-0.4)))
round(prediction2,2)

## [1] 0.81

# agains should round to 0.81! YAY

From ecological theory, we might expect that there is an optimum temperature and precipitation for the sparrows, and abundance declines when the conditions are further away from the optimum. We can model this using a quadratic curve.

Fit a model with quadratic effects for temperature and precipitation.

coef(model2.cloglog)

##       (Intercept)      prec.mean.sc      temp.mean.sc I(prec.mean.sc^2) 
##        0.70796868        0.03620322        0.08790546       -0.03559948 
## I(temp.mean.sc^2) 
##       -0.17372846

coef(model2.logit)

##       (Intercept)      prec.mean.sc      temp.mean.sc I(prec.mean.sc^2) 
##        1.82686960        0.06576767        0.19244368       -0.05587003 
## I(temp.mean.sc^2) 
##       -0.31733508

confint(model2.cloglog)

## Waiting for profiling to be done...

##                         2.5 %       97.5 %
## (Intercept)        0.61767816  0.798260195
## prec.mean.sc      -0.02736616  0.099583285
## temp.mean.sc       0.02200101  0.154439759
## I(prec.mean.sc^2) -0.08250291  0.006793856
## I(temp.mean.sc^2) -0.23102075 -0.117827285

confint(model2.logit)

## Waiting for profiling to be done...

##                         2.5 %      97.5 %
## (Intercept)        1.64456440  2.01445039
## prec.mean.sc      -0.05708296  0.18812957
## temp.mean.sc       0.07139517  0.31385592
## I(prec.mean.sc^2) -0.12840839  0.02192178
## I(temp.mean.sc^2) -0.41706711 -0.21764286

5. Does the ecological theory seem reasonable for this data? What parameter value(s) tell you about this?

This is asking whether the idea that there is a quadratic (n shaped) effect of temperature and precipitation is reasonable. We can test this by adding a quadratic effect to our model and seeing if the estimates have confidence intervals that span 0. The parameter values that tell us about this are those for I(prec.mean.sc^2) and I(temp.mean.sc^2). The confidence interval for the quadratic effect of precipitation crosses 0 so suggests that with uncertainty we are not sure of the direction of this effect, so ecological theory doesn’t hold. But for temperature it looks like it does. We would be unlikely to see the temperature effect if there was no quadratic effect.

6. What is the predicted number of sparrows for the two sites mentioned above?

predData <- data.frame(temp.mean.sc=c(-0.3, -0.4), prec.mean.sc=c(0.3, 5.8))
exp(predict(model2.cloglog, newdata=predData, type ="link"))

##         1         2 
## 1.9613215 0.7099379

predict(model2.logit, newdata=predData, type="response")

##         1         2 
## 0.8526173 0.5501071

# Manually from Q2
# original X values: Rain = 0.3, Tmep = -0.3
# squared: Rain2 = 0.09, Temp = 0.09
# alpha(intercept) = 0.71, beta Rain = 0.04, beta Rain2 = -0.04, beta Temp = 0.09, beta Temp2 = -0.17
prediction <- exp(0.71+(0.04*0.3) + (-0.04*0.09) + (0.09*-0.3) + (-0.17*0.09)) 
prediction

## [1] 1.966195

# should be 1.97, rounds to 2 individuals, same as predict() function!

This question is a bit more complicated and relates to the cloglog link. If you used the Logit link, you could just answer this question in theory i.e. how you could get to the abundance. To do this from the cloglog link you need to predict on the link scale i.e. without taking in inverse. This is just the same as the equation in Q2. Then we take exp() of prediction, again just like Q2. If you do this you predict abundance of 2 and 0.7. From the Logit model you jsut predict probability of presence of 0.85 and 0.55 respectively.

As well as using the predict function you could also plug the coefficient and X (now also including X squared) values into the equations from Q2 to get the same prediction. Shown above in R code.

7. Comment briefly on the differences in the predictions from the linear and quadratic model.

This asks you to compare the predictions from the model without the quadratic terms to the model with them. The predictions for site 1 go up with the quadratic model but go down for site 2. The quadratic model is clearly showing a different pattern to the linear one. It has reversed the relative magnitude of the two sites.

Exercise 10: Binomial Models

Instructions:

More House Sparrows in North America