Excerise 11: Poisson GLM

Instructions:

Hints and reminders are italic

Questions appear in blue.

Answers in red or as R code

General tips for exam style questions

This exercise will have questions more in the exam style. All more complex R code will be provided at the bottom.

Read all questions carefully.
Take your time, there is no need to rush.
Try to think about when you have done similar things. We will ask you in different contexts but the ideas are the same.
Some answers will be very short/simple.
Some questions have no single correct answer, just make sure to justify your choices.
When asked for an equation it is likely to be an equation you would use to predict a value of Y for a given value of X.
When asked to explain or why, give reasons.
If we have not given you a plot, or ask you specifically to interpret coefficients, then you do not need a plot to do this! We won't trick you. So take some hints from the information we give with a question, it tells you what sort of answer we expect.

Resources:

You might need to go back over some of your previous lectures in order to do this.

R-code at end

The challenge: Investigating fraud in the bird world.

Back in 1962 some British birders (people who spend a lot of time looking at rare birds) suspected that a lot of observations from around Hastings from betwen 1890 and 1930 were frauds. John Nelder (co-inventor of GLMs) took a look at the data, and compared Hastings with two areas next to Hastings.

This 'scandal' even has it's own wikipedia page. The link is hidden, try to solve the mystery yourselves before looking at the answer.

Wikipedia

https://en.wikipedia.org/wiki/Hastings_Rarities

Your job is to find out whether there were more rare birds seen in Hastings than in surrounding areas before 1925 compared to after 1925?

The data

You have data on:

Year (1895 to 1954)
Era (pre-1925 and after-1925)
Area (Hastings, Sussex, Kent) These are three nearby places in the UK
Count: number of records (number of reports of a rare species: could be the same species at different times)

We are looking only at the rarest species, nationally categorised as “Very Rare”. We have numbers observed in nearby areas and the focus area (Hastings) from two time periods. The 'pre' time period is when we suspect fraud might have occurred, the 'post' time period is after.

Data can be found at: https://www.math.ntnu.no/emner/ST2304/2019v/Week13/HastingsData.csv

HastingsData <- read.csv("https://www.math.ntnu.no/emner/ST2304/2019v/Week13/HastingsData.csv", header=T)
head(HastingsData)

##   Year     Area Count Era
## 1 1895 Hastings     2 Pre
## 2 1896 Hastings     4 Pre
## 3 1897 Hastings     2 Pre
## 4 1898 Hastings     0 Pre
## 5 1899 Hastings     1 Pre
## 6 1900 Hastings     4 Pre

Questions

Before you begin any analyses, you need to think about what steps you will need to conduct in order to reach a conclusion. This relates to the modelling process.

1. What steps do we need to take to model this data?

This question is asking for the general steps, that you would take for any data, including this one.

You begin with looking at our data and choosing an appropriate model. This depends on our question and data.
Then you fit the model and conduct model selection to find 'best' one, this could be confirmatory or exploratory, depending on your question. Fitting the model involves maximum likelihood estimation of parameters.
Then you check that the model meets assumptions and fits the data, how we assess this will depend on the model and on the data.
Then you can interpret the output and reach a conclusion. Final step needs uncertainty, which is quantified as part of maximum likelihood estimation.

2. What kind of data do you have as response and explanatory variable(s)?

Response is number of records (count data) Explanatory are Era and Area (both categorical)

3a. What model would you use for this data and why? 3b. What distribution would you use? 3c. What link function would you use?

3a. Poisson GLM because count data. We can also see there is a low mean value, so cannot approximate to normal, choosing Poisson is important here. 3b. The Poisson distribution reflects how this count data was generated. 3c. The log link as it is the canonical link for Poisson and reflects counting effort.

4. How would you fit this model in R, include a line of code. (also actually do it)

HastingsModel <- glm(Count ~ Area*Era, family="poisson"(link=log), data=HastingsData)

5. How can you test the hypothesis of whether there is an interaction between Era and Area i.e. Is the effect of area different before and after 1925? (also actually do it) Once you have, decide on the best model based on the results of your hypothesis test.

Conduct model selection. In this case it is confirmatory because we have a specific hypothesis of whether there is an interaction between era and area. As the model is a GLM we do an analysis of deviance.

NullModel <- glm(Count ~ Area+Era, family="poisson"(link=log), data=HastingsData)
anova(NullModel, HastingsModel, test = "LRT")

## Analysis of Deviance Table
## 
## Model 1: Count ~ Area + Era
## Model 2: Count ~ Area * Era
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       176     428.04                          
## 2       174     372.82  2   55.226 1.018e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# The test was significant so we reject the null and choose a model with an interaction

6. What assumptions do we have for this model?

Assumptions should focus around: Lack of outliers,
Correct distribution used, Correct link function is used, Correct variance function is used, Dispersion parameter is constant, Independence of y. I would also especially expect a mention of variance assumptions. I.e. that variance = mean in Poisson GLM, or no over- or under-dispersion.

7. How can we test these assumptions? (do this)

Answers in the R code

# We can test lack of outliers with Cook's distance plot
plot(HastingsModel, which=4)

plot of chunk unnamed-chunk-4

# We can test correct distribution and link from our knowledge of data
# We can also look at whether the deviance residuals have equal variance and seem normal
# use residuals vs fitted
# this also checks linearity
residuals <- residuals(HastingsModel)
fitted <- fitted(HastingsModel)
plot(fitted,residuals)

plot of chunk unnamed-chunk-4

# and normal QQ
qqnorm(residuals)
qqline(residuals)

plot of chunk unnamed-chunk-4

# We can test variance with overdispersion and deviance ratio
deviance(HastingsModel)/df.residual(HastingsModel)

## [1] 2.142622

8. How good is the fit of the model resulting from qu 5?

Normality, linearity, outliers, and variance equality all look ok. But the model is overdispersed. We should look at ways to deal with this - like correct likelihood or negative binomial glm.

HastingsModelNB <- MASS::glm.nb(Count ~ Area*Era, data = HastingsData)
summary(HastingsModelNB)

## 
## Call:
## MASS::glm.nb(formula = Count ~ Area * Era, data = HastingsData, 
##     init.theta = 1.608068033, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3863  -1.0484  -0.8008   0.5106   2.0726  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.6217     0.1999   3.110 0.001871 ** 
## AreaKent           -0.7309     0.3160  -2.313 0.020733 *  
## AreaSussex         -1.0445     0.3377  -3.093 0.001980 ** 
## EraPre              1.4374     0.2533   5.676 1.38e-08 ***
## AreaKent:EraPre    -2.3643     0.4848  -4.877 1.08e-06 ***
## AreaSussex:EraPre  -1.7405     0.4743  -3.670 0.000243 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.6081) family taken to be 1)
## 
##     Null deviance: 372.60  on 179  degrees of freedom
## Residual deviance: 178.17  on 174  degrees of freedom
## AIC: 557.96
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.608 
##           Std. Err.:  0.404 
## 
##  2 x log-likelihood:  -543.956

# correct likelihood
dispersion <- deviance(HastingsModel)/df.residual(HastingsModel)
summary(HastingsModel, dispersion = dispersion)

## 
## Call:
## glm(formula = Count ~ Area * Era, family = poisson(link = log), 
##     data = HastingsData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9595  -1.1447  -0.8424   0.6478   5.3350  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.6217     0.1992   3.121  0.00180 ** 
## AreaKent           -0.7309     0.3494  -2.092  0.03645 *  
## AreaSussex         -1.0445     0.3904  -2.676  0.00746 ** 
## EraPre              1.4374     0.2202   6.527 6.70e-11 ***
## AreaKent:EraPre    -2.3643     0.5706  -4.143 3.43e-05 ***
## AreaSussex:EraPre  -1.7405     0.5514  -3.156  0.00160 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 2.142622)
## 
##     Null deviance: 848.02  on 179  degrees of freedom
## Residual deviance: 372.82  on 174  degrees of freedom
## AIC: 650.03
## 
## Number of Fisher Scoring iterations: 5

9. Interpret the output (coefficient values) for the model resulting from qu 5

In all interpreting from GLMs - remember the link function! We can see that the mean of the post 1925 Hastings = exp(0.62) = 1.86 The means other areas for the same time period are: Kent = exp(0.62+-0.73) = 0.9 Sussex = exp(0.62+-1.04) = 0.66. So there are differences between the areas after 1925 (none of the confidence intervals cross 0 so we can say we would be unlikely to see these effects, differences,
if nothing was going on.) Remember there was overdispersion, so we need to multiply the standard errors by the dispersion to get better confidence estimates (correcting likelihood) or use a negative binomial model.

Accounting for overdispersion the confidence intervals for the models don't cross 0. So we would be unlikely to see any of these effects if there was no effect.

If we look at the effect of Era for Hastings, we can see that pre 1925 had a higher number of rare birds. This effect is 1.44 on the log scale, giving a mean of exp(0.62+1.43) = 7.77 for pre 1925. This is MUCH higher than post 1925. If we then look at the interaction, we can see that both Kent and Sussex have negative values pre-1925, meaning they are lower than would be expected from an additive model. We can work out the means of Kent in 1925 = exp(0.62+-0.73+1.43-2.36) = 0.35 and Sussex in 1925 = exp(0.62+-1.04+1.43-1.74) = 0.48 From this we can see that pre-1925, both Kent and Sussex had lower mean counts than post-1925. Hastings on the other hand had much higher counts. Hastings behaves differently pre and post 1925 compared to Sussex and Kent - causing an interaction between Era and Area. Can either give results from negative binomial or corrected Poisson, results are the same. It only alters confidence intervals/standard error but they still don't cross 0.

coef(HastingsModel)

##       (Intercept)          AreaKent        AreaSussex            EraPre 
##         0.6216882        -0.7308875        -1.0445450         1.4373860 
##   AreaKent:EraPre AreaSussex:EraPre 
##        -2.3642786        -1.7404662

# Results - corrected the likelihood
summary(HastingsModel, dispersion = dispersion)

## 
## Call:
## glm(formula = Count ~ Area * Era, family = poisson(link = log), 
##     data = HastingsData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9595  -1.1447  -0.8424   0.6478   5.3350  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.6217     0.1992   3.121  0.00180 ** 
## AreaKent           -0.7309     0.3494  -2.092  0.03645 *  
## AreaSussex         -1.0445     0.3904  -2.676  0.00746 ** 
## EraPre              1.4374     0.2202   6.527 6.70e-11 ***
## AreaKent:EraPre    -2.3643     0.5706  -4.143 3.43e-05 ***
## AreaSussex:EraPre  -1.7405     0.5514  -3.156  0.00160 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 2.142622)
## 
##     Null deviance: 848.02  on 179  degrees of freedom
## Residual deviance: 372.82  on 174  degrees of freedom
## AIC: 650.03
## 
## Number of Fisher Scoring iterations: 5

# Results - negative binomial glm
summary(HastingsModelNB)

## 
## Call:
## MASS::glm.nb(formula = Count ~ Area * Era, data = HastingsData, 
##     init.theta = 1.608068033, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3863  -1.0484  -0.8008   0.5106   2.0726  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.6217     0.1999   3.110 0.001871 ** 
## AreaKent           -0.7309     0.3160  -2.313 0.020733 *  
## AreaSussex         -1.0445     0.3377  -3.093 0.001980 ** 
## EraPre              1.4374     0.2533   5.676 1.38e-08 ***
## AreaKent:EraPre    -2.3643     0.4848  -4.877 1.08e-06 ***
## AreaSussex:EraPre  -1.7405     0.4743  -3.670 0.000243 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.6081) family taken to be 1)
## 
##     Null deviance: 372.60  on 179  degrees of freedom
## Residual deviance: 178.17  on 174  degrees of freedom
## AIC: 557.96
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.608 
##           Std. Err.:  0.404 
## 
##  2 x log-likelihood:  -543.956

Code to create a plot of the data

# Might want to plot
# colour code by Area
HastingsData$Colour <- c("red2", "blue", "orange")[as.numeric(HastingsData$Area)]

plot(HastingsData$Year, 
     HastingsData$Count, type="n", xlab="Year", ylab="Frequency")
# put a rectangle below 1925
rect(1850, -5, 1925, 59, col="grey90", border=NA)
# plot counts colour coded by area
points(HastingsData$Year, 
       HastingsData$Count, col = HastingsData$Colour, pch=18, cex=1.5)
# add legend to show different colours
legend(1935, 20, levels(HastingsData$Area), col=c("red2", "blue", "orange"), pch=18, cex=1)

plot of chunk unnamed-chunk-7

10. Do you think there was fraud going on in Hastings pre-1925? Explain your answer

It is very suspicious that Hastings follows an opposite pattern to the other areas. Particularly that is had almost 7 fold higher rare birds pre 1925 than after. There could be biological reasons for this, like reduction of rare species over time. But fraud seems more likely given the patterns in the other areas.

R-code

# glm(Y ~ X1*X2, family=" "(link= ), data=Data)
# anova(NullModel, AltModel, test = "LRT")
# AIC(Model1), BIC(Model1)

# plot(HastingsModel, which=4) # Cook's distance

# residuals <- residuals(Model)
# fitted <- fitted(Model)
# plot(fitted,residuals) # residuals vs fitted

# qqnorm(residuals) # normal QQ plot
# qqline(residuals)

# dispersion <- deviance(Model)/df.residual(Model)
# summary(Model, dispersion = dispersion)

# coef(Model)
# confint(Model)

# A way to colour code by Area
# HastingsData$Colour <- c("red2", "blue", "orange")[as.numeric(HastingsData$Area)]

# plot counts colour coded by area
# points(X,Y, col=by area), pch = shape, cex = size
# points(HastingsData$Year, HastingsData$Count, col = HastingsData$Colour, pch=18, cex=1.5)

# MASS:glm.nd(Y~X)