Excerise 11: Poisson GLM

Instructions:

Hints and reminders are italic

Questions appear in blue.

Needs to be completed and handed in by 11th April


General tips for exam style questions

This exercise will have questions more in the exam style. All more complex R code will be provided at the bottom.


Resources:

You might need to go back over some of your previous lectures in order to do this.

R-code at end


The challenge: Investigating fraud in the bird world.

Back in 1962 some British birders (people who spend a lot of time looking at rare birds) suspected that a lot of observations from around Hastings from betwen 1890 and 1930 were frauds. John Nelder (co-inventor of GLMs) took a look at the data, and compared Hastings with two areas next to Hastings.

This 'scandal' even has it's own wikipedia page. The link is hidden, try to solve the mystery yourselves before looking at the answer.

Wikipedia

https://en.wikipedia.org/wiki/Hastings_Rarities

Your job is to find out whether there were more rare birds seen in Hastings than in surrounding areas before 1925 compared to after 1925?


The data

You have data on:

We are looking only at the rarest species, nationally categorised as “Very Rare”. We have numbers observed in nearby areas and the focus area (Hastings) from two time periods. The 'pre' time period is when we suspect fraud might have occurred, the 'post' time period is after.

Data can be found at: https://www.math.ntnu.no/emner/ST2304/2019v/Week13/HastingsData.csv

##   Year     Area Count Era
## 1 1895 Hastings     2 Pre
## 2 1896 Hastings     4 Pre
## 3 1897 Hastings     2 Pre
## 4 1898 Hastings     0 Pre
## 5 1899 Hastings     1 Pre
## 6 1900 Hastings     4 Pre

Questions

Before you begin any analyses, you need to think about what steps you will need to conduct in order to reach a conclusion. This relates to the modelling process.

1. What steps do we need to take to model this data?

2. What kind of data do you have as response and explanatory variable(s)?

3a. What model would you use for this data and why? 3b. What distribution would you use? 3c. What link function would you use?

4. How would you fit this model in R, include a line of code. (also actually do it)

5. How can you test the hypothesis of whether there is an interaction between Era and Area i.e. Is the effect of area different before and after 1925? (also actually do it) Once you have, decide on the best model based on the results of your hypothesis test.

6. What assumptions do we have for this model?

7. How can we test the assumptions? (do this)

8. How good is the fit of the model resulting from qu 5? Can you improve it?

9. Interpret the output (coefficient values) for the the model resulting from qu 5, what do each of the values mean?

10. Do you think there was fraud going on in Hastings pre-1925? Explain your answer


R-code

# glm(Y ~ X1*X2, family=" "(link= ), data=Data)
# anova(NullModel, AltModel, test = "LRT")
# AIC(Model1), BIC(Model1)

# plot(HastingsModel, which=4) # Cook's distance

# residuals <- residuals(Model)
# fitted <- fitted(Model)
# plot(fitted,residuals) # residuals vs fitted

# qqnorm(residuals) # normal QQ plot
# qqline(residuals)

# dispersion <- deviance(Model)/df.residual(Model)
# summary(Model, dispersion = dispersion)

# coef(Model)
# confint(Model)

# A way to colour code by Area
# HastingsData$Colour <- c("red2", "blue", "orange")[as.numeric(HastingsData$Area)]

# plot counts colour coded by area
# points(X,Y, col=by area), pch = shape, cex = size
# points(HastingsData$Year, HastingsData$Count, col = HastingsData$Colour, pch=18, cex=1.5)

# MASS::glm.nd(Y~X)