Excerise 11: Poisson GLM

Instructions:

Hints and reminders are italic

Questions appear in blue.

Needs to be completed and handed in by 11th April

General tips for exam style questions

This exercise will have questions more in the exam style. All more complex R code will be provided at the bottom.

Read all questions carefully.
Take your time, there is no need to rush.
Try to think about when you have done similar things. We will ask you in different contexts but the ideas are the same.
Some answers will be very short/simple.
Some questions have no single correct answer, just make sure to justify your choices.
When asked for an equation it is likely to be an equation you would use to predict a value of Y for a given value of X.
When asked to explain or why, give reasons.
If we have not given you a plot, or ask you specifically to interpret coefficients, then you do not need a plot to do this! We won't trick you. So take some hints from the information we give with a question, it tells you what sort of answer we expect.

Resources:

You might need to go back over some of your previous lectures in order to do this.

R-code at end

The challenge: Investigating fraud in the bird world.

Back in 1962 some British birders (people who spend a lot of time looking at rare birds) suspected that a lot of observations from around Hastings from betwen 1890 and 1930 were frauds. John Nelder (co-inventor of GLMs) took a look at the data, and compared Hastings with two areas next to Hastings.

This 'scandal' even has it's own wikipedia page. The link is hidden, try to solve the mystery yourselves before looking at the answer.

Wikipedia

https://en.wikipedia.org/wiki/Hastings_Rarities

Your job is to find out whether there were more rare birds seen in Hastings than in surrounding areas before 1925 compared to after 1925?

The data

You have data on:

Year (1895 to 1954)
Era (pre-1925 and after-1925)
Area (Hastings, Sussex, Kent) These are three nearby places in the UK
Count: number of records (number of reports of a rare species: could be the same species at different times)

We are looking only at the rarest species, nationally categorised as “Very Rare”. We have numbers observed in nearby areas and the focus area (Hastings) from two time periods. The 'pre' time period is when we suspect fraud might have occurred, the 'post' time period is after.

Data can be found at: https://www.math.ntnu.no/emner/ST2304/2019v/Week13/HastingsData.csv

##   Year     Area Count Era
## 1 1895 Hastings     2 Pre
## 2 1896 Hastings     4 Pre
## 3 1897 Hastings     2 Pre
## 4 1898 Hastings     0 Pre
## 5 1899 Hastings     1 Pre
## 6 1900 Hastings     4 Pre

Questions

Before you begin any analyses, you need to think about what steps you will need to conduct in order to reach a conclusion. This relates to the modelling process.

1. What steps do we need to take to model this data?

2. What kind of data do you have as response and explanatory variable(s)?

3a. What model would you use for this data and why? 3b. What distribution would you use? 3c. What link function would you use?

4. How would you fit this model in R, include a line of code. (also actually do it)

5. How can you test the hypothesis of whether there is an interaction between Era and Area i.e. Is the effect of area different before and after 1925? (also actually do it) Once you have, decide on the best model based on the results of your hypothesis test.

6. What assumptions do we have for this model?

7. How can we test the assumptions? (do this)

8. How good is the fit of the model resulting from qu 5? Can you improve it?

9. Interpret the output (coefficient values) for the the model resulting from qu 5, what do each of the values mean?

10. Do you think there was fraud going on in Hastings pre-1925? Explain your answer

R-code

# glm(Y ~ X1*X2, family=" "(link= ), data=Data)
# anova(NullModel, AltModel, test = "LRT")
# AIC(Model1), BIC(Model1)

# plot(HastingsModel, which=4) # Cook's distance

# residuals <- residuals(Model)
# fitted <- fitted(Model)
# plot(fitted,residuals) # residuals vs fitted

# qqnorm(residuals) # normal QQ plot
# qqline(residuals)

# dispersion <- deviance(Model)/df.residual(Model)
# summary(Model, dispersion = dispersion)

# coef(Model)
# confint(Model)

# A way to colour code by Area
# HastingsData$Colour <- c("red2", "blue", "orange")[as.numeric(HastingsData$Area)]

# plot counts colour coded by area
# points(X,Y, col=by area), pch = shape, cex = size
# points(HastingsData$Year, HastingsData$Count, col = HastingsData$Colour, pch=18, cex=1.5)

# MASS::glm.nd(Y~X)