Exercise 11: Investigating fraud

Instructions:

This document contains information, questions, R code, and plots.

Hints and reminders are bold

Questions appear in blue.

In this exercise, the aim is to practice all of the modelling tools you learnt in this course. It is indicated how these questions might relate to an exam and how long an answer is required in each case. You can use the solution to grade your own work and then direct your revision.

You can decide how much help you want. For each question there is a general hint and an R hint (if applicable).

In the exam you will not get hints. So, now is the time to practice without!

R this week:

Things to remember:

glm(Y ~ X1+X2, family=YOURFAMILY(link=YOURLINK), data=Data)
anova(NullModel, AltModel, test = "LRT") for confirmatory model selection
AIC(Model1), BIC(Model1) for exploratory model selection
plot(HastingsModel, which=4) # Cook’s distance
plot(fitted,residuals) # residuals vs fitted
qqnorm(residuals) # normal QQ plot

New this week:

dispersion <- deviance(Model)/df.residual(Model) to check for overdispersion
summary(Model, dispersion = dispersion) to correct for overdispersion
MASS:glm.nd(Y~X) to run a negative binomial GLM

—–

The challenge: Investigating fraud in the bird world.

Back in 1962 some British birders (people who spend a lot of time looking at rare birds) suspected that a lot of observations from around Hastings from betwen 1890 and 1930 were frauds. John Nelder (co-inventor of GLMs) took a look at the data, and compared Hastings with two nearby areas.

This ‘scandal’ even has it’s own wikipedia page. The link is hidden, try to solve the mystery yourselves before looking at the answer.

Wikipedia

https://en.wikipedia.org/wiki/Hastings_Rarities

Your job is to find out whether there were more rare birds seen in Hastings than in surrounding areas before 1925 compared to after 1925?

—–

The data

You have data on:

Year (1895 to 1954)
Era (pre-1925 and after-1925)
Area (Hastings, Sussex, Kent) These are three regions in the UK.
Count: number of records (number of reports of a rare species: could be the same species at different times)

We are only looking at the rarest species, nationally categorised as “Very Rare”. We have numbers observed in nearby areas and the focus area (Hastings) from two time periods. The ‘pre’ time period is when we suspect fraud might have occurred, the ‘post’ time period is after.

Data can be found at: https://www.math.ntnu.no/emner/ST2304/2019v/Week13/HastingsData.csv it is a .csv file with column headers.

—–

Part A: Choosing an appropriate model

In this section, the aim is to practice explaining and justifying different modelling concepts.

Before you begin any analyses, you should to think about what steps you will need to conduct in order to reach a conclusion. This relates to the modelling process.

A1. What steps would you want to take to model the Hastings data? (at least 4 steps)

General hint

Think or look back to the first few weeks or the summary of maximum likelihood concepts.

A2. What are the response and explanatory variable(s) and what data type are they, be specific?

General hint

Remember we have been introduced to several different data types. The main distinction is categorical/continuous. But continuous can be discrete or fully continuous. Click here for more on data.

A3. What model would you use for this data and why? Include which distribution you would choose for the random part and if you need a link function (say which if you do).

General hint

Think about all of the models we have covered in the course: distributions, linear models, T-tests, generalised linear models. Which works best here?

Then think about what distribution it is using for error.

—–

Part B: Running the model

B1. How would you fit the model from A3 in R, write one line of code.

General hint: if you are not sure about the model

If you were not sure you got A3 correct: the key thing is that the response variable is count data. This will not be normally distributed in terms of error, so cannot be modelled with a linear model. We will need a GLM. A Poisson GLM with a log link is the appropriate chose for count data.

R hint

The code you need to edit slightly is in the R section of this document. Here

No I really can’t work it out

You need to fill in your own data and remember to save as an object.

glm(Count ~ Area+Era, family=poisson(link=log), data=YourData)

Can either have an additive + or interaction * model.

B2. Run the model in R (this would not be in the exam but it is helpful here).

—–

Part C: Model selection

C1. How can you test the hypothesis: ‘there is an interaction between Era and Area’ i.e. Is the effect of area different before and after 1925? Include the name of the method you would use and say why you picked that method.

General hint

The hypothesis mentioned above is asking if there is an interaction. This is the same as asking which variables to include in a model. We have a specific idea of which variables we think are important, we are not testing lots of different ones. We need to find if including the interaction is needed and select the model that balances explanation and complexity.

R hint

The code you need to edit slightly is in the R section of this document. Here. You won’t need all the code listed, so pick the right part! It should take 2 lines.

C2. Run the method you chose for C1. What is your conclusion regarding the hypothesis? (again, in an exam you wouldn’t run yourself in R)

General hint

Make sure to include support for your conclusion. Which part of the output did you use to make the choice?

—–

Part D: Model checking

D1. What are the assumptions of this model? (I count 6)

General hint

These were listed in Week 10 lectures but can also be found on google. There are several different ways to write them. But I am looking for approx. 6.

D2. How can we test these assumptions? (I expect 4 checks)

General hint

Just list the different methods here with the assumptions they test. You run them in the next part.

They are not all plots.

D3. Run the checks in R for your preferred GLM model. This should be decided by the outcome of C2 (again, wouldn’t need to in an exam)

R hint

All the necessary code is at the top of this document. Here

D4. How good is the fit of the model based on your checks?

General hint

Go through each check and determine if the assumption is met.

D5. Would you want to improve it? If so, how?

General hint

Think about which assumption wasn’t met. How do you fix it?

D1, 2, 4, and 5 could be in an exam. The part of running in R, would not be. But interpretation of the R bit would.

D6. Try an improvement.

R hint

The code for possible corrections is also included at the start of this document. Here

—–

Part E: Conclusions

E1. Interpret the output (coefficient values) for your final model (the one you decided on in C2 and used in D).

General hint

Focus on the size of effects and whether they have statistical support, i.e. when we include uncertainty are we still clear about the direction? and remember to include the correct uncertainty numbers in your answer.

Deciding what this means for bird fraud is the next part.

E2. Do you think there was fraud going on in Hastings pre-1925? Explain your answer

We have plotted the data to help with this too. You can use this as well as the other information to support your conclusion.

General hint

Now use the interpretation you had before to draw a conclusion.

plot of chunk unnamed-chunk-8

—–

Part F: Reflection

F1. Use the solution to correct your answers.

F2. Think about how this went. Were you still using most of the hints? What were you still unsure of? Are there some areas you want to prioritise for revision? Were any of the answers surprising?

F3. Email Emily if there are any particular areas you would like covering in a summing up lecture (probably virtual and prerecorded)

General hint

These have no right answer but are up to you.

Exercise 11: Investigating fraud

Emily G Simmonds

Instructions:

R this week:

The challenge: Investigating fraud in the bird world.

The data

Part A: Choosing an appropriate model

Part B: Running the model

Part C: Model selection

Part D: Model checking

Part E: Conclusions

Part F: Reflection