Excerise 11: Investigating fraud

Instructions:

This document contains information, questions, R code, and plots.

Hints and reminders are bold

Questions appear in blue.

Answers are in purple.

In this exercise, the aim is to practice all of the modelling tools you learnt in this course. It is indicated how these questions might relate to an exam and how long an answer is required in each case. You can use the solution to grade your own work and then direct your revision.

You can decide how much help you want. For each question there is a general hint and an R hint (if applicable).

In the exam you will not get hints. So, now is a time to practice without!

R this week:

Things to remember:

glm(Y ~ X1+X2, family=YOURFAMILY(link=YOURLINK), data=Data)
anova(NullModel, AltModel, test = "LRT") for confirmatory model selection
AIC(Model1), BIC(Model1) for exploratory model selection
plot(HastingsModel, which=4) # Cook's distance
plot(fitted,residuals) # residuals vs fitted
qqnorm(residuals) # normal QQ plot

New this week:

dispersion <- deviance(Model)/df.residual(Model) to check for overdispersion
summary(Model, dispersion = dispersion) to correct for overdispersion
MASS:glm.nd(Y~X) to run a negative binomial GLM

The challenge: Investigating fraud in the bird world.

Back in 1962 some British birders (people who spend a lot of time looking at rare birds) suspected that a lot of observations from around Hastings from betwen 1890 and 1930 were frauds. John Nelder (co-inventor of GLMs) took a look at the data, and compared Hastings with two nearby areas.

This 'scandal' even has it's own wikipedia page. The link is hidden, try to solve the mystery yourselves before looking at the answer.

Wikipedia

https://en.wikipedia.org/wiki/Hastings_Rarities

Your job is to find out whether there were more rare birds seen in Hastings than in surrounding areas before 1925 compared to after 1925?

The data

You have data on:

Year (1895 to 1954)
Era (pre-1925 and after-1925)
Area (Hastings, Sussex, Kent) These are three regions in the UK.
Count: number of records (number of reports of a rare species: could be the same species at different times)

We are only looking at the rarest species, nationally categorised as “Very Rare”. We have numbers observed in nearby areas and the focus area (Hastings) from two time periods. The 'pre' time period is when we suspect fraud might have occurred, the 'post' time period is after.

Data can be found at: https://www.math.ntnu.no/emner/ST2304/2019v/Week13/HastingsData.csv it is a .csv file with column headers.

Remember to add stringsAsFactors = TRUE

HastingsData <- read.csv("https://www.math.ntnu.no/emner/ST2304/2019v/Week13/HastingsData.csv", 
                         header=TRUE, stringsAsFactors = TRUE)
head(HastingsData)

##   Year     Area Count Era
## 1 1895 Hastings     2 Pre
## 2 1896 Hastings     4 Pre
## 3 1897 Hastings     2 Pre
## 4 1898 Hastings     0 Pre
## 5 1899 Hastings     1 Pre
## 6 1900 Hastings     4 Pre

Part A: Choosing an appropriate model

In this section, the aim is to practice explaining and justifying different modelling concepts.

Before you begin any analyses, you should to think about what steps you will need to conduct in order to reach a conclusion. This relates to the modelling process.

A1. What steps would you want to take to model the Hastings data? (at least 4 steps) (4 marks)

General hint

Think back to steps you've taken in previous analyses or look at Section G of last week's model for some prompts.

This question is asking for the general steps, that you would take for any data, including this one.

One mark for each step up to 4 max

You begin with looking at our data and choosing an appropriate model. This depends on our question and data.
Then you fit the model and conduct model selection to find 'best' one, this could be confirmatory or exploratory, depending on your question. Fitting the model involves maximum likelihood estimation of parameters.
Then you check that the model meets assumptions and fits the data, how we assess this will depend on the model and on the data.
Then you can interpret the output and reach a conclusion. Final step needs uncertainty, which is quantified as part of maximum likelihood estimation.

Some of these steps are mixed together there are 6 steps in total here.

A2. What are the response and explanatory variable(s) and what data type are they, be specific? (4 marks)

General hint

Remember we have been introduced to several different data types. The main distinction is categorical/continuous. But continuous can be discrete or fully continuous.

Response is number of records (discrete but continuous AND count data. They are not in categories) Explanatory are Era and Area (both categorical)

1 mark for response and explanatory (2 total), 1 for the data type (of each variable - 2 total)

A3. What model would you use for this data and why? Include which distribution you would choose for the random part and if you need a link function (say which if you do). (3 marks)

General hint

Think about all of the models we have covered in the course: distributions, linear models, T-tests, generalised linear models. Which works best here?

Then think about what distribution it is using for error.

Poisson GLM because the response is count data. We also know there is a low mean value, so the counts cannot approximate to normal, choosing Poisson is important here. The Poisson distribution reflects how this count data were generated. The log link as it is the canonical link for Poisson and reflects counting effort.

1 mark for correct GLM type, 1 mark for reason for Poisson, 1 for correct link

Part B: Running the model

B1. How would you fit the model from A3 in R, write one line of code. (3 marks)

General hint: if you are not sure about the model

If you were not sure you got A3 correct: the key thing is that the response variable is count data. This will not be normally distributed in terms of error, so cannot be modelled with a linear model. We will need a GLM. A Poisson GLM with a log link is the appropriate chose for count data.

R hint

The code you need to edit slightly is in the R section of this document.

1 mark for correct function (glm), 1 for correct formula, 1 for correct family and link

HastingsModel <- glm(Count ~ Area*Era, family="poisson"(link=log), data=HastingsData)

B2. Run the model in R (this would not be in the exam but it is helpful here).

coef(HastingsModel)

##       (Intercept)          AreaKent        AreaSussex            EraPre   AreaKent:EraPre 
##         0.6216882        -0.7308875        -1.0445450         1.4373860        -2.3642786 
## AreaSussex:EraPre 
##        -1.7404662

confint(HastingsModel)

## Waiting for profiling to be done...

##                        2.5 %     97.5 %
## (Intercept)        0.3425666  0.8770545
## AreaKent          -1.2138196 -0.2736571
## AreaSussex        -1.5925425 -0.5405303
## EraPre             1.1511621  1.7421101
## AreaKent:EraPre   -3.1651615 -1.6242250
## AreaSussex:EraPre -2.4930707 -1.0060774

Part C: Model selection

C1. How can you test the hypothesis: 'there is an interaction between Era and Area' i.e. Is the effect of area different before and after 1925? Include the name of the method you would use and say why you picked that method. (3 marks)

General hint

The hypothesis mentioned above is asking if there is an interaction. This is the same as asking which variables to include in a model. We have a specific idea of which variables we think are important, we are not testing lots of different ones. We need to find if including the interaction is needed and select the model that balances explanation and complexity.

R hint

The code you need to edit slightly is in the R section of this document. You won't need all the code listed, so pick the right part! It should take 2 lines.

Conduct model selection. In this case it is confirmatory because we have a specific hypothesis of whether there is an interaction between era and area. As the model is a GLM we do an analysis of deviance.

1 mark = confirmatory model selection, 1 mark = why, 1 mark = analysis of deviance

C2. Run the method you chose for C1. What is your conclusion regarding the hypothesis? (again, in an exam you wouldn't run yourself in R)

General hint

Make sure to include support for your conclusion. Which part of the output did you use to make the choice?

NullModel <- glm(Count ~ Area+Era, family="poisson"(link=log), data=HastingsData)
anova(NullModel, HastingsModel, test = "LRT")

## Analysis of Deviance Table
## 
## Model 1: Count ~ Area + Era
## Model 2: Count ~ Area * Era
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       176     428.04                          
## 2       174     372.82  2   55.226 1.018e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The test was significant (p < 0.05) so we reject the null hypothesis and choose a model with an interaction

Part D: Model checking

D1. What are the assumptions of this model? (I count 6) (so 6 marks)

General hint

These were listed in the GLM module but can also be found on google. There are several different ways to write them. But I am looking for approx. 6.

GLM assumptions should focus around: Lack of outliers, Correct distribution used, Correct link function is used, Correct variance function is used, Dispersion parameter is constant, Independence of y. You could also say more on the variance assumptions here as we know we have a Poisson GLM i.e. that variance = mean in Poisson GLM, or no over- or under-dispersion. This last bit would get an extra point.

1 mark for each assumption up to 6

D2. How can we test these assumptions? (I expect 4 checks) (again 4 marks)

General hint

Just list the different methods here with the assumptions they test. You run them in the next part.

They are not all plots.

In plots such as residuals vs fitted, cook's d, and normal qq. Also need to check overdispersion with the deviance ratio

1 mark for each plot/test

D3. Run the checks in R for your preferred GLM model. This should be decided by the outcome of C2 (again, wouldn't need to in an exam)

R hint

All the necessary code is at the top of this document.

Answers in the R code

# We can test lack of outliers with Cook's distance plot
plot(HastingsModel, which=4)

plot of chunk unnamed-chunk-5

# We can test correct distribution and link from our knowledge of data
# We can also look at whether the deviance residuals have equal variance and seem normal
# use residuals vs fitted
# this also checks linearity
residuals <- residuals(HastingsModel)
fitted <- fitted(HastingsModel)
plot(fitted,residuals)

plot of chunk unnamed-chunk-5

# and normal QQ
qqnorm(residuals)
qqline(residuals)

plot of chunk unnamed-chunk-5

# We can test variance with overdispersion and deviance ratio
deviance(HastingsModel)/df.residual(HastingsModel)

## [1] 2.142622

D4. How good is the fit of the model based on your checks? (5 marks)

General hint

Go through each check and determine if the assumption is met.

The model fit seems ok but over dispersed. Normality (normal QQ), linearity (residual-fitted), outliers (cook's D), and variance equality (residual-fitted) all look ok. But the model is overdispersed (deviance ratio).

1 mark for an overall answer (ok, bad, good), then 1 each for an assumption checked and method.

D5. Would you want to improve it? If so, how? (2 marks)

General hint

Think about which assumption wasn't met. How do you fix it?

We should look at ways to deal with the overdispersion

like correct likelihood or negative binomial glm.

1 mark for yes/no, 1 mark for improvement suggestion

D6. Try an improvement. (would not be in exam)

R hint

The code for possible corrections is also included at the start of this document.

HastingsModelNB <- MASS::glm.nb(Count ~ Area*Era, data = HastingsData)
summary(HastingsModelNB)

## 
## Call:
## MASS::glm.nb(formula = Count ~ Area * Era, data = HastingsData, 
##     init.theta = 1.608068033, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3863  -1.0484  -0.8008   0.5106   2.0726  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.6217     0.1999   3.110 0.001871 ** 
## AreaKent           -0.7309     0.3160  -2.313 0.020733 *  
## AreaSussex         -1.0445     0.3377  -3.093 0.001980 ** 
## EraPre              1.4374     0.2533   5.676 1.38e-08 ***
## AreaKent:EraPre    -2.3643     0.4848  -4.877 1.08e-06 ***
## AreaSussex:EraPre  -1.7405     0.4743  -3.670 0.000243 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.6081) family taken to be 1)
## 
##     Null deviance: 372.60  on 179  degrees of freedom
## Residual deviance: 178.17  on 174  degrees of freedom
## AIC: 557.96
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.608 
##           Std. Err.:  0.404 
## 
##  2 x log-likelihood:  -543.956

# correct likelihood
dispersion <- deviance(HastingsModel)/df.residual(HastingsModel)
summary(HastingsModel, dispersion = dispersion)

## 
## Call:
## glm(formula = Count ~ Area * Era, family = poisson(link = log), 
##     data = HastingsData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9595  -1.1447  -0.8424   0.6478   5.3350  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.6217     0.1992   3.121  0.00180 ** 
## AreaKent           -0.7309     0.3494  -2.092  0.03645 *  
## AreaSussex         -1.0445     0.3904  -2.676  0.00746 ** 
## EraPre              1.4374     0.2202   6.527 6.70e-11 ***
## AreaKent:EraPre    -2.3643     0.5706  -4.143 3.43e-05 ***
## AreaSussex:EraPre  -1.7405     0.5514  -3.156  0.00160 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 2.142622)
## 
##     Null deviance: 848.02  on 179  degrees of freedom
## Residual deviance: 372.82  on 174  degrees of freedom
## AIC: 650.03
## 
## Number of Fisher Scoring iterations: 5

Part E: Conclusions

E1. Interpret the output (coefficient values) for your final model (the one you decided on in C2 and used in D). (14 marks)

General hint

Focus on the size of effects and whether they have statistical support, i.e. when we include uncertainty are we still clear about the direction? and remember to include the correct uncertainty numbers in your answer.

Deciding what this means for bird fraud is the next part.

In all interpreting from GLMs - remember the link function! So, what you want to do to answer this questions is to look at the coefficient estimates AND the confidence intervals to interpret. You will need to predict the mean number of birds seen for each area back on the original scale, using the inverse of the log link - exp() (it will be easier to compare). You can also see if we are confident about the direction of each difference by looking at the confidence intervals.

We can see that the mean of the afetr 1925 Hastings = exp(0.62) = 1.86, this is the intercept of the model (our contrast group). To find the means of the other areas for the same time period we need to predict them using the linear equation and the inverse of the link \(Y_i = e^{(\alpha + \beta X_i)}\). We get: Kent = exp(0.62+-0.73) = 0.9 Sussex = exp(0.62+-1.04) = 0.66. So, there do seem to be differences between the areas after 1925. None of the confidence intervals cross 0 (-1.43 to -0.03 for Kent and -1.82 to -0.26 for Sussex) therefore even with uncertainty, the estimated direction of these differences is the same. Remember there was overdispersion, so we need to multiply the standard errors by the dispersion to get better confidence estimates (correcting likelihood) or use a negative binomial model. The results here are for a corrected likelihood. The standard errors from the negative binomial model.

If we look at the effect of Era for Hastings, we can see that pre 1925 had a higher number of rare birds. This effect is 1.44 (SE 0.22) on the log scale, giving a mean of exp(0.62+1.43) = 7.77 for pre 1925. This is MUCH higher than post 1925. If we then look at the interaction, we can see that both Kent and Sussex have negative values pre-1925, meaning they are lower than would be expected from an additive model. We can work out the means of Kent in 1925 = exp(0.62+-0.73+1.43-2.36) = 0.35 and Sussex in 1925 = exp(0.62+-1.04+1.43-1.74) = 0.48 From this we can see that pre-1925, both Kent and Sussex had lower mean counts than post-1925 (standard errors were 0.57 and 0.55, both of which are smaller than half the size of the mean estimate of the differences, therefore we can tell that the confidence intervals would not cross 0). Hastings on the other hand had much higher counts. Hastings behaves differently pre and post 1925 compared to Sussex and Kent - causing an interaction between Era and Area. Can either give results from negative binomial or corrected Poisson, interpretation is the same.

coef(HastingsModel)

##       (Intercept)          AreaKent        AreaSussex            EraPre   AreaKent:EraPre 
##         0.6216882        -0.7308875        -1.0445450         1.4373860        -2.3642786 
## AreaSussex:EraPre 
##        -1.7404662

# Results - corrected the likelihood
summary(HastingsModel, dispersion = dispersion)

## 
## Call:
## glm(formula = Count ~ Area * Era, family = poisson(link = log), 
##     data = HastingsData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9595  -1.1447  -0.8424   0.6478   5.3350  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.6217     0.1992   3.121  0.00180 ** 
## AreaKent           -0.7309     0.3494  -2.092  0.03645 *  
## AreaSussex         -1.0445     0.3904  -2.676  0.00746 ** 
## EraPre              1.4374     0.2202   6.527 6.70e-11 ***
## AreaKent:EraPre    -2.3643     0.5706  -4.143 3.43e-05 ***
## AreaSussex:EraPre  -1.7405     0.5514  -3.156  0.00160 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 2.142622)
## 
##     Null deviance: 848.02  on 179  degrees of freedom
## Residual deviance: 372.82  on 174  degrees of freedom
## AIC: 650.03
## 
## Number of Fisher Scoring iterations: 5

# Results - negative binomial glm
summary(HastingsModelNB)

## 
## Call:
## MASS::glm.nb(formula = Count ~ Area * Era, data = HastingsData, 
##     init.theta = 1.608068033, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3863  -1.0484  -0.8008   0.5106   2.0726  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.6217     0.1999   3.110 0.001871 ** 
## AreaKent           -0.7309     0.3160  -2.313 0.020733 *  
## AreaSussex         -1.0445     0.3377  -3.093 0.001980 ** 
## EraPre              1.4374     0.2533   5.676 1.38e-08 ***
## AreaKent:EraPre    -2.3643     0.4848  -4.877 1.08e-06 ***
## AreaSussex:EraPre  -1.7405     0.4743  -3.670 0.000243 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(1.6081) family taken to be 1)
## 
##     Null deviance: 372.60  on 179  degrees of freedom
## Residual deviance: 178.17  on 174  degrees of freedom
## AIC: 557.96
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  1.608 
##           Std. Err.:  0.404 
## 
##  2 x log-likelihood:  -543.956

2 mark for each result discussed (I count 6 main effects, each location before 1925 then each location after. But also interaction and whether it made a difference) MUST inc confidence interval and correct understanding of which scale you interpret on to get the marks.

Code to create a plot of the data

E2. Do you think there was fraud going on in Hastings pre-1925? Explain your answer (5 marks)

We have plotted the data to help with this too. You can use this as well as the other information to support your conclusion.

General hint

Now use the interpretation you had before to draw a conclusion.

plot of chunk unnamed-chunk-8

It is very suspicious that Hastings follows an opposite pattern to the other areas. Particularly because it had an almost 7 times higher number of rare birds pre 1925 than after. There could be biological reasons for this, like reduction of rare species over time. But fraud seems more likely given the patterns in the other areas.

1 mark for a yes/no answer. 4 marks for giving reasons for how you made that decision

Part F: Reflection

F1. Think about how this went. Were you still using most of the hints? What were you still unsure of?

F2. Are there some areas you want to prioritise for revision?

F3. Is there anything you are very proud of from your group?

F4. Anything specific for your final feedback?