This document contains information, questions, R code, and plots.
Hints and reminders are bold
Questions appear in blue.
In this exercise, the aim is to practice all of the modelling tools you learnt in this course. It is indicated how these questions might relate to an exam and how long an answer is required in each case. You can use the solution to grade your own work and then direct your revision.
You can decide how much help you want. For each question there is a general hint and an R hint (if applicable).
In the exam you will not get hints. So, now is the time to practice without!
Things to remember:
glm(Y ~ X1+X2, family=YOURFAMILY(link=YOURLINK), data=Data)
anova(NullModel, AltModel, test = "LRT")
for confirmatory model selectionAIC(Model1), BIC(Model1)
for exploratory model selectionplot(HastingsModel, which=4)
# Cook’s distanceplot(fitted,residuals)
# residuals vs fittedqqnorm(residuals)
# normal QQ plotNew this week:
dispersion <- deviance(Model)/df.residual(Model)
to check for overdispersionsummary(Model, dispersion = dispersion)
to correct for overdispersionMASS:glm.nd(Y~X)
to run a negative binomial GLM—–
Back in 1962 some British birders (people who spend a lot of time looking at rare birds) suspected that a lot of observations from around Hastings from betwen 1890 and 1930 were frauds. John Nelder (co-inventor of GLMs) took a look at the data, and compared Hastings with two nearby areas.
This ‘scandal’ even has it’s own wikipedia page. The link is hidden, try to solve the mystery yourselves before looking at the answer.
Your job is to find out whether there were more rare birds seen in Hastings than in surrounding areas before 1925 compared to after 1925?
—–
You have data on:
We are only looking at the rarest species, nationally categorised as “Very Rare”. We have numbers observed in nearby areas and the focus area (Hastings) from two time periods. The ‘pre’ time period is when we suspect fraud might have occurred, the ‘post’ time period is after.
Data can be found at: https://www.math.ntnu.no/emner/ST2304/2019v/Week13/HastingsData.csv it is a .csv file with column headers.
—–
In this section, the aim is to practice explaining and justifying different modelling concepts.
Before you begin any analyses, you should to think about what steps you will need to conduct in order to reach a conclusion. This relates to the modelling process.
A1. What steps would you want to take to model the Hastings data? (at least 4 steps)
General hint
Think or look back to the first few weeks or the summary of maximum likelihood concepts.
A2. What are the response and explanatory variable(s) and what data type are they, be specific?
General hint
Remember we have been introduced to several different data types. The main distinction is categorical/continuous. But continuous can be discrete or fully continuous. Click here for more on data.
A3. What model would you use for this data and why? Include which distribution you would choose for the random part and if you need a link function (say which if you do).
General hint
Think about all of the models we have covered in the course: distributions, linear models, T-tests, generalised linear models. Which works best here?
Then think about what distribution it is using for error.
—–
B1. How would you fit the model from A3 in R, write one line of code.
General hint: if you are not sure about the model
If you were not sure you got A3 correct: the key thing is that the response variable is count data. This will not be normally distributed in terms of error, so cannot be modelled with a linear model. We will need a GLM. A Poisson GLM with a log link is the appropriate chose for count data.
R hint
The code you need to edit slightly is in the R section of this document. Here
No I really can’t work it out
You need to fill in your own data and remember to save as an object.
glm(Count ~ Area+Era, family=poisson(link=log), data=YourData)
Can either have an additive +
or interaction *
model.
B2. Run the model in R (this would not be in the exam but it is helpful here).
—–
C1. How can you test the hypothesis: ‘there is an interaction between Era and Area’ i.e. Is the effect of area different before and after 1925? Include the name of the method you would use and say why you picked that method.
General hint
The hypothesis mentioned above is asking if there is an interaction. This is the same as asking which variables to include in a model. We have a specific idea of which variables we think are important, we are not testing lots of different ones. We need to find if including the interaction is needed and select the model that balances explanation and complexity.
R hint
The code you need to edit slightly is in the R section of this document. Here. You won’t need all the code listed, so pick the right part! It should take 2 lines.
C2. Run the method you chose for C1. What is your conclusion regarding the hypothesis? (again, in an exam you wouldn’t run yourself in R)
General hint
Make sure to include support for your conclusion. Which part of the output did you use to make the choice?
—–
D1. What are the assumptions of this model? (I count 6)
General hint
These were listed in Week 10 lectures but can also be found on google. There are several different ways to write them. But I am looking for approx. 6.
D2. How can we test these assumptions? (I expect 4 checks)
General hint
Just list the different methods here with the assumptions they test. You run them in the next part.
They are not all plots.
R hint
All the necessary code is at the top of this document. Here
D4. How good is the fit of the model based on your checks?
General hint
Go through each check and determine if the assumption is met.
D5. Would you want to improve it? If so, how?
General hint
Think about which assumption wasn’t met. How do you fix it?
D1, 2, 4, and 5 could be in an exam. The part of running in R, would not be. But interpretation of the R bit would.
D6. Try an improvement.
R hint
The code for possible corrections is also included at the start of this document. Here
—–
E1. Interpret the output (coefficient values) for your final model (the one you decided on in C2 and used in D).
General hint
Focus on the size of effects and whether they have statistical support, i.e. when we include uncertainty are we still clear about the direction? and remember to include the correct uncertainty numbers in your answer.
Deciding what this means for bird fraud is the next part.
E2. Do you think there was fraud going on in Hastings pre-1925? Explain your answer
We have plotted the data to help with this too. You can use this as well as the other information to support your conclusion.
General hint
Now use the interpretation you had before to draw a conclusion.
—–
F1. Use the solution to correct your answers.
F2. Think about how this went. Were you still using most of the hints? What were you still unsure of? Are there some areas you want to prioritise for revision? Were any of the answers surprising?
F3. Email Emily if there are any particular areas you would like covering in a summing up lecture (probably virtual and prerecorded)
General hint
These have no right answer but are up to you.