Exercise 4: Model checking

SOLUTION

Instructions:

Hints and reminders are italic

Questions appear in blue.

Answers in red or as R code.

Needs to be completed and handed in by 21st February

Resources:

P60-65 in 'The New Statistics with R'
Google (or any other search engine)

The challenge: How many police officers do you need?

You are part of the police headquarters team in Chicago. It is predicted to be a very cold weekend, with an average temperature of -10ºC. Obviously most of the police officers would like to make the most of the cold weather by going skiing and ice skating. But you also need to keep the public safe. Previous research has shown that approximately 2 police officers (they work in pairs) are needed for every 20 daily crimes. The Chief of Police has asked your team to provide a recommendation for how many officers they need for this cold weekend.

Luckily (as always) you already have data on temperature and crime numbers that you can use.

It is your job to find out how many police officers you would recommend to be on duty on Saturday.

Politi

The data from last week can be found at https://www.math.ntnu.no/emner/ST2304/2019v/Week5/NoFreezeTempCrimeChicago.csv

The first step is to import the data and assign it to an object. You can use the whole web link above to import the data. It is a csv file with column names (header) included.

NoFreeze <- read.csv("https://www.math.ntnu.no/emner/ST2304/2019v/Week5/NoFreezeTempCrimeChicago.csv", 
                     header=T)

The next step is to plot the data, its good to remind ourselves what it looked like. The line of the linear regression from last week is also included.

plot(NoFreeze$TempC, NoFreeze$Crime, pch=16, xlab="Temp (ºC)",
     ylab="Daily crime number", las=1)
abline(lm(Crime~TempC, data=NoFreeze), col=2)

plot of chunk unnamed-chunk-2

1. Using the linear regression model from last week, predict the number of daily crimes for an average temperature of -10ºC. Report the answer with prediction interval

Hint1: you will need to use the predict function, create newdata, and predict with a prediction interval

Hint2: for the lm() you will need to use format lm(y ~ x, data = YourData)

# run the model
model1 <- lm(Crime~TempC, data=NoFreeze)

# create newdata
newdata <- data.frame(TempC=-10)

# create the predictions
predictions <- predict(model1, newdata, interval="prediction")

2. Based on this result how many police officers would you recommend to be on duty? Explain reasons behind your answer. Include prediction intervals

The Lower prediction interval is 760 crimes, upper is 1021. So you could recommend between 76 and 102 police officers. The Exact number will be a choice, you might choose 104 incase there are more crimes or 76 to save money. As long as the prediction intervals are correctly interpreted then the choice is ok. But the choice should be justified in the context of the intervals. The prediction intervals indicate if you take a sample and calculate a prediction interval, many many times, then collect a new value for a certain X value, you would expect the new value to be within that prediction interval for 95% of the samples.

You are speaking to a colleague at lunch, discussing this project (because you are very excited about it). Your colleague tells you they have some extra data. They have mean daily crime numbers from temperatures under 0ºC. This is just what you needed!

You can find the new complete dataset here:

https://www.math.ntnu.no/emner/ST2304/2019v/Week6/TempCrimeChicago.csv

It is important to import the data and plot it.

WithFreeze <- read.csv("https://www.math.ntnu.no/emner/ST2304/2019v/Week6/TempCrimeChicago.csv",
                       header=T)

plot(WithFreeze$TempC, WithFreeze$Crime, pch=16, xlab="Temp (ºC)",
     ylab="Daily crime number", las=1)

plot of chunk unnamed-chunk-4

3. Fit a linear regression to the complete dataset including days <0ºC and **plot** the regression line

Remember: lm(y ~ x, data = YourData) and abline

# Create a new model
model2 <- lm(Crime ~ TempC, data = WithFreeze)

# Repeat the plot above
plot(WithFreeze$TempC, WithFreeze$Crime, pch=16, xlab="Temp (ºC)",
     ylab="Daily crime number", las=1)

# Add the model line
abline(model2, col=2)

plot of chunk unnamed-chunk-5

4. Look at your plot of the data and regression line. What do you think of the fit? Are there any problems with using a straight line on this data? Do this by eye not with analyses

The extra data has added a curve to our data. Now it increases more steeply up to 0ºC then levels off to a shallower slope after. There are also some outliers at around -17ºC. A straight line is a very bad fit to this data. It does not seem to meet linearity assumptions. As a result equal variance also becomes violated as there is higher variance at low X values (temp) than in the middle or at high X (high temp). I think it is not very good.

You have had a go at checking this model just by looking at the data and the regression line. But there are some more thorough ways we can explicitly check whether our model meets the assumptions of a linear regression.

The four graphs that statisticians typically use are called: Residuals vs fitted, Normal Q-Q, and Residuals vs leverage

5. Create a residuals vs fitted plot for the linear model on all the data. Interpret the plot in terms of model fit.

Think about which assumption this plot assesses, what you expect it to look like if the assumption is met, how does your data differ from this

The Residuals vs fitted plot helps us assess the variance in our residuals. We can see whether this is equal across our fitted values. It also helps us see linearity. We have linearity if there is no structure in the residuals. Here we can see the residuals follow a clear curve. The curve looks quadratic (will come back to that later), so our data are not linear. This assumption (linearity) is violated at the moment.

# I have called my model model2, replace this with the name 
# of your model

# create a vector of rounded residuals
CrimeResiduals <- round(residuals(model2),2)

# create a vector of rounded fit
CrimeFitted <- round(fitted(model2),2)

# plot the fitted and residuals
plot(CrimeFitted, CrimeResiduals)
# add a horizontal line at 0
# line is grey and dashed (lty=2)
abline(h=0, lty=2, col="grey")

plot of chunk unnamed-chunk-6

6. Create a Normal Q-Q plot for the linear model on all the data. Interpret the plot in terms of model fit.

Again: Think about which assumption this plot assesses, what you expect it to look like if the assumption is met, how does your data differ from this

The Normal Q-Q plot assesses normality of the residuals. It plots the actual data against theoretical quantiles, expected values from a normal distribution. If our residuals are normal, we expect them to sit close to the Q-Q line. We do not expect it to be exact but close. For our data this is not bad, most of the data sits along the line there is some deviation at the high and low values, but this is quite normal and the deviation isn't too high. I would say normality is sufficiently met.

# use the residuals you calculated above 
# now create the Normal Q-Q plot 
qqnorm(CrimeResiduals)
# add the ideal line
qqline(CrimeResiduals)

plot of chunk unnamed-chunk-7

7. Interpret the Cook's distance plot below. What does it tell you that the Residuals vs fitted and Normal Q-Q plots did not?

The Cook's distance looks at the leverage of data points. It tests how much the other fitted values change if you remove one data point. Those points with a high Cook's distance (labelled on the plot), have high leverage and could cause problems. Points 1, 2, and 7 are marked on the plot. These have high leverage. These would be the 1st, 2nd, 7th points in our data set. Number 7 is particularly worrying. Turns out this is an outlier! It is one of the points we remove at question 9.

# Then create the last plot for model checking
# this is done by using the plot function (which would
# create 4 plots of the model) and choosing just the
# last one (which=4). A bit of a cheat to a nice plot.
plot(model2, which=4)

plot of chunk unnamed-chunk-8

8. Based on your assessments of the model checking plots, how could you improve the model fit for these data? (suggest at least 2 things you could do - try not to cheat)

The main problem is linearity, but maybe also leverage. To address linearity you could transform (log or square root or power (e.g. square)), looks very curved so could include quadratic term (squared temp). For leverage, you could look at removing high leverage and outlier points. You do not need to know which of these WILL work, just suggest the options that might. Could also suggest modelling below and above separately. This is not what we would recommend here but can be suggested. All reasonable ideas welcome.

Exercise 4: CONTINUED - improving models

SOLUTION

The colleague that gave you the extra data has come back to see how you are getting on. They suggest that the main assumption not being met is linearity. A straight line does not seem to capture the data because it is curved. There are also some outliers.

Your team decides to remove the outliers. You have reason to believe they might be typos.

9. What are some positives and negatives of removing outliers? What things should you consider when removing them?

Positives: removing outliers can help improve fit of a model. Some outliers will come from typos or data error, removing them will therefore make the data more representative of the real processes going on. Rather than keeping in incorrect data.

Negatives: outliers could be true data, that is just a bit different if these are removed just because they don't fit an expectation then data is lost and patterns might be missed. It could be the outlier is showing something important, e.g. a threshold or influence of the another variable etc. They could hold biological meaning.

When removing outliers you should always consider why it is an outlier. Do not just remove because you don't like it and want neat data. They should be removed if it can be justified or if there is a reason to believe it is incorrect data e.g. a typo or failed equipment.

# Remove the outliers (my data was called WithFreeze)
# This is done by using indexing brackets [,]
# These work by searching inside a data set
# by row first then column e.g. [row,column]
# Here we take ALL columns, which is why it is blank.
# But we only take rows which have a residual of 
# less than 200. Basically we move the data that
# created the highest residuals

NoOutliers <- WithFreeze[residuals(model2) < 200,]

plot(NoOutliers$TempC, NoOutliers$Crime, pch=16, xlab="Temp (ºC)",
     ylab="Daily crime number", las=1)

plot of chunk unnamed-chunk-9

The data is still curved. So you will want need to use a transformation of the response variable or a polynomial (square or cube etc). But which one?

You can use Box-Cox to indicate what kind of transformation might help with improve the linear regression. The plot shows the likelihood for different powers of transformation. E.g. 2 is a squared transformation, 3 is cubic etc.

# You might need to install the package MASS
# install.packages("MASS")

# Run the 
MASS::boxcox(model2, lambda = seq(1,4, length=30))

plot of chunk unnamed-chunk-10

Box-Cox suggests that a quadratic (x²) transformation. You could either transform the response variable OR add a quadratic term as an explanatory variable. You choose to try the second and add the quadratic term.

# Create a linear model
# to add a quradtic (or any power) term you must right the
# explanatory variable as I(variable^2) AND keep in the original
# explanatory variable. See example below.
# We need both the linear and quadratic components.

model3 <- lm(Crime ~ TempC + I(TempC^2), data = NoOutliers)

# Also plot the Residuals vs fitted
# create a vector of rounded residuals
CrimeResiduals2 <- round(residuals(model3),2)

# create a vector of rounded fit
CrimeFitted2 <- round(fitted(model3),2)

# plot the fitted and residuals
plot(CrimeFitted2, CrimeResiduals2)
# add a horizontal line at 0
# line is grey and dashed (lty=2)
abline(h=0, lty=2, col="grey")

plot of chunk unnamed-chunk-11

9. Look at the new Residuals vs Fitted plot. What do you think of this new model? Has it improved the fit?

Yes it generally looks better. There is less structure in the residuals now. If I was very critical, I can see some bunching at high fitted values. There is also maybe a little more variance at the very high fitted values. But generally I would now be happy with the fit. You could also choose to include other plots e.g. Cook's distance and Normal Q-Q to further check the model. But you don't have to.

10. What about usefulness? How much of the variance in daily crime numbers is this model explaining? Work out using R squared and explain what this measures

Hint(basically the whole code): summary(model)$r.squared

# First calculate the R squared.
summary(model3)$r.squared

## [1] 0.8592709

# The R squared is 0.86 so temperature explains 86% of the variance
# in daily crime rate. This is pretty high! So this model seems to 
# be doing pretty well at capturing what is going on with just one
# explanatory variable (temperature and its quadratic).

# If you compare to model2, the R squared there was 0.65.
# So adding in the quadratic term and removing 2 outliers lead to
# an extra 20% of the variance being explained.

11. Now predict again from the new model. Does this change your recommendation for the number of police needed? If so, how?

# create newdata (or use the same one you made at the beginning)
newdata <- data.frame(TempC=-10)

# create the predictions
predictions2 <- predict(model3, newdata, interval="prediction")

The lower prediction interval is now 649 crimes, upper is 908. This is quite a lot lower than before. You could recommend between 64 and 92 police officers. Rounding to nearest 2. The exact number will be a choice, as before, could choose higher or lower based on being cautious or daring. As long as the prediction intervals are correctly interpreted then the choice is ok. Main point is that the number needed should be lower. You might have made officers needlessly miss skiing if we hadn't corrected our model!

12. Think about the biological context of the results. Why could there be a quadratic relationship between daily crime numbers and temperature? How could you try to find out what the reasons are? E.g. new studies or data you would need

Why the quadratic? Could be many reasons:

Different process under 0ºC and above e.g. above get more angry, below there are fewer people outside
Could be an optimum temperature to commit crimes, too hot or too cold it is too difficult
The temperature relates to time of year. Maybe more people are on holiday in the summer (hotter) and so commit crime
Could be different types of crime in hot and cold temperatures e.g. violent crime vs fraud

Anything like the above is ok. There might be other ideas too. To try and find out you could collect data relating to the idea. For example you could collect data on the number of people outside during certain temperatures and see if that explains crime numbers. You could experiment (this would be hard) e.g. increase temperature in winter and see if you get more crime (test seasonal idea). Or find data on type of crime, see if patterns change by crime type.