Exercise 3, Regression

SOLTUION

answers are in purple.

Question 1: Dragonfire

As part of our investigation of the re-introduction of dragons into Trondheim, we have been looking at fire precautions. We sent a student out to measure the lengths of 50 dragons, and (from a safe distance) recorded how far they could breath fire after being poked with a very long stick. Each dragon was only tested once. Note that no major cities were destroyed during this experiment.

This is the data:

DragonFire <- read.csv(file="https://www.math.ntnu.no/emner/ST2304/2021v/Week05/DragonFire.csv")

plot(DragonFire$Size, DragonFire$Distance, xlab="Length (m)", ylab="Distance (m)")

It looks like there is a linear relationship between distance and the length of a dragon. We can fit a linear model, that looks like:

\[ y = \alpha + \beta x + \epsilon \]

Remember, we are using a maximum likelihood approach to calculate estimates for the parameters \(\alpha\) (the intercept) and \(\beta\) (the slope). That means we are looking for estimates that have the highest probability of having produced the data (maximum likelihood estimators; MLEs).

1. Fit a linear model to the Dragon Fire data using R and get the coefficient estimates (MLEs for the intercept and slope)

mod <- lm(Distance~Size, data=DragonFire)
coef(mod)

## (Intercept)        Size 
##   5.0556929   0.6804651

How certain can we be about the estimates? We can answer this by calculating confidence intervals. Remember the definition of a confidence interval:

If we repeat our sample many times, 95% of the confidence intervals drawn would contain the true population value

So, we assume the MLEs represent the true parameter values, and every time we repeat the sampling of our data, we can calculate slightly different MLEs. Lets inspect fit of the model visually by adding a mean line to the scatter plot:

plot(DragonFire$Size, DragonFire$Distance, xlab="Length (m)", ylab="Distance (m)")
abline(lm(Distance~Size, data=DragonFire), col=2) # add mean line

If we retrieve estimates from a model every time we sample data, we get a distribution of the slope estimator. We can do this artificially by simulation. For that, we have to assume that we know the true values of the parameters. First we have a function to simulate the data that takes the true parameter values as input, and a second function that calculates an estimate of the slope from the simulated data.

alpha.hat <- coef(mod)[1]
beta.hat <- coef(mod)[2]
sigma.hat <- summary(mod)$sigma # std deviation

# function to simulate the data, fit the model & return the slope estimate
SimData <- function(alpha, beta, sigma, X) {
  E.y <- alpha + beta*X # calculate the line
  y.sim <- rnorm(length(X), E.y, sigma) # simulate the data
  model <- lm(y.sim ~ X) # fit the model
  coef(model)[2] # return the estimated slope
}

# Run one simulation, to check the function works
SimData(alpha=alpha.hat, 
        beta=beta.hat, 
        sigma=sigma.hat, 
        X=DragonFire$Size)

##         X 
## 0.7526991

You should read the function to check that you understand what each line does, and what the function produces. For example, the code coef(model)[2] takes the second element of the object of coefficients from the model. This is the estimate of the slope value for the linear regression.

With this function, we can “sample” the data lots of times, and get a distribution for the estimator of the slope:

1.1. Run the code, plot the histogram, and use quantile() to find the 95% confidence interval for the MLE of the slope parameter.

ANSWER in code below

Rep.beta <- replicate(1000, SimData(alpha=alpha.hat, 
                                    beta=beta.hat, 
                                    sigma=sigma.hat, 
                                    X=DragonFire$Size))

hist(Rep.beta)

# Calculate the 95% confidence interval
quantile(Rep.beta, c(0.025, 0.975))

##      2.5%     97.5% 
## 0.5903728 0.7736831

This is the distribution of the estimates of the slope that would come if the data were repeatedly sampled. The mean of this distribution should be similar to the value of the “true” parameter:

Rep.beta <- replicate(1000, SimData(alpha=alpha.hat, 
                                    beta=beta.hat, 
                                    sigma=sigma.hat, 
                                    X=DragonFire$Size))

mean(Rep.beta) # Mean of sampling distribution

## [1] 0.6807926

beta.hat # "True" parameter used in simulation

##      Size 
## 0.6804651

The confidence intervals are a way of summarising the whole distribution: they put limits on the most likely values.

R can also calculate the confidence interval itself using the confint() function.

1.2. Use confint() to calculate the confidence interval for your Dragon Fire model. Is it the same/similar/different to the one you got by simulation? Why?

confint(mod)

##                 2.5 %    97.5 %
## (Intercept) 3.5104891 6.6008967
## Size        0.5861248 0.7748054

The confidence interval is 0.59, 0.77. It is similar to the confidence interval from simulation, but not exactly the same. This is because the simulated version has simulation error because we take a random sample each time, this means each version is slightly different.

1.3 Discuss the interpretation of the confidence interval. Explain exactly what it means.

If we repeat our sampling many times and each time make a confidence interval, 95% of the confidence intervals drawn would contain the true population value. This says nothing about how likely our current confidence interval is to contain the true population value. But it does give us a range of plausible values. It helps us measure uncertainty in our estimate. The wider the confidence intervals the higher the uncertainty.

Now we can do some prediction. One released dragon has taken residence in a small lake, and it will sometimes sit at the water’s edge and breath fire. We want to know what distance away from the lake is safe, and for example where should trees be cut down so they aren’t burnt down. We know this dragon is 12m long, so we want to predict how far it can breathe fire.
We do this using the model:

\[ y_{pred} = \hat{\alpha} + \hat{\beta} x \] where \(\hat{\alpha}\) and \(\hat{\beta}\) are the MLEs, and \(x\) is the value we want to predict for.

1.4. calculate the predicted value of \(y\) with \(x=12\) and the MLEs for the parameters. Use the equation above and fill in the MLEs you calculated in qu 1.

To predict you need to use the equation above and replace alpha hat and beta hat with the coefficients you estimated above, 5.056 and 0.68, and x with 12. e.g. y = 5.056 + (0.68 *12) = 13.22m.

We can summarise our uncertainty in the prediction with more confidence intervals.

The function below is a modification of the function above: this one simulates the prediction. By simulate, we mean that this mimics what would happen if we repeated our sampling (to get our data) many many times and each time made a prediction. In reality, we don’t do this, but we use the maths involved in maximum likelihood estimation to represent this. R will do this with mathematics.

1.5. simulate 1000 predictions for \(x=12\) and calculate the confidence interval for the prediction. You can use the code below but you will need to edit it to predict for the right \(x\) value.

# function to simulate the data, fit the model & return a prediction
SimPrediction <- function(alpha, beta, sigma, X, x.pred) {
  E.y <- alpha + beta*X 
  y.sim <- rnorm(length(X), E.y, sigma) # simulates values of y
  model <- lm(y.sim ~ X) # run model on this simulated data
  pred.mu <- coef(model)[1] + coef(model)[2]*x.pred # mean prediction
  pred.mu
}

# Run one simulation, to check the function works
SimPrediction(alpha=alpha.hat, 
              beta=beta.hat, 
              sigma=sigma.hat, 
              X=DragonFire$Size,
              x.pred=1)

## (Intercept) 
##    5.774576

ANSWER 1.5 below

# use the SimPrediction function to predict for 1000 different data sets

# make sure to change x.pred to the value of x you want to predict for
Rep.pred <- replicate(1000, SimPrediction(alpha=alpha.hat, 
                                          beta=beta.hat, 
                                          sigma=sigma.hat, 
                                          X=DragonFire$Size, 
                                          x.pred=12))

hist(Rep.pred)

# Calculate the 95% confidence interval of the predictions
quantile(Rep.pred, c(0.025, 0.975))

##     2.5%    97.5% 
## 12.76937 13.69002

Above you simulated new repeating sampling of dragon fire, fitting a model to each new sample, and each time using the model to make a prediction.

Below, is a very similar function. But in this function there is an extra process included, new <- rnorm(length(x.pred), pred.mu, sigma). This second function should more closely represent what would happen if repeated the dragon fire sampling many times, fit the model each time, and then went to collect one more sample for a given size of dragon e.g. \(x\) = 12.

1.6. in this second function we have to use rnorm() twice. Why is it used a second time (in new <- rnorm(length(x.pred), pred.mu, sigma))? What are we calculating if we don’t do that i.e. in the first function? (hint: what are the different sources of uncertainty in predictions?)

If we don’t have the second rnorm() then we would predict the mean \(y\) value for each \(x\). This would be predicting on the fitted line, but in reality our points don’t always sit on the line. We need the variance too, this is what the second rnorm() does.

1.7. simulate 1000 predictions for \(x=12\) using the SimPrediction2 function and calculate the confidence interval for these predictions. How is this different to the confidence interval in 1.5?

Code answer shown below, as this is simulation, the exact numbers will differ but it should be clear that the confidence intervals from SimPrediction2 are wider than those from qu 1.5. This is because they include variability in the data around the fitted line (residuals) as well as uncertainty in the line itself (uncertainty in the intercept and slope).

1.8. the confidence interval from 1.7 more accurately represents a 95% prediction interval (the confidence interval for a prediction) how would you interpret these confidence intervals for the prediction? i.e. give a definition of what they mean/represent.

The 95% prediction interval represents a range within which we would expect a future observation to occur 95% of the time. If we predict a new value and draw a 95% prediction interval many times, 95% of the those intervals will contain the actual new observation for the specified response variable. It takes account of uncertainty in the parameters and also the variability of the data.

# function to simulate the data, fit the model & return a prediction
SimPrediction2 <- function(alpha, beta, sigma, X, x.pred) {
  E.y <- alpha + beta*X 
  y.sim <- rnorm(length(X), E.y, sigma) # simulates values of y
  model <- lm(y.sim ~ X) # run model on this simulated data
  pred.mu <- coef(model)[1] + coef(model)[2]*x.pred # mean prediction
  new <- rnorm(length(x.pred), pred.mu, sigma) # simulated new observations
  new
}

# Run one simulation, to check the function works
SimPrediction2(alpha=alpha.hat, 
              beta=beta.hat, 
              sigma=sigma.hat, 
              X=DragonFire$Size,
              x.pred=1)

## [1] 5.338002

Rep.pred2 <- replicate(1000, SimPrediction2(alpha=alpha.hat, 
                                          beta=beta.hat, 
                                          sigma=sigma.hat, 
                                          X=DragonFire$Size, 
                                          x.pred=20))

hist(Rep.pred2)

# Calculate the 95% interval of the data
quantile(Rep.pred2, c(0.025, 0.975))

##     2.5%    97.5% 
## 16.60435 20.60073

R can also calculate the prediction itself using the predict() function:

# create new data to predict for
PredictionData <- data.frame(Size=20)
# get a point prediction
predict(mod, newdata = PredictionData)

##      1 
## 18.665

# get a point prediction plus prediction interval
predict(mod, newdata = PredictionData, interval="prediction")

##      fit      lwr     upr
## 1 18.665 16.55859 20.7714

1.9. Try out predict(mod, newdata = PredictionData, interval="confidence") instead of predict(mod, newdata = PredictionData, interval="prediction"). What is different about these two types of prediction. (Hint: compare these to the simulations in qu 1.5-1.8).

If you had used confidence not prediction intervals, you would get a confidence interval accounting for uncertainty in alpha (intercept) and beta (slope) but not accounting for variance of the residuals. Therefore, it would not correctly estimate uncertainty of the prediction. It would just show uncertainty in the mean estimate.

1.10. What distance away from the lake would YOU consider safe from dragon fire? Why is the predicted distance not a good suggestion? Explain your answer.

The predicted distance (18.7 m) is a Bad Idea because that is the average distance the dragon can breathe fire. So, it has a good chance of being able to breathe further. The precise distance depends on how safe we want to be. You should use the confidence intervals to work this out. Based on the 95% prediction interval, a dragon 12m in length could breathe fire up to 20.77m. You could decide further than this is safe, or maybe you want to be more cautious. Perhaps you’d rather use a 99% prediction interval or just 5m to the upper bound of the confidence interval. Or maybe you like risks and think you could get closer. You could be even more cautious, or riskier. But you should use the prediction interval for the decision.

Question 2: How long will the Deathly Hallows be?

Here we have the lengths of the Harry Potter book plotted against their order. This analysis was originally done in 2006, where the question was how long the next book would be.

First, you can just run the following code, to create the data and plot it.

# Make the data frame using function data.frame()
# each name before the = is the name of a column
# the elements after the = are the values for that column
HPbooks <- data.frame(
  Book = c("Philosopher's Stone", "Chamber of Secrets", "Prisoner of Azkaban",
          "Goblet of Fire", "Order of the Phoenix", "Half-Blood Prince", 
          "Deathly Hallows"),
  Length = c(223, 251, 317, 636, 766, 607, NA), # 607
  Date = c(1997, 1998, 1999, 2000, 2003, 2005, 2007)
)

HPbooks$BookOrder <- 1:nrow(HPbooks) # nrow() = count number of rows
# This will make the book titles easier to read

# Feel free to ignore this: you don't need to know about gsub() for this course
HPbooks$BookLabels <- gsub("Harry Potter and the ", "", HPbooks$Book)

# set margins of the plot
par(mar=c(6.1,4.1,1,1))

plot(HPbooks$BookOrder, HPbooks$Length, 
     xlab="", 
     ylab="Number of pages", 
     xaxt="n") # xaxt = "n" tells R not to include an x-axis
# instead we add it manually
axis(1, at=HPbooks$BookOrder, padj=0.8, cex.axis=0.8, 
     labels = gsub(" ", "\n", HPbooks$BookLabels))
text(7, 600, "?", col=2, cex=5)

Your task is to look at the relationship between the length of a book and the order it is in the series. In particular, do later books tend to be longer than earlier ones? You should use lm() to regress the number of pages (Length) against where the book is in the series (BookOrder). Question 1 should help you with using the right code (and, yes, you can use the easy built in R functions).

2. Run the linear regression model in R for Length as a response to BookOrder. Look at the results from the model.

HPModel <- lm(Length ~ BookOrder, data = HPbooks)

coef(HPModel)

## (Intercept)   BookOrder 
##    88.26667   108.11429

confint(HPModel)

##                  2.5 %   97.5 %
## (Intercept) -232.75298 409.2863
## BookOrder     25.68405 190.5445

2.1. Is there evidence that the books are getting longer?

Yes, the results of the lm() show an increase of 108 pages per book order. The confidence intervals for this effect do not span 0, so we would be unlikely to get see this level of effect if there was no trend.

2.2. how much longer is each book getting (on average)? What is the uncertainty around this?

108 pages on average, with uncertainty of 25.68 to 190.54. ALWAYS include uncertainty!

2.3. how long do you predict Book 7 should be? What is the prediction interval around that?

predict(mod, newdata= data.frame(BookOrder = 7), interval="prediction"). The answer is 845 pages with an interval of 373.94 and 1316.195.

2.4. The final Harry Potter book was actually 607 pages long. How good was the prediction? What does this tell us about making predictions from linear regression analyses?

It was good in that the actual value was within the prediction CI, but the interval was so broad that it would include most modern fantasies, as well as War and Peace and all of the A Song of Ice and Fire books. So, generally the uncertainty was too large to make a good prediction.

Question 3: The Climate Change of Crime

There are many effects of the changing climate, with species moving futher north, and great tits (Parus major) becoming more murderous. But even humans are not immune. There have been suggestions that people also become more violent when the temperature get hotter. We can explore some of the evidence about this, thanks to the Chicago Police Department.

Our data comes from their online resources: these are:

Daily mean temperatures. These have been converted from Fahrenheit to Celsius therefore are no long integers.
Mean number of daily crimes for each temperature between January 1st 2001 and January 23rd 2019. i.e. all days with the same mean daily temperature were grouped together and their crime counts averaged.

We have removed data where the man temperature was below freezing (the full data will appear later in the course). We can read in the data from the web:

ChicagoCrime <- read.csv("https://www.math.ntnu.no/emner/ST2304/2019v/Week5/NoFreezeTempCrimeChicago.csv")
str(ChicagoCrime)

3.1. Plot the data, and describe the general patterns you see.

We suggest a scatter plot.

plot(ChicagoCrime$TempC, ChicagoCrime$Crime, type="p",
     ylab = "Daily crime number", 
     xlab = "Daily mean temp in degrees C")

3.2. What seems to be the overall effect of temperature on crime? (use the plot to answer this question)

A higher temperature means more crime.

3.3. are there any features of the data that stand out (either from looking at the plots, or the numbers themselves)?

Looking at the plot, there are 3 points with more crime than the trend: especially >25 degrees C. Looking at the plot, there might be some curvature in the trend. Even though they are averages, the number of crimes are integers, so have been rounded.

3.4. does a regression seem like a reasonable idea? Does it make sense to fit a straight line?

Yes it does: the relationship is fairly straight, and is continuous.

Now you should use lm()to fit a regression line.

3.5. what is the coefficient for the temperature effect? What are the confidence intervals?

3.6. Interpret the fitted model: what is the effect of warmer temperatures on the crime numbers? How strong is the effect?

3.7. The prediction is that Chicago will be about 3°C warmer in 2050. What effect is this likely to have on crime numbers?

3.8. From this model, predict how much crime there will be if the temperature is -10°C

All code below.

But we also need words! The effect of temperature on crime numbers is estimated as 5.57 extra daily crimes (on average) for every 1 degree increase in temperature. The confidence interval for this is 3.78 daily crimes to 7.36 daily crimes. This means that the warmer it is, the more daily crimes there are on average. This effect has a strength of between 3.78 and 7.36 crimes per day on average per degree. So, a shift of 3 degrees warmer would be 11.34 to 22.08 extra crimes per day (which is quite a lot).

# run model 
reg <- lm(Crime~TempC, data=ChicagoCrime)

coef(reg)

## (Intercept)       TempC 
##  946.218443    5.569831

confint(reg)

##                  2.5 %     97.5 %
## (Intercept) 913.541968 978.894917
## TempC         3.778051   7.361611

# Effect of increasing temp. by 3C
# calculate by multiplying the MLE for the slope but 3
# slope shows change in y for 1 degree change in x
3*coef(reg)["TempC"]

##    TempC 
## 16.70949

# Predict -10C crime
predict(reg, newdata=data.frame(TempC=-10), interval = "pred")

##        fit      lwr      upr
## 1 890.5201 760.0736 1020.967

Exercise 3, Regression

Bob O’Hara

SOLTUION

Question 1: Dragonfire

Question 2: How long will the Deathly Hallows be?

Question 3: The Climate Change of Crime