Question 1: Simulating regression

We will look at regression and how it works with some simulated data. First we have to, um, simulate some data

alpha <- 5 # intercept
beta <- 5 # slope
sigma <- 1 # standard deviation
x <- runif(50, 0, 1) # random uniform distribution
mu <- alpha + beta*x
y <- rnorm(length(mu), mu, sigma)
plot(x, y)
lines(x, mu, col=2) # add mean line

We can fit the model. Remember, we are using a maximum likelihood approach, so the ‘best’ parameters are those which have the highest probability of having produced the data. We know the true parameter values, but the estimates obviously are slightly different:

mod <- lm(y ~ x)
coef(mod)
## (Intercept)           x 
##    5.439873    4.388110

How certain can we be about the estimates? We can answer this by calculating confidence intervals. Remember the definition of a confidence interval:

If we repeat our sample many times, 95% of the confidence intervals drawn would contain the true population value

So, we assume the MLEs are the true values, and from these and the model, we know that every time we repeated sampling to get the data, we would get another sample, and from this we could calculate the slope. If we do this lots of times, we get a distribution of slope estimates. We can do this virtually by simulation. First we have a function to simulate the data, and calculate an estimate of the slope:

alpha.hat <- coef(mod)[1]
beta.hat <- coef(mod)[2]
sigma.hat <- summary(mod)$sigma # std deviation

# function to simulate the data, fit the model & return the slope estimate
SimData <- function(alpha, beta, sigma, X) {
  E.y <- alpha + beta*X # calculate the line
  y.sim <- rnorm(length(X), E.y, sigma) # simulate
  m <- lm(y.sim ~ X) # fit the model
  coef(m)[2] # return the estimated slope
}
# Run one simulation, to check the function works
SimData(alpha=alpha.hat, beta=beta.hat, sigma=sigma.hat, X=x)
##        X 
## 3.912114

You should read the function to check that you understand what each line does, and what the function produces.

With this function, we can simulate the data lots of times, and get a sampling distribution for the slope:

Rep.beta <- replicate(1e3, SimData(alpha=alpha.hat, beta=beta.hat, sigma=sigma.hat, X=x))
hist(Rep.beta)
# Calculate the 95% confidence interval
quantile(Rep.beta, c(0.025, 0.975))

So, you should run this, and plot the histogram. This is the distribution of the estimates of the slope that would come if the data were repeatedly sampled. The confidence intervals are a way of summarising the whole distribution: they put limits on the likely values).

Your tasks: - discuss the interpretation of the confidence interval. Explain exactly what it means - R will calculate the confidence interval itself using the confint() function. - Use confint() to calculate the confidence interval for this model. Is it the same/similar/different to the one you got by simulation?

Now we can do some prediction. For some reason, we want to predict \(y\) when \(x=1\). We do this using the model:

\[ y_{pred} = \hat{\alpha} + \hat{\beta} x \] where \(\hat{\alpha}\) and \(\hat{\beta}\) are the MLEs, and \(x\) is the value we want to predict for. If we use the true values, we would have \(y_{pred} = 5 + 5 \times 1 = 10\). But the MLEs are slightly different

We can summarise our uncertainty in the prediction with more confidence intervals.

The function below is a modification of the function above: this one simulates the prediction.

# function to simulate the data, fit the model & return a prediction
SimPrediction <- function(alpha, beta, sigma, X, x.pred) {
  E.y <- alpha + beta*X
  y.sim <- rnorm(length(X), E.y, sigma)
  m <- lm(y.sim ~ X)
  pred.mu <- coef(m)[1] + coef(m)[2]*x.pred # mean prediction
  prediction <- rnorm(length(x.pred), pred.mu, sigma) # prediction
  prediction
}
# Run one simulation, to check the function works
SimPrediction(alpha=alpha.hat, beta=beta.hat, sigma=sigma.hat, X=x, x.pred=1)
## [1] 9.979607
Rep.pred <- replicate(1e3, SimPrediction(alpha=alpha.hat, beta=beta.hat, sigma=sigma.hat, X=x, x.pred=1))
hist(Rep.pred)
# Calculate the 95% confidence interval
quantile(Rep.pred, c(0.025, 0.975))

R will calculate the prediction itself using the predict() function:

# create new data to predict for
PredictionData <- data.frame(x=1)
# get a point prediction
predict(mod, newdata = PredictionData)
# get a point prediction plus confidence interval
predict(mod, newdata = PredictionData, interval="prediction")

Question 2: How long will the Deathly Hallows be?

Here we have the lengths of the Harry Potter book plotted against their order. This analysis was originally done in 2006, where the question was how long the next book would be.

# Make the data frame 
HPbooks <- data.frame(
  Book = c("Philosopher's Stone", "Chamber of Secrets", "Prisoner of Azkaban",
          "Goblet of Fire", "Order of the Phoenix", "Half-Blood Prince", 
          "Deathly Hallows"),
  Length = c(223, 251, 317, 636, 766, 607, NA), # 607
  Date = c(1997, 1998, 1999, 2000, 2003, 2005, 2007)
)
HPbooks$BookOrder <- 1:nrow(HPbooks) # nrow() = count number of rows
# This will make the book titles easier to read
#  Feel free to ignore this: you don't need to know about gsub() for this course
HPbooks$BookLabels <- gsub("Harry Potter and the ", "", HPbooks$Book)

par(mar=c(6.1,4.1,1,1))
plot(HPbooks$BookOrder, HPbooks$Length, xlab="", ylab="Number of pages", xaxt="n")
axis(1, at=HPbooks$BookOrder, padj=0.8, cex.axis=0.8, labels = gsub(" ", "\n", HPbooks$BookLabels))
text(7, 600, "?", col=2, cex=5)

Your task is to look at the relationship between the length of a book and the order it is in the series. In particular, do later books tend to be longer than earlier ones? You should use lm() to regress the number of pages (Length) against where the book is in the series (BookOrder). Question 1 should help you with using the right code (and, yes, you can use the easy built in R functions).

Question 3: The Climate Change of Crime

There are many effects of the changing climate, with species moving futher north, and great tits (Parus major) becoming more murderous. But even humans are not immune. There have been suggestions that people also become more violent when the temperature get hotter. We can explore some of the evidence about this, thanks to the Chicago Police Department. Our data comes from their online resources: these are the daily mean temperatures and number of crimes between January 1st 2001 and January 23rd 2019. We have removed data where the man temperature was below freezing (the full data will appear later in the course). We can read in the data from the web (check that hte link is correct!):

ChicagoCrime <- read.csv("https://www.math.ntnu.no/emner/ST2304/2019v/Week5/NoFreezeTempCrimeChicago.csv")
str(ChicagoCrime)
## 'data.frame':    57 obs. of  2 variables:
##  $ TempC: num  0.556 1.111 1.667 2.222 2.778 ...
##  $ Crime: num  888 969 950 995 891 ...

The data consists of the temperature in Celcius, and for that temperature, the average the number of crimes reported (over the whole period).

Now you should use lm()to fit a regression line.