answers are in purple.
As part of our investigation of the re-introduction of dragons into Trondheim, we have been looking at fire precautions. We sent a student out to measure the lengths of 50 dragons, and (from a safe distance) recorded how far they could breath fire after being poked with a very long stick. Each dragon was only tested once. Note that no major cities were destroyed during this experiment.
This is the data:
FileFlames <- "https://www.math.ntnu.no/emner/ST2304/2020v/Week05/DragonFire.csv"
DragonFire <- read.csv(file=FileFlames)
plot(DragonFire$Size, DragonFire$Distance, xlab="Length (m)", ylab="Distance (m)")
abline(lm(Distance~Size, data=DragonFire), col=2) # add mean line
We can fit the model. Remember, we are using a maximum likelihood approach, so the ‘best’ parameters are those which have the highest probability of having produced the data.
mod <- lm(Distance~Size, data=DragonFire)
coef(mod)
## (Intercept) Size
## 6.1453222 0.6359166
How certain can we be about the estimates? We can answer this by calculating confidence intervals. Remember the definition of a confidence interval:
If we repeat our sample many times, 95% of the confidence intervals drawn would contain the true population value
So, we assume the MLEs are the true values, and from these and the model, we know that every time we repeated sampling to get the data, we would get another sample, and from this we could calculate the slope. If we do this lots of times, we get a distribution of slope estimates. We can do this virtually by simulation. First we have a function to simulate the data, and calculate an estimate of the slope.
alpha.hat <- coef(mod)[1]
beta.hat <- coef(mod)[2]
sigma.hat <- summary(mod)$sigma # std deviation
# function to simulate the data, fit the model & return the slope estimate
SimData <- function(alpha, beta, sigma, X) {
E.y <- alpha + beta*X # calculate the line
y.sim <- rnorm(length(X), E.y, sigma) # simulate the data
m <- lm(y.sim ~ X) # fit the model
coef(m)[2] # return the estimated slope
}
# Run one simulation, to check the function works
SimData(alpha=alpha.hat, beta=beta.hat, sigma=sigma.hat, X=DragonFire$Size)
## X
## 0.6144988
You should read the function to check that you understand what each line does, and what the function produces.
With this function, we can simulate the data lots of times, and get a sampling distribution for the slope:
Rep.beta <- replicate(1e3, SimData(alpha=alpha.hat, beta=beta.hat, sigma=sigma.hat, X=DragonFire$Distance))
hist(Rep.beta)
# Calculate the 95% confidence interval
quantile(Rep.beta, c(0.025, 0.975))
This is the distribution of the estimates of the slope that would come if the data were repeatedly sampled. The confidence intervals are a way of summarising the whole distribution: they put limits on the likely values).
quantile()
to find the 95% confidence interval. Rep.beta <- replicate(1e3, SimData(alpha=alpha.hat, beta=beta.hat, sigma=sigma.hat, X=DragonFire$Distance))
hist(Rep.beta)
# Calculate the 95% confidence interval
quantile(Rep.beta, c(0.025, 0.975))
## 2.5% 97.5%
## 0.5078049 0.7610609
R will calculate the confidence interval itself using the confint()
function.
confint()
to calculate the confidence interval for this model. Is it the same/similar/different to the one you got by simulation?The confidence interval is 0.55, 0.72. It is similar to the confidence interval from simulation (here 0.5078049, 0.7610609), but not exactly the same. This is because the simulated version has simulation error because we take a random sample each time, this means each version is slightly different.
If we repeat our sampling many times and each time make a confidence interval, 95% of the confidence intervals drawn would contain the true population value. This says nothing about how likely our current confidence interval is to contain the true population value. But it does give us a range of plausible values. It helps us measure uncertainty in our estimate. The wider the confidence intervals the higher the uncertainty.
Now we can do some prediction. One released dragon has taken residence in a small lake, and it will sometimes sit at the water’s edge and breath fire. We want to know what distance away from the lake is safe, and for example where should trees be cut down so they aren’t burnt down. We know this dragon is 12m long, so we want to predict how far it can breathe fire. We do this using the model:
\[ y_{pred} = \hat{\alpha} + \hat{\beta} x \] where \(\hat{\alpha}\) and \(\hat{\beta}\) are the MLEs, and \(x\) is the value we want to predict for.
To predict you need to use the equation above and replace alpha hat and beta hat with the coefficients you estimated above, 6.145 and 0.636, and x with 12. e.g. y = 0.636 + (0.636 *12) = 13.78m.
We can summarise our uncertainty in the prediction with more confidence intervals.
The 95% prediction interval represents a range within which we would expect a future observation to occur 95% of the time. If we predict a new value and draw a 95% prediction interval many times, 95% of the those intervals will contain the actual new observation for the specified response variable. It takes account of uncertainty in the parameters and also the data.
The function below is a modification of the function above: this one simulates the prediction.
# function to simulate the data, fit the model & return a prediction
SimPrediction <- function(alpha, beta, sigma, X, x.pred) {
E.y <- alpha + beta*X
y.sim <- rnorm(length(X), E.y, sigma)
m <- lm(y.sim ~ X)
pred.mu <- coef(m)[1] + coef(m)[2]*x.pred # mean prediction
prediction <- rnorm(length(x.pred), pred.mu, sigma) # prediction
prediction
}
# Run one simulation, to check the function works
SimPrediction(alpha=alpha.hat, beta=beta.hat, sigma=sigma.hat, X=DragonFire$Size, x.pred=1)
## [1] 4.832281
Rep.pred <- replicate(1e3, SimPrediction(alpha=alpha.hat, beta=beta.hat, sigma=sigma.hat, X=DragonFire$Size, x.pred=20))
hist(Rep.pred)
# Calculate the 95% confidence interval
quantile(Rep.pred, c(0.025, 0.975))
rnorm()
a second time (in prediction <- rnorm(length(x.pred), pred.mu, sigma)
). What are we calculating if we don’t do that? (hint: what are the different sources of uncertainty?) If we don’t have the second rnorm()
then we would predict the mean Y value for each X. This would be predicting on the fitted line, but in reality our points don’t always sit on the line. We need the variance too, this is what the second rnorm()
does.
See above for code.
R will calculate the prediction itself using the predict()
function:
# create new data to predict for
PredictionData <- data.frame(Size=20)
# get a point prediction
predict(mod, newdata = PredictionData)
## 1
## 18.86365
# get a point prediction plus confidence interval
predict(mod, newdata = PredictionData, interval="prediction")
## fit lwr upr
## 1 18.86365 16.92599 20.80132
confint()
to calculate the confidence interval for this model. Is it the same/similar/different to the one you got by simulation?Code above, should be very close to simulated ones.
predict(mod, newdata = PredictionData, interval="confidence")
instead of predict(mod, newdata = PredictionData, interval="prediction")
(Hint: compare these to the simulations).If you had used confidence not prediction intervals, you would get a confidence interval accounting for uncertainty in alpha and beta but not accounting for variance of residuals. Therefore, it would not correctly estimate uncertainty of the prediction. It would just show uncertainty in the mean estimate.
The predicted distance (18.9 m) is a Bad Idea because that is the average distance the dragon can breathe fire. So, it has a good chance of being able to breathe further. The precise distance depends on how safe we want to be. We could decide to make the lakeside safe from 99% of breaths, in which case we would need to go beyond the upper 1% quantile: round(predict(mod, newdata = PredictionData, interval="prediction", level=0.98), 2)[3]
= 21.18 m. But we could argue for other quantiles. You could be even more cautious, or riskier.
Here we have the lengths of the Harry Potter book plotted against their order. This analysis was originally done in 2006, where the question was how long the next book would be. First, you can just run the following code, to create the data and plot it.
# Make the data frame
HPbooks <- data.frame(
Book = c("Philosopher's Stone", "Chamber of Secrets", "Prisoner of Azkaban",
"Goblet of Fire", "Order of the Phoenix", "Half-Blood Prince",
"Deathly Hallows"),
Length = c(223, 251, 317, 636, 766, 607, NA), # 607
Date = c(1997, 1998, 1999, 2000, 2003, 2005, 2007)
)
HPbooks$BookOrder <- 1:nrow(HPbooks) # nrow() = count number of rows
# This will make the book titles easier to read
# Feel free to ignore this: you don't need to know about gsub() for this course
HPbooks$BookLabels <- gsub("Harry Potter and the ", "", HPbooks$Book)
par(mar=c(6.1,4.1,1,1))
plot(HPbooks$BookOrder, HPbooks$Length, xlab="", ylab="Number of pages", xaxt="n")
axis(1, at=HPbooks$BookOrder, padj=0.8, cex.axis=0.8, labels = gsub(" ", "\n", HPbooks$BookLabels))
text(7, 600, "?", col=2, cex=5)
Your task is to look at the relationship between the length of a book and the order it is in the series. In particular, do later books tend to be longer than earlier ones? You should use lm()
to regress the number of pages (Length
) against where the book is in the series (BookOrder
). Question 1 should help you with using the right code (and, yes, you can use the easy built in R functions).
Yes, the results of the lm()
show an increase of 108 pages per book order. The confidence intervals for this effect do not span 0, so we would be unlikely to get see this level of effect if there was no trend.
108 pages on average, with uncertainty of 25.68 to 190.54. ALWAYS include uncertainty!
predict(mod, newdata= data.frame(BookOrder = 7), interval="prediction")
. The answer is 845 pages with an interval of 373.94 and 1316.195.
It was good in that the actual value was within the CI, but the interval was so broad that it would include most modern fantasies, as well as War and Peace and all of the A Song of Ice and Fire books. So, generally the uncertainty was too large to make a good prediction.
There are many effects of the changing climate, with species moving futher north, and great tits (Parus major) becoming more murderous. But even humans are not immune. There have been suggestions that people also become more violent when the temperature get hotter. We can explore some of the evidence about this, thanks to the Chicago Police Department. Our data comes from their online resources: these are the daily mean temperatures and number of crimes between January 1st 2001 and January 23rd 2019. We have removed data where the man temperature was below freezing (the full data will appear later in the course). We can read in the data from the web (check that the link is correct!):
ChicagoCrime <- read.csv("https://www.math.ntnu.no/emner/ST2304/2019v/Week5/NoFreezeTempCrimeChicago.csv")
str(ChicagoCrime)
## 'data.frame': 57 obs. of 2 variables:
## $ TempC: num 0.556 1.111 1.667 2.222 2.778 ...
## $ Crime: num 888 969 950 995 891 ...
The data consists of the temperature in Celcius, and for that temperature, the average the number of crimes reported (over the whole period).
A higher temperature means more crime.
Looking at the plot, there are 3 points with more crime than the trend: row numbers 25, 52, 56. Looking at the plot, there might be some cuvrature in the trend. Even though they are averages, the number of crimes are integers, so have been rounded.
Yes it does: the relationship is fairly straight, and is continuous.
Now you should use lm()
to fit a regression line.
All code below.
reg <- lm(Crime~TempC, data=ChicagoCrime)
summary(reg)
##
## Call:
## lm(formula = Crime ~ TempC, data = ChicagoCrime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.658 -41.537 -3.748 29.867 227.275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 946.2184 16.2985 58.056 < 2e-16 ***
## TempC 5.5698 0.8937 6.232 7.22e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.29 on 54 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.4184, Adjusted R-squared: 0.4076
## F-statistic: 38.84 on 1 and 54 DF, p-value: 7.218e-08
# Temperature coefficient:
coef(reg)["TempC"]
## TempC
## 5.569831
# Confidence interval
confint(reg)["TempC",]
## 2.5 % 97.5 %
## 3.778051 7.361611
# Effect of increasing temp. by 3C
3*coef(reg)["TempC"]
## TempC
## 16.70949
# Predict -10C crime
predict(reg, newdata=data.frame(TempC=-10), interval = "pred")
## fit lwr upr
## 1 890.5201 760.0736 1020.967