Exercise 2: Can you outrun a zombie?

Instructions:

This document contains information, questions, R code, and plots.

Hints and reminders are bold

Questions appear in blue.

Exercise to be handed in by end of 12th February

Rationale:

To complete this exercise you will need to use:

The theory you have been taught in the lectures
The R code you have been given in the intro
The R help pages and google
Some biological knowledge
A bit of creativity and problem solving

This week the exercise looks at how we can use maximum likelihood estimation for distributions that use two parameters to make a life saving (or not) decision. These exercises will get you to use all steps in statistical modelling (choosing models, estimating parameters, estimating uncertainty, and interpretation of results.)

R hints and general help

Question vocabulary:

Describe is asking you to summarise some outputs or findings e.g. “X is larger than Y” or “X has a straight line relationship with Y with some outliers” or “The confidence interval is 2.4 to 3.1”
Interpret is asking you to take what you can describe and put it into context (especially back into the units it was calculated from) e.g. “X is larger than Y by 5 days, this means that…” or “X has a straight line relationship with Y of 1.5 degrees to every day change in X. This shows that it will be 10.5 degrees warmer at the end of the week” or “The confidence interval is 2.4 to 3.1, meaning we are 95% confident that the population mean lies between these values. The CI does not span zero so we are confident our estimate is statistcally different to zero.”
Your opinion if we ask what you think then we need an opinion as an answer. Give reasons to your answer, there may not be a clear wrong or right. But you should use data available to back up your opinion.

Resources:

Chapter 3 in 'The New Statistics with R'
Google (or any other search engine)
Recap maximum likelihood estimation and parameter distributions
Recap t-tests

R this week:

New functions:

lm() takes the arguments x, and y e.g. lm(y ~ x) x and y here correspond to an explanatory variable (x) and a response (y)
confint() gives confidence intervals for an object created using the lm() function. Uses t-distribution.
summary() produces a summary of an object created using the lm() function.
exp() computes the exponential function, takes e to the power of the number in (). Opposite of taking natural log.
sapply() repeats a function across a vector of values e.g. sapply(seq(1,10,1), sqrt) takes the square root of numbers 1 to 10 in order and produces an output of 10 numbers.

The challenge: Should you try to outrun a zombie?

It is day 40 after the event. An infection has been spreading through the world. Its symptoms produce a zombie-like state of reduced cognitive function coupled with increased aggression. 50% of the population have already been infected. It is not known how many are immune.

You are a group of statisticians who have managed to remain safe inside your office. You now need to decide whether to stay here or try and run to a different location. But, can you outrun the zombies? From TV and film you have pieced together mixed information on zombie abilities. Some sources suggest zombies are slow, only walking or shuffling. Others show that zombies can move incredibly fast, even running. The only way we can find out how these zombies behave is with data.

Luckily! One of your team has a dataset that might help.

One infection outbreak began at the 2018 athletics tournament in Paris. The 100m final had been run two days before when an outbreak occurred during the 100m relay. For some reason the newly infected athletes still decided to run the relay event (we don't know why). So you have data on the 100m times for the same runners both before and after infection.

Zombie picture

The data can be found here: https://www.math.ntnu.no/emner/ST2304/2021v/Week04/Ztimes100.csv

It is your job to analyse the data and decide whether to run or stay.

Part A: Should you stay or go? using the t-distribution

The first step is to import the data and assign it to an object. You can use the whole web link above to import the data. It is a csv file with column names (header) included.

Once you have imported the data it is important to look at it.

We have made two graphs for you (you do not need to know all the code used to do it). In Figure a the grey lines show which times are from the same athlete. The means are marked in red. You can also look at the raw data in R to get familiar with the column names and layout.

Figure b is a histogram of the difference in time before and after being infected. The x-axis is the difference in seconds and the y-axis is the frequency that difference was observed in the data.

plot of chunk unnamed-chunk-2

The first step when deciding how to analyse the data is to choose a model to represent how the data were generated.

Remember: When deciding how to model any data, you need to consider the characteristics of the data and match these to an equation or distribution.

In this case you have data that are differences in times, time is continuous (it can take any value, even negative here). You can also see in Figure b above that the data are almost bell-shaped. Our data seem to be normally distributed - they follow a normal distribution at the population level.

The normal distribution is characterised by two parameters:

The mean ($\mu$)
The variance ($\sigma^2$)

In this case you are only interested in the mean. You want to estimate the value of the mean ($\hat{\mu}$) using maximum likelihood estimation. Remember that the distribution of those estimates of the mean ($\hat{\mu}$) will be a t-distribution. More on this later in the exercise.

In this example the distribution of the data and the parameter are not the same!

The t-distribution has a single parameter; the variance, which is defined by the degrees of freedom (calculated as sample size -1). The mean of the t-distribution is equal to 0.

In this example today you want to use your model to answer a specific question Are zombies faster than humans?

In order to do this you will be doing the equivalent of a paired t-test. You should have covered this in ST0103 during 'hypothesis testing (in practice)'. But here we will look at this from a modelling perspective and put it in the framework of the rest of this course.

To pair the data you will need to create an object which contains the difference in the times before and after infection. To do this you need to subtract one column from the other.

A1. Create the object of differences in time. Why does the data need to be paired like this?

R hint

To look at your column names either type the name of the object into R e.g. Ztimes100, use the str() function to see the structure of the data e.g. str(Ztimes100), or use the specific function colnames(OBJECT). Remember to replace OBJECT with your data name.

To access a specific column you use $ in the format:

DATANAME$COLUMNNAME e.g. x$column1

A2. What are you trying to find out by modelling these differences? (What do you want to know from the data and how can you find that out? i.e. what criteria would you use to decide on the answer to the question?)

You can model the data in R using the lm() function.

The lm() function in R is equivalent to using maximum likelihood estimation but R does all of the estimating for you. It then gives you a summary that contains the maximum likelihood estimate for each parameter (the mean) and a measure of uncertainty (the standard error or confidence intervals - using the t-distribution).

You will need two arguments to give to lm(), these are x and y.

y is the object of differences you made in A1. This will make the data paired. These times will be argument y, which is the response variable.

Which column you substracted from the other will influence how you interpret the answer.

Remember: to assign the differences to an object.

Argument x here is 1. This is because you are looking only at the mean difference. There is no explanatory variable. By taking the difference in times rather than the raw times themselves, you only model the differences.

A3. Run the model using the lm() function.

Look at the results using summary(lm_output) Remember to replace 'lm_output' with the name of your own lm() model!

The output of the lm() using summary() displays:

model you have run
the quantiles of the residuals (distance between data and estimate)
values of the coefficient here:
estimate = estimate of the parameter we are interested in (the mean of the sample)
Std. Error = the standard error of the estimated parameter (taken from t-distribution)
t value = the calculated t-statistic
Pr(>|t|) = probability value
Significance codes (ignore these)
Residual error and degrees of freedom - more on these next week

You can also find the confidence intervals for your parameter estimate based on the t-distribution using the confint() function as in the code below.

A4. What are the confidence intervals for your estimate of the mean difference (what values are the upper and lower bound)? Describe what they mean statistically.

A5. Your team must decide whether to stay or go. Based on the data you have analysed, do you decide to try and outrun the zombies? Include reasons and the results in your answer (MLE and confidence intervals). hint: think about what the estimated mean represents in terms of 100m times

Part B: Was there a better way to model uncertainty?

B1. Do you think you could you have used a normal distribution here to represent the parameter distribution instead of a t-distribution? Say why/why not. Think about what the difference between a t-distribution and normal distribution is.

One way to test whether using a normal distribution would be different to a t-distribution, is to try it!

B2. How many unknown parameters does the normal distribution have?

To get a likelihood curve for the estimate of the mean of the differences ($\hat{\mu}$) in 100m times using a normal distribution, you can use the dnorm() function (similar to what you did last week with the Binomial and Poisson distributions). This allows you to calculate the likelihood, which for a continuous distribution is the density.

Here you create a distribution for the estimated parameter.

Remember: as you have many data points, you will need to sum the log likelihood for all data points.

You have code to do this in week 4 module

R hints

You used dnorm() in the exercise part of this week's lecture.

There is also a function there to sum log likelihoods, you can copy and paste to use that here. It is CalcNormlogLh() it takes the arguments: (mu, sigma, and data), mu = mean, sigma = standard deviation and data = observed data.

The function is in this script https://www.math.ntnu.no/emner/ST2304/2021v/Week04/NormalDistFunctions.R

The estimate of the mean of the differences stays exactly the same as what you got before, because this part has not changed. It is still the mean of the sample. What does change is how we quantify uncertainty. So, how does the estimate of uncertainty change? To find out you need to calculate the confidence intervals.

B3. Calculate the confidence intervals for the estimate of $\hat{\mu}$ using the normal distribution (code below).

## [1] -1.31 -0.98

# Calculate the confidence intervals using a normal distribution 

# First save an object that is the standard error
standard_error_norm <- sd(differences)/sqrt(length(differences))

# Then + or - 1.96 times the standard error from the 
# maximum likelihood estimate
CI_upper <- mean(differences) + 1.96*standard_error_norm
CI_lower <- mean(differences) - 1.96*standard_error_norm

# display the confidence intervals to 2 decimal places
round(c(CI_lower, CI_upper),2)

B4. Compare these confidence intervals those calculated from the t-distribution?

You should have those written down from question A4.

You can also plot the difference confidence intervals. Here we plot them both on the Normal distribution density plot.

We have given you the code to do this - it isn't something you need to memorise but do read the comments if you want to understand what it does.

# First create a plot with the Normal distribution density.
# We use xlim and ylim to set the limits of our axes to make
# the plot look nicer and be easier to read. 
plot(means, exp(likelihoods.norm-max(likelihoods.norm)), 
     type="l", ylab="Density", xlab="Mean", xlim = c(-3,3), ylim=c(0,1.1))
# We then add a point for the maximum likelihood estimate of the mean.
points(mean(differences), 1, col = "red", pch=16)

# Then add the t-distribution confidence intervals in red
abline(v = c(confint(t_test1)[1], confint(t_test1)[2]), col = "red")
# and the Normal distribution confidence intervals in blue
abline(v = c(CI_lower, CI_upper), col = "blue")

# The final step is a legend to explain the colour scheme
# bty = 'n' just means there will be no box around the legend
# the first argument gives the location of the legend
# then the labels
# lty = line type to display
# col = colours
legend("topright", c("Normal CI", "t CI"), bty = "n",
       lty= c(1,1), col = c("blue", "red"))

plot of chunk unnamed-chunk-16

B5. Now you have tried the Normal distribution, do you think the Normal distribution is a good model for the uncertainty in your parameter estimate? Why/Why not In what situation might it become better?

B6. Consider the Normal distribution confidence intervals. If you use these only, would it change your decision to run or stay (question A5)? Explain your answer

Part C: Reflection

When modelling, it is always important to consider how good our model and our data are.

C1. What other data might you want to improve your decision making? Could there be any problems with the current study design?

Part D: Feedback

D1. How do you think this exercise went? What do you think your group did well, what was a challenge? (2 examples of each)

D2. What do you think you improved from last week?

D3. What would you like feedback on this week?