Exercise 2: Normal distribution/T-Tests

Instructions:

The same as Exercise 1.

Hints and reminders are italic

Questions appear in blue.

Answers in red or as R code!.

Update! remember when downloading data do not open in excel, modify and re-save. This messes with the formatting.

Question vocabulary:

Describe is asking you to summarise some outputs or findings e.g. “X is larger than Y” or “X has a straight line relationship with Y with some outliers” or “The confidence interval is 2.4 to 3.1”
Interpret is asking you to take what you can describe and put it into context (especially back into the units it was calculated from) e.g. “X is larger than Y by 5 days, this means that…” or “X has a straight line relationship with Y of 1.5 degrees to every day change in X. This shows that it will be 10.5 degrees warmer at the end of the week” or “The confidence interval is 2.4 to 3.1, meaning we are 95% confident that the population mean lies between these values. The CI does not span zero so we are confident our estimate is statistcally different to zero.”
Your opinion if we ask what you think then we need an opinion as an answer. Give reasons to your answer, there may not be a clear wrong or right. But you should use data available to back up your opinion.

Resources:

Chapter 3 in 'The New Statistics with R'
Google (or any other search engine)

The challenge: Should you try to outrun a zombie?

It is day 40 after the event. An infection has been spreading through the world. Its symptoms produce a zombie-like state of reduced cognitive function coupled with increased agression. 50% of the population have already been infected. It is not known how many are immune.

You are a group of statisticians who have managed to remain safe inside your office. You now need to decide whether to stay here or try and run to a different location. But, can you outrun the zombies? From TV and film you have pieced together mixed information on zombie abilities. Some sources suggest zombies are slow, only walking or shuffling. Others show that zombies can move incredibly fast, even running. The only way we can find out how these zombies behave is with data.

Luckily! One of your team has a dataset that might help.

One infection outbreak began at the 2018 athletics tournament in Paris. The 100m final had been run two days before when an outbreak occurred during the 100m relay. For some reason the newly infected athletes still decided to run the relay event (we don't know why). So you have data on the 100m times for the same runners both before and after infection.

Zombie picture

The data can be found here: https://www.math.ntnu.no/emner/ST2304/2019v/Week4/Ztimes100.csv

It is your job to analyse the data and decide whether to run or stay.

The first step is to import the data and assign it to an object. You can use the whole web link above to import the data. It is a csv file with column names (header) included.

Ztimes100 <- read.csv("https://www.math.ntnu.no/emner/ST2304/2019v/Week4/Ztimes100.csv", header=T)

The next step is to plot the data, we always want to look at our data before we start doing anything else.

We have made a graph for you (you do not need to know all the code used to do it). The grey lines show which times are from the same athlete. The means are marked in red.

# To make the plot it is easier if we re-organise our data
# we want to have a column indicating if the time was before or after
# infection and a column with all the times.

# We do this by making a data frame.
# We will code before and after as 0 and 1 to make plotting easier.
# 0 = before, 1 = after. We use rep to repeat 0 and 1 ten times each.
# We join together the before and after times into new column Times.
# Athlete is created by repeating 1 to 10 twice.
Ztimes100_2 <- data.frame(Group = rep(c(0,1), each = 10),
                          Times = c(Ztimes100$times_before, 
                                    Ztimes100$times_after),
                          Athlete = rep(1:10, 2))

# Then plot all of the data
# xaxt = 'n' tells R not to draw an x axis, we will do this ourselves
plot(Ztimes100_2$Group, Ztimes100_2$Times, pch = 16, xaxt = 'n',
     xlim = c(-0.5,1.5), ylim = c(5,15), xlab = "Group", ylab = "Times (s)")
# draw the x axis (side = 1) and give appropriate labels
axis(side = 1, at = c(0,1), labels = c("Before", "After"))

# Then mark on the means in red
points(c(0,1), c(mean(Ztimes100$times_before),
                 mean(Ztimes100$times_after)), col="red", pch = 4, lwd=3)
# to do this we loop through the 10 values of before and after
# we make the line grey
for(i in 1:10){
lines(c(0,1), c(Ztimes100$times_before[i], Ztimes100$times_after[i]),
      col = "grey50")}

plot of chunk unnamed-chunk-2

To analyse this data you will be using a paired t-test.

1. Why does the t-test need to be paired?

It needs to be paired because there is some dependence in the data. The same athletes are present in each dataset. We need to compare each athlete post infection to their own time pre-infection.

2. What are you trying to find out using a t-test? ( What does it model?)

A t-test compares the means of two groups. It allows us to model whether two groups have different means. In a paired t-test we can ask if the mean difference (the mean of the paired differences) between two groups is different to zero.

T-test can be conducted in several ways in R:

The first way is to use lm() function. Take a look at the function using ?lm. The arguments you will use in lm are x, and y. lm(y ~ x)

You will need to create a vector of differences between the times before and after infection. This will make the test paired. This will be y.

x here is 1 as we are looking only at the mean. It makes it an intercept only model.

Remember: which column you substract from the other will influence how you interpret the answer

# Create a vector of paired differences
# the - symbol automatically subtracts each element of a vector from the
# same location in another vector (creates paired differences)
differences <- Ztimes100$times_before - Ztimes100$times_after

# run a paired t-test on the differences between before and after 100m times.
t_test1 <- lm(differences ~ 1)

# show the output of the lm, this displays:
# - model you have run
# - the quantiles of the residuals (distance between data and estimate)
# - values of the coefficient (here:
# * estimate = the mean difference
# * Std. Error = the standard error of the mean
# * t value = the calculated t statistic
# * Pr(>|t|) = probability value)
# - Significance codes (ignore these)
# - Residual error and degrees of freedom
summary(t_test1)

## 
## Call:
## lm(formula = differences ~ 1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7138 -0.3520  0.0018  0.3205  0.6097 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1707     0.1393   15.58  8.1e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4405 on 9 degrees of freedom

# We can show the rounded 95% confidence intervals like this
round(confint(t_test1),2)

##             2.5 % 97.5 %
## (Intercept)  1.86   2.49

3. What are the confidence intervals for this t-test (what values are the upper and lower bound)? Describe what they mean.

The lower bound is 1.86 and upper is 2.49 seconds.
The definition of the confidence interval is, if we repeated our sampling many times and took a confidence interval each time, 95% of them would contain the true population parameter. In this case that is the true difference in time between zombie 100m times and human 100m times. Here we can use the confidence interval to indicate uncertainty around our estimate. The width of the interval is not too wide (< 1 second) so we do not have high uncertainty.

When drawing conclusions, we would want to take this into account. For example, we might choose not to run because the confidence interval is as high as 2.49 seconds faster (over 100m that is a lot!), even if we might have felt happy with running with a 1.86 second difference. But they do not span zero. So we might also conclude that with the estimated difference (the one we calculated with the t-test), there is a 95% probability of seeing a difference (i.e. not 0) within the 95% confidence interval. We are unlikely to have seen these results if there was no difference.

The second way is by using R as a calculator.

The formula for a t-test is:

mean of data/the standard error

The formula for the standard error is:

the standard deviation/the square root of sample size (n)

4. Using the formulas above, calculate the T statistic in R. What value do you get? (include a copy and paste of your code)

R code below.

# calculate the standard error
standard_error <- sd(differences)/sqrt(length(differences))

# calculate the t statistic
t_test2 <- mean(differences)/standard_error

Have a look at how your results from question 4 and compare to the results from lm(). Hopefully they will be the same. This is one way to check you did question 4 correctly.

5. Your team must decide whether to stay or go. Based on the data you have analysed, do you decide to try and outrun the zombies? Include reasons and the results in your answer (mean and CIs). hint: think about what the mean difference you calculated means in terms of 100m times

Based on the data you have, you should choose to stay. Zombies are faster by an average of 2.17 seconds, with 95% confidence intervals between 1.86 and 2.49 seconds. Even when we include uncertainty, we still estimate a difference of greater than zero. This is an instance when we probably have a very cautious attitude to our results. Even if our confidence intervals did span zero, so no difference was included in our uncertainty of the effect. We probably would still choose not to run. This interplay between uncertainty and effect size (how big the difference is) is discussed nicely on p77 in The New Statistics with R.

Make sure the direction is correct! Zombies are faster.

6. Do you think you could you have used a normal distribution here instead of a t-distribution? Say why/why not. After all, our y variable (the differences) should be normally distributed.

It would not be the most appropriate distribution here because of our low sample size of 10 points. Once you get above 30 data points the T distribution becomes almost identical to the normal. But lower it has a more squashed shape and takes account of the higher uncertainty in small sample sizes. This is something you should know, but you could also test it was do below or by plotting the two distributions (T and normal) for the correct degrees of freedom and comparing them.

One way to test whether using a normal distribution instead of a T distribution would work is to try it!

7. How many parameters does the normal distribution have?

Two, the mean and the standard deviation.

To model the differences in 100m times as a normal distribution you can use the dnorm() function (similar to what we did last week with the Binomial and Poisson distributions). This allows us to calculate the likelihood, which for a continuous distribution is the density.

# First we write a short function that lets us sum the 
# log likelihoods for all data points to give one
# likelihood measure for the whole dataset

# this line sets up the function
Likelihood_calculation <- function(data, mean, sd) { 
  # this line sums the log likelihoods
  likelihood <- sum(dnorm(data, mean, sd, log=TRUE))
  # this line returns the likehood for the whole dataset
  return(likelihood)
}

# next set up a vector of different mean values to 
# get the likelihood for
means <- seq(1,3,0.01)

# Now use the 'Likelihood_calculation' function we made earlier.
# Use it inside a function called sapply.
# You don't need to know much about this function,
# but it runs another function for all values of a vector (here 'means')
likelihoods.norm <- sapply(means, Likelihood_calculation, 
                           data = differences, sd = sd(differences))

# We can look at a few of the likelihoods (exp back transforms the log)
head(exp(likelihoods.norm))

## [1] 1.899876e-18 3.464348e-18 6.284629e-18 1.134226e-17 2.036486e-17
## [6] 3.637684e-17

# WOW! Those are really small numbers. 
# So when we plot we subtract the maximum likelihood to scale the 
# numbers and make them a bit more relatable. 
# plot the outcome
plot(means, exp(likelihoods.norm-max(likelihoods.norm)), 
     type="l", xlab="Mean", ylab="Density", main = "Normal likelihood")

# Optimise the maximum likelihood for the norm distribution
# this code works the same way as optimise did in Exercise 1
MLE_norm <- optimise(Likelihood_calculation, lower = 0, upper = 10,
                     data = differences,
                     sd = sd(differences), maximum = TRUE)$maximum

# show the rounded result
round(MLE_norm,2)

## [1] 2.17

You have found that the maximum likelihood estimate for the mean of the differences is 2.17. It is not surprising that this is the same as the mean difference as calculated by the t-tests above. But is the estimate of uncertainty the same? To find out we need to calculate the confidence intervals.

# Calculate the confidence intervals for our normal maximum likelihood
# estimate. 

# First save an object that is the standard error
standard_error_norm <- sd(differences)/sqrt(length(differences))

# Then ± 1.96 times the standard error from the 
# maximum likelihood estimate
CI_upper <- MLE_norm + 1.96*standard_error_norm
CI_lower <- MLE_norm - 1.96*standard_error_norm
# display the confidence intervals to 2 decimal places
round(c(CI_lower, CI_upper),2)

## [1] 1.90 2.44

The confidence interval calculated from the Normal distribution is 1.89 to 2.45.

8. Compare these confidence intervals those calculated from the T distribution?

You should have those written down from question 3. But we will print them below too.

The Normal confidence intervals are narrower than those calculated from the T distribution.

# To extract only the confidence intervals from the lm() output
# we can use the code below.
round(confint(t_test1),2)

##             2.5 % 97.5 %
## (Intercept)  1.86   2.49

You can also plot the difference confidence intervals. Here we plot them both on the Normal distribution density plot.

# First create a plot with the Normal distribution density.
# We use xlim and ylim to set the limits of our axes to make
# the plot look nicer and be easier to read. 
plot(means, exp(likelihoods.norm-max(likelihoods.norm)), 
     type="l", ylab="Density", xlab="Mean", xlim = c(1.5,3), ylim=c(0,1.1))
# We then add a point for the maximum likelihood estimate of the mean.
points(MLE_norm, 1, col = "red", pch=16)

# Then add the T distribution confidence intervals in red
abline(v = c(confint(t_test1)[1], confint(t_test1)[2]), col = "red")
# and the Normal distribution confidence intervals in blue
abline(v = c(CI_lower, CI_upper), col = "blue")

# The final step is a legend to explain the colour scheme
# bty = 'n' just means there will be no box around the legend
legend("topright", c("Normal CI", "T CI"), bty = "n",
       lty= c(1,1), col = c("blue", "red"))

plot of chunk unnamed-chunk-8

9. Now you have tried the Normal distribution, do you think the Normal distribution is a good model for this data? Why/Why not In what situation might it become better?

It is not a great model here because the confidence intervals are too tight. They under estimate the uncertainty because we have only a few data points. The T distribution captures the uncertainty more accurately.

The Normal distribution will be more appropriate with more data. You can always test this out but creating more data and running all the code again.

10. Consider the Normal distribution confidence intervals. If you use these only, would it change your decision to run or stay (question 5)? Explain your answer

Using the Normal distribution confidence intervals should not change your decision. Both sets of CIs do not cross zero and indicate a difference of greater than 1.5 seconds or higher. This difference is that zombies are faster than humans. But there could be scenarios where using the Normal distribution when you should use a T-distribution would lead to over confidence in the estimate, which could lead to a wrong decision (or at least bad) decision. E.g. if the Normal distribution confidence interval was -1.5 to -0.2 but the T-distribution confidence interval was -2 to 0.5. From the Normal distribution you might conclude that humans are faster than zombies. From the T-distribution you can see this is actually more uncertain. Being equal speed or just slower should also be considered a plausible difference, if the uncertainty is fully considered.

11. What other data might you want to improve your decision making? Could there be any problems with the current study design (are the zombies in our data representative)?

Any answer with a good reason provided is acceptable. Some examples:

Wind data, this could influence the change in times.
More data so we can be more confident of results.
Data from another time, maybe there is an initial increase then it drops off.
Data from other distances, maybe zombies are great at 100m but rubbish after 200m.
Could be other reasons for the increase e.g. warmer.
The zombies we have data from are highly trained and could be taking supplements. We do not know normal humans/zombies will behave the same way.