Continuation Exam Solution

Problem 1: Olympic Records

TOTAL MARKS QU 1 = 45

A few years ago some researchers published an analysis of the winning times in the 100m sprint at the Olymppics. They suggested that in about 2156 a woman would win in a faster time than the men’s champion.

Winning times in 100m for men and women at Olympic games

The women’s winning time across years was modelled by linear regression with year as a covariate and winning time (measured in seconds) as the response, see Figure 2.

mod.women <- lm(WinningTime ~ Year, data=Women100m)
summary(mod.women)

## 
## Call:
## lm(formula = WinningTime ~ Year, data = Women100m)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37579 -0.08460  0.00929  0.08285  0.32234 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 44.347049   4.284251   10.35 1.70e-08 ***
## Year        -0.016822   0.002176   -7.73 8.63e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2104 on 16 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.7888, Adjusted R-squared:  0.7756 
## F-statistic: 59.76 on 1 and 16 DF,  p-value: 8.626e-07

Write down the model for the women’s times in mathematical notation.

ANSWER (total = 4 marks)

4 marks for : T = alpha + beta * Year + e, where T = time, alpha = intercept, beta = slope, Year = year, e = error.

What assumptions are we making when using the model?

ANSWER (total = 10 marks)

2 marks each for: Linear response, gaussian/normal residuals, equal variance for all fitted values, independence of residuals, no outliers

Which unknown parameters does the model contain?

ANSWER (total = 3 marks)

1 mark each for: intercept, slope, and variance of error. We don’t usually care so much about error but still needs to be estimated.

What is the predicted change in times from one Olympic games to the next (i.e. from one games to the next, 4 years later)?

ANSWER (total = 2 marks)

1 mark each for: slope = -0.017, so change is a reduction of 0.017*4 = 0.068s.

The equivalent model for the change in men’s time was fitted (Figure 3):

mod.men <- lm(WinningTime ~ Year, data=Men100m)
summary(mod.men)

## 
## Call:
## lm(formula = WinningTime ~ Year, data = Men100m)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.263708 -0.052702  0.007381  0.080048  0.214559 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 31.8264525  1.6796428   18.95 4.11e-15 ***
## Year        -0.0110056  0.0008593  -12.81 1.13e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1347 on 22 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.8817, Adjusted R-squared:  0.8764 
## F-statistic:   164 on 1 and 22 DF,  p-value: 1.128e-11

Write down the estimates for the regression parameters from the men’s and women’s models, and comment on the values.

ANSWER (total = 4 marks)

0.5 marks each for stating the parameters: slopes Women: -0.016822 and Men: -0.0110056. Intercepts Women: 44.35 and Men: 31.83.

1 mark for comments: e.g. Women’s slope is steeper and intercept higher than men’s.

Question does not ask to interpret further so no extra marks for that.

If we want to test whether the women’s time are changing at a different rate to the men’s, we need to fit a new model to the full data, with both the men’s and women’s races included. The code for this model fitting and the analysis of variance table are shown in Figure 4: Sex is a factor that codes whether the race was run by men (the intercept level) or women.

mod.full <- lm(WinningTime ~ Sex*Year, data=Olymp100m)
anova(mod.full)

## Analysis of Variance Table
## 
## Response: WinningTime
##           Df Sum Sq Mean Sq  F value    Pr(>F)    
## Sex        1 8.5566  8.5566 293.5207 < 2.2e-16 ***
## Year       1 5.3937  5.3937 185.0198 3.507e-16 ***
## Sex:Year   1 0.2292  0.2292   7.8615  0.007911 ** 
## Residuals 38 1.1078  0.0292                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Which test in the ANOVA table tests whether the women’s time are changing at a different rate to the men’s? What distribution is used in the test?

ANSWER (total = 2 marks)

1 mark for the Sex:Year line. 1 mark for F-distribution.

Does the model suggest that the women’s time are changing at a different rate to the men’s time? What are the test statistic and p-value?

ANSWER (total = 3 marks)

1 mark each for: Yes, the model suggests a difference. Test statistic: (F-value) 7.86, p-value: 0.0079 (if previous question is wrong, give marks if extracted the correct values for the wrong hypothesis).

The summary of the model is in Figure 5.

summary(mod.full)

## 
## Call:
## lm(formula = WinningTime ~ Sex * Year, data = Olymp100m)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37579 -0.05460  0.00738  0.08276  0.32234 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   31.826453   2.128910  14.950  < 2e-16 ***
## SexWomen      12.520596   4.076141   3.072  0.00392 ** 
## Year          -0.011006   0.001089 -10.104 2.56e-12 ***
## SexWomen:Year -0.005817   0.002074  -2.804  0.00791 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1707 on 38 degrees of freedom
##   (12 observations deleted due to missingness)
## Multiple R-squared:  0.9275, Adjusted R-squared:  0.9218 
## F-statistic: 162.1 on 3 and 38 DF,  p-value: < 2.2e-16

What are the equations for the men’s times and for the women’s times?

ANSWER (total = 4 marks)

2 marks each for: Mens: Time = 31.82 - 0.011 * Year, Womens: Time = 31.82 + 12.52 + (-0.011 -0.0058) * Year

By how much are the women’s times changing compared to the men’s?

ANSWER (total = 2 marks)

2 marks for: They change by 0.0058s per year, or 0.023s per Olympic games

In a response to this model of winning times, a statistician suggested that the race in 2636 would, according to this analysis, be “far more interesting”. Why?

ANSWER (total = 2 marks)

2 marks for: Beause that will be when a woman first records a negative time.

Do you think this is a reasonable model? Explain (briefly!) your thinking.

ANSWER (total = 4 marks)

2 marks for interpretation biologically speaking. So a statement of no it is not reasonable because if we predict too far then it will make unrealistic predictions.

2 marks for a statistical interpretation that this model is appropriate for the data and predictions generated make sense given the current data. This is when you do not consider what they mean just focus on the type of data and the mathematics used to represent it. It is a good description of the current data.

We check the model in Figure 6. The normal probability plots are shown, along with a plot of the residuals against the fitted values.

Do the residuals look normally distributed? Explain your answer (in 1 or 2 sentences)

ANSWER (total = 2 marks)

2 marks for No. The plot suggests that the tails are too thick.

There is a concern that the effect of time is not linear, do the residual plots support this?

ANSWER (total = 2 marks)

2 marks for: Either: Yes, there doesn’t seem to be much pattern OR: No, it seems heteroscedastic (either variance increases with the mean or with fited value). This is slightly subjective. Whichever answer is chosen must be justified.

If you wanted to test whether there was a non-linear reffect of time, how could you do it?

ANSWER (total = 1 mark)

1 mark for either: Fit a quadratic term, or use a Box-Cox transformation (or something else sensible).

Problem 2

TOTAL MARKS QU 2 = 30

The North American Breeding Bird Survey is conducted across North America to estimate the abundances of birds. Bird watchers go to sites across North America and count the number of birds they observe for a set time. This data can be used to ask a wide range of questions. Here we can look at the effects of climate on abundance of the house sparrow.

The data set consists of 1714 observations at different sites across North America. For each site the the number of birds observed (Count) are recorded. Mean temperature (temp.mean.sc) and total precipitation (perc.mean.sc) are extracted from a standard database. Both the temperature and precipitation were centered and scaled, so that a temperature value of 0.3 would mean that the temperature is 0.3 standard deviations above the mean.

First, a generalised linear model (Poisson regression) was fitted to the count of the number of sparrows, with mean temperature and total precipitation (both centred and scaled) as explanatory variables (Figure 7).

mod.sparrows <- glm(Count ~ prec.mean.sc + temp.mean.sc, 
               family=poisson, data=Data)
summary(mod.sparrows)

## 
## Call:
## glm(formula = Count ~ prec.mean.sc + temp.mean.sc, family = poisson, 
##     data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0784  -2.0569  -0.9564   0.9732   8.9266  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.854271   0.009564 193.889  < 2e-16 ***
## prec.mean.sc  0.040401   0.010177   3.970 7.19e-05 ***
## temp.mean.sc -0.045191   0.010200  -4.431 9.39e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 8452.4  on 1713  degrees of freedom
## Residual deviance: 8425.5  on 1711  degrees of freedom
## AIC: 14112
## 
## Number of Fisher Scoring iterations: 5

Write down the assumptions of this model and an equation for the expected number of sparrows at a site.

ANSWER (total = 7 marks)

1 mark each for the assumptions: Poisson distribution of error, linear response on link scale, variance controlled by mean (equal in this case), no outliers, independence of residuals.

2 marks for: the below equation where site is indicated by i. Y = number of sparrows, T = temperature and R = precipitation, beta = slope values of R and T.

\[ \begin{aligned} Y_i &= exp(\alpha + \beta_TT_i + \beta_RR_i) \end{aligned} \]

What is the estimated coefficient for the effect of mean precipitation (prec.mean.sc)? And what is the 95% confidence interval for this estimate?

ANSWER (total = 5 marks)

1 mark for the estimate and 2 each for confidence intervals. Estimate: 0.040401 (1 mark), 95% CI: 0.040401 +/- 1.96*0.010177= (0.0205, 0.0603)

Are there any signs of over- or under-dispersion in the data?

ANSWER (total = 2 marks)

1 mark for “overdispersed”, 2nd for reporting stats too. Overdsperion: residual dispersion is 8425.5 on 1711 df, so the deviance ratio is 4.9. We assume it should be 1 for Poisson.

What is the predicted number of sparrows in the following sites:
1. at a site near Seattle where the mean temperature is 0.3 standard deviations below the mean, and mean precipitation is 0.3 standard deviations above the mean?
2. at a site just west of Seattle, in a temperate rainforst, where the mean temperature is 0.4 standard deviations below the average, but the precipitation is 5.8 standard deviations above the average.

ANSWER (total = 4 marks)

2 marks for (one for answer, one for showing work): exp(1.85 + 0.3*0.040 + (-0.3*-0.04519139)) = exp(1.88) = 6.5

2 marks for (one for answer, one for showing work): exp(1.85 + 5.8*0.040 + (-0.4*-0.045)) = exp(2.1) = 8.2

From ecological theory, we might expect that there is an optimum temperature and precipitation for the sparrows, and abundance declines when the conditions are further away from the optimum. We can model this using a quadratic curve. A model for this was fitted (Figure 8).

mod.sparrows2 <- glm(Count ~ prec.mean.sc + temp.mean.sc + I(prec.mean.sc^2) + I(temp.mean.sc^2), 
               family=poisson, data=Data)
disp <- mod.sparrows2$deviance/mod.sparrows2$df.residual
summary(mod.sparrows2, dispersion=disp)

## 
## Call:
## glm(formula = Count ~ prec.mean.sc + temp.mean.sc + I(prec.mean.sc^2) + 
##     I(temp.mean.sc^2), family = poisson, data = Data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1822  -1.9344  -0.8248   0.9953  10.7981  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        2.09946    0.03028  69.332  < 2e-16 ***
## prec.mean.sc      -0.00121    0.02534  -0.048    0.962    
## temp.mean.sc       0.03841    0.02605   1.475    0.140    
## I(prec.mean.sc^2) -0.10552    0.02085  -5.061 4.17e-07 ***
## I(temp.mean.sc^2) -0.16822    0.02181  -7.713 1.23e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 4.614599)
## 
##     Null deviance: 8452.4  on 1713  degrees of freedom
## Residual deviance: 7886.4  on 1709  degrees of freedom
## AIC: 13577
## 
## Number of Fisher Scoring iterations: 5

Does the ecological theory seem reasonable for this data? What parameter values tell you about this?

ANSWER (total = 4 marks)

2 marks for Yes it does. The theory would sugget that the quadratic terms should be negative.

2 marks for I(prec.mean.sc^2) and I(temp.mean.sc^2) and they are both negative.

What is the predicted number of sparrows for the three sites menetioned above?

ANSWER (total = 4 marks) should now predict with quadratic terms too

2 marks for (one for answer, one for showing work): exp(2.1 -0.0012*0.3 - 0.038*0.3 - 0.105*(*0.3²)-(0.168*-0.3²)) = exp(2.09) = 8.11.

2 marks for (one for answer, one for showing work): exp(2.1 -0.0012*5.8 - 0.038*0.4 - 0.105*(*5.8²)-(0.168*-0.4²)) = exp(-1.44) = 0.23.

Comment briefly on the differences in the predictions from the linear and quadratic model.

ANSWER (total = 4 marks)

1 mark for noting each difference in prediction: The wet site (site 2) has a very different predicton, because it is extremely wet. The drier site (site 1) also changed slightly, the prediction is higher from the quadratic model.

2 marks for why in the linear model the rainfall effect has to increase, whereas the quadratic model is more flexible and makes it decrease.

TOTAL MARKS FOR WHOLE PAPER = 75