TOTAL MARKS QU 1 = 45
A few years ago some researchers published an analysis of the winning times in the 100m sprint at the Olymppics. They suggested that in about 2156 a woman would win in a faster time than the men’s champion.
Winning times in 100m for men and women at Olympic games
The women’s winning time across years was modelled by linear regression with year as a covariate and winning time (measured in seconds) as the response, see Figure 2.
mod.women <- lm(WinningTime ~ Year, data=Women100m)
summary(mod.women)
##
## Call:
## lm(formula = WinningTime ~ Year, data = Women100m)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.37579 -0.08460 0.00929 0.08285 0.32234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 44.347049 4.284251 10.35 1.70e-08 ***
## Year -0.016822 0.002176 -7.73 8.63e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2104 on 16 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.7888, Adjusted R-squared: 0.7756
## F-statistic: 59.76 on 1 and 16 DF, p-value: 8.626e-07
ANSWER (total = 4 marks)
4 marks for : T = alpha + beta * Year + e, where T = time, alpha = intercept, beta = slope, Year = year, e = error.
ANSWER (total = 10 marks)
2 marks each for: Linear response, gaussian/normal residuals, equal variance for all fitted values, independence of residuals, no outliers
ANSWER (total = 3 marks)
1 mark each for: intercept, slope, and variance of error. We don’t usually care so much about error but still needs to be estimated.
ANSWER (total = 2 marks)
1 mark each for: slope = -0.017, so change is a reduction of 0.017*4 = 0.068s.
The equivalent model for the change in men’s time was fitted (Figure 3):
mod.men <- lm(WinningTime ~ Year, data=Men100m)
summary(mod.men)
##
## Call:
## lm(formula = WinningTime ~ Year, data = Men100m)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.263708 -0.052702 0.007381 0.080048 0.214559
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.8264525 1.6796428 18.95 4.11e-15 ***
## Year -0.0110056 0.0008593 -12.81 1.13e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1347 on 22 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.8817, Adjusted R-squared: 0.8764
## F-statistic: 164 on 1 and 22 DF, p-value: 1.128e-11
ANSWER (total = 4 marks)
0.5 marks each for stating the parameters: slopes Women: -0.016822 and Men: -0.0110056. Intercepts Women: 44.35 and Men: 31.83.
1 mark for comments: e.g. Women’s slope is steeper and intercept higher than men’s.
Question does not ask to interpret further so no extra marks for that.
If we want to test whether the women’s time are changing at a different rate to the men’s, we need to fit a new model to the full data, with both the men’s and women’s races included. The code for this model fitting and the analysis of variance table are shown in Figure 4: Sex is a factor that codes whether the race was run by men (the intercept level) or women.
mod.full <- lm(WinningTime ~ Sex*Year, data=Olymp100m)
anova(mod.full)
## Analysis of Variance Table
##
## Response: WinningTime
## Df Sum Sq Mean Sq F value Pr(>F)
## Sex 1 8.5566 8.5566 293.5207 < 2.2e-16 ***
## Year 1 5.3937 5.3937 185.0198 3.507e-16 ***
## Sex:Year 1 0.2292 0.2292 7.8615 0.007911 **
## Residuals 38 1.1078 0.0292
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANSWER (total = 2 marks)
1 mark for the Sex:Year line. 1 mark for F-distribution.
ANSWER (total = 3 marks)
1 mark each for: Yes, the model suggests a difference. Test statistic: (F-value) 7.86, p-value: 0.0079 (if previous question is wrong, give marks if extracted the correct values for the wrong hypothesis).
The summary of the model is in Figure 5.
summary(mod.full)
##
## Call:
## lm(formula = WinningTime ~ Sex * Year, data = Olymp100m)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.37579 -0.05460 0.00738 0.08276 0.32234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.826453 2.128910 14.950 < 2e-16 ***
## SexWomen 12.520596 4.076141 3.072 0.00392 **
## Year -0.011006 0.001089 -10.104 2.56e-12 ***
## SexWomen:Year -0.005817 0.002074 -2.804 0.00791 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1707 on 38 degrees of freedom
## (12 observations deleted due to missingness)
## Multiple R-squared: 0.9275, Adjusted R-squared: 0.9218
## F-statistic: 162.1 on 3 and 38 DF, p-value: < 2.2e-16
ANSWER (total = 4 marks)
2 marks each for: Mens: Time = 31.82 - 0.011 * Year, Womens: Time = 31.82 + 12.52 + (-0.011 -0.0058) * Year
ANSWER (total = 2 marks)
2 marks for: They change by 0.0058s per year, or 0.023s per Olympic games
ANSWER (total = 2 marks)
2 marks for: Beause that will be when a woman first records a negative time.
ANSWER (total = 4 marks)
2 marks for interpretation biologically speaking. So a statement of no it is not reasonable because if we predict too far then it will make unrealistic predictions.
2 marks for a statistical interpretation that this model is appropriate for the data and predictions generated make sense given the current data. This is when you do not consider what they mean just focus on the type of data and the mathematics used to represent it. It is a good description of the current data.
We check the model in Figure 6. The normal probability plots are shown, along with a plot of the residuals against the fitted values.
ANSWER (total = 2 marks)
2 marks for No. The plot suggests that the tails are too thick.
ANSWER (total = 2 marks)
2 marks for: Either: Yes, there doesn’t seem to be much pattern OR: No, it seems heteroscedastic (either variance increases with the mean or with fited value). This is slightly subjective. Whichever answer is chosen must be justified.
ANSWER (total = 1 mark)
1 mark for either: Fit a quadratic term, or use a Box-Cox transformation (or something else sensible).
TOTAL MARKS QU 2 = 30
The North American Breeding Bird Survey is conducted across North America to estimate the abundances of birds. Bird watchers go to sites across North America and count the number of birds they observe for a set time. This data can be used to ask a wide range of questions. Here we can look at the effects of climate on abundance of the house sparrow.
The data set consists of 1714 observations at different sites across North America. For each site the the number of birds observed (Count) are recorded. Mean temperature (temp.mean.sc) and total precipitation (perc.mean.sc) are extracted from a standard database. Both the temperature and precipitation were centered and scaled, so that a temperature value of 0.3 would mean that the temperature is 0.3 standard deviations above the mean.
First, a generalised linear model (Poisson regression) was fitted to the count of the number of sparrows, with mean temperature and total precipitation (both centred and scaled) as explanatory variables (Figure 7).
mod.sparrows <- glm(Count ~ prec.mean.sc + temp.mean.sc,
family=poisson, data=Data)
summary(mod.sparrows)
##
## Call:
## glm(formula = Count ~ prec.mean.sc + temp.mean.sc, family = poisson,
## data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0784 -2.0569 -0.9564 0.9732 8.9266
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.854271 0.009564 193.889 < 2e-16 ***
## prec.mean.sc 0.040401 0.010177 3.970 7.19e-05 ***
## temp.mean.sc -0.045191 0.010200 -4.431 9.39e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 8452.4 on 1713 degrees of freedom
## Residual deviance: 8425.5 on 1711 degrees of freedom
## AIC: 14112
##
## Number of Fisher Scoring iterations: 5
ANSWER (total = 7 marks)
1 mark each for the assumptions: Poisson distribution of error, linear response on link scale, variance controlled by mean (equal in this case), no outliers, independence of residuals.
2 marks for: the below equation where site is indicated by i. Y = number of sparrows, T = temperature and R = precipitation, beta = slope values of R and T.
\[ \begin{aligned} Y_i &= exp(\alpha + \beta_TT_i + \beta_RR_i) \end{aligned} \]
ANSWER (total = 5 marks)
1 mark for the estimate and 2 each for confidence intervals. Estimate: 0.040401 (1 mark), 95% CI: 0.040401 +/- 1.96*0.010177= (0.0205, 0.0603)
ANSWER (total = 2 marks)
1 mark for “overdispersed”, 2nd for reporting stats too. Overdsperion: residual dispersion is 8425.5 on 1711 df, so the deviance ratio is 4.9. We assume it should be 1 for Poisson.
What is the predicted number of sparrows in the following sites:
at a site near Seattle where the mean temperature is 0.3 standard deviations below the mean, and mean precipitation is 0.3 standard deviations above the mean?
at a site just west of Seattle, in a temperate rainforst, where the mean temperature is 0.4 standard deviations below the average, but the precipitation is 5.8 standard deviations above the average.
ANSWER (total = 4 marks)
2 marks for (one for answer, one for showing work): exp(1.85 + 0.3*0.040 + (-0.3*-0.04519139)) = exp(1.88) = 6.5
2 marks for (one for answer, one for showing work): exp(1.85 + 5.8*0.040 + (-0.4*-0.045)) = exp(2.1) = 8.2
From ecological theory, we might expect that there is an optimum temperature and precipitation for the sparrows, and abundance declines when the conditions are further away from the optimum. We can model this using a quadratic curve. A model for this was fitted (Figure 8).
mod.sparrows2 <- glm(Count ~ prec.mean.sc + temp.mean.sc + I(prec.mean.sc^2) + I(temp.mean.sc^2),
family=poisson, data=Data)
disp <- mod.sparrows2$deviance/mod.sparrows2$df.residual
summary(mod.sparrows2, dispersion=disp)
##
## Call:
## glm(formula = Count ~ prec.mean.sc + temp.mean.sc + I(prec.mean.sc^2) +
## I(temp.mean.sc^2), family = poisson, data = Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1822 -1.9344 -0.8248 0.9953 10.7981
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.09946 0.03028 69.332 < 2e-16 ***
## prec.mean.sc -0.00121 0.02534 -0.048 0.962
## temp.mean.sc 0.03841 0.02605 1.475 0.140
## I(prec.mean.sc^2) -0.10552 0.02085 -5.061 4.17e-07 ***
## I(temp.mean.sc^2) -0.16822 0.02181 -7.713 1.23e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 4.614599)
##
## Null deviance: 8452.4 on 1713 degrees of freedom
## Residual deviance: 7886.4 on 1709 degrees of freedom
## AIC: 13577
##
## Number of Fisher Scoring iterations: 5
ANSWER (total = 4 marks)
2 marks for Yes it does. The theory would sugget that the quadratic terms should be negative.
2 marks for I(prec.mean.sc^2) and I(temp.mean.sc^2) and they are both negative.
ANSWER (total = 4 marks) should now predict with quadratic terms too
2 marks for (one for answer, one for showing work): exp(2.1 -0.0012*0.3 - 0.038*0.3 - 0.105*(*0.32)-(0.168*-0.32)) = exp(2.09) = 8.11.
2 marks for (one for answer, one for showing work): exp(2.1 -0.0012*5.8 - 0.038*0.4 - 0.105*(*5.82)-(0.168*-0.42)) = exp(-1.44) = 0.23.
ANSWER (total = 4 marks)
1 mark for noting each difference in prediction: The wet site (site 2) has a very different predicton, because it is extremely wet. The drier site (site 1) also changed slightly, the prediction is higher from the quadratic model.
2 marks for why in the linear model the rainfall effect has to increase, whereas the quadratic model is more flexible and makes it decrease.
TOTAL MARKS FOR WHOLE PAPER = 75