Hints and reminders are in bold
Questions appear in blue.
The exam style questions are optional, but will be helpful for, well, the exam. They will also help you now to see if you understand the content.
(Only a subset of 20 horseshoe crabs in the original data was used in the exam problems, but we will use the full data set of 173 horseshoe crabs - so results will not be the same. Permitted aids for the exam was “all printed and handwritten material and all calculators” - NB that is not the case for 2018!)
We assume that the number of satellites for each female horseshoe
crab follows a Poisson distribution and want to perform a Poisson
regression using a log-link to find if there is a connection between the
expected number of satellites Sa and the width
W and colour C of the carapace of the female
horseshoe crab.
C: the color of the female horseshoe crab (1=light
medium, 2=medium, 3=dark medium, 4=dark)W: width of carapace (cm)The following model was fitted.
crab <- read.table("https://www.math.ntnu.no/emner/TMA4315/2018h/crab.txt")
colnames(crab) <- c("Obs", "C", "S", "W", "Wt", "Sa")
crab <- crab[,-1] #remove column with Obs
crab$C <- as.factor(crab$C)
modelEXAM <- glm(Sa ~ W + C, family = poisson(link = log), data = crab, contrasts = list(C = "contr.sum"))
summary(modelEXAM)
library(car)
Anova(modelEXAM, type = "III", test.statistic = "Wald")
##
## Call:
## glm(formula = Sa ~ W + C, family = poisson(link = log), data = crab,
## contrasts = list(C = "contr.sum"))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.92089 0.56010 -5.215 1.84e-07 ***
## W 0.14934 0.02084 7.166 7.73e-13 ***
## C1 0.27085 0.11784 2.298 0.0215 *
## C2 0.07117 0.07296 0.975 0.3294
## C3 -0.16551 0.09316 -1.777 0.0756 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 559.34 on 168 degrees of freedom
## AIC: 924.64
##
## Number of Fisher Scoring iterations: 6
##
## Analysis of Deviance Table (Type III tests)
##
## Response: Sa
## Df Chisq Pr(>Chisq)
## (Intercept) 1 27.196 1.839e-07 ***
## W 1 51.350 7.726e-13 ***
## C 3 8.518 0.03644 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
a) Write down the estimated model and evaluate the significance of the estimated coefficients.
The estimated model is
\[ \begin{aligned} Y_i &\sim \text{Poisson}(\lambda_i) \\ \log{\lambda_i} &= -2.92 + 0.15 \text{W} + 0.27 X_1 + 0.07 X_2 - 0.17 X_3 + \xi X_4 \end{aligned} \]
where \(W\) is the width, and \(X_j\) is an indicator of whether the female is in colour category \(j\). \(\xi\) is the effect of \(X_4\), to be calculated later. Note that we use a sum to zero contrast.
The summary() gives Wald statistics and their associated
p-values. The effects of width (W) is significant:
p<<0. For the colour effects, C2 is clearly not significantly
different from C4: C1 and C3 are both just either side of the critical
value at 5%. It is notlcealr why the contrast to C4 is important,
though.
How would you proceed to evaluate the significance of the coefficient for the colour C4. Explain (but only do the calculation if you really feel like it).
First, work out what the coefficient is. Then explain what test you would use, and where you would get the components of it.
If you want to do the calculation, you will need to do a bit of matrix manipulation in R.The coefficient is \(-\sum_{i=1}^3{\beta_i} = -(0.27 + 0.07 - 0.17) = -0.17\)
We can calculate the standard error using the contrast, \(C\) and can plug this into
\[ H_0: {\bf C}{\bf \beta}={\bf d} \text{ vs. } H_1: {\bf C}{\bf \beta}\neq {\bf d} \]
i.e. \(C=\{0,0-1,-1,-1 \}\) (the 0s
are for the intercept and W term), and \(\beta = \{-2.92, 0.15, 0.27, 0.07,
-0.17\}\) so we test if \(-(0.27 + 0.07
- 0.17) = -0.17\) is different from 0. We can use the Wald
statistic:
\[ w=({\bf C} \hat{\boldsymbol{\beta}}-{\bf d})^{\text T}[{\bf C}F^{-1}(\hat{\beta}){\bf C}^{\text T}]^{-1}({\bf C}\hat{{\boldsymbol \beta}}-{\bf d}) \]
where \(F^{-1}(\hat{\beta})\) is the
covariance matrix, which we can extract with vcov():
vcov(modelEXAM)
## (Intercept) W C1 C2
## (Intercept) 0.313712405 -0.0116075311 0.0090846333 5.635922e-03
## W -0.011607531 0.0004343333 -0.0002800538 -3.120317e-04
## C1 0.009084633 -0.0002800538 0.0138864871 -2.197738e-03
## C2 0.005635922 -0.0003120317 -0.0021977383 5.323442e-03
## C3 -0.005792980 0.0001811855 -0.0042680583 2.192067e-05
## C3
## (Intercept) -5.792980e-03
## W 1.811855e-04
## C1 -4.268058e-03
## C2 2.192067e-05
## C3 8.679453e-03
C = t(as.vector(c(0, 0, -1, -1, -1)))
betahat <- coef(modelEXAM)
d <- 0
invF <- vcov(modelEXAM)
# solve() calculates the inverse of the matrix
(w <- t(C %*% betahat - d) %*% solve(C %*% invF %*% t(C)) %*% (C %*% betahat - d))
## [,1]
## [1,] 2.076764
(the calculation of w involves a bit of over-kill and
matrix operations)
We are testing one hypothesis , so \(r=1\), so we compare this to a \(\chi^1_1\) distribution:
pchisq(w, 1, lower.tail = FALSE)
## [,1]
## [1,] 0.1495569
So this is not significant at 5%.
Perform a test to evaluate the fit of the model. What is your conclusion? Use a 5% significance level for your evaluations.
We are asking if any variation not explained by the model could just be random Poisson noise.
We want to test if the residual deviance follows a \(\chi^2\) distribution. Luckily
summary() gives us what we need:
summary(modelEXAM)
##
## Call:
## glm(formula = Sa ~ W + C, family = poisson(link = log), data = crab,
## contrasts = list(C = "contr.sum"))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.92089 0.56010 -5.215 1.84e-07 ***
## W 0.14934 0.02084 7.166 7.73e-13 ***
## C1 0.27085 0.11784 2.298 0.0215 *
## C2 0.07117 0.07296 0.975 0.3294
## C3 -0.16551 0.09316 -1.777 0.0756 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 559.34 on 168 degrees of freedom
## AIC: 924.64
##
## Number of Fisher Scoring iterations: 6
The residual deviance is 559.344785, with 168 degrees of freedom. The p-value is thus:
pchisq(modelEXAM$deviance, modelEXAM$df.residual, lower = FALSE)
## [1] 1.471037e-43
which is horribly significant.
(In Problem 3d we derive the mathematical formula for the test statistic this test is based on.)
b) Make a sketch by hand illustrating the connection between the expected number of satellites and the width of the carapace for each of the four colours of the horseshoe crab.
These are the coefficients:
coef(modelEXAM)
## (Intercept) W C1 C2 C3
## -2.92089322 0.14934271 0.27085229 0.07116581 -0.16551073
Start by drawing on the log scale, so we get straight lines. The C4 effect is about -0.17, so C3 and C4 cross the x axis at about -3.1, and the y-axis at about 3.1/0.15=~20. C2 crosses the x axis at about -2.85, and C1 at -2.65.
But in the data W ranges between about 21 and 33, so the values of \(\log{\lambda}\) are all positive: for C3 \(\log{\lambda}\) is between about \(-3.1+21\times 0.15 = 0.05\) and \(-3.1+33\times 0.15 = 1.85\).
On the log scale these are all parallel. But on the data sale, they will diverge. So we end up with something like this mess:
The estimated multiplicative change in the expected number of satellites when the width of the carapace is changed by 1 cm is, according to the model, independent of the colour.
Explain why.
Also find a 95% confidence interval for this change.
exp(confint(modelEXAM)["W", ])
## 2.5 % 97.5 %
## 1.114412 1.209282
It is, asymptotically at least, a linear combination of normal random variables (the estimates of the parameters), so must be Normal.
Which value of \(x_W\) would give estimated mean number of satellites equal to 5?
We need to plug in the slope for the W effect, and the
“light medium” (=C1) effect:
\[ \ln(5) = -2.92 + 0.15\times W_5 + 0.27 = -2.65 + 0.15\times W_5 \]
i.e.
\[ W_5 = \frac{\ln(5) + 2.65}{0.15} = 28.4 \]
Or, in R:
(W5 = (log(5) - (coef(modelEXAM)["(Intercept)"] + coef(modelEXAM)["C1"]))/coef(modelEXAM)["W"])
## (Intercept)
## 28.5215
The slight difference is due to rounding.
predict function: ?predict.glm tells
you more.
We do this be predicting the value.
newd <- data.frame(C = factor(1, levels = levels(crab$C)), W = W5)
(pred <- predict(modelEXAM, newdata = newd, type = "link", se.fit = TRUE))
# 95% CI:
(c(pred$fit + 1.96 * pred$se.fit, pred$fit - 1.96 * pred$se.fit))
## $fit
## (Intercept)
## 1.609438
##
## $se.fit
## [1] 0.1448597
##
## $residual.scale
## [1] 1
##
## (Intercept) (Intercept)
## 1.893363 1.325513
Optional: Use ggplot to create
(and improve) the sketch from b).
Consider a random variable \(Y\). In our course we have considered the univariate exponential family having distribution (probability density function for continuous variables and probability mass function for discrete variables) \[ f(y)=\exp \left( \frac{y \theta-b(\theta)}{\phi}w + c(y,\phi,w) \right)\] where \(\theta\) is called the natural parameter (or parameter of interest) and \(\phi\) the dispersion parameter.
The Poisson distribution is a discrete distribution with probability mass function \[ f(y)=\frac{\lambda^{y}}{y!}\exp(- \lambda), \text{ for } y=0,1,\ldots\] where \(\lambda>0\).
a) (10 points)
The log likelihood for a single datum, \(y\), can be written as
\[ \begin{aligned} l(\theta)&=\ln L(\theta)=[y \ln(\lambda)-\lambda-\ln(y!)] \\ &= \frac{y \ln(\lambda)-\lambda}{1}1 + \ln(y!) \end{aligned} \]
So \(\theta=\ln(\lambda)\), \(=b(\theta)=\lambda\), \(w=\phi=1\) and \(c(y,\phi,w)=\ln(y!)\).b) (15 points)
We consider a Poisson regression with log link \(\eta_i=g(\mu_i)=\ln(\mu_i)\), and linear predictor equal to \(\eta_i={\bf x}_i^T \boldsymbol{\beta}\). Further, let \(p\) be the number of regression parameters in \(\boldsymbol{\beta}\) (intercept included). The response–covariate pairs \((Y_i, {\bf x}_i)\) are independent for \(i=1,\ldots, n\).
Yes it does. The Poisson distribution is a member of the exponential family, indeed the log link it the cononical link for the Poisson. The data are independent pairs, and the covaraites have a linear effect on \(\eta_i\).
\[ \log(f(y)) =\log \left(\frac{\lambda^{y}}{y!}\exp(- \lambda)\right) \]
Strictly this would be correct, but a bit naughty. This is better:
\[ l(y) = y \log \lambda - \lambda - \log(y!) \]
And if you forget the \(- \log(y!)\), it’s still right (if not exact).Above we wrote “let \(p\) be the number of regression parameters in \(\boldsymbol{\beta}\) (intercept included)”, so
The MLE for \(\boldsymbol{\beta}\) is the solution to \({\bf s}(\boldsymbol{\beta}) = 0\). We can solve this using Fisher scoring, which is also a Newton-Raphson appraoch (because the observed and expected Fisher information are the same). i.e. we iterate
\[ \beta^{(t+1)} = \beta^{(t)} + F(\beta^{(t)})^{-1} s(\beta^{(t)}) \] until \(\beta^{(t+1)} - \beta^{(t)} < \epsilon\), for some small \(\epsilon\).
c) (10 points)
We now look at a data set giving the number of species of tortoise on the various Galapagos Islands (Data taken from the book “Practical Regression and Anova using R” by Julian J. Faraway.).
The data set contains measurements on 30 islands, and we study the following variables:
Species: The number of species of tortoise found on the
island.Area: The area of the island (km\(^2\)).Elevation: The highest elevation of the island
(m).Nearest: The distance from the nearest island
(km).Scruz: The distance from Santa Cruz island (km).Adjacent: The area of the adjacent island (km\(^2\)).We have fitted a Poisson regression with log link to
Species as response, and the other five variables are used
as continuous covariates. Print-out from the fitted model is given in
below.
library(faraway)
fit <- glm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent, data = gala,
family = poisson(link = log))
summary(fit)
##
## Call:
## glm(formula = Species ~ Area + Elevation + Nearest + Scruz +
## Adjacent, family = poisson(link = log), data = gala)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.155e+00 5.175e-02 60.963 < 2e-16 ***
## Area -5.799e-04 2.627e-05 -22.074 < 2e-16 ***
## Elevation 3.541e-03 8.741e-05 40.507 < 2e-16 ***
## Nearest 8.826e-03 1.821e-03 4.846 1.26e-06 ***
## Scruz -5.709e-03 6.256e-04 -9.126 < 2e-16 ***
## Adjacent -6.630e-04 2.933e-05 -22.608 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3510.73 on 29 degrees of freedom
## Residual deviance: 716.85 on 24 degrees of freedom
## AIC: 889.68
##
## Number of Fisher Scoring iterations: 5
Let \(\boldsymbol{\beta}\) be a \(6 \times 1\) column vector with the regression coefficients (intercept included), and let \(\boldsymbol{\hat{\beta}}\) be the maximum likelihood estimator for \(\boldsymbol{\beta}\).
Assymptotically \(\boldsymbol{\hat{\beta}} \sim N(\boldsymbol{\beta}), F(\beta))\) where \(F(\beta) = \text{Cov}(\beta)\), i.e. the expected Fisher information matrix.
We can approximate \(F(\beta)\) with the observed Fisher information calculated at \(\boldsymbol{\hat{\beta}}\), i.e.
\[ H(\boldsymbol{\hat{\beta}}) = -\frac{\partial s(\boldsymbol{\hat{\beta}})}{\partial \boldsymbol{\hat{\beta}}^T} \]
We will focus on the effect of Elevation, and denote the
corresponding regression coefficient \(\beta_2\).
3.541e-03
(orcoef(fit)["Elevation"] = 0.0035406 if you want to
extract it).
Elevation on the
number of species of tortoise found on the islands? It say that the number of species increase \(e^{0.00335} = 1.00035\) times for each meter higher an island is.
This is small, but so is a 1m change in elevation. So we can talk about a change in 100m: this would increase the number of species \(e^{100*0.00335} = 1.42\) times.
In fact, a change of 200m doubles the numer of species.
8.741e-05
(or sqrt(vcov(fit)["Elevation", "Elevation"]) =
8.7407037^{-5} if you want to extract it).
Approximately this is \({\hat \beta}_2 \pm 1.96 \hat{\sigma}_2\),
i.e. 3.541e-03 \(\pm 1.96 \times\) 8.741e-05 = (3.37e-03, 3.71e-03).
(written out - with small changes to fit the notation we use - from https://www.uio.no/studier/emner/matnat/math/STK3100/h17/stk3100-4100_2017_2eng.pdf)
Assume that the random variable \(Y\) is Poisson distributed with probability mass function (pmf)
\[\text{P}(Y = y | \lambda)= \frac{\lambda^y}{y!} \exp(-\lambda), \ y = 0, 1, 2, \dots.\]
a) Show that the distribution of \(Y\) is an exponential family, that is, show that the pmf can be written in the form \[\exp\left\{\frac{y \theta - b(\theta)}{\phi}w + c(y, \phi, w)\right\},\] and determine \(\theta\), \(\phi\), \(w\), \(b(\theta)\) and \(c(y, \phi, w)\).
The log likelihood for a single datum, \(y\), can be written as
\[ \begin{aligned} l(\theta)&=\ln L(\theta)=[y \ln(\lambda)-\lambda-\ln(y!)] \\ &= \frac{y \ln(\lambda)-\lambda}{1}1 + \ln(y!) \end{aligned} \]
So \(\theta=\ln(\lambda)\), \(=b(\theta)=\lambda\), \(w=\phi=1\) and \(c(y,\phi,w)=\ln(y!)\).
(seem familiar? I just copied and pasted from Q2)We then assume that \(Y_1, Y_2, \dots, Y_n\) are independent with the pmf from a), and let \(\mu_i = \text{E}(Y_i)\), \(i = 1, \dots, n\).
b) Explain what we mean by a generalized linear model (GLM) for \(Y_1, Y_2, \dots, Y_n\) with link function \(g(\cdot)\), and determine the canonical link function.
A GLM is a model for the data, \(Y_1, Y_2, \dots, Y_n\), where we assume \(g(\text{E}(Y_i)) = \bf{x}_i' \beta\), where \(g()\) is the link function.
For the Poisson model we see \(\theta=\ln(\lambda)\), where \(\lambda\) is \(\text{E}(Y))\), so \(\log\) must be the canonical link.c) Derive an expression for the log-likelihood function \(l(\mathbf{\mu}; \mathbf{y})\), where \(\mathbf{y} = (y_1, \dots, y_n)^T\) is the observed value of \(\mathbf{Y} = (Y_1, Y_2, \dots, Y_n)^T\) and \(\mathbf{\mu} = (\mu_1, \dots, \mu_n)^T\).
First, \(l(\mathbf{\mu}; \mathbf{y}) = \sum_{i=1}^n{\log \text{P}(Y_i = y_i | \mu_i)}\), so
\(l(\mathbf{\mu}; \mathbf{y}) = \sum_{i=1}^n{\log \text{P}(Y_i = y_i | \mu_i)}\), so from answer (a) we see \(l(\mu_i; y_i)=y_i \ln(\mu_i)-\lambda_i-\ln(y_i!)\), and thus
\[ l(\mathbf{\mu}; \mathbf{y}) = \sum_{i=1}^n{y_i \ln(\mu_i)} - \sum_{i=1}^n{\mu_i}- \sum_{i=1}^n{\ln(y_i!)} \]
d) Explain what we mean by a saturated model and determine the maximum of \(l(\mathbf{\mu}; \mathbf{y})\) for the saturated model.
A saturated model is one where there is a parameter for every data point, i.e. \(p=n\)
We can find \(l(\mathbf{\mu}; \mathbf{y})\) for the saturated model by solving \(\partial l(\mu_i; y_i)/\partial \mu_i =0\)
\[ \begin{aligned} \frac{\partial l(\mu_i; y_i)}{\partial \mu_i}&=\frac{\partial \left(y \ln(\mu_i)-\mu_i-\ln(y_i!)\right)}{\partial \mu_i} =0\\ &= \frac{y_i}{\mu_i} -1=0 \end{aligned} \]
So \(y_i = \mu_i\)
e) Explain what we mean by the deviance \(D(\mathbf{y}; \mathbf{\hat{\mu}})\) of a Poisson GLM, find an expression for the deviance, and discuss how it may be used.
The deviance of a model is -2 times the difference between the log likelihood for the model and the saturated likelihood, i.e. \(- 2 (l(\mathbf{\beta}; \mathbf{y}) - l(\mathbf{\mu}; \mathbf{y}))\).
So for the Poisson distribution, using \(\tilde{\mu}_i\) for the estimate of \(\mu_i\) at the model of interest, and substituting \(\mu_i = y_i\) for the saturated model, we get
\[ \begin{aligned} - 2 (l(\mathbf{\beta}; y_i) - l(\mu_i; y_i)) &= -2\left(\left(y_i \ln(\tilde{\mu}_i)-\tilde{\mu}_i-\ln(y_i!)\right) - \left(y_i \ln(y_i)-y_i-\ln(y_i!) \right)\right)\\ &= -2\left( y_i (\ln \tilde{\mu}_i -\ln y_i) - (\tilde{\mu}_i - y_i) \right) \\ &= 2y_i \ln \frac{y_i}{\tilde{\mu}_i} - 2( y_i - \tilde{\mu}_i) \end{aligned} \]
(there is some flexibility on where to stop fiddling to make this look the most elegant)
install.packages(c("tidyverse", "ggplot2", "statmod", "corrplot", "ggplot2", "GGally",
"boot"))