Hints and reminders are in bold
Questions appear in blue.
The exam style questions are optional, but will be helpful for, well, the exam. They will also help you now to see if you understand the content.
(Only a subset of 20 crabs in the original data was used in the exam problems, but we will use the full data set of 173 crabs - so results will not be the same. Permitted aids for the exam was “all printed and handwritten material and all calculators” - NB that is not the case for 2018!)
We assume that the number of satellites for each female crab follows
a Poisson distribution and want to perform a Poisson regression using a
log-link to find if there is a connection between the expected number of
satellites Sa
and the width W
and colour
C
of the carapace of the female crab.
C
: the color of the female crab (1=light medium,
2=medium, 3=dark medium, 4=dark)W
: width of carapace (cm)The following model was fitted.
crab <- read.table("https://www.math.ntnu.no/emner/TMA4315/2018h/crab.txt")
colnames(crab) <- c("Obs", "C", "S", "W", "Wt", "Sa")
crab <- crab[,-1] #remove column with Obs
crab$C <- as.factor(crab$C)
modelEXAM <- glm(Sa ~ W + C, family = poisson(link = log), data = crab, contrasts = list(C = "contr.sum"))
summary(modelEXAM)
library(car)
Anova(modelEXAM, type = "III", test.statistic = "Wald")
##
## Call:
## glm(formula = Sa ~ W + C, family = poisson(link = log), data = crab,
## contrasts = list(C = "contr.sum"))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.92089 0.56010 -5.215 1.84e-07 ***
## W 0.14934 0.02084 7.166 7.73e-13 ***
## C1 0.27085 0.11784 2.298 0.0215 *
## C2 0.07117 0.07296 0.975 0.3294
## C3 -0.16551 0.09316 -1.777 0.0756 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 559.34 on 168 degrees of freedom
## AIC: 924.64
##
## Number of Fisher Scoring iterations: 6
##
## Analysis of Deviance Table (Type III tests)
##
## Response: Sa
## Df Chisq Pr(>Chisq)
## (Intercept) 1 27.196 1.839e-07 ***
## W 1 51.350 7.726e-13 ***
## C 3 8.518 0.03644 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
a) Write down the estimated model and evaluate the significance of the estimated coefficients.
The estimated model is
\[ \begin{aligned} Y_i &\sim \text{Poisson}(\lambda_i) \\ \log{\lambda_i} &= -2.92 + 0.15 \text{W} + 0.27 X_1 + 0.07 X_2 - 0.17 X_3 + \xi X_4 \end{aligned} \]
where \(W\) is the width, and \(X_j\) is an indicator of whether the female is in colour category \(j\). \(\xi\) is the effect of \(X_4\), to be calculated later. Note that we use a sum to zero contrast.
The summary()
gives Wald statistics and their associated
p-values. The effects of width (W
) is significant:
p<<0. For the colour effects, C2 is clearly not significantly
different from C4: C1 and C3 are both just either side of the critical
value at 5%. It is notlcealr why the contrast to C4 is important,
though.
How would you proceed to evaluate the significance of the coefficient for the colour C4. Explain (but only do the calculation if you really feel like it).
First, work out what the coefficient is. Then explain what test you would use, and where you would get the components of it.
If you want to do the calculation, you will need to do a bit of matrix manipulation in R.The coefficient is \(-\sum_{i=1}^3{\beta_i} = -(0.27 + 0.07 - 0.17) = -0.17\)
We can calculate the standard error using the contrast, \(C\) and can plug this into
\[ H_0: {\bf C}{\bf \beta}={\bf d} \text{ vs. } H_1: {\bf C}{\bf \beta}\neq {\bf d} \]
i.e. \(C=\{0,0-1,-1,-1 \}\) (the 0s
are for the intercept and W
term), and \(\beta = \{-2.92, 0.15, 0.27, 0.07,
-0.17\}\) so we test if \(-(0.27 + 0.07
- 0.17) = -0.17\) is different from 0. We can use the Wald
statistic:
\[ w=({\bf C} \hat{\boldsymbol{\beta}}-{\bf d})^{\text T}[{\bf C}F^{-1}(\hat{\beta}){\bf C}^{\text T}]^{-1}({\bf C}\hat{{\boldsymbol \beta}}-{\bf d}) \]
where \(F^{-1}(\hat{\beta})\) is the
covariance matrix, which we can extract with vcov()
:
vcov(modelEXAM)
## (Intercept) W C1 C2
## (Intercept) 0.313712405 -0.0116075311 0.0090846333 5.635922e-03
## W -0.011607531 0.0004343333 -0.0002800538 -3.120317e-04
## C1 0.009084633 -0.0002800538 0.0138864871 -2.197738e-03
## C2 0.005635922 -0.0003120317 -0.0021977383 5.323442e-03
## C3 -0.005792980 0.0001811855 -0.0042680583 2.192067e-05
## C3
## (Intercept) -5.792980e-03
## W 1.811855e-04
## C1 -4.268058e-03
## C2 2.192067e-05
## C3 8.679453e-03
C = t(as.vector(c(0, 0, -1, -1, -1)))
betahat <- coef(modelEXAM)
d <- 0
invF <- vcov(modelEXAM)
# solve() calculates the inverse of the matrix
(w <- t(C %*% betahat - d) %*% solve(C %*% invF %*% t(C)) %*% (C %*% betahat - d))
## [,1]
## [1,] 2.076764
(the calculation of w
involves a bit of over-kill and
matrix operations)
We are testing one hypothesis , so \(r=1\), so we compare this to a \(\chi^1_1\) distribution:
pchisq(w, 1, lower.tail = FALSE)
## [,1]
## [1,] 0.1495569
So this is not significant at 5%.
Perform a test to evaluate the fit of the model. What is your conclusion? Use a 5% significance level for your evaluations.
We are asking if any variation not explained by the model could just be random Poisson noise.
We want to test if the residual deviance follows a \(\chi^2\) distribution. Luckily
summary()
gives us what we need:
summary(modelEXAM)
##
## Call:
## glm(formula = Sa ~ W + C, family = poisson(link = log), data = crab,
## contrasts = list(C = "contr.sum"))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.92089 0.56010 -5.215 1.84e-07 ***
## W 0.14934 0.02084 7.166 7.73e-13 ***
## C1 0.27085 0.11784 2.298 0.0215 *
## C2 0.07117 0.07296 0.975 0.3294
## C3 -0.16551 0.09316 -1.777 0.0756 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 559.34 on 168 degrees of freedom
## AIC: 924.64
##
## Number of Fisher Scoring iterations: 6
The residual deviance is 559.344785, with 168 degrees of freedom. The p-value is thus:
pchisq(modelEXAM$deviance, modelEXAM$df.residual, lower = FALSE)
## [1] 1.471037e-43
which is horribly significant.
(In Problem 3d we derive the mathematical formula for the test statistic this test is based on.)
b) Make a sketch by hand illustrating the connection between the expected number of satellites and the width of the carapace for each of the four colours of the crab.
These are the coefficients:
coef(modelEXAM)
## (Intercept) W C1 C2 C3
## -2.92089322 0.14934271 0.27085229 0.07116581 -0.16551073
Start by drawing on the log scale, so we get straight lines. The C4 effect is about -0.17, so C3 and C4 cross the x axis at about -3.1, and the y-axis at about 3.1/0.15=~20. C2 crosses the x axis at about -2.85, and C1 at -2.65.
But in the data W ranges between about 21 and 33, so the values of \(\log{\lambda}\) are all positive: for C3 \(\log{\lambda}\) is between about \(-3.1+21\times 0.15 = 0.05\) and \(-3.1+33\times 0.15 = 1.85\).
On the log scale these are all parallel. But on the data sale, they will diverge. So we end up with something like this mess:
The estimated multiplicative change in the expected number of satellites when the width of the carapace is changed by 1 cm is, according to the model, independent of the colour.
Explain why.
Also find a 95% confidence interval for this change.
exp(confint(modelEXAM)["W", ])
## 2.5 % 97.5 %
## 1.114412 1.209282
It is a linear combination of normal random variables (the estiamtes of the parameters), so must be Normal.
Which value of \(x_W\) would give estimated mean number of satellites equal to 5?
We need to plug in the slope for the W
effect, and the
“light medium” (=C1) effect:
\[ \ln(5) = -2.92 + 0.15\times W_5 + 0.27 = -2.65 + 0.15\times W_5 \]
i.e.
\[ W_5 = \frac{\ln(5) + 2.65}{0.15} = 28.4 \]
Or, in R:
(W5 = (log(5) - (coef(modelEXAM)["(Intercept)"] + coef(modelEXAM)["C1"]))/coef(modelEXAM)["W"])
## (Intercept)
## 28.5215
The slight difference is due to rounding.
predict
function: ?predict.glm
tells
you more.
We do this be predicting the value.
newd <- data.frame(C = factor(1, levels = levels(crab$C)), W = W5)
(pred <- predict(modelEXAM, newdata = newd, type = "link", se.fit = TRUE))
# 95% CI:
(c(pred$fit + 1.96 * pred$se.fit, pred$fit - 1.96 * pred$se.fit))
## $fit
## (Intercept)
## 1.609438
##
## $se.fit
## [1] 0.1448597
##
## $residual.scale
## [1] 1
##
## (Intercept) (Intercept)
## 1.893363 1.325513
Optional: Use ggplot
to create
(and improve) the sketch from b).
Consider a random variable \(Y\). In our course we have considered the univariate exponential family having distribution (probability density function for continuous variables and probability mass function for discrete variables) \[ f(y)=\exp \left( \frac{y \theta-b(\theta)}{\phi}w + c(y,\phi,w) \right)\] where \(\theta\) is called the natural parameter (or parameter of interest) and \(\phi\) the dispersion parameter.
The Poisson distribution is a discrete distribution with probability mass function \[ f(y)=\frac{\lambda^{y}}{y!}\exp(- \lambda), \text{ for } y=0,1,\ldots\] where \(\lambda>0\).
a) (10 points)
b) (15 points)
We consider a Poisson regression with log link \(\eta_i=g(\mu_i)=\ln(\mu_i)\), and linear predictor equal to \(\eta_i={\bf x}_i^T \boldsymbol{\beta}\). Further, let \(p\) be the number of regression parameters in \(\boldsymbol{\beta}\) (intercept included). The response–covariate pairs \((Y_i, {\bf x}_i)\) are independent for \(i=1,\ldots, n\).
c) (10 points)
We now look at a data set giving the number of species of tortoise on the various Galapagos Islands (Data taken from the book “Practical Regression and Anova using R” by Julian J. Faraway.).
The data set contains measurements on 30 islands, and we study the following variables:
Species
: The number of species of tortoise found on the
island.Area
: The area of the island (km\(^2\)).Elevation
: The highest elevation of the island
(m).Nearest
: The distance from the nearest island
(km).Scruz
: The distance from Santa Cruz island (km).Adjacent
: The area of the adjacent island (km\(^2\)).We have fitted a Poisson regression with log link to
Species
as response, and the other five variables are used
as continuous covariates. Print-out from the fitted model is given in
below.
library(faraway)
fit <- glm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent, data = gala,
family = poisson(link = log))
summary(fit)
##
## Call:
## glm(formula = Species ~ Area + Elevation + Nearest + Scruz +
## Adjacent, family = poisson(link = log), data = gala)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.155e+00 5.175e-02 60.963 < 2e-16 ***
## Area -5.799e-04 2.627e-05 -22.074 < 2e-16 ***
## Elevation 3.541e-03 8.741e-05 40.507 < 2e-16 ***
## Nearest 8.826e-03 1.821e-03 4.846 1.26e-06 ***
## Scruz -5.709e-03 6.256e-04 -9.126 < 2e-16 ***
## Adjacent -6.630e-04 2.933e-05 -22.608 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3510.73 on 29 degrees of freedom
## Residual deviance: 716.85 on 24 degrees of freedom
## AIC: 889.68
##
## Number of Fisher Scoring iterations: 5
Let \(\boldsymbol{\beta}\) be a \(6 \times 1\) column vector with the regression coefficients (intercept included), and let \(\hat{\boldsymbol{\beta}}\) be the maximum likelihood estimator for \(\boldsymbol{\beta}\).
We will focus on the effect of Elevation
, and denote the
corresponding regression coefficient \(\beta_2\).
Elevation
on the
number of species of tortoise found on the islands? (written out - with small changes to fit the notation we use - from https://www.uio.no/studier/emner/matnat/math/STK3100/h17/stk3100-4100_2017_2eng.pdf)
Assume that the random variable \(Y\) is Poisson distributed with probability mass function (pmf)
\[\text{P}(Y = y | \lambda)= \frac{\lambda^y}{y!} \exp(-\lambda), \ y = 0, 1, 2, \dots.\]
a) Show that the distribution of \(Y\) is an exponential family, that is, show that the pmf can be written in the form \[\exp\left\{\frac{y \theta - b(\theta)}{\phi}w + c(y, \phi, w)\right\},\] and determine \(\theta\), \(\phi\), \(w\), \(b(\theta)\) and \(c(y, \phi, w)\).
We then assume that \(Y_1, Y_2, \dots, Y_n\) are independent with the pmf from a), and let \(\mu_i = \text{E}(Y_i)\), \(i = 1, \dots, n\).
b) Explain what we mean by a generalized linear model (GLM) for \(Y_1, Y_2, \dots, Y_n\) with link function \(g(\cdot)\), and determine the canonical link function.
c) Derive an expression for the log-likelihood function \(l(\mathbf{\mu}; \mathbf{y})\), where \(\mathbf{y} = (y_1, \dots, y_n)^T\) is the observed value of \(\mathbf{Y} = (Y_1, Y_2, \dots, Y_n)^T\) and \(\mathbf{\mu} = (\mu_1, \dots, \mu_n)^T\).
d) Explain what we mean by a saturated model and determine the maximum of \(l(\mathbf{\mu}; \mathbf{y})\) for the saturated model.
e) Explain what we mean by the deviance \(D(\mathbf{y}; \mathbf{\hat{\mu}})\) of a Poisson GLM, find an expression for the deviance, and discuss how it may be used.
install.packages(c("tidyverse", "ggplot2", "statmod", "corrplot", "ggplot2", "GGally",
"boot"))