Hints and reminders are in bold

Questions appear in blue.

Some hints and answers are hidden away in a fold like this Well done, you found me!

The exam style questions are optional, but will be helpful for, well, the exam. They will also help you now to see if you understand the content.

Interactive session - first week

Problem 1: Exam 2005 (Problem 1d-f - slightly modified) - Female crabs and satellites

(Only a subset of 20 crabs in the original data was used in the exam problems, but we will use the full data set of 173 crabs - so results will not be the same. Permitted aids for the exam was “all printed and handwritten material and all calculators” - NB that is not the case for 2018!)

We assume that the number of satellites for each female crab follows a Poisson distribution and want to perform a Poisson regression using a log-link to find if there is a connection between the expected number of satellites Sa and the width W and colour C of the carapace of the female crab.

  • C: the color of the female crab (1=light medium, 2=medium, 3=dark medium, 4=dark)
  • W: width of carapace (cm)

The following model was fitted.

crab <- read.table("https://www.math.ntnu.no/emner/TMA4315/2018h/crab.txt")
colnames(crab) <- c("Obs", "C", "S", "W", "Wt", "Sa")
crab <- crab[,-1] #remove column with Obs
crab$C <- as.factor(crab$C)
modelEXAM <- glm(Sa ~ W + C, family = poisson(link = log), data = crab, contrasts = list(C = "contr.sum"))
summary(modelEXAM)
library(car)
Anova(modelEXAM, type = "III", test.statistic = "Wald")
## 
## Call:
## glm(formula = Sa ~ W + C, family = poisson(link = log), data = crab, 
##     contrasts = list(C = "contr.sum"))
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.92089    0.56010  -5.215 1.84e-07 ***
## W            0.14934    0.02084   7.166 7.73e-13 ***
## C1           0.27085    0.11784   2.298   0.0215 *  
## C2           0.07117    0.07296   0.975   0.3294    
## C3          -0.16551    0.09316  -1.777   0.0756 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 559.34  on 168  degrees of freedom
## AIC: 924.64
## 
## Number of Fisher Scoring iterations: 6
## 
## Analysis of Deviance Table (Type III tests)
## 
## Response: Sa
##             Df  Chisq Pr(>Chisq)    
## (Intercept)  1 27.196  1.839e-07 ***
## W            1 51.350  7.726e-13 ***
## C            3  8.518    0.03644 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

a) Write down the estimated model and evaluate the significance of the estimated coefficients.

Answer

The estimated model is

\[ \begin{aligned} Y_i &\sim \text{Poisson}(\lambda_i) \\ \log{\lambda_i} &= -2.92 + 0.15 \text{W} + 0.27 X_1 + 0.07 X_2 - 0.17 X_3 + \xi X_4 \end{aligned} \]

where \(W\) is the width, and \(X_j\) is an indicator of whether the female is in colour category \(j\). \(\xi\) is the effect of \(X_4\), to be calculated later. Note that we use a sum to zero contrast.

The summary() gives Wald statistics and their associated p-values. The effects of width (W) is significant: p<<0. For the colour effects, C2 is clearly not significantly different from C4: C1 and C3 are both just either side of the critical value at 5%. It is notlcealr why the contrast to C4 is important, though.

How would you proceed to evaluate the significance of the coefficient for the colour C4. Explain (but only do the calculation if you really feel like it).

Hint

First, work out what the coefficient is. Then explain what test you would use, and where you would get the components of it.

If you want to do the calculation, you will need to do a bit of matrix manipulation in R.
Answer

The coefficient is \(-\sum_{i=1}^3{\beta_i} = -(0.27 + 0.07 - 0.17) = -0.17\)

We can calculate the standard error using the contrast, \(C\) and can plug this into

\[ H_0: {\bf C}{\bf \beta}={\bf d} \text{ vs. } H_1: {\bf C}{\bf \beta}\neq {\bf d} \]

i.e. \(C=\{0,0-1,-1,-1 \}\) (the 0s are for the intercept and W term), and \(\beta = \{-2.92, 0.15, 0.27, 0.07, -0.17\}\) so we test if \(-(0.27 + 0.07 - 0.17) = -0.17\) is different from 0. We can use the Wald statistic:

\[ w=({\bf C} \hat{\boldsymbol{\beta}}-{\bf d})^{\text T}[{\bf C}F^{-1}(\hat{\beta}){\bf C}^{\text T}]^{-1}({\bf C}\hat{{\boldsymbol \beta}}-{\bf d}) \]

where \(F^{-1}(\hat{\beta})\) is the covariance matrix, which we can extract with vcov():

vcov(modelEXAM)
##              (Intercept)             W            C1            C2
## (Intercept)  0.313712405 -0.0116075311  0.0090846333  5.635922e-03
## W           -0.011607531  0.0004343333 -0.0002800538 -3.120317e-04
## C1           0.009084633 -0.0002800538  0.0138864871 -2.197738e-03
## C2           0.005635922 -0.0003120317 -0.0021977383  5.323442e-03
## C3          -0.005792980  0.0001811855 -0.0042680583  2.192067e-05
##                        C3
## (Intercept) -5.792980e-03
## W            1.811855e-04
## C1          -4.268058e-03
## C2           2.192067e-05
## C3           8.679453e-03
C = t(as.vector(c(0, 0, -1, -1, -1)))
betahat <- coef(modelEXAM)
d <- 0
invF <- vcov(modelEXAM)

# solve() calculates the inverse of the matrix
(w <- t(C %*% betahat - d) %*% solve(C %*% invF %*% t(C)) %*% (C %*% betahat - d))
##          [,1]
## [1,] 2.076764

(the calculation of w involves a bit of over-kill and matrix operations)

We are testing one hypothesis , so \(r=1\), so we compare this to a \(\chi^1_1\) distribution:

pchisq(w, 1, lower.tail = FALSE)
##           [,1]
## [1,] 0.1495569
So this is not significant at 5%.

Perform a test to evaluate the fit of the model. What is your conclusion? Use a 5% significance level for your evaluations.

Hint

We are asking if any variation not explained by the model could just be random Poisson noise.

Answer

We want to test if the residual deviance follows a \(\chi^2\) distribution. Luckily summary() gives us what we need:

summary(modelEXAM)
## 
## Call:
## glm(formula = Sa ~ W + C, family = poisson(link = log), data = crab, 
##     contrasts = list(C = "contr.sum"))
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.92089    0.56010  -5.215 1.84e-07 ***
## W            0.14934    0.02084   7.166 7.73e-13 ***
## C1           0.27085    0.11784   2.298   0.0215 *  
## C2           0.07117    0.07296   0.975   0.3294    
## C3          -0.16551    0.09316  -1.777   0.0756 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 559.34  on 168  degrees of freedom
## AIC: 924.64
## 
## Number of Fisher Scoring iterations: 6

The residual deviance is 559.344785, with 168 degrees of freedom. The p-value is thus:

pchisq(modelEXAM$deviance, modelEXAM$df.residual, lower = FALSE)
## [1] 1.471037e-43

which is horribly significant.

(In Problem 3d we derive the mathematical formula for the test statistic this test is based on.)

b) Make a sketch by hand illustrating the connection between the expected number of satellites and the width of the carapace for each of the four colours of the crab.

Answer

These are the coefficients:

coef(modelEXAM)
## (Intercept)           W          C1          C2          C3 
## -2.92089322  0.14934271  0.27085229  0.07116581 -0.16551073

Start by drawing on the log scale, so we get straight lines. The C4 effect is about -0.17, so C3 and C4 cross the x axis at about -3.1, and the y-axis at about 3.1/0.15=~20. C2 crosses the x axis at about -2.85, and C1 at -2.65.

But in the data W ranges between about 21 and 33, so the values of \(\log{\lambda}\) are all positive: for C3 \(\log{\lambda}\) is between about \(-3.1+21\times 0.15 = 0.05\) and \(-3.1+33\times 0.15 = 1.85\).

On the log scale these are all parallel. But on the data sale, they will diverge. So we end up with something like this mess:

Two hand drawn graphs of the effect of W on expected number of satellites
Two hand drawn graphs of the effect of W on expected number of satellites

The estimated multiplicative change in the expected number of satellites when the width of the carapace is changed by 1 cm is, according to the model, independent of the colour.

Explain why.

Answer Because the model does not include any interaction between W and C.

Also find a 95% confidence interval for this change.

Answer
exp(confint(modelEXAM)["W", ])
##    2.5 %   97.5 % 
## 1.114412 1.209282
  1. Let \(\hat{\eta}(x_{W})\) be the estimated linear predictor when the width of the carapace is \(x_{W}\) and the crab is “light medium”.
What is the distribution of \(\hat{\eta}(x_{W})\)?
Answer

It is a linear combination of normal random variables (the estiamtes of the parameters), so must be Normal.

Which value of \(x_W\) would give estimated mean number of satellites equal to 5?

Answer

We need to plug in the slope for the W effect, and the “light medium” (=C1) effect:

\[ \ln(5) = -2.92 + 0.15\times W_5 + 0.27 = -2.65 + 0.15\times W_5 \]

i.e. 

\[ W_5 = \frac{\ln(5) + 2.65}{0.15} = 28.4 \]

Or, in R:

(W5 = (log(5) - (coef(modelEXAM)["(Intercept)"] + coef(modelEXAM)["C1"]))/coef(modelEXAM)["W"])
## (Intercept) 
##     28.5215
The slight difference is due to rounding.
Optional: Using R – for this value of \(x_W\), construct a 95% confidence interval for the mean number of satellites.
Hint use the predict function: ?predict.glm tells you more.
Answer

We do this be predicting the value.

newd <- data.frame(C = factor(1, levels = levels(crab$C)), W = W5)
(pred <- predict(modelEXAM, newdata = newd, type = "link", se.fit = TRUE))

# 95% CI:

(c(pred$fit + 1.96 * pred$se.fit, pred$fit - 1.96 * pred$se.fit))
## $fit
## (Intercept) 
##    1.609438 
## 
## $se.fit
## [1] 0.1448597
## 
## $residual.scale
## [1] 1
## 
## (Intercept) (Intercept) 
##    1.893363    1.325513

Optional: Use ggplot to create (and improve) the sketch from b).

Answer/Hint (not yet added) 42

Problem 2: Exam 2017 (Problem 1) - Poisson regression

Consider a random variable \(Y\). In our course we have considered the univariate exponential family having distribution (probability density function for continuous variables and probability mass function for discrete variables) \[ f(y)=\exp \left( \frac{y \theta-b(\theta)}{\phi}w + c(y,\phi,w) \right)\] where \(\theta\) is called the natural parameter (or parameter of interest) and \(\phi\) the dispersion parameter.

The Poisson distribution is a discrete distribution with probability mass function \[ f(y)=\frac{\lambda^{y}}{y!}\exp(- \lambda), \text{ for } y=0,1,\ldots\] where \(\lambda>0\).

a) (10 points)

  • Show that the Poisson distribution is a univariate exponential family, and specify what the elements of the exponential family (\(\theta\), \(\phi\), \(b(\theta)\), \(w\), \(c(y,\phi,w)\)) are.
Answer/Hint (not yet added) 42
  • What is the connections between \(\text{E}(Y)\) and elements of the exponential family?
Answer/Hint (not yet added) 42
  • What is the connections between \(\text{Var}(Y)\) and elements of the exponential family?
Answer/Hint (not yet added) 42
  • Use these connections to derive the mean and variance for the Poisson distribution.
Answer/Hint (not yet added) 42
  • If the Poisson distribution is used as the distribution for the response in a generalized linear model, what is the canonical link function?
Answer/Hint (not yet added) 42

b) (15 points)

We consider a Poisson regression with log link \(\eta_i=g(\mu_i)=\ln(\mu_i)\), and linear predictor equal to \(\eta_i={\bf x}_i^T \boldsymbol{\beta}\). Further, let \(p\) be the number of regression parameters in \(\boldsymbol{\beta}\) (intercept included). The response–covariate pairs \((Y_i, {\bf x}_i)\) are independent for \(i=1,\ldots, n\).

  • Does this set-up satisfy the requirements of a GLM model? Explain.
Answer/Hint (not yet added) 42
  • Write down the log-likelihood.
Answer/Hint (not yet added) 42
  • From the log-likelihood, derive the formula for the score function \({\bf s}(\boldsymbol{\beta})\) and the expected Fisher information matrix, \({\bf F}(\boldsymbol{\beta})\).
  • What are the dimensions of \({\bf s}(\boldsymbol{\beta})\) and \({\bf F}(\boldsymbol{\beta})\)?
Answer/Hint (not yet added) 42
  • How can \({\bf s}(\boldsymbol{\beta})\) and \({\bf F}(\boldsymbol{\beta})\) be used to arrive at a maximum likelihood estimate for \(\boldsymbol{\beta}\)?
Answer/Hint (not yet added) 42

c) (10 points)

We now look at a data set giving the number of species of tortoise on the various Galapagos Islands (Data taken from the book “Practical Regression and Anova using R” by Julian J. Faraway.).

The data set contains measurements on 30 islands, and we study the following variables:

  • Species: The number of species of tortoise found on the island.
  • Area: The area of the island (km\(^2\)).
  • Elevation: The highest elevation of the island (m).
  • Nearest: The distance from the nearest island (km).
  • Scruz: The distance from Santa Cruz island (km).
  • Adjacent: The area of the adjacent island (km\(^2\)).

We have fitted a Poisson regression with log link to Species as response, and the other five variables are used as continuous covariates. Print-out from the fitted model is given in below.

library(faraway)
fit <- glm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent, data = gala,
    family = poisson(link = log))
summary(fit)
## 
## Call:
## glm(formula = Species ~ Area + Elevation + Nearest + Scruz + 
##     Adjacent, family = poisson(link = log), data = gala)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.155e+00  5.175e-02  60.963  < 2e-16 ***
## Area        -5.799e-04  2.627e-05 -22.074  < 2e-16 ***
## Elevation    3.541e-03  8.741e-05  40.507  < 2e-16 ***
## Nearest      8.826e-03  1.821e-03   4.846 1.26e-06 ***
## Scruz       -5.709e-03  6.256e-04  -9.126  < 2e-16 ***
## Adjacent    -6.630e-04  2.933e-05 -22.608  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 3510.73  on 29  degrees of freedom
## Residual deviance:  716.85  on 24  degrees of freedom
## AIC: 889.68
## 
## Number of Fisher Scoring iterations: 5

Let \(\boldsymbol{\beta}\) be a \(6 \times 1\) column vector with the regression coefficients (intercept included), and let \(\hat{\boldsymbol{\beta}}\) be the maximum likelihood estimator for \(\boldsymbol{\beta}\).

  • Write down the asymptotic distribution for \(\hat{\boldsymbol{\beta}}\), and specify how the covariance matrix for \(\hat{\boldsymbol{\beta}}\) is estimated.
Answer/Hint (not yet added) 42

We will focus on the effect of Elevation, and denote the corresponding regression coefficient \(\beta_2\).

  • Write down the maximum likelihood estimate for \(\beta_2\) in the print-out above.
Answer/Hint (not yet added) 42
  • How can you explain this value to a biologist interested in understanding the effect of Elevation on the number of species of tortoise found on the islands?
Answer/Hint (not yet added) 42
  • What is numerical value for the estimated standard deviation of \(\hat{\beta}_2\) given in above?
Answer/Hint (not yet added) 42
  • Construct an approximate 95% confidence interval for \(\beta_2\).
Answer/Hint (not yet added) 42

Problem 3: Exam December 2017 from UiO, Problem 1.

(written out - with small changes to fit the notation we use - from https://www.uio.no/studier/emner/matnat/math/STK3100/h17/stk3100-4100_2017_2eng.pdf)

Assume that the random variable \(Y\) is Poisson distributed with probability mass function (pmf)

\[\text{P}(Y = y | \lambda)= \frac{\lambda^y}{y!} \exp(-\lambda), \ y = 0, 1, 2, \dots.\]

a) Show that the distribution of \(Y\) is an exponential family, that is, show that the pmf can be written in the form \[\exp\left\{\frac{y \theta - b(\theta)}{\phi}w + c(y, \phi, w)\right\},\] and determine \(\theta\), \(\phi\), \(w\), \(b(\theta)\) and \(c(y, \phi, w)\).

Answer/Hint (not yet added) 42

We then assume that \(Y_1, Y_2, \dots, Y_n\) are independent with the pmf from a), and let \(\mu_i = \text{E}(Y_i)\), \(i = 1, \dots, n\).

b) Explain what we mean by a generalized linear model (GLM) for \(Y_1, Y_2, \dots, Y_n\) with link function \(g(\cdot)\), and determine the canonical link function.

Answer/Hint (not yet added) 42

c) Derive an expression for the log-likelihood function \(l(\mathbf{\mu}; \mathbf{y})\), where \(\mathbf{y} = (y_1, \dots, y_n)^T\) is the observed value of \(\mathbf{Y} = (Y_1, Y_2, \dots, Y_n)^T\) and \(\mathbf{\mu} = (\mu_1, \dots, \mu_n)^T\).

Answer/Hint (not yet added) 42

d) Explain what we mean by a saturated model and determine the maximum of \(l(\mathbf{\mu}; \mathbf{y})\) for the saturated model.

Answer/Hint (not yet added) 42

e) Explain what we mean by the deviance \(D(\mathbf{y}; \mathbf{\hat{\mu}})\) of a Poisson GLM, find an expression for the deviance, and discuss how it may be used.

Answer/Hint (not yet added) 42

R packages

install.packages(c("tidyverse", "ggplot2", "statmod", "corrplot", "ggplot2", "GGally",
    "boot"))

Further reading

  • A. Agresti (1996): “An Introduction to Categorical Data Analysis”.
  • A. Agresti (2015): “Foundations of Linear and Generalized Linear Models.” Wiley.
  • A. J. Dobson and A. G. Barnett (2008): “An Introduction to Generalized Linear Models”, Third edition.
  • J. Faraway (2015): “Extending the Linear Model with R”, Second Edition. http://www.maths.bath.ac.uk/~jjf23/ELM/
  • P. McCullagh and J. A. Nelder (1989): “Generalized Linear Models”. Second edition.