TMA4315 Generalized linear models H2017

Module 4: Count and continuous positive response data (Poisson- and gamma regression)

Mette Langaas, Department of Mathematical Sciences, NTNU – with contributions from Ingeborg Hem

02.10.2017 and 09.10.2017 [PL=plenary lecture in F3], 04.10.2017 and 11.10.2017 [IL=interactive lecture in Smia] (Version 10.10.2017)

#libraries needed to run this module page
install.packages("tidyverse")
install.packages("ggplot2")
install.packages("statmod")
install.packages("corrplot")

Plan for this module

present regression models for count data and positive/skewed continuous response data

Textbook: Fahrmeir et al (2013): Chapter 5.2, 5.3. Slides on analysis of \(r\times c\) contingency tables.

First week

Classnotes from first week (02.10.2017)

  • examples of count data
  • the Poission distribution
  • regression with count data
  • Poisson regression with log-link
  • parameter estimation (ML): log-likelhood, score vector, information matrix to give iterative calculations
  • asymptotic MLE properties
  • confidence intervals and hypothesis tests (Wald, score and LRT)
  • deviance, model fit and model choice
  • overdispersion
  • rate models and offset

Jump to interactive (week 1)


Second week

Classnotes from second week (09.10.2017)

  • modelling continuous response data: lognormal and gamma
  • the gamma distribution
  • the gamma GLM model
  • gamma likelihood and derivations thereof
  • dispersion parameter: scaled and unscaled deviance
  • summing up - element in common for normal, binomial, poission and gamma
  • analysing \(r\times c\) contingency tables

Jump to interactive (week 2)

Modelling count data

Examples of count data

  • the number of automobile thefts pr city worldwide
  • the number of UFO sightings around the world
  • the number of visits at web pages
  • the number of male crabs (satellites) residing nearby a female crab
  • the number of goals by the home team and the number of goals for the away team in soccer
  • the number of newspapers sold at newsagents

Ex: Modelling sale of newspapers

This is a short description of a project run at the Norwegian Computing Centre a few years ago.

The aim of the project was to provide a statistical model to predict the number of newspapers to be sold at each of 11 thousand outlets all over Norway for a given day (“tomorrow” - maybe scaled to match front page issues). And then based on the prediction to decide on how many newspapers to be delivered to each outlet in order to optimize the overall profit (lost sales if outlet are sold out, return costs if unsold papers).

Response data: number of newspapers (delivered) sold at each outlet. Covariate data: type of outlet, but mainly calendar information= weekday, month, season, public holidays, winter/autumn/easter/xmas, …


Example: female crabs with satellites

The example is taken from Agresti (1996): “An Introduction to Categorical Data Analysis”, and the data is from a study of nesting horseshoe crabs (J. Brockmann, Ethology 1996)

First, the study objects were female crabs (horseshoe crabs). Each female crab had a male crab attached to her in her nest. The objective of the study was to investigate factors that affect whether the female crab had any other males, called satellites, residing near her. The following covariates were collected for 173 female crabs:

  • C: the color of the female crab (1=light medium, 2=medium, 3=dark medium, 4=dark)
  • S: spine condition (1=both good, 2=one worn or broken, 3=both worn or broken)
  • W: width of carapace (cm)
  • Wt: weight (kg)

The response was the number of satellites, Sa= male crabs residing nearby.

library(ggplot2)
library(GGally)
crab=read.table("https://www.math.ntnu.no/emner/TMA4315/2017h/crab.txt")
colnames(crab)=c("Obs","C","S","W","Wt","Sa")
crab=crab[,-1] #remove column with Obs
crab$C=as.factor(crab$C)
crab$S=as.factor(crab$S)
ggpairs(crab)

Q: Discuss what you see. Any potential covariates to influence Sa? Which distribution can Sa have? A: Seems that C, W and Wt might be correlated with Sa. Also W and Wt are very correlated. Difficult to see the distribution of Sa since plotting a Poisson variable with different means might not be easy to recognize.


The Poission process

We observe events that may occur within a time interval or a region.

  1. The number of events occuring within a time interval or a region, is independent of the number of events that occurs in any other disjoint (non-overlapping) time interval or region.
  2. The probability that a single event occurs within a small time interval or region, is proportional to the length of the interval or the size of the region.
  3. The probability that more than one event may occur within a small time interval or region is negligable.

When all of these three properties are funfilled we have a Poisson process. This leads to three distributions

  • The number of events in a Poisson process follows a Poisson distribution.
  • Time between two events in a Poisson process follows an exponential distribution.
  • Time between many events in a Poisson process follows a gamma distribution.

We will first study the Poisson distribution - and link it to a regression setting (and soon data that might come from a gamma distribution).


The Poisson distribution

We study a Poisson process within a time interval or a region of specified size. Then, the number of events, \(Y\), will follow a Poisson distribution with parameter \(\lambda\) \[ f(y)=\frac{\lambda^y}{y!}e^{-\lambda} \text{ for } y=0,1,2,… \] Here the parameter \(\lambda\) is the proportionality factor in the requirement 2 (above) for the Poisson process. Another popular parameterization is \(\mu\), or given some interval \(\lambda t\), but we will stick with \(\lambda\). In R we calculate the Poission point probabilities using dpois.

If you want to see how this distribution function is derived form the binomial distribution you may watch this video: Poisson process and distribution

Cumulative distribution (cdf): The cumulative distribution is calculated by summing \(F(y)=\sum_{t\le y} f(t)\), and we might calculate the Poission cdf in R with ppois.

Expected value and variance: Let \(Y\) follow a Poisson distribution with parameter \(\lambda\). Then \[\text{E}(Y)=\lambda \text{ and } \text{Var}(Y)=\lambda\] Proof: \[ E(y)=\sum_{y=0}^{\infty} y \frac{\lambda^y}{y!}e^{-\lambda}= \sum_{y=1}^{\infty} y \frac{\lambda^y}{y!}e^{-\lambda}=\sum_{y=1}^{\infty} \frac{\lambda \lambda^{y-1}}{(y-1)!}e^{-\lambda}=\lambda \sum_{z=0}^{\infty} \frac{\lambda^{z}}{z!}e^{-\lambda}=\lambda \] In the first transition we use that the term \(y=0\) gives no contribution to the sum. Then we cancel out \(y\) in the numerator with the first term of \(y!\) in the denominator. Then we let \(z=y-1\), and finally we use that the sum of the Poisson distribution for all possible outcomes equal 1.

To calculate the variance we first use that \[ \text{Var}(Y)=\text{E}(Y^2)-(\text{E}(Y))^2\] and then that \[Y^2=Y(Y-1)+Y\] so that \[\text{Var}(Y)=\text{E}(Y(Y-1))+\text{E}(Y)-(\text{E}(Y))^2\] The reason for this is that we may use the same type of trick as for \(\text{E}(Y)\) since the sum is over all values of the distribution function. \[ \text{E}(Y(Y-1))=\sum_{y=0}^{\infty} y(y-1) \frac{\lambda^y}{y!}e^{-\lambda}=\sum_{y=2}^{\infty} y(y-1) \frac{\lambda^y}{y!}e^{-\lambda}\] \[=\sum_{y=2}^{\infty} \frac{\lambda^{y-2}\lambda^2}{(y-2)!}e^{-\lambda}=\lambda^2\sum_{y=0}^{\infty}\frac{\lambda^{y}}{y!}e^{-\lambda}=\lambda^2\]

Putting it all togehter \[\text{Var}(Y)=\lambda^2 +\lambda - \lambda^2=\lambda\] which was to be shown.

Properties of the Poisson distribution

  • A sum of \(n\) independent Poisson distributed random variables, \(Y_i\) with means \(\lambda_i\) are Poisson distributed with mean \(\sum_{i=1}^n \lambda_i\).
  • When the mean increases the Poission distribution becomes more and more symmetric and for large \(\lambda\) the Poission distribution can be approximated by a normal distribution.

Exponential family

In Module 1 we introduced distributions of the \(Y_i\), that could be written in the form of a univariate exponential family \[ f(y_i\mid \theta_i)=\exp \left( \frac{y_i \theta_i-b(\theta_i)}{\phi}\cdot w_i + c(y_i, \phi, w_i) \right) \] where we said that

  • \(\theta_i\) is called the canonical parameter and is a parameter of interest

  • \(\phi\) is called a nuisance parameter (and is not of interest to us=therefore a nuisance (plage))

  • \(w_i\) is a weight function, in most cases \(w_i=1\)

  • b and c are known functions.

It can be shown that \(\text{E}(Y_i)=b'(\theta_i)\) and \(\text{Var}(Y_i)=b''(\theta_i)\cdot \frac{\phi_i}{w_i}\).

In Module 1 we found that the Poisson distribution \(Y_i\sim \text{Poisson}(\lambda_i)\) is an exponential family derivation from Module 1.

and that

  • \(\theta_i=\ln(\lambda_i)\) is the canonical parameter
  • \(\phi=1\), no nuisance
  • \(w_i=1\)
  • \(b(\theta_i)=\exp(\theta)\)
  • \(\mu_i=\text{E}(Y_i)=\lambda_i\)

For later (GLM with linear predictor \(\eta_i\)) - to have a canonical link we need \[\theta_i=\eta_i\] Since \(\eta_i=g(\mu_i)=g(\lambda_i)\) this means to us that we need \[ g(\mu_i)=g(\lambda_i)=\theta_i\] saying that with the Poisson the canonical link is \(\ln(\lambda_i)\).

Regression with count data

  1. Construct a model to help understand the relationship between a count variable and one or many possible explanatory variables. The response measurements are counts.

  2. Use the model for understanding what can explain count, and for prediction of counts.

When modelling the counts some times the normal approximation might be used, especially when the counts are high, but in general we may use a Poission model.

It is possible to construct a linear Poission model where we have the direct relationship \[\lambda_i=\eta_i\] between the mean of the Poisson distribution and the linear predictor. However, since the rate \(\lambda_i\) can not be negative, then to use this model restrictions on the parameter space of the \(\beta\) is needed. We will not use the linear Poisson model, but instead the log-linear Poisson model.


The log-linear Poisson model

Assumptions:

  1. \(Y_i \sim \text{Poisson}(\lambda_i)\), with \(\text{E}(Y_i)=\lambda_i\), and \(\text{Var}(Y_i)=\lambda_i\).

  2. Linear predictor: \(\eta_i={\bf x}_i^T \beta\).

  3. Log link \[\eta_i=\ln(\lambda_i)=g(\lambda_i)\] and (inverse thereof) response function \[\lambda_i=\exp(\eta_i)\]

Assumptions 1 and 3 above can be written as \[Y_i \sim \text{Poisson}(\exp(\eta_i)), \text{ }i=1,\ldots,n\]


Interpreting parameters in the log-linear Poisson model

In the log-linear model the mean, \(\text{E}(Y_i)=\lambda_i\) satisfy an exponential relationship to covariates \[ \lambda_i=\exp(\eta_i)=\exp({\bf x}_i^T \beta)=\exp(\beta_0)\cdot \exp(\beta_1)^{x_{i1}} \cdots \exp(\beta_k)^{x_{ik}}.\]

Let us look in detail at \(\beta_1\) with covariate \(x_{i1}\) for observation \(i\).

  1. If \(x_{i1}\) increases by one unit to \(x_{i1}+1\) then the mean \(\text{E}(Y_i)\) will in our model change by a factor \(\exp(\beta_1)\).
  2. If \(\beta_1\)=0 then \(\exp(\beta_1)=1\), so that a change in \(x_{i1}\) does not change \(\text{E}(Y_i)\).
  3. If \(\beta_1<0\) then \(\exp(\beta_1)<1\) so if \(x_{i1}\) increase then \(\text{E}(Y_i)\) decrease.
  4. If \(\beta_1>0\) then \(\exp(\beta_1)>1\) so if \(x_{i1}\) increase then \(\text{E}(Y_i)\) increase.

Example: interpreting parameters for the female crabs with satellites

We fit a log-linear model to Sa, assuming the number of satellites follows a Poisson distribution with log-link.

Q:

  1. First the model is fitted with intercept only. What do we assume then? Interpret the fit.
  2. Then width W is added as a covariate in the log-linear model. What happens if the width increase by one unit (cm)?
  3. What is the predicted number of satellites for the average width?
model1=glm(Sa~1, family=poisson(link=log),data=crab)
cat("Intercept only\n")
print(model1$coefficients)
cat("Intercept only, exp\n")
exp(model1$coefficients)
model2=glm(Sa~W,family=poisson(link=log),data=crab)
cat("Intercept + W\n")
print(model2$coefficients)
cat("Intercept+W, exp\n")
exp(model2$coefficients)
cat("summary of width\n")
summary(crab$W)
cat("what is this?\n")
print(exp(model2$coefficients[1]+model2$coefficients[2]*mean(crab$W)))
## Intercept only
## (Intercept) 
##    1.071267 
## Intercept only, exp
## (Intercept) 
##    2.919075 
## Intercept + W
## (Intercept)           W 
##  -3.3047572   0.1640451 
## Intercept+W, exp
## (Intercept)           W 
##  0.03670812  1.17826744 
## summary of width
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    21.0    24.9    26.1    26.3    27.7    33.5 
## what is this?
## (Intercept) 
##    2.744061

Parameter estimation with maximum likelihood

Our parameter of interest is the vector \(\beta\) of regression coefficients, and we have no nuicance parameters. We would like to estimate \(\beta\) from maximizing the likelihood - the presentation here is essentially the same as for Module 3: Binary regression - with Poisson and log instead of Bernoulli and logit. And, also here we will not have a closed form solution for \(\hat{\beta}\).

Likelihood \(L(\beta)\)

We assume that pairs of covariates and response are measured independently of each other: \(({\bf x}_i,Y_i)\), and \(Y_i\) follows the distribution specified above, and \({\bf x}_i\) is fixed.

\[L(\beta)=\prod_{i=1}^n L_i(\beta)=\prod_{i=1}^n f(y_i; \beta)=\prod_{i=1}^n\frac{\lambda_i^{y_i}}{y_i!}\exp(-\lambda_i)\]

Note: still a slight misuse of notation - where is \(\beta\)? A: \(\eta_i=\ln(\lambda_i)={\bf x}_i^T\beta\), so replace \(\lambda_i\) with \(\exp({\bf x}_i^T \beta)\).


Loglikelihood \(l(\beta)\)

The log-likelihood is just the natural log of the likelihood, and we work with the log-likelihood because this makes the mathematics simpler - since we work with exponential families. The main aim with the likelihood is to maximize it to find the maximum likelihood estimate, and since the log is a monotone function the maximum of the log-likelihood will be in the same place as the maximum of the likelihood.

\[l(\beta)=\ln L(\beta)=\sum_{i=1}^n \ln L_i(\beta)=\sum_{i=1}^n l_i(\beta)=\sum_{i=1}^n [y_i \ln(\lambda_i)-\lambda_i-\ln(y!)]\] Observe that the log-likelihood is a sum of invidual contributions for each observation pair \(i\). We often omit the last term since it is not a function of model parameters, only data.

If we want a function of \(\eta_i=\ln(\lambda_i)\) or \(\beta\): \[l(\beta)=\sum_{i=1}^n[y_i \eta_i-\exp(\eta_i)+C_i]=\sum_{i=1}^n y_i{\bf x}_i^T\beta-\sum_{i=1}^n \exp({\bf x}_i^T\beta) +C\]


Score function \(s(\beta)\)

The score function is a \(p\times 1\) vector, \(s(\beta)\), with the partial derivatives of the log-likelihood with respect to the \(p\) elements of the \(\beta\) vector. Remember, the score function is linear in the individual contributions: \[s(\beta)=\frac{\partial l(\beta)}{\partial \beta}= \sum_{i=1}^n \frac{\partial l_i(\beta)}{\partial \beta}= \sum_{i=1}^n s_i(\beta)\]

We work with \(l_i(\beta)=l_i(\pi_i(\eta_i(\beta)))\) and use the chain rule to find \(s_i(\beta)\).

\[s_i(\beta)=\frac{\partial l_i(\beta)}{\partial \beta}=\frac{\partial l_i(\beta)}{\partial \eta_i}\cdot \frac{\partial \eta_i}{\partial \beta}=\frac{\partial [y_i\eta_i-\exp(\eta_i)+C_i]}{\partial \eta_i}\cdot \frac{\partial [{\bf x}_i^T\beta ]}{\partial \beta}=[y_i-\exp(\eta_i)]\cdot {\bf x}_i=(y_i-\lambda_i){\bf x}_i \] See Module 3 for rules for partial derivatives of scalar wrt vector.

The score function is given as:

\[s(\beta)=\sum_{i=1}^n s_i(\beta)=\sum_{i=1}^n (y_i-\lambda_i){\bf x}_i\] Again, observe that \(\text{E}(s_i(\beta))=\text{E}((Y_i-\lambda_i){\bf x}_i)=0\) because \(\text{E}(Y_i)=\lambda_i\), and thus also \(\text{E}(s(\beta))=0\).

To find the maximum likelihood estimate \(\hat{\beta}\) we solve the set of \(p\) non-linear equations: \[s(\hat{\beta})=0\] And, as before we do that using the Newton-Raphson or Fisher Scoring iterative methods, so we need the derivative of the score vector (our Fisher information).


The expected Fisher information matrix \(F(\beta)\)

The covariance of \(s(\beta)\) is called the expected Fisher information matrix, \(F(\beta)\) and is given by

\[\begin{align} F(\beta) &= \text{Cov}(s(\beta)) = \sum_{i=1}^n \text{Cov}(s_i(\beta)) \\ &= \sum_{i=1}^n E\left[\Big(s_i(\beta) - E(s_i(\beta))\Big)\Big(s_i(\beta)-E(s_i(\beta))\Big)^T\right] \\ &= \sum_{i=1}^n E(s_i(\beta)s_i(\beta)^T) = \sum_{i=1}^n F_i(\beta) \end{align}\]

where it is used that the responses \(Y_i\) and \(Y_j\) are independent, and that \(E(s_i(\beta)) = 0 \ \forall i\).

Remember that \(s_i(\beta)=(Y_i-\lambda_i){\bf x}_i\), then:

\[ F_i(\beta) = E(s_i(\beta)s_i(\beta)^T) = E((Y_i-\lambda_i){\bf x}_i(Y_i-\lambda_i){\bf x}_i^T) = {\bf x}_i{\bf x}_i^T E((Y_i-\lambda_i)^2) = {\bf x}_i{\bf x}_i^T \lambda_i \] where \(E((Y_i-\lambda_i)^2)=\text{Var}(Y_i)=\lambda\) is the variance of \(Y_i\). Thus

\[F(\beta) = \sum_{i=1}^n {\bf x}_i{\bf x}_i^T \lambda_i.\]

Observed Fisher information matrix \(H(\beta)\)

We do not really need the observed version of the Fisher information matrix, and since we use canonical link \(H(\beta)=F(\beta)\) - so we already have it.

But, for completeness, we add the direct derivation of \(H(\beta)\).

\[H(\beta) = -\frac{\partial^2l(\beta)}{\partial\beta\partial\beta^T} = -\frac{\partial s(\beta)}{\partial\beta^T} = \frac{\partial}{\partial\beta^T}\left[\sum_{i=1}^n (\lambda_i-y_i){\bf x}_i \right] =\frac{\partial}{\partial\beta^T}\left[\sum_{i=1}^n (\exp(\eta_i)-y_i){\bf x}_i \right]\]

because \(s(\beta) = \sum_{i=1}^n (y_i-\lambda_i){\bf x}_i\) and hence \(-s(\beta) = \sum_{i=1}^n (\lambda_i-y_i){\bf x}_i\). Note that \(\lambda_i = \exp(\eta_i)\).

\[H(\beta) = \sum_{i=1}^n \frac{\partial}{\partial\beta^T}[{\bf x}_i\lambda_i-{\bf x}_iy_i] = \sum_{i=1}^n \frac{\partial}{\partial\beta^T}{\bf x}_i\lambda_i = \sum_{i=1}^n {\bf x}_i \frac{\partial \lambda_i}{\partial \eta_i} \frac{\partial \eta_i}{\partial \beta^T} \]

Use that

\[ \frac{\partial \eta_i}{\partial \beta^T}=\frac{\partial {\bf x}_i^T\beta}{\partial \beta^T} = \left(\frac{\partial {\bf x}_i^T\beta}{\partial \beta}\right)^T = {\bf x}_i^T \]

and

\[ \frac{\partial \lambda_i}{\partial \eta_i} = \frac{\partial\exp(\eta_i)}{\partial \eta_i} = \exp(\eta_i)=\lambda_i\]

And thus

\[H(\beta) =\sum_{i=1}^n {\bf x}_i {\bf x}_i^T\lambda_i.\] Note that the observed and the expected Fisher information matrix are equal (see below - canonical link - that this is a general finding).

Parameter estimation - in practice

To find the ML estimate \(\hat{\beta}\) we need to solve \[s(\hat{\beta})=0\] We have that the score function for the log-linear model is: \[s(\beta)=\sum_{i=1}^n {\bf x}_i (y_i-\lambda_i)=\sum_{i=1}^n {\bf x}_i (y_i-\exp({\bf x}_i^T\beta)).\] Observe that this is a non-linear function in \(\beta\), and has no closed form solution.

Fisher scoring

To solve this we use the Fisher scoring algorithm, were we at interation \(t+1\) have

\[\beta^{(t+1)}=\beta^{(t)} + F(\beta^{(t)})^{-1} s(\beta^{(t)})\]

Requirements for convergence

For the Fisher scoring algorithm the expected Fisher information matrix \(F\) needs to be invertible, and analogous to what we saw in Module 3 this is possible if \(\lambda_i>0\) for all \(i\) and that the design matrix has full rank (\(p\)). Since we have the log-link we have that \(\lambda_i=\exp({\bf x}_i^T\beta)\) which is always positive, so all good. Note, with the linear link \(\lambda_i=\eta_i\) this might be a challenge, and restrictions on \(\beta\) must be set.

Again, it is possible that the algorithm does not converge. This may happen for “unfavorable” data configurations (especially for small samples). Accoring to our text book, Fahrmeir et al (2013), page 284, the conditions for uniqueness and existence of ML estimators are very complex, and the authors suggest that the GLM user instead checks for convergence in practice by performing the iterations - also for the Poisson log-linear model.

Statistical Inference

Asymptotic properties of ML estimates

We repeat what we found for Module 3: Under some (weak) regularity conditions:

Let \(\hat{\beta}\) be the maximum likelihood (ML) estimate in the GLM model. As the total sample size increases, \(n\rightarrow \infty\):

  1. \(\hat{\beta}\) exists
  2. \(\hat{\beta}\) is consistent (convergence in probability, yielding asymptotically unbiased estimator, variances goes towards 0)
  3. \(\hat{\beta} \approx N_p(\beta,F^{-1}(\hat{\beta}))\)

Observe that this means that asymptotically \(\text{Cov}(\hat{\beta})=F^{-1}(\hat{\beta})\): the inverse of the expected Fisher information matrix evaluated at the ML estimate.

In our case we have \[F(\beta)=\sum_{i=1}^n {\bf x}_i{\bf x}_i^T \lambda_i={\bf X}^T {\bf W} {\bf X},\] where \({\bf W}=\text{diag}(\lambda_i)\). This means \(\text{Cov}(\hat{\beta})=({\bf X}^T {\bf W} {\bf X})^{-1}\) (remember that \(\hat{\beta}\) comes in with \(\hat{\lambda}_i\) in \({\bf W}\)).

Let \({\bf A}(\beta)=F^{-1}(\beta)\), and \(a_{jj}(\beta)\) is diagonal element number \(j\).

For one element of the parameter vector: \[ Z_j=\frac{\hat{\beta}_j-\beta_j}{\sqrt{a_{jj}(\hat{\beta})}}\] is standard normal, which can be used to make confidence intervals - and test hypotheses.


Confidence intervals

In addition to providing a parameter estimate for each element of our parameter vector \(\beta\) we should also report a \((1-\alpha)100\)% confidence interval (CI) for each element.

Let \(z_{\alpha/2}\) be such that \(P(Z_j>z_{\alpha/2})=\alpha/2\). We then use \[ P(-z_{\alpha/2}\le Z_j \le z_{\alpha/2})=1-\alpha\] insert \(Z_j\) and solve for \(\beta_j\) to get \[ P(\hat{\beta}_j-z_{\alpha/2}\sqrt{a_{jj}(\hat{\beta})} \le \beta_j \le \hat{\beta}_j-z_{\alpha/2}\sqrt{a_{jj}(\hat{\beta})})=1-\alpha\]

A \((1-\alpha)\)% CI for \(\beta_j\) is when we insert numerical values for the upper and lower limits.


Example: Female crabs with satellites

model2=glm(Sa~W,family=poisson(link=log),data=crab)
summary(model2)
cat("lower\n")
lower=model2$coefficients-qnorm(0.975)*sqrt(diag(vcov(model2)))
lower
cat("upper\n")
upper=model2$coefficients+qnorm(0.975)*sqrt(diag(vcov(model2)))
upper
confint(model2)
## 
## Call:
## glm(formula = Sa ~ W, family = poisson(link = log), data = crab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8526  -1.9884  -0.4933   1.0970   4.9221  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.30476    0.54224  -6.095  1.1e-09 ***
## W            0.16405    0.01997   8.216  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 567.88  on 171  degrees of freedom
## AIC: 927.18
## 
## Number of Fisher Scoring iterations: 6
## 
## lower
## (Intercept)           W 
##  -4.3675312   0.1249137 
## upper
## (Intercept)           W 
##  -2.2419833   0.2031764 
##                  2.5 %     97.5 %
## (Intercept) -4.3662326 -2.2406858
## W            0.1247244  0.2029871

Q:

  1. Explain what is done in the R-print-out.
  2. What if we instead want a CI for \(\beta_0+\beta_1 x_{i1}\)?
  3. What if we instead want a CI for \(\lambda_i=\exp(\beta_0+\beta_1 x_{i1})\)?

Hypothesis testing

There are three methods that are mainly used for testing hypotheses in GLMs - these are called Wald test, likelihood ratio test and score test. In Module 3 we looked at the Wald and likelihood ratio test - and what we found there still applies for the Poisson log-linear model.

We only repeat the previous findings, and then add an example with the crab data.

We will look at null hypotheses and alternative hypotheses that can be written \[H_0: {\bf C}{\bf \beta}={\bf d} \text{ vs. } {\bf C}{\bf \beta}\neq {\bf d} \] by specifying \({\bf C}\) to be a \(r \times p\) matrix and \({\bf d}\) to be a column vector of length \(r\).

The Wald test

The Wald test statistic is given as: \[w=({\bf C}\hat{{\bf \beta}}-{\bf d})^{\text T}[{\bf C}F^{-1}(\hat{\beta}){\bf C}^{\text T}]^{-1}({\bf C}\hat{{\bf \beta}}-{\bf d}) \] and measures the distance between the estimate \({\bf C}\hat{\beta}\) and the value under then null hypothesis \({\bf d}\), weighted by the asymptotic covariance matrix of \({\bf C}\hat{\beta}\), and under the null follows a \(\chi^2\) distribution with \(r\) degrees of freedom (where \(r\) is the number of hypotheses tested). \(P\)-values are calculated in the upper tail of the \(\chi^2\)-distribution.

The likelihood ratio test

Notation:

  • A: the larger model and
  • B: the smaller model (under \(H_0\)), and the smaller model is nested within the larger model (that is, B is a submodel of A).

The likelihood ratio statistic is defined as \[- 2\ln \lambda=-2(\ln L(\hat{\beta}_B)-\ln L(\hat{\beta}_A)) \] which under the null is asymptotically \(\chi^2\)-distributed with degrees of freedom equal the difference in the number of parameters in the large and the small model. Again, \(p\)-values are calculated in the upper tail of the \(\chi^2\)-distribution.


Example: Female crabs with satellites - the different tests.

We fit a model with two covariates, one is categorical and we use effect coding.

model3=glm(Sa~W+C,family=poisson(link=log),data=crab,contrasts=list(C="contr.sum"))
summary(model3)
## 
## Call:
## glm(formula = Sa ~ W + C, family = poisson(link = log), data = crab, 
##     contrasts = list(C = "contr.sum"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0415  -1.9581  -0.5575   0.9830   4.7523  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.92089    0.56010  -5.215 1.84e-07 ***
## W            0.14934    0.02084   7.166 7.73e-13 ***
## C1           0.27085    0.11784   2.298   0.0215 *  
## C2           0.07117    0.07296   0.975   0.3294    
## C3          -0.16551    0.09316  -1.777   0.0756 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 559.34  on 168  degrees of freedom
## AIC: 924.64
## 
## Number of Fisher Scoring iterations: 6
# possible to use type III in Anova
library(car)
Anova(model3,type="III",test.statistic = "Wald") #same as summary?
## Analysis of Deviance Table (Type III tests)
## 
## Response: Sa
##             Df  Chisq Pr(>Chisq)    
## (Intercept)  1 27.196  1.839e-07 ***
## W            1 51.350  7.726e-13 ***
## C            3  8.518    0.03644 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova(model3,type="III",test.statistic = "LR") #same as comparing deviances
## Analysis of Deviance Table (Type III tests)
## 
## Response: Sa
##   LR Chisq Df Pr(>Chisq)    
## W   49.794  1  1.707e-12 ***
## C    8.534  3    0.03618 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# compare to type I with anova
anova(model3,test="Chisq")
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: Sa
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                   172     632.79              
## W     1   64.913       171     567.88 7.828e-16 ***
## C     3    8.534       168     559.34   0.03618 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model3,test="LRT")
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: Sa
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                   172     632.79              
## W     1   64.913       171     567.88 7.828e-16 ***
## C     3    8.534       168     559.34   0.03618 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Q: Comment on what you see in the print-out.

Residuals

Two types of residuals are popular: deviance and Pearson.

Deviance residuals

Recall that the deviance was defined by relating a candidate model to a saturated model and calculating the likelihood ratio statistic with these two models.

Saturated model: If we were to provide a perfect fit to our data then we would estimate the mean \(\lambda_i\) by the observed count for observation \(i\). That is, \(\tilde{\lambda}_i=y_i\) Then \(\tilde{\lambda}\) is an \(n\) dimensional column vector with the elements \(\tilde{\lambda}_i\).

Candidate model: The model that we are investigated can be thought of as a candidate model. Then we maximize the likelihood and get \(\hat{\beta}\) which through our linear predictor and link function we turn into \(\hat{\lambda}_i\), also called \(\hat{y}_i\). Then \(\hat{\lambda}\) is an \(n\)-dimensional column vector with the elements \(\hat{\lambda}_i\), also called \(\hat{y}\).

\[D=-2(\ln L(\text{candidate model})-\ln L(\text{saturated model}))=-2(l(\hat{\lambda})-l(\tilde{\lambda}))= -2\sum_{i=1}^n(l_i(\hat{\lambda}_i)-l_i(\tilde{\lambda}_i))\] Inserting the Poisson likelihood and \(\lambda_i\) estimates this gives:

\[ D=2 \sum_{i=1}^n [y_i\ln(\frac{y_i}{\hat{y}_i})-(y_i-\hat{y}_i)]\] Where \(\hat{y}_i=\exp({\bf x}_i^T \hat{\beta})\). Verify this by yourself.

The deviance residuals are given by a signed version of each element in the sum for the deviance, that is \[d_i=\text{sign}(y_i-\hat{y}_i)\cdot \left\{ 2[y_i\ln(\frac{y_i}{\hat{y}_i})-(y_i-\hat{y}_i)]\right\}^{1/2}\] where the term \(\text{sign}(y_i-\hat{y}_i)\) makes negative residuals possible - and we get the same sign as the Pearson residuals


Pearson residuals

The Pearson residuals are given as \[ r_i=\frac{o_i-e_i}{\sqrt{e_i}}\] where \(o_i\) is the observed count for observation \(i\) and \(e_i\) is the estimated expected count for observation \(i\). We have that \(o_i=y_i\) and \(e_i=\hat{y}_i=\hat{\lambda}_i=\exp({\bf x}_i^T\hat{\beta})\).

Remark: A standardized version scales the Pearson residuals with \(\sqrt{1-h_{ii}}\) similar to the standardized residuals for the normal model. Here \(h_{ii}\) is the diagonal element number \(i\) in the hat matrix \({\bf H}={\bf X}({\bf X}^T{\bf X})^{-1}{\bf X}^T\).


Plotting residuals

Deviance and Pearson residuals can be used for checking the fit of the model, by plotting the residuals against fitted values and covariates. Normality of residuals are also assumed, and can be checked using qq-plots as for the MLR in Module 2.

Below - notice the trend in the residuals, this is due to the discrete nature of the response. The plot with different shades of blue shows that the structures are for equal values of \(y\).

library(ggplot2)
model3=glm(Sa~W+C,family=poisson(link=log),data=crab,contrasts=list(C="contr.sum",S="contr.sum"))
df=data.frame("Sa"=crab$Sa,"fitted"=model3$fitted.values,"dres"=residuals(model3,type="deviance"),"pres"=residuals(model3,type="pearson"))
gg1=ggplot(df)+geom_point(aes(x=fitted,y=dres,color="deviance"))+geom_point(aes(x=fitted,y=pres,color="pearson"))
gg1    

gg2=ggplot(df)+geom_point(aes(x=fitted,y=pres,color=Sa))
gg2

qqnorm(residuals(model3,type="deviance"))
qqline(residuals(model3,type="deviance"))

qqnorm(residuals(model3,type="pearson"))
qqline(residuals(model3,type="pearson"))

Model Assessment and Model Choice

The fit of the model can be assessed based on goodness of fit statistics (and related tests) and by residual plots. Model choice can be made from analysis of deviance, or by comparing the AIC for different models.

Deviance test

We may use the deviance test presented in Module 3 to test if the model under study is preferred compared to the saturated model.

We may write the deviance test as a sum of the squared deviance residuals.

\[ D=2 \sum_{i=1}^n [y_i\ln(\frac{y_i}{\hat{y}_i})-(y_i-\hat{y}_i)]\]

Remark: if \(\sum_{i=1}^n y_i=\sum_{i=1}^n \hat{y}_i\) then deviance residuals will be equal to \[ D=2 \sum_{i=1}^n y_i\ln(\frac{y_i}{\hat{y}_i})\] Q: is this the case for the log-linear model?

The deviance statistic might be approximately \(\chi^2_{n-p}\), at least when the counts are high.

Note: see below for remark.

Remember that the likelihood ratio test can be performed using the difference between two deviances.


Pearson test

The Pearson \(\chi^2\)-goodness of fit statistic is given as the sum of the squared Pearson residuals

\[ X_P^2=\sum_{i=1}^n r_i^2=\sum_{i=1}^n \frac{(y_j-\hat{y}_i)^2}{\hat{y}_i}\] where \(\hat{y}_i=\hat{\lambda}_i=\exp({\bf x}_i^T\hat{\beta})\). The Pearson \(\chi^2\) statistic is asymptotically equivalent to the deviance statistic and thus is asymptotically \(\chi^2_{n-p}\).

Remark: the asymptotic distribution of both statistics (deviance and Pearson) are questionable when there are many low counts. Agresti (1996, page 990) suggest analysing grouped data, for example by grouping by width in the crab example.

Remark: The Pearson statistic is also used for testing independence in contingency tables - next week.


Example: goodness of fit with female crabs

model3=glm(Sa~W+C,family=poisson(link=log),data=crab,contrasts=list(C="contr.sum"))
summary(model3)
## 
## Call:
## glm(formula = Sa ~ W + C, family = poisson(link = log), data = crab, 
##     contrasts = list(C = "contr.sum"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0415  -1.9581  -0.5575   0.9830   4.7523  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.92089    0.56010  -5.215 1.84e-07 ***
## W            0.14934    0.02084   7.166 7.73e-13 ***
## C1           0.27085    0.11784   2.298   0.0215 *  
## C2           0.07117    0.07296   0.975   0.3294    
## C3          -0.16551    0.09316  -1.777   0.0756 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 559.34  on 168  degrees of freedom
## AIC: 924.64
## 
## Number of Fisher Scoring iterations: 6
1-pchisq(model3$deviance,model3$df.residual)
## [1] 0
Xp=sum(residuals(model3,type="pearson")^2)
Xp
## [1] 543.249
1-pchisq(Xp,model3$df.residual)
## [1] 0

Q: Comment on the print-out. Is this a good fit? What might a bad fit be due to? A: wrong model or missing covariates.


AIC

Identical to Module 3 - we may use the Akaike informations criterion. Let \(p\) be the number of regression parameters in our model. \[\text{AIC} =-2 \cdot l(\hat{\beta})+2p\] A scaled version of AIC, standardizing for sample size, is sometimes preferred. And, we may also use the BIC, where \(2p\) is replaced by \(\log(n)\cdot p\).

Analysis of deviance

Identical to Module 3 we may also sequentially compare models, and use analysis of deviance for this.

Overdispersion

Count data might show greater variability in the response counts than we would expect if the response followed a Poisson distribution. This is called overdispersion.

Our model states that the variance \(\text{Var}(Y_i)=\lambda_i\). If we change the model to \(\text{Var}(Y_i)=\phi\lambda_i\) we may allow for an increased variance due to heterogeniety among subjects.

Agresti (1996, page 92) explains: For our crab data set, what if width, weight, colour and spine affect the number of satellites for a female crab, and we only fitted a model with width as covariate. Then the crabs with a certain width are a mixture of crabs of various weights, colours and spine condition - that is, a mixture of several Poisson populations, each with its own mean for the response. This heterogeniety may give an overall response distribution where the variance is greater than the standard Poisson variance.

The overdispersion parameter can be estimated as the average Pearson statistic or average deviance

\[\hat{\phi}_D = \frac{1}{n-p} D\]

where \(D\) is the deviance. Note that similarity to \(\hat{\sigma^2} = 1/(n-p)\cdot\text{SSE}\) in the MLR. The Cov\((\hat{\beta})\) can then be changed to \(\hat{\phi}F^{-1}(\hat{\beta})\), so we multiply the standard error by the square root of \(\hat{\phi}_D\).

Remark: We are now moving from likelihood to quasi-likelihood theory, where only E\((Y_j)\) and Var\((Y_j)\) - and not the distribution of \(Y_j\) - are used in the estimation.

model.disp=glm(Sa~W, family=quasipoisson(link=log), data=crab)
summary.glm(model.disp)
## 
## Call:
## glm(formula = Sa ~ W, family = quasipoisson(link = log), data = crab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8526  -1.9884  -0.4933   1.0970   4.9221  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.30476    0.96729  -3.417 0.000793 ***
## W            0.16405    0.03562   4.606 7.99e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 3.182205)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 567.88  on 171  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6
summary.glm(model.disp)$dispersion
## [1] 3.182205

Rate models and offset

In the Poisson process we might analyse an event that occurs within a time interval or region in space, and therefore it is often of interest to model the rate at which events occur.

Agresti (1996, page 86): what if we want to model the number of auto thefts for a year in a sample of cities. We would make a rate for each city by dividing the number of auto thefts by the population size of the city. The model could then describe how this rate depends on unemployment rate, median income, percentage of residents having completed high school.

Now we don’t want a model for \(Y_i\) but for \(Y_i/t_i\), where \(t_i\) denotes the index (population size in the example) associated with observation \(i\). The expected value of \(Y_i/t_i\) would then be \(\text{E}(Y_i)/t_i=\lambda_i/t_i\), and a log-linear model would be \[ \log(\lambda_i/t_i)={\bf x}_i^T \beta\] We may equivalently write the model as \[ \log(\lambda_i)-\log(t_i)={\bf x}_i^T \beta\] This adjustment term is called an offset and is a known quantity.

The expected number of outcomes will then satisfy \[ \text{E}(Y_i)=t_i \exp({\bf x}_i^T \beta).\] We will look more at models with offset in the interactive session.

Interactive session - first week

We will test out a new strategy for the interactive session. We keep the randomly assigned groups, but all groups will

  • work with Problem 1 in 12.15-12.50,
  • and then summarize findings in plenum (groups may volunteer to answer) 12.50-13.00.
  • Then 15 minutes break.
  • Then work with Problem 2 in 13.15-13.50. and
  • then summarize the findings in plenum 13.50-14.00.
  • Go to the additional exam questions if you need more work!

Problem 1: Exam 2005 (Problem 1d-f - slightly modified) - Female crabs and satellites

(Only a subset of 20 crabs in the original data was used in the exam problems, but we will use the full data set of 173 crabs - so results will not be the same. Permitted aids for the exam was “all printed and handwritten material and all calculators”)

We assume that the number of satellites for each female crab follows a Poisson distribution and want to perform a Poisson regression using a log-link to find if there is a connection between the expected number of satellites Sa and the width W and colour C of the carapace of the female crab.

  • C: the color of the female crab (1=light medium, 2=medium, 3=dark medium, 4=dark)
  • W: width of carapace (cm)

The following model was fitted.

crab <- read.table("https://www.math.ntnu.no/emner/TMA4315/2017h/crab.txt")
colnames(crab) <- c("Obs", "C", "S", "W", "Wt", "Sa")
crab <- crab[,-1] #remove column with Obs
crab$C <- as.factor(crab$C)
modelEXAM <- glm(Sa ~ W + C, family = poisson(link = log), data = crab, contrasts = list(C = "contr.sum"))
summary(modelEXAM)
## 
## Call:
## glm(formula = Sa ~ W + C, family = poisson(link = log), data = crab, 
##     contrasts = list(C = "contr.sum"))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0415  -1.9581  -0.5575   0.9830   4.7523  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.92089    0.56010  -5.215 1.84e-07 ***
## W            0.14934    0.02084   7.166 7.73e-13 ***
## C1           0.27085    0.11784   2.298   0.0215 *  
## C2           0.07117    0.07296   0.975   0.3294    
## C3          -0.16551    0.09316  -1.777   0.0756 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 559.34  on 168  degrees of freedom
## AIC: 924.64
## 
## Number of Fisher Scoring iterations: 6
library(car)
Anova(modelEXAM, type = "III", test.statistic = "Wald")
## Analysis of Deviance Table (Type III tests)
## 
## Response: Sa
##             Df  Chisq Pr(>Chisq)    
## (Intercept)  1 27.196  1.839e-07 ***
## W            1 51.350  7.726e-13 ***
## C            3  8.518    0.03644 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

a) Write down the estimated model and evaluate the significance of the estimated coefficients. Perform a test to evaluate the fit of the model. Use a 5% significance level for your evaluations. What is your conclusion?

b) Make a sketch by hand illustrating the connection between the expected number of satellites and the width of the carapace for each of the four colours of the crab. The estimated multiplicative change in the expected number of satellites when the width of the carapace is changed by 1 cm is, according to the model, independent of the colour. Explain why. Also find a 95% confidence interval for this change.

c) Let \(\hat{\eta}(x_{W})\) be the estimated linear predictor when the width of the carapace is \(x_{W}\) and the crab is “light medium”. What is the distribution of \(\hat{\eta}(x_{W})\)? Which value of \(x_W\) would give estimated mean number of satellites equal to 5?

New: Using R – for this value of \(x_W\), construct a 95% confidence interval for the mean number of satellites. New: Use ggplot to create (and improve) the sketch from b).

Problem 2: Exam 2007 (Problem 1, a bit modified) - Smoking and lung cancer

(Permitted aids for the exam was “Tabeller og formler i statistikk”, Matematisk formelsamling (Rottmann), one A5 sheet with your own handwritten notes, and a simple calculator.)

The dataset given in smoking.txt consists of four variables:

  • deaths: number of lung cancer deaths over a period of six years [remark: incorrectly 1 year in exam question]
  • population: the number of people [remark: incorrectly in 100 000 people in exam question]
  • age: in five-year age groups (40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80+)
  • ageLevel: age group as numbers from 1 to 9 (1 corresponds to 40-44, 2 to 45-59, and so on)
  • smoking status: doesn’t smoke (no), smokes cigars or pipe only (cigarPipeOnly), smokes cigarettes and cigar or pipe (cigarettePlus), and smokes cigarettes only (cigaretteOnly)

You can look at the dataset here: https://www.math.ntnu.no/emner/TMA4315/2017h/smoking.txt. The smoking dataset can be found here as well: http://data.princeton.edu/wws509/datasets/#smoking

Remark: the data set is probably taken from https://www.jstor.org/stable/41983444?seq=1#page_scan_tab_contents and there it is said the the persons under study are males who have contributed in wars before 1956 and who answered a questionary. The auhtors point out that the dataset is not representative for the whole population.

We are interested in studying if the mortality rate due to lung cancer (the number of deaths due to lung cancer per individual during six year) controlled for age group varies with smoking status. Assume that the number of deaths for each set of covariate values, \(Y_i\), can be considered Poisson distributed, \(Y_i \sim \text{Poisson}(\lambda_i)\). We fit the following model:

# load data:
smoking <- read.table(file = "https://www.math.ntnu.no/emner/TMA4315/2017h/smoking.txt")
head(smoking)
##   dead  pop   age ageLevel smoke
## 1   18  656 40-44        1    no
## 2   22  359 45-59        2    no
## 3   19  249 50-54        3    no
## 4   55  632 55-59        4    no
## 5  117 1067 60-64        5    no
## 6  170  897 65-69        6    no
nrow(smoking)
## [1] 36
model1 <- glm(dead ~ age + smoke, family = "poisson", data = smoking, offset = log(pop)) 
# note that the size of the population for each combination of the covaraites is the offset here
summary(model1)
## 
## Call:
## glm(formula = dead ~ age + smoke, family = "poisson", data = smoking, 
##     offset = log(pop))
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.06055  -0.54773   0.06431   0.29963   1.48348  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -3.63222    0.06783 -53.552  < 2e-16 ***
## age45-59             0.55388    0.07999   6.924 4.38e-12 ***
## age50-54             0.98039    0.07682  12.762  < 2e-16 ***
## age55-59             1.37946    0.06526  21.138  < 2e-16 ***
## age60-64             1.65423    0.06257  26.439  < 2e-16 ***
## age65-69             1.99817    0.06279  31.824  < 2e-16 ***
## age70-74             2.27141    0.06435  35.296  < 2e-16 ***
## age75-79             2.55858    0.06778  37.746  < 2e-16 ***
## age80+               2.84692    0.07242  39.310  < 2e-16 ***
## smokecigarretteOnly  0.36915    0.03791   9.737  < 2e-16 ***
## smokecigarrettePlus  0.17015    0.03643   4.671 3.00e-06 ***
## smokeno             -0.04781    0.04699  -1.017    0.309    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4055.984  on 35  degrees of freedom
## Residual deviance:   21.487  on 24  degrees of freedom
## AIC: 285.51
## 
## Number of Fisher Scoring iterations: 4

a) What is an offset?
For 53-year old non-smokers, what is the estimated number of deaths due to lung cancer (per person over 6 years)?
Why is the number of degrees of freedom for the deviance of this model 24? Does the model give a good fit?

b) Let \(\lambda(a,s)\) denote the expected number of lung cancer deaths per person in age group \(a\) with smoking status \(s\). For two different smoking statuses \(s_1\) and \(s_2\), define

\[r(a, s_1, s_2) = \frac{\lambda(a, s_1)}{\lambda(a, s_2)}.\]

Explain why \(r(a, s_1, s_2)\) does not vary as a function of \(a\) in model1.
For \(s_1 =\) cigarPipeOnly and \(s_2 =\) cigaretteOnly, find an estimate value for \(r(a, s_1, s_2)\) and an approximate 90 % confidence interval. Is there a significant difference in the expected number of lung cancer deaths for individuals that smoke cigarettes cigaretteOnly versus those that smoke cigar/pipe cigarPipeOnly?

c) We will now consider two alternative models, model2 and model3:

model2 <- glm(dead ~ smoke, family = "poisson", data = smoking, offset = log(pop))
model3 <- glm(dead ~ ageLevel + smoke, family = "poisson", data = smoking, offset = log(pop))

summary(model2)
## 
## Call:
## glm(formula = dead ~ smoke, family = "poisson", data = smoking, 
##     offset = log(pop))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -24.546   -5.892   -2.310    8.343   15.612  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.47319    0.03099 -47.532  < 2e-16 ***
## smokecigarretteOnly -0.31219    0.03576  -8.729  < 2e-16 ***
## smokecigarrettePlus -0.43013    0.03468 -12.402  < 2e-16 ***
## smokeno             -0.36678    0.04669  -7.855 3.98e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4056.0  on 35  degrees of freedom
## Residual deviance: 3910.7  on 32  degrees of freedom
## AIC: 4158.7
## 
## Number of Fisher Scoring iterations: 5
summary(model3)
## 
## Call:
## glm(formula = dead ~ ageLevel + smoke, family = "poisson", data = smoking, 
##     offset = log(pop))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.0918  -1.1673  -0.2755   0.7803   2.6364  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -3.705950   0.050717 -73.071  < 2e-16 ***
## ageLevel             0.333006   0.005591  59.559  < 2e-16 ***
## smokecigarretteOnly  0.405019   0.037463  10.811  < 2e-16 ***
## smokecigarrettePlus  0.203426   0.035996   5.651 1.59e-08 ***
## smokeno             -0.032927   0.046894  -0.702    0.483    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4055.984  on 35  degrees of freedom
## Residual deviance:   75.734  on 31  degrees of freedom
## AIC: 325.76
## 
## Number of Fisher Scoring iterations: 4

Why does model2 and model3 have 32 and 31 degrees of freedom, respectively? Which model would you choose as the best? Justify your answer by formulating relevant hypothesis and do the hypothesis tests, based on the print-outs above. Hint: remember that to compare two models then one model needs to be nested within the other model.

New: For model3: Plot the regression line for the expected number of lung cancer deaths per person as a function of age for each of the four different smoking levels in the same plot. First use all coefficients, and then only the ones that are significant. Use ggplot! Discuss what you see.

Modelling continuous response data

Examples of continuous positive responses

  • Insurance: Claim sizes
  • Medicine: Time to blood coagulation (main example)
  • Biology: Time in various development stages for fruit fly
  • Meteorology: Amount of precipitation (interactive session - exam question 2012)

Models for continuous positive responses

  • Lognormal distribution on response
  • Gamma distribution on response
  • Inverse Gaussian distribution on response (we will not consider this here)

Example: Time to blood coagulation

This data is described in McCullagh and Nelder (1989, page 300). The data represents clotting time of blood (in seconds) y for normal plasma diluted to nine different percentage concentrations u with socalled prothrombin-free plasma. To induce the clotting a chemical called thromboplasting was used, and in the experiment two different lots of the chemical were used - denoted lot. Our aim is to investigate the relationship between the clotting time and the dilution percentage, and look at differences between the lots.

clot = read.table("https://www.math.ntnu.no/emner/TMA4315/2017h/clot.txt",header=T)
clot$lot = as.factor(clot$lot)
summary(clot)
##        u            time       lot  
##  Min.   :  5   Min.   : 12.0   1:9  
##  1st Qu.: 15   1st Qu.: 18.0   2:9  
##  Median : 30   Median : 23.0        
##  Mean   : 40   Mean   : 32.5        
##  3rd Qu.: 60   3rd Qu.: 35.0        
##  Max.   :100   Max.   :118.0

Lognormal distribution on response

Let \(Y_i\) be the response on the original scale, where \(Y_i>0\).

Transform the response to a logaritmic scale: \(Y^*_i=\ln(Y_i)\). Then, assume that transformed responses follow a normal distribution (or follows approximately) and use ordinary MLR. This means we have a GLM with normal response and identity link (on logarithmic scale of reponse).

  1. \(Y^*_i \sim N(\mu^{*}_i,\sigma^{*2})\)
  2. \(\eta_i={\bf x}_i^T\beta\)
  3. \(\mu^{*}_i=\eta_i\) (identity link)

There are two ways of looking at this,

  1. either this is just a transformation to achieve approximate normality, or
  2. we assume that the original data follows a lognormal distribution.

In genomics one usually assume the former, and reports back results on the exponential scale - just say that the mean of original data is \(\exp(\mu^{*}_i)\).

However, if on instead assume that the original data really comes from a lognormal distribution, then it can be shown that \[\text{E}(Y_i)=\exp(\mu^{*}_i)\cdot \exp(\sigma^{*2}/2)\] \[\text{Var}(Y_i) =\exp(\sigma^{*2} − 1)\cdot \mu_i^2\] i.e. standard deviation proportional to expectation. That is in general a put-off for using the lognormal model.

orgmu=1
orgsd=0.3
library(ggplot2)
xrange=range(rlnorm(1000,orgmu,orgsd))
ggplot(data.frame(x=xrange),aes(xrange))+
   xlab(expression(x))+ 
    stat_function(fun=dlnorm, args=list(meanlog=orgmu,sdlog=orgsd), geom="line", colour="red")

```

Gamma regression

The gamma distribution

We have seen that a gamma distributed variable may be the result of the time between events in a Poisson process. The well known \(\chi^2_{\delta}\)-distribution is a special case of the gamma distribution (\(\frac{\nu}{\mu_i}=2\), \(\nu=\frac{\delta}{2}\)).

There are many parameterization for the gamma distribution, but we will stick with the one used in our textbook (page 643):

\(Y_i \sim Ga(\mu_i,\nu)\) with density \[ f(y_i)=\frac{1}{\Gamma(\nu)} (\frac{\nu}{\mu_i})^{\nu} y_i^{\nu-1}\exp(-\frac{\nu}{\mu_i}y_i) \text{ for }y_i>0\]

mu=1
nu=0.3
library(ggplot2)
xrange=range(rgamma(1000,shape=mu/nu,scale=nu))
ggplot(data.frame(x=xrange),aes(xrange))+
   xlab(expression(x))+ 
    stat_function(fun=dgamma, args=list(shape=mu/nu,scale=nu), geom="line", colour="red")


We found in Module 1, see exponential family that the gamma distribution is an exponential family, with

  • \(\theta_i=-\frac{1}{\mu_i}\) is the canonical parameter
  • \(\phi=\frac{1}{\nu}\),
  • \(w_i=1\)
  • \(b(\theta_i)=-\ln(-\theta_i)\)
  • \(\text{E}(Y_i)=b'(\theta_i)=-\frac{1}{\theta_i}=\mu_i\)
  • \(\text{Var}(Y_i)=b''(\theta_i)\frac{\psi}{w_i}=\frac{\mu_i^2}{\nu}\)

For a GLM model we have canonical link if \[\theta_i=\eta_i\] Since \(\eta_i=g(\mu_i)\) this means to us that we need \[\theta_i=g(\mu_i)=-\frac{1}{\mu_i}\] saying that with the canonical link is \(-\frac{1}{\mu_i}\). However, the most commonly used link is \(g(\mu_i)=\ln(\mu_i)\), and the identity link is also used.

Q: Discuss the implications on \(\eta_i\) when using the canonical link. What about using log-link?

Remark: often the inverse and not the negative inverse is used, and since \[g(\mu_i)=-\frac{1}{\mu_i}={\bf x}_i^T\beta\] then \[\frac{1}{\mu_i}=-{\bf x}_i^T\beta={\bf x}_i^T\beta^*\] where \(\beta^*=-\beta\).


Gamma GLM model

  1. \(Y_i \sim Ga(\mu_i,\nu)\)
  2. \(\eta_i={\bf x}_i^T\beta\)

  3. Popular link functions:
  • \(\eta_i=\mu_i\) (identity link)
  • \(\eta_i=\frac{1}{\mu_i}\) (inverse link)
  • \(\eta_i=\ln(\mu_i)\) (log-link)

Remark: In our model the parameter \(\mu_i\) varies with \(i\) but \(\nu\) is the same for all observations.


Example: Time to blood coagulation

A simple model to start with is as follows (dosages often analysed on log scale):

fit1 = glm(time~lot+log(u),data=clot,family=Gamma(link=log))
summary(fit1)
## 
## Call:
## glm(formula = time ~ lot + log(u), family = Gamma(link = log), 
##     data = clot)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.17470  -0.11596  -0.04281   0.06919   0.27749  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.44660    0.13453   40.48  < 2e-16 ***
## lot2        -0.47034    0.07095   -6.63 8.02e-06 ***
## log(u)      -0.58476    0.03772  -15.50 1.22e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.02265072)
## 
##     Null deviance: 7.7087  on 17  degrees of freedom
## Residual deviance: 0.3211  on 15  degrees of freedom
## AIC: 104.28
## 
## Number of Fisher Scoring iterations: 5

Q: describe what you see in the print-out.

Likelihood and derivations thereof

Likelihood: \[L(\beta)=\prod_{i=1}^n \exp(-\frac{\nu y_i}{\mu_i}-\nu \ln \mu_i +\nu \ln \nu +(\nu-1)\ln y_i -\ln(\Gamma(\nu)))\]

Log-likelihood: \[l(\beta)=\sum_{i=1}^n [-\frac{\nu y_i}{\mu_i}-\nu \ln \mu_i +\nu \ln \nu +(\nu-1)\ln y_i -\ln(\Gamma(\nu))]\] Observe that we now- for the first time - have a nuisance parameter \(\nu\) here.

To produce numerical estimates for the parameter of interest \(\beta\) we may proceed to the score function, and solve using Newton Raphson or Fisher scoring. If we do not have the canonical link the observed and expected Fisher information matrix may not be equal.

What about \(\phi\)? Also estimated using maximum likelihood.

Scaled and unscaled deviance

We have defined the deviance as \[D=-2(\ln L(\text{candidate model})-\ln L(\text{saturated model}))\]

This is often called the scaled deviance.

The unscaled deviance is then defined as \(\phi D\), but is sadly sometimes also called the deviance - for example by R.

  1. For the normal model the
  • scaled deviance is \(D=\frac{1}{\sigma^2}\sum_{i=1}^n (y_i-\mu_i)^2\), while
  • unscaled deviance is \(\phi D=\sum_{i=1}^n (y_i-\mu_i)^2\)
  1. For the binomial and Poisson model \(\phi=1\) so the scaled and unscaled deviance are equal.

  2. What about the Gamma model?

\(\oplus\) some calculations - see IL week 2, problem 1b.

\[D=\frac{-2 \sum_{i=1}^n [\ln (\frac{y_i}{\hat{\mu}_i})-\frac{y_i-\hat{\mu}_i}{\hat{\mu}_i}]}{\phi}\] and unscaled as \(\phi D=-2 \sum_{i=1}^n [\ln (\frac{y_i}{\hat{\mu}_i})-\frac{y_i-\hat{\mu}_i}{\hat{\mu}_i}]\).

Compare to print-out from R:

deviance(fit1)
## [1] 0.3210963
nu1=1/summary(fit1)$dispersion
nu1
## [1] 44.14871
D=-2*nu1*sum(log(fit1$y/fit1$fitted.values)-((fit1$y-fit1$fitted.values)/fit1$fitted.values))
D
## [1] 14.17599
deviance(fit1)*nu1
## [1] 14.17599

So, the deviance in R is the unscaled deviance.

Comparing models

Comparing models based on deviance

fit2=glm(time~lot+log(u)+lot:log(u),data=clot,family=Gamma(link=log))
anova(fit1,fit2)
## Analysis of Deviance Table
## 
## Model 1: time ~ lot + log(u)
## Model 2: time ~ lot + log(u) + lot:log(u)
##   Resid. Df Resid. Dev Df  Deviance
## 1        15    0.32110             
## 2        14    0.31576  1 0.0053352

The deviance table does not include \(\phi\), so the unscaled deviance is reported. If significance testing is done, the estimated \(\phi\) from the largest model is used, and \(p\)-values are based on the scaled deviance.

anova(fit1,fit2,test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: time ~ lot + log(u)
## Model 2: time ~ lot + log(u) + lot:log(u)
##   Resid. Df Resid. Dev Df  Deviance Pr(>Chi)
## 1        15    0.32110                      
## 2        14    0.31576  1 0.0053352   0.6355
1-pchisq((deviance(fit1)-deviance(fit2))/summary(fit2)$dispersion,fit1$df.residual-fit2$df.residual)
## [1] 0.6355477
anova(fit1,fit2,test="F")
## Analysis of Deviance Table
## 
## Model 1: time ~ lot + log(u)
## Model 2: time ~ lot + log(u) + lot:log(u)
##   Resid. Df Resid. Dev Df  Deviance      F Pr(>F)
## 1        15    0.32110                           
## 2        14    0.31576  1 0.0053352 0.2246 0.6429
1-pf((deviance(fit1)-deviance(fit2))/summary(fit2)$dispersion,fit1$df.residual-fit2$df.residual,fit2$df.residual)
## [1] 0.642854

Comparing models based on AIC

AIC(fit1,fit2)
##      df      AIC
## fit1  4 104.2763
## fit2  5 105.9738

Q: would you prefer fit1 or fit2?

AIC can also be used when we compare models with different link functions (models that are not nested).

The literature suggests to plot \(y_i\) vs. each covariate to get a hint about which link function to use.

  • Identity: Plot of \(y_i\) vs \(x_i\) should be close to linear
  • \(\ln:\) Plot of \(\ln(y_i)\) vs \(x_i\) should be close to linear
  • Inverse (reciprocal): Plot of \(1/y_i\) vs \(x_i\) should be close to linear
library(ggplot2)
library(ggpubr)
y=clot$time
x=clot$u

df=data.frame(y=y,x=x)
gg1=ggplot(df)+geom_point(aes(x=log(x),y=y))
gg2=ggplot(df)+geom_point(aes(x=log(x),y=log(y)))
gg3=ggplot(df)+geom_point(aes(x=log(x),y=1/y))
gg4=ggplot(df)+geom_point(aes(x=sqrt(1/x),y=log(y)))
ggarrange(gg1,gg2,gg3,gg4)

fit4=glm(time~lot+sqrt(1/u),data=clot,family=Gamma(link=log))
AIC(fit1,fit4)
##      df       AIC
## fit1  4 104.27633
## fit4  4  45.01688
df4 = data.frame(fitted = fit4$fitted.values, dres = residuals(fit4, 
    type = "deviance"))
gg4 = ggplot(df4) + 
geom_point(aes(x = fitted, y = dres)) +
scale_color_discrete("")
df1 = data.frame(fitted = fit1$fitted.values, dres = residuals(fit1, 
    type = "deviance"))
gg1 = ggplot(df1) + 
geom_point(aes(x = fitted, y = dres))+
scale_color_discrete("")
ggarrange(gg1,gg4)

Analysing categorical data – in contingency tables

Lecture slides


Breast cancer example

ds=matrix(c(683,2537,1498,8747),2,2,byrow=TRUE)
colnames(ds)=c("Age>=30","Age<30")
rownames(ds)=c("Case","Control")
res=chisq.test(ds,correct=FALSE)
res$observed
round(res$expected,0)
(res$observed-res$expected)^2/res$expected
res$residuals^2
res$statistic
res$p.value
fit=glm(t(ds)~1,family=binomial)
contrib <- 100*res$residuals^2/res$statistic
library(corrplot)
corrplot(contrib, is.cor = FALSE)

library(statmod)
glm.scoretest(fit,c(0,1))^2
##         Age>=30 Age<30
## Case        683   2537
## Control    1498   8747
##         Age>=30 Age<30
## Case        522   2698
## Control    1659   8586
##          Age>=30   Age<30
## Case    49.97022 9.658371
## Control 15.70562 3.035623
##          Age>=30   Age<30
## Case    49.97022 9.658371
## Control 15.70562 3.035623
## X-squared 
##  78.36984 
## [1] 8.544684e-19
## [1] 78.37033

Football data from the Lee (1997) article

see Lee (1997) - to be used on the work with Compulsory 2.

# home and away goals in Lee 1997
ds=matrix(c(27,29,10,8,2,59,53,14,12,4,28,32,14,12,4,19,14,7,4,1,7,8,10,2,0),ncol=5,nrow=5,byrow=TRUE)
colnames(ds)=c("H=0","H=1","H=2","H=3","H=4+")
rownames(ds)=c("A=0","A=1","A=2","A=3","A=4+")
coltots=apply(ds,2,sum)
rowtots=apply(ds,1,sum)
cbind(rbind(ds,coltots),c(rowtots,sum(rowtots)))
res=chisq.test(ds,correct=FALSE)
round(res$expected,0)
(res$observed-res$expected)^2/res$expected
res$residuals^2
res$statistic
res$p.value
res
contrib <- 100*res$residuals^2/res$statistic
library(corrplot)
corrplot(contrib, is.cor = FALSE)

##         H=0 H=1 H=2 H=3 H=4+    
## A=0      27  29  10   8    2  76
## A=1      59  53  14  12    4 142
## A=2      28  32  14  12    4  90
## A=3      19  14   7   4    1  45
## A=4+      7   8  10   2    0  27
## coltots 140 136  55  38   11 380
##      H=0 H=1 H=2 H=3 H=4+
## A=0   28  27  11   8    2
## A=1   52  51  21  14    4
## A=2   33  32  13   9    3
## A=3   17  16   7   4    1
## A=4+  10  10   4   3    1
##             H=0         H=1        H=2        H=3        H=4+
## A=0  0.03571429 0.119117647 0.09090909 0.02105263 0.018181818
## A=1  0.85401885 0.093422143 2.08912326 0.34084507 0.002971898
## A=2  0.80233918 0.001375989 0.07278044 1.00000000 0.746677299
## A=3  0.35355054 0.275197798 0.03639022 0.05555556 0.070308347
## A=4+ 0.87329435 0.286251577 9.49712033 0.18148148 0.781578947
##             H=0         H=1        H=2        H=3        H=4+
## A=0  0.03571429 0.119117647 0.09090909 0.02105263 0.018181818
## A=1  0.85401885 0.093422143 2.08912326 0.34084507 0.002971898
## A=2  0.80233918 0.001375989 0.07278044 1.00000000 0.746677299
## A=3  0.35355054 0.275197798 0.03639022 0.05555556 0.070308347
## A=4+ 0.87329435 0.286251577 9.49712033 0.18148148 0.781578947
## X-squared 
##  18.69926 
## [1] 0.2845667
## 
##  Pearson's Chi-squared test
## 
## data:  ds
## X-squared = 18.699, df = 16, p-value = 0.2846

Q: what do you see and how can you conclude?


Fisher’s exact test

Salt and CVD example

ds=matrix(c(2,23,5,30),ncol=2,nrow=2,byrow=TRUE)
chisq.test(ds,correct=FALSE)
fisher.test(ds)
## 
##  Pearson's Chi-squared test
## 
## data:  ds
## X-squared = 0.55911, df = 1, p-value = 0.4546
## 
## 
##  Fisher's Exact Test for Count Data
## 
## data:  ds
## p-value = 0.6882
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.04625243 3.58478157
## sample estimates:
## odds ratio 
##   0.527113

Lady tasting tea example

ds1=matrix(c(4,0,0,4),2,2)
fisher.test(ds1,alternative="greater")
ds2=matrix(c(3,1,0,4),2,2)
fisher.test(ds2,alternative="greater")
ds3=matrix(c(4,0,1,3),2,2)
fisher.test(ds3,alternative="greater")
## 
##  Fisher's Exact Test for Count Data
## 
## data:  ds1
## p-value = 0.01429
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  2.003768      Inf
## sample estimates:
## odds ratio 
##        Inf 
## 
## 
##  Fisher's Exact Test for Count Data
## 
## data:  ds2
## p-value = 0.07143
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  0.7957678       Inf
## sample estimates:
## odds ratio 
##        Inf 
## 
## 
##  Fisher's Exact Test for Count Data
## 
## data:  ds3
## p-value = 0.07143
## alternative hypothesis: true odds ratio is greater than 1
## 95 percent confidence interval:
##  0.7957678       Inf
## sample estimates:
## odds ratio 
##        Inf

Interactive session - second week

We also this week try out same structure as week 1:

  • 12.15: waffles and fruit - and short intro to Problem 1,
  • work with Problem 1 in 12.15-12.50,
  • and then summarize findings in plenum (groups may volunteer to answer) 12.50-13.00.
  • Then 15 minutes break.
  • Then work with Problem 2 in 13.15-13.50. and
  • then summarize the findings in plenum 13.50-14.00.
  • Go to the additional exam questions if you need more work!

This means that we wait with team Kahoot! until Module 5 (next week).


Problem 1: TMA4315 Exam 2012, Problem 3: Precipitation in Trondheim, amount

Remark: the text is slightly modified from the original exam since we parameterized the gamma as in our textbook.

Remark: Problems 1 (binomial) and 2 (multinomial-not on our reading list) at the 2012 exam also asked about percipitation.

We want to model the amount of daily precipitation given that it is precipitation, and denote this quantity \(Y\). It is common to model \(Y\) as a gamma distributed random variable, \(Y \sim Gamma(\nu,\mu)\), with density

\[ f_Y(y) = \frac{1}{\Gamma(\nu)} \left(\frac{\nu}{\mu}\right)^{\nu} y^{\nu-1}\exp\left(-\frac{\nu}{\mu} y \right) \]

In this problem we consider \(N\) observations, each gamma distributed with \(Y_i \sim Gamma(\nu, \mu_i)\) (remark: common \(\nu\)). Here \(\nu\) is considered to be a known nuisance parameter, and the \(\mu_i\)s are unknown.

a) Show that the gamma distribution function is member of the exponential family when \(\mu_i\) is the parameter of interest.
Use this to find expressions for the expected value and the variance of \(Y_i\), in terms of \((\nu,\mu_i)\), and interpret \(\nu\).
New: What is the canonical link for the Gamma distribution?

b) Explain what a saturated model is.
Set up the log-likelihood function expressed by \(\mu_i\), and use it to find the maximum likelihood estimators for \(\mu_i\)-s of the saturated model.
Find the deviance (based on all \(N\) observations).

c) We now want to construct a model for amount of precipitation (given that there are occurrence) with precipitation forecast as explanatory variable.
Let \(Y_i\) be amount of precipitation for day \(i\), and let \(x_i\) be the precipitation forecast valid for day \(i\). Set up a GLM for this, and argue for your choice of link function and linear predictor.


Problem 2: Taken from UiO, STK3100, 2015, problem 2

(For reference: here is the original exam).

Do not look at the dataset before the end of the exercise! You should solve the exercise without using R, just as if you were at an exam.

In this problem you shall consider data of survivals from a study of treatment for breast cancer. The response is the number that survived for three years. The covariates were the four factors

  • app: appearance of tumor, two levels, 1 = malignant, 2 = benign
  • infl: inflammatory reaction, two levels, 1 = minimal, 2 = moderate or severe
  • age: age of patients, three levels, 1 = under 50, 2 = 50 to 69, 3 = 70 or older
  • country: hospital of treatment, three levels, 1 = Japan, 2 = US, 3 = UK

The dataset we have used differs slightly from the one they used at UiO (the number of survivals and non-survivals differ).

The number of survivors is modelled as a binomially distributed variable using a canonical logit link. Dummy variable coding is used for all factors, with the level coded as “1” as the reference category.

a) The output from fitting the model where only appearance and country are used as covariates, i.e., a model with predictor on the form

\[ \eta = \beta_0 + \beta_1 \texttt{ fapp} + \beta_2 \texttt{ fcountry2} + \beta_3 \texttt{ fcountry3} \]

is displayed below. What is the interpretation of the estimate of the coefficient of appearance, fapp (f means factor)? Explain also how this coefficient can be expressed in terms of an odds ratio.

Hint: first explain what is the predicted odds of survivial for country \(j\) for a benign tumor, then the same for a malignant tumor- and then make an odds ratio.

New: The main difference between our dataset and UiO’s dataset is the number of degrees of freedom for the null model; we have 34, they have 35. How many observations do we have in the dataset? And how many does UiO have? Hint: All the covariates in the dataset are categorical, two with 3 levels, and two with 2 levels. How many possible combinations of observations does that make?

Call:
glm(formula = cbind(surv, nsurv) ~ fapp + fcountry, family = binomial, 
    data = brc)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8142  -0.7279   0.2147   0.7576   1.8715  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.0803     0.1656   6.522 6.93e-11 ***
fapp2         0.5157     0.1662   3.103 0.001913 ** 
fcountry2    -0.6589     0.1998  -3.298 0.000972 ***
fcountry3    -0.4945     0.2071  -2.388 0.016946 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 57.588  on 34  degrees of freedom
Residual deviance: 36.625  on 31  degrees of freedom

b) The output below is an analysis of deviance for comparing various model spesifications. Fill out the positions indicated i-iv.
Note: y ~ x1*x2 gives both the linear components and the interaction, i.e., it is the same as y ~ x1 + x2 + x1:x2 (x1:x2 gives interaction only).

Remark: remember that anova gives the sequential comparison of deviances - that is - comparing each model to the previous model.

Analysis of Deviance Table

Model 1: cbind(surv, nsurv) ~ fapp + fage + fcountry
Model 2: cbind(surv, nsurv) ~ fapp + fage + finfl + fcountry
Model 3: cbind(surv, nsurv) ~ fapp + finfl + fage * fcountry
Model 4: cbind(surv, nsurv) ~ fapp * finfl + fage * fcountry
Model 5: cbind(surv, nsurv) ~ fapp * finfl + fapp * fage + fage * fcountry
Model 6: cbind(surv, nsurv) ~ fapp * finfl * fage * fcountry
  Resid. Df Resid. Dev Df Deviance
1        29     33.102            
2         i     33.095  1   0.0065
3        24     25.674 ii   7.4210
4        23     25.504  1      iii
5        21     22.021  2   3.4823
6         0      0.000 iv  22.0214

In the remaining parts of this problem we return to the model in part a) and consider the hypothesis

\[ H_0 : \beta_2 + \beta_3 = -1 \text{ versus } H_1 : \beta_2 + \beta_3 \neq -1 \]

Note: what are you testing now?

c) The estimated covariance matrix between the estimators of the coefficients \(\beta_2\) and \(\beta_3\) above is \(\left(\begin{matrix} 0.040 & 0.021 \\ 0.021 & 0.043\end{matrix} \right)\). Use a Wald test to test the null hypothesis above.

Note: remember - this was a written exam so you do this by hand!

d) Explain how the null hypothesis can be tested with a likelihood ratio test by fitting two suitable models. No numerical calculations are necessary, but it must be specified how the predictors should be defined.
Remark: rather technical - and hint: offset term?

New: Do this test in R. Here is the data set!

Work on your own: Exam questions

December 2013 (Essay exam)

We will consider the following Poisson regression \[Y_i \sim \text{Poisson}(\exp(\eta_i)), \text{ }i=1,\ldots,n\] where the linear predictor is \(\eta_i={\bf x}_i^T\beta\). Here \({\bf x}_i\) is a vector of the \(p\) covariates for the \(i\)th observation \(Y_i\) and \(\beta\) is unknown \(p\) dimensional column vector of unknown regression coefficients.

Write an introduction to Poisson regression and its practical usage, for a student with a good background in statistics, but no knowledge about Generalized Linear Models (GLM). Topics you may want to consider, are

  • When to use it? Underlying assumptions.
  • Parameter estimation, limiting results for the MLE, Fisher information and observed Fisher information, confidence intervals and hypothesis testing.
  • Output analysis, residual plots and interpretation of results.
  • Deviance and its usage.
  • What do we do when a covariate is a factor, and should the results be interpreted?
  • The use of Poisson regression in the analysis of contingency tables.

Further reading

  • Agresti (1996): “An Introduction to Categorical Data Analysis”,