TMA4315 Generalized linear models H2017

Module 5: Generalized linear models - common core

Mette Langaas, Department of Mathematical Sciences, NTNU - with contributions from Ingeborg Hem

16.10.2017 [PL=plenary lecture in F3], 20.10.2017 [IL=interactive lecture in Smia] (Version 18.10.2017)

Plan for this module

Textbook: Fahrmeir et al (2013): Chapter 5.4, 5.8.2.

  • random component: exponential family
    • elements: \(\theta\), \(\phi\), \(w\), \(b(\theta)\)
    • elements for normal, binomial, Poisson and gamma
    • properties: \(\text{E}(Y)=b'(\theta)\) and \(\text{Var}(Y)=b''(\theta)\frac{\phi}{w}\) (and proof)
  • systematic component= linear predictor

  • link function and response function
    • link examples for normal, binomial, Poisson and gamma
    • requirements: one-to-one and twice differentiable
    • canonical link

  • likelihood inference set-up: \(\theta_i\leftrightarrow \mu_i\leftrightarrow \eta_i \leftrightarrow \mathbf{\beta}\)
  • likelihood equations for the GLM (score function)
  • expected Fisher information matrix for the GLM and covariance for \(\hat{\beta}\)
    • what about covariance of \(\hat{\beta}\) when \(\phi\) needs to be estimated?
    • estimator for dispersion parameter
  • Fisher scoring and iterated reweighted least squares (IRWLS)
  • Pearson and deviance statistic
  • AIC

– so, for the first time: no practical examples or data sets to be analysed!

Jump to interactive.

Random component - exponential family

In Module 1 we introduced distributions of the \(Y_i\), that could be written in the form of a univariate exponential family \[ f(y_i\mid \theta_i)=\exp \left( \frac{y_i \theta_i-b(\theta_i)}{\phi}\cdot w_i + c(y_i, \phi, w_i) \right) \] where we said that

  • \(\theta_i\) is called the canonical parameter and is a parameter of interest

  • \(\phi\) is called a nuisance parameter (and is not of interest to us=therefore a nuisance (plage))

  • \(w_i\) is a weight function, in most cases \(w_i=1\)

  • \(b\) and \(c\) are known functions.


Elements - for normal, binomial, Poisson and gamma

We have seen:

Distribution \(\theta\) \(b(\theta)\) \(\phi\) \(w\)
normal \(\mu\) \(\frac{1}{2} \theta^2\) \(\sigma^2\) \(1\)
binomial \(\ln \left( \frac{p}{1-p} \right)\) \(n \ln (1+\exp(\theta))\) \(1\) \(1\)
Poisson \(\ln \mu\) \(\exp(\theta)\) \(1\) \(1\)
gamma \(-\frac{1}{\mu}\) \(-\ln (-\theta)\) \(\frac{1}{\nu}\) \(1\)

Properties

\(\text{E}(Y_i)=b'(\theta_i)\) and \(\text{Var}(Y_i)=b''(\theta_i)\frac{\phi}{w_i}\)

\(\oplus\): Proof! Proof (16.10.2017)

\(b''(\theta_i)\) is often called the variance function \(v(\mu_i)\).


Systematic component - linear predictor

Nothing new - as always in this course: \(\eta_i={\bf x}_i^T \mathbf{\beta}\)

Likelihood inference set-up:

Classnotes on the likelihood inference (16.10.2017)

\(\theta_i\leftrightarrow \mu_i\leftrightarrow \eta_i \leftrightarrow \mathbf{\beta}\)

Likelihood equations for the GLM (score function)

(see proof in classnotes for likelihood inference)

\[s(\beta)=\sum_{i=1}^n \frac{(y_i-\mu_i){\bf x}_i h'(\eta_i)}{\text{Var}(Y_i)}={\bf X}^T {\bf D} \Sigma^{-1}({\bf y}-{\mathbf \mu})\] where \(\Sigma=\text{diag}(\text{Var}(Y_i))\) and \({\bf D}=\text{diag}(h'(\eta_i))\) (derivative wrt \(\eta_i\)).

Expected Fisher information matrix for the GLM and covariance for \(\hat{\beta}\)

(see proof in classnoted for likelihood inference)

\[F_{[h,l]}(\beta)=\sum_{i=1}^n \frac{x_{ih}x_{il}(h'(\eta_i))^2}{\text{Var}(Y_i)}\] \[F(\beta)={\bf X}^T {\bf W} {\bf X}\] where \({\bf W}=\text{diag}(\frac{h'(\eta_i)^2}{\text{Var}(Y_i)})\).

Fisher scoring and iterated reweighted least squares (IRWLS)

Classnotes on IRWLS (16.10.2017)

\[\beta^{(t+1)}=\beta^{(t)} + F(\beta^{(t)})^{-1} s(\beta^{(t)})\] Insert formulas for expected Fisher information and score function.

\[\beta^{(t+1)}=({\bf X}^T {\bf W}(\beta^{(t)}) {\bf X})^{-1} {\bf X}^T {\bf W}(\beta^{(t)})\tilde{\bf y}_i^{(t)}\]

Distribution of the MLE

As before we have that maximum likelihood estimator \(\hat{\beta}\) asymptotically follows the multivariate normal distribution with mean \(\beta\) and covariance matrix equal to the inverse of the expected Fisher information matrix. This is also true when we replace the unknown \(\beta\) with the estimated \(\hat{\beta}\) for the expected Fisher information matrix.

\[\hat{\beta} \approx N_p(\beta,F^{-1}(\hat{\beta}))\] and with \[F(\hat{\beta})={\bf X}^T \hat{{\bf W}} {\bf X}\] where \(\hat{{\bf W}}\) denotes that \(\hat{\beta}\) is used then calculating \({\bf W}=\text{diag}(\frac{h'(\eta_i)^2}{\text{Var}(Y_i)})\).

What about covariance of \(\hat{\beta}\) when \(\phi\) needs to be estimated?

(see classnotes for likelihood inference - not lectured: cool property of orthogonal parameters)

Estimator for dispersion parameter

(see classnotes for likelihood inference - not lectured: similar to what we so for the normal model)


See what is output from glm- observe working weights as weights.

fitgrouped=glm(cbind(y, n-y) ~ ldose, family = "binomial", data = investr::beetle) 
names(fitgrouped)
##  [1] "coefficients"      "residuals"         "fitted.values"    
##  [4] "effects"           "R"                 "rank"             
##  [7] "qr"                "family"            "linear.predictors"
## [10] "deviance"          "aic"               "null.deviance"    
## [13] "iter"              "weights"           "prior.weights"    
## [16] "df.residual"       "df.null"           "y"                
## [19] "converged"         "boundary"          "model"            
## [22] "call"              "formula"           "terms"            
## [25] "data"              "offset"            "control"          
## [28] "method"            "contrasts"         "xlevels"
?glm
fitgrouped$weights
##         1         2         3         4         5         6         7 
##  3.254867  8.227383 14.321313 13.378893 10.261055  5.156671  2.653398 
##         8 
##  1.230713

Hypothesis testing (unchanged)

We will look at null hypotheses and alternative hypotheses that can be written \[H_0: {\bf C}{\bf \beta}={\bf d} \text{ vs. } {\bf C}{\bf \beta}\neq {\bf d} \] by specifying \({\bf C}\) to be a \(r \times p\) matrix and \({\bf d}\) to be a column vector of length \(r\).

The Wald test

–here we insert our formula for the expected Fisher information matrix for the exponential family.

The Wald test statistic is given as: \[w=({\bf C}\hat{{\bf \beta}}-{\bf d})^{\text T}[{\bf C}F^{-1}(\hat{\beta}){\bf C}^{\text T}]^{-1}({\bf C}\hat{{\bf \beta}}-{\bf d}) \] and measures the distance between the estimate \({\bf C}\hat{\beta}\) and the value under then null hypothesis \({\bf d}\), weighted by the asymptotic covariance matrix of \({\bf C}\hat{\beta}\), and under the null follows a \(\chi^2\) distribution with \(r\) degrees of freedom (where \(r\) is the number of hypotheses tested). \(P\)-values are calculated in the upper tail of the \(\chi^2\)-distribution.


The likelihood ratio test

–here we insert our formulas for the loglikelihood for the exponential family.

Notation:

  • A: the larger model and
  • B: the smaller model (under \(H_0\)), and the smaller model is nested within the larger model (that is, B is a submodel of A).

The likelihood ratio statistic is defined as \[- 2\ln \lambda=-2(\ln L(\hat{\beta}_B)-\ln L(\hat{\beta}_A)) \] which under the null is asymptotically \(\chi^2\)-distributed with degrees of freedom equal the difference in the number of parameters in the large and the small model. Again, \(p\)-values are calculated in the upper tail of the \(\chi^2\)-distribution.

Pearson and deviance statistic

Group observations together in groups of maximal size (covariate patterns? interval versions thereof?). Group \(i\) has \(n_i\) observations, and there are \(G\) groups. Asymptotic distribution correct if all groups have big \(n_i\). For individual data asymptotic results can not be trusted.

Deviance \[D=-2[\sum_{i=1}^{g}(l_i(\hat{\mu}_i)-l_i(\bar{y}_i))]\] with approximate \(\chi^2\)-distribution with \(G-p\) degrees of freedom.

Pearson: \[ X_P^2=\sum_{i=1}^G \frac{(y_i-\hat{\mu}_i)^2}{v(\hat{\mu}_i)/w_i}\] with approximate \(\phi \cdot \chi^2\)-distribution with \(G-p\) degrees of freedom.

AIC

Let \(p\) be the number of regression parameters in our model. \[\text{AIC} =-2 \cdot l(\hat{\beta})+2p\] If the dispersion parameter is estimated use \((p+1)\) in place of \(p\).

Interactive session

Plan: work in randomly assigned groups.

  • 12.15: short intro to Problem 1 and 2.
  • 12.15-12.55: work with Problem 1, continue on 2 if you finish 1,
  • 12.55-13.00: summarize findings.
  • Then 15 minutes break.
  • 13.15: short intro to Problem 3.
  • 13.15-13.40: work with Problem 3, have 2 as back-up if you finish 3.
  • 13.40-13:45: summarize findings,
  • 13:45-14:00 team Kahoot!

Exercise 1: Exam 2011, problem 3

a) Define the class of generalized linear models (GLMs), and explicitly list all requirements for each part of the model.

b) Below are three likelihoods, three link functions, and three linear components listed. Explain which combinations that give valid GLMs (8 in total), and also comment on these models (you do not have to mathematically prove which are valid).

Likelihoods:

  1. Gaussian, \(Y \sim N(\mu, \sigma^2)\)
  2. Binomial, \(Y \sim Bin(n, \pi)\), where \(n\) is not fixed (hence is unknown and be estimated)
  3. Poisson, \(Y \sim Poi(\lambda)\)

Link functions:

  1. \(\eta = \cos(\mu)\)
  2. \(\eta = \mu\)
  3. \(\eta = \log(\mu)\)

Linear components:

  1. \(\eta = \beta_0 + \beta_1x_1 + \beta_2x_2\)
  2. \(\eta = \beta_0 + \beta_1x_1 + \beta_2x_1^2\)
  3. \(\eta = \beta_0 + \beta_1x_1 + \beta_1^2x_2\)

Exercise 2: Exam 2006, problem 2 (a, b, d)

Let \(Y_1,Y_2,\ldots,Y_N\) be independent and exponentially distributed random variables, where \(Y_i\) has the density

\[f(y_i ; \alpha_i)=\alpha_i e^{-\alpha_iy_i} \text{ for } y_i>0, \alpha_i>0, i=1,2,\ldots,N.\]

a) Show that the distribution of \(Y_i\) comes from the exponential family. Use the general formulas to find E(\(Y_i\)) and Var(\(Y_i\)) as functions of \(\alpha_i\).

b) Show that the log-likelihood for the data \(y_1, \dots, y_n\) can be written as

\[ l = \sum_{i=1}^N \{-\alpha_i y_i + \ln \alpha_i\} \]

Use this to show that the deviance for a generalized linear model with estimated expectations \(\hat{\mu_i} = \hat{y_i}\) is

\[ D = 2 \sum _{i=1}^N \left\{ \frac{y_i-\hat y_i}{\hat y_i} - \ln \left(\frac{y_i}{\hat y_i} \right) \right\}\]

d) We want to test the null hypothesis

\[H_0: \ \alpha_1 = \alpha_2 = \cdots = \alpha_N = \alpha \]

against the alternative hypothesis that at least one \(\alpha_i\) is different from the others.
Use b) to find a test statistic for this problem.
What distribution does this test statistic have under \(H_0\)?

New: could you also use the Wald test? Explain how to do that (no need to calculated the mathematically explicit test statistic).


Exercise 3: Exam 2007, problem 2 a, b, c

Assume that \(Y_1, \dots, Y_N\) are independent continuous distributed random variables, where the density of \(Y_i\) is given by

\[ f(y_i; \gamma_i) = \begin{cases} \frac{\gamma_i^2}{2}y_ie^{-\gamma_iy_i} &\text{ for } y_i \geq 0 \\ 0 &\text{ else} \end{cases}\]

where \(\gamma_i\) is a scalar parameter.

a) Show that the distribution of \(Y_i\) comes from the exponential family. Hint: usually we choose to let \(\phi=\frac{1}{2}\).

Use the general formulas to show that E\((Y_i) = 2/\gamma_i\) and Var\((Y_i) = 2/\gamma_i^2\).

Assume a GLM for \(Y_1, \dots, Y_N\) where the distribution of \(Y_i\) is as above for all \(i\), with the following link function:

\[\eta = g(\mu) = \ln(\mu) = x^T\beta\]

where \(x, \ \beta \in {\rm I\!R}^p\) and \(\mu\) is the expected value of \(y\).

b) Use general formulas to find the score vector \(s(\beta) = [s_1(\beta), \dots, s_p(\beta)]^T\) and the expected Fisher information matrix \(F(\beta) = [F_{ij}(\beta)]_{i,j = 1}^p\), expressed using \(y_1, \dots, y_N, \beta, N\) and the covariates \(x\).
Write down the equation that can be used to find the MLE for \(\beta\). Note that this is a recursive equation.

c) Write down the log-likelihood for the model above. Use this to find the deviance \(D\) for the model as a function of \(y_1, \dots, y_N\) and \(\hat{y_1}, \dots, \hat{y_N}\), where \(\hat{y_i}\) is the estimated expected value of \(y_i\).
Find an expression for the deviance residuals \(d_i\) using \(y_i\) and \(\hat{y_i}\).

Exam questions

December 2005, Problem 2 (modified)

  1. Derive the formula for the (scaled) deviance for the binomial distribution.

New: is the deviance dependent on which link function we choose? Can we use the deviance to compare two models with different link functions?

  1. The covariance matix for the estimated coefficents are given as \(\text{Cov}(\hat{\beta})=({\bf X}^T{\bf W}{\bf X})^{-1}\) where \({\bf X}\) is the design matrix.
  1. (New) The matrix \({\bf W}\) is a diagonal matrix. What is on the diagonal?

  2. Calulate the elements of \({\bf W}\) for a Poisson regression- both with log and identity link. Compare.

  3. Calulate the elements of \({\bf W}\) for a binary regression - both with logit and identity link. Compare.

  4. (New) Did this give any insight into the role of the link function and its effect on the covariance for the parameter estimates?

December 2015

One of the important concepts we have discussed in this course, is deviance (for Gaussian regression, Poisson regression and logistic regression).

  1. Explain what deviance is, and how it relates to residual sum of squares (RSS) for Gaussian regression. Remark 2017: we have called this “sums of squares of errors - SSE”

  2. Discuss how it relates to a likelihood ratio test (LRT) for comparing two nested regression models.

  3. Discuss how deviance can be used to construct “ANOVA” tables for Poisson regression and logistic regression. Remark 2017: these are called analysis of deviance tables.

  4. Discuss how deviance can be used to define residuals, for Poisson regression and logistic regression.