(Latest changes: 11.010, added links to handwritten materials and dispersion formula, 07.10.2018 first version)

Overview

Learning material

Textbook: Fahrmeir et al (2013): Chapter 5.4, 5.8.2.
Classnotes 27.09.2018

Additional notes (with theoretical focus):

Topics

random component: exponential family
- elements: \(\theta\), \(\phi\), \(w\), \(b(\theta)\)
- elements for normal, binomial, Poisson and gamma
- properties: \(\text{E}(Y)=b'(\theta)\) and \(\text{Var}(Y)=b''(\theta)\frac{\phi}{w}\) (and proof)
systematic component= linear predictor
- requirements: full rank of design matrix
link function and response function
- link examples for normal, binomial, Poisson and gamma
- requirements: one-to-one and twice differentiable
- canonical link

likelihood inference set-up: \(\theta_i\leftrightarrow \mu_i\leftrightarrow \eta_i \leftrightarrow \mathbf{\beta}\)
the loglikelihood
the score function
expected Fisher information matrix for the GLM and covariance for \(\hat{\beta}\)
- what about covariance of \(\hat{\beta}\) when \(\phi\) needs to be estimated?
- estimator for dispersion parameter
Fisher scoring and iterated reweighted least squares (IRWLS)
Pearson and deviance statistic
AIC

– so, for the first time: no practical examples or data sets to be analysed!

Jump to interactive.

GLM — three ingredients

Random component - exponential family

In Module 1 we introduced distributions of the \(Y_i\), that could be written in the form of a univariate exponential family \[ f(y_i\mid \theta_i)=\exp \left( \frac{y_i \theta_i-b(\theta_i)}{\phi}\cdot w_i + c(y_i, \phi, w_i) \right) \] where we said that

\(\theta_i\) is called the canonical parameter and is a parameter of interest
\(\phi\) is called a nuisance parameter (and is not of interest to us=therefore a nuisance (plage))
\(w_i\) is a weight function, in most cases \(w_i=1\) (NB: can not contain any unknown parameters)
\(b\) and \(c\) are known functions.

Elements - for normal, Bernoulli, Poisson and gamma

We have seen:

Distribution	\(\theta\)	\(b(\theta)\)	\(\phi\)	\(w\)	\(\text{E}(Y)=b'(\theta)\)	\(b''(\theta)\)	\(\text{Var}(Y)=b''(\theta)\phi/w\)
normal	\(\mu\)	\(\frac{1}{2} \theta^2\)	\(\sigma^2\)	\(1\)	\(\mu=\theta\)	\(1\)	\(\sigma^2\)
Bernoulli	\(\ln \left( \frac{p}{1-p} \right)\)	\(\ln (1+\exp(\theta))\)	\(1\)	\(1\)	\(p=\frac{\exp(\theta)}{1+\exp(\theta)}\)	\(p(1-p)\)	\(p(1-p)\)
Poisson	\(\ln \mu\)	\(\exp(\theta)\)	\(1\)	\(1\)	\(\lambda=\exp(\theta)\)	\(\lambda\)	\(\lambda\)
gamma	\(-\frac{1}{\mu}\)	\(-\ln (-\theta)\)	\(\frac{1}{\nu}\)	\(1\)	\(\mu=-1/\theta\)	\(\mu^2\)	\(\mu^2/\nu\)

Systematic component - linear predictor

Nothing new - as always in this course: \(\eta_i={\bf x}_i^T \mathbf{\beta}\), and we require that the \(n \times p\) design matrix \({\bf X}=({\bf x}_1^T, {\bf x}_2^T,\ldots, {\bf x}_n^T)\) has full rank (which is \(p\)).

Remark: in this course we always assume that \(n>>p\).

Link function - and response function

Link function \[\eta_i=g(\mu_i)\]

Response function \[\mu_i=h(\eta_i)\]

Note that this is a link between \(\eta_i\) and \(\mu_i\), but the exponential family is written as a funciton of \(\theta_i\). For the canonical link \(\eta_i=\theta_i\), so \(g(\mu_i)=\theta_i\), so this is not a problem. We will see the more general form below.

When the canonical link is used some of the results for the GLM (to be studied in the next sections) are simplified.

Examples for normal, binomial, Poisson and gamma

random component	response function and link function
normal	\(h(\eta_i) = \eta_i\) and \(g(\mu_i) = \mu_i\), “identity link”.
binomial	\(h(\eta_i) = \frac{e^{\eta_i}}{1+e^{\eta_i}}\) and \(g(\mu_i) = \ln \left(\frac{\mu_i}{1-\mu_i} \right) = \text{logit}(p_i)\). NB: \(\mu_i = p_i\) in our set-up.
Poisson	\(h(\eta_i) = \exp(\eta_i)\) and \(g(\mu_i) = \ln(\mu_i)\), log-link.
gamma	\(h(\eta_i) = -\frac{1}{\eta_i}\) and \(g(\mu_i) = -\frac{1}{\mu_i}\), negative inverse, or \(h(\eta_i) = \exp(\eta_i)\) and \(g(\mu_i) = \ln(\mu_i)\), log-ink.

Requirements

There are a few formal requirements for the mathematics to work, in particular:

one-to-one (inverse exists)
twice differential (for score function and expected Fisher information matrix)

\(\text{E}(Y)\) and \(\text{Var}(Y)\)

These are the main results: \(\text{E}(Y_i)=b'(\theta_i)\) and \(\text{Var}(Y_i)=b''(\theta_i)\frac{\phi}{w_i}\). But how do we get there?

These are both derivatives with respect to \(\theta\). But the expectations are over the random variables, \(Y\). To derive these we differentiate once and twice w.r.t. \(\theta\), integrate over \(Y\) \(f(Y)\), and look hard enough for the useful terms. So, take a deep breath…

To start off, we show that \(\int \frac{\partial f(Y| \theta)}{\partial \theta} dy =0\), and also \(\int \frac{\partial^2 f(Y| \theta)}{\partial \theta^2} dy =0\).

We do this by reversing the integration and differentiation (which we can because the necessary conditions are met):

\[ \int \frac{\partial f(Y|\theta)}{\partial \theta} dy = \frac{\partial \int f(Y|\theta) dy}{\partial \theta} = \frac{\partial1}{\partial \theta} =0 \] Differentiating this again also gives us 0.

Now we know that these equal 0, we can evaluate them to see what equals 0: these will include \(E(Y)\) and \(Var(Y)\), so we can then solve to find the mean and variance.

\(\text{E}(Y_i)=b'(\theta_i)\)

This is a video of the proof.

To calculate \(\frac{\partial f(Y|\theta)}{\partial \theta}\) (and end up finding \(\text{E}(Y_i)\)) we start with the pdf for \(Y\):

\[ f(Y|\theta) = \exp \left( (y\theta - b(\theta))\frac{w}{\phi} + c(Y, \phi, w) \right) \]

and the score:

\[ s(\theta_i) = \frac{\partial \log f(Y|\theta)}{\partial \theta} = (y_i - b'(\theta_i))\frac{w}{\phi}. \]

Then using the chain rule

\[ \begin{aligned} \frac{\partial f(Y|\theta)}{\partial \theta} &= \frac{\partial e^{l(\theta|Y)}}{\partial \theta} \\ &= \frac{\partial e^{l(\theta|Y)}}{\partial l(\theta|Y)} \frac{\partial l(\theta|Y)}{\partial \theta} \\ &= e^{l(\theta|Y)} s(\theta) \\ &= f(y|\theta) (y - b'(\theta))\frac{w}{\phi} \end{aligned} \]

Now we integrate this over y (and noting that \(\int f(y) y dy = E(y)\) and \(\int f(y) dy=1\)):

\[ \begin{aligned} \int \frac{\partial f(Y|\theta)}{\partial \theta} dy &= \int f(y) (y - b'(\theta))\frac{w}{\phi} dy \\ &= \frac{w}{\phi} \left( \int f(y) y dy - b'(\theta)) \int f(y) dy \right) \\ &= \frac{w}{\phi} \left( E(y) - b'(\theta) \right) \\ \end{aligned} \]

This, as we showed above, equals 0, so \(\frac{w}{\phi} \left( E(y) - b'(\theta) \right) = 0\), and thus \(E(y) = b'(\theta)\)

\(\text{Var}(Y_i)=b''(\theta_i)\frac{\phi}{w_i}\)

First up, a video:

Oscars all round!

First we calculate the second derivative w.r.t \(\theta\), and again then integrate over \(Y\). To start, we plug in the first derivative:

\[ \begin{aligned} \frac{\partial^2 f(Y|\theta)}{\partial \theta^2} &= \frac{\partial \left(f(y) (y - b'(\theta))\frac{w}{\phi}\right)}{\partial \theta} \end{aligned} \]

We can now use the product rule, \(\frac{d u v}{dx} = u \frac{d v}{dx} + v \frac{d u}{dx}\) (and where we see \(f'(\theta)\) note that we have just calculated it):

\[ \begin{aligned} \frac{\partial^2 f(Y|\theta)}{\partial \theta^2} &= \frac{\partial \left(f(y) (y - b'(\theta))\frac{w}{\phi}\right)}{\partial \theta} \\ &= \frac{w}{\phi} \left(f(y) \frac{\partial (y - b'(\theta))}{\partial \theta} + (y - b'(\theta)) f'(\theta) \right)\\ &= \frac{w}{\phi} \left(-f(y) b''(\theta) + (y - b'(\theta)) f(y) (y - b'(\theta))\frac{w}{\phi} \right)\\ &= f(y)\frac{w}{\phi} \left( (y - b'(\theta))^2\frac{w}{\phi} -b''(\theta)\right)\\ \end{aligned} \]

We now integrate this over \(y\). We need to notice that \(b'(\theta)=E(y)\) (as we have just calculated), so \(\int f(y)(y - b'(\theta))^2dy = \int f(y)(y - E(y))^2dy = \text{Var}(y)\), and also \(\int f(y) dy=1\):

\[ \begin{aligned} \int \frac{\partial^2 f(Y|\theta)}{\partial \theta^2} dy &= \int f(y)\frac{w}{\phi} \left((y - b'(\theta))^2\frac{w}{\phi} -b''(\theta)\right) dy\\ &= \left(\frac{w}{\phi}\right)^2 \int f(y)(y - b'(\theta))^2 dy - \frac{w}{\phi}\int f(y) b''(\theta) dy\\ &= \left(\frac{w}{\phi}\right)^2 Var(y) - \frac{w}{\phi}b''(\theta)\int f(y) dy\\ &= \left(\frac{w}{\phi}\right)^2 Var(y) - \frac{w}{\phi}b''(\theta)\\ \end{aligned} \]

Above we saw that \(\int \frac{\partial^2 f(Y|\theta)}{\partial \theta^2} dy = 0\), and so

\[ \begin{aligned} \left(\frac{w}{\phi}\right)^2 Var(y) - \frac{w}{\phi}b''(\theta) &= 0\\ Var(y) &= \frac{\phi}{w} b''(\theta) \end{aligned} \]

Phew.

\(b''(\theta_i)\) is often called the variance function \(v(\mu_i)\).

Another Proof that \(\text{E}(s(\theta_i))=0\)

Via the score

Easy \(\text{E}(s(\theta_i))=0\) and \(s(\theta_i) = (y_i - b'(\theta_i))\frac{w_i}{\phi}\)

Likelihood inference set-up

The likelihood has to go through a few steps to get to something that can be maximised for \(\beta\).

We have

\[ \begin{aligned} l(y_i|\theta_i) &= (y_i\theta_i - b(\theta_i))\frac{w_i}{\phi} + c(y_i, w_i, \phi) \\ \mu_i &= g^{-1}(\eta_i) \\ \eta_i &= \mathbf{x}_i' \beta \end{aligned} \]

But how do we get from \(\mu_i\) to \(\theta_i\)? With a canonical link \(\mu_i=\theta_i\), but more generally we have to use \(b'(\theta_i) = \mu_i\), which we proved above. So the route between \(\theta_i\) and \(\mathbf{\beta}\) is:

\[ \theta_i\leftrightarrow \mu_i\leftrightarrow \eta_i \leftrightarrow \mathbf{\beta} \]

We need to take this route to get to an expression for the score, which we then use to get the maximum likelihood estimates of \(\beta\), i.e. \(\hat{\beta}\).

See class notes or Fahrmeir et al (2015), Section 5.8.2 for the derivation of the loglikelihood, score and expected Fisher information matrix.

Loglikelihood

Now let’s work out the log-likelihood as a function of \(\beta\)

\[l(\theta)=\sum_{i=1}^n l_i(\theta)=\sum_{i=1}^n\frac{1}{\phi}(y_i\theta_i-b(\theta_i))w_i+\sum_{i=1}^n c(y_i,\phi,w_i)\]

Getting from there to \(\boldsymbol{\beta}\) takes a few steps:

\[ \begin{aligned} f(y_i | \theta_i) &= exp \left(\frac{y_i \theta - b(\theta_i)}{\phi/w_i} + c(y_i, \phi, w_i) \right) \\ \theta_i &= b^{'-1}(\mu) (\text{from } \mu_i = b'(\theta_i) (=E(Y_i)))\\\\ \mu_i &= g^{-1}(\eta_i) \\ \eta_i &= x_i' \boldsymbol{\beta} \end{aligned} \]

Remark: the part of the loglikelihood involving both the data and the parameter of interest is for a canonical link equal to

\[\sum_{i=1}^n y_i \theta_i=\sum_{i=1}^n y_i {\bf x}_i^T\beta=\sum_{i=1}^n y_i \sum_{j=1}^p x_{ij}\beta_j=\sum_{j=1}^p \beta_j \sum_{i=1}^n y_i x_{ij}\]

Which makes things easier

Score function

What is the score function as a function of \(\beta\)? We need to use this chain:

\(\theta_i\leftrightarrow \mu_i\leftrightarrow \eta_i \leftrightarrow \boldsymbol{\beta}\)

And thus a long chain rule…

\[ s(\beta)= \frac{\partial l}{\partial \beta} = \frac{\partial l(\theta)}{\partial \theta} \frac{\partial \theta}{\partial \mu} \frac{\partial \mu}{\partial \eta} \frac{\partial \eta}{\partial \beta} \]

We already have \(\partial l/\partial \theta = (y_i - b'(\theta))\frac{w_i}{\phi}\) above ,so we need the rest. Most are easy:

\(\partial l/\partial \theta = (y_i - b'(\theta))\frac{w_i}{\phi}\) from above
\(\frac{\partial \theta_i}{\partial \mu_i} = \frac{\phi}{w_i \text{Var}(y_i)}\) see below.
\(\frac{\partial \mu_i}{\partial \eta_i} = \frac{\partial h(\eta_i)}{\partial \eta_i} = h'(\eta_i)\) and will obviously depend on the link function.
\(\frac{\partial \eta_i}{\partial \beta} = \frac{\partial \mathbf{x}_i' \beta}{\partial \beta} = \mathbf{x}_i\) is easy.

We get \(\frac{\partial \theta_i}{\partial \mu_i}\) by inverting it, i.e. calculating \(\frac{\partial \mu_i}{\partial \theta_i}\):

\[ \frac{\partial \mu_i}{\partial \theta_i} = \frac{\partial b'(\theta_i)}{\partial \theta_i} = b''(\theta_i) = \frac{w_i \text{Var}(y_i)}{\phi} \]

So we can put it all together (and cancel \(w_i\) and \(\phi\)):

\[ s_i(\beta) = (y_i - b'(\theta_i))\frac{w_i}{\phi} \frac{\phi}{w_i \text{Var}(y_i)} h'(\eta_i) \mathbf{x}_i = \frac{(y_i - b'(\theta_i))}{\text{Var}(y_i)} h'(\eta_i) \mathbf{x}_i \]

Now the total score is the sum of these components:

\[s(\beta)=\sum_{i=1}^n \frac{(y_i-\mu_i){\bf x}_i h'(\eta_i)}{\text{Var}(Y_i)} = {\bf X}^T {\bf D} \Sigma^{-1}({\bf y}-{\mathbf \mu})\]

where \(\Sigma=\text{diag}(\text{Var}(Y_i))\) and \({\bf D}=\text{diag}(h'(\eta_i))\) (derivative wrt \(\eta_i\)).

Observe that \(s(\beta)=0\) only depends on the distribution of \(Y_i\) through \(\mu_i\) and \(\text{Var}(Y_i)\).

Canonical link:

\[ s(\beta)=\sum_{i=1}^n \frac{(y_i-\mu_i){\bf x}_i w_i}{\phi} \]

since \(\frac{\partial \mu_i}{\partial \eta_i}=b''(\theta_i)\).

Expected Fisher information matrix for the GLM and covariance for \({\hat\beta}\)

The exprect Fisher information matrix is \(F_{[h,l]}(\boldsymbol{\beta}) =\text{E}(\mathbf{s}_i(\boldsymbol{\beta}) \mathbf{s}_i'(\boldsymbol{\beta}))\):

\[ \begin{aligned} F_{i}(\boldsymbol{\beta}) &=\text{E}(\mathbf{s}_i(\boldsymbol{\beta}) \mathbf{s}_i'(\boldsymbol{\beta})) \\ &= E \left( \frac{(y_i-\mu_i){\bf x}_i h'(\eta_i)}{\text{Var}(Y_i)} \left(\frac{(y_i-\mu_i){\bf x}_i h'(\eta_i)}{\text{Var}(Y_i)}'\right)\right) \\ &= {\bf x}_i {\bf x}'_i E \left( \frac{(y_i-\mu_i)^2 (h'(\eta_i))^2}{\text{Var}^2(Y_i)}\right) \\ &= {\bf x}_i {\bf x}'_i E(y_i-\mu_i)^2 \frac{ (h'(\eta_i))^2}{\text{Var}^2(Y_i)} \\ &= {\bf x}_i {\bf x}'_i \frac{ (h'(\eta_i))^2}{\text{Var}(Y_i)} \end{aligned} \]

\[F_{[h,l]}(\beta)=\sum_{i=1}^n \frac{x_{ih}x_{il}(h'(\eta_i))^2}{\text{Var}(Y_i)}\]

\[F(\beta)={\bf X}^T {\bf W} {\bf X}\]

where \({\bf W}=\text{diag}(\frac{h'(\eta_i)^2}{\text{Var}(Y_i)})\).

Canonical link:

\[ \frac{\partial^2 l_i}{\partial \beta_j \partial \beta_l}=- \frac{x_{ij} w_i}{\phi}(\frac{\partial \mu_i}{\partial \beta_l}) \] which do not contain any random variables, so the observed must be equal to the expected Fisher information matrix.

Fisher scoring and iterated reweighted least squares (IWLS)

STILL UNDER CONSTRUCTION. WE HAVE NOT CONVERGED ON A SOLUTION YET

Details on the derivation: IWLS

IWLS uses the Newton-Raphson algorithm, via Fisher scoring. The idea is to replace the matrix of derivatives, the observed Fisher information, by the expected Fisher information. For a canonical link these are the same, of course. The reason this is called iterated reweighted least squares is that each iteration is a weighted least squares calculation. To remind you, for known weights the weighted least squares solution is \(\hat{\beta} = (X' W X)^{-1} X'W{\bf y}\), where \(W\) is a diagonal matrix of known weights (if there are off-diagonal terms, the name changes to generalised least squares). So we will be looking for something that looks like this.

In general, Newton-Raphson solves the equation \(f(x)=0\) for \(x\). It does this by starting with an initial guess, \((x^{(0)}, f(x^{(0)}\), finding the tangent and extrapolating this to the x-axis to find \(x^{(1)}\). Then from \((x^{(1)}, f(x^{(1)}))\) it does the same thing, and continues until \(f(x^{(t+1)})\) is close enough to 0.

In practice this means that at each iteration we start with \(x^{(t)}\), and update with \(x^{(t+1)} = x^{(t)} - \frac{f(x^{(t)})}{f'(x^{(t)})}\). We keep doing this until the change is sm all enough, i.e. \(|f(x_{t+1})-f(x_{t})| < \epsilon\) for some \(f()\) (e.g. the deviance).

For a GLM we need to solve \(s(\boldsymbol{\beta})=0\). We now have a vector we want to solve, but the idea is still the same. Of course, for a GLM \(f'(x)\) is the observed Fisher information. But it can be easier to use the expected Fisher information, \(F(\boldsymbol{\beta}) = {\bf X}^T {\bf W} {\bf X}\). So we get

\[ \begin{aligned} \boldsymbol{\beta}^{(t+1)}&=\boldsymbol{\beta}^{(t)} + F(\boldsymbol{\beta}^{(t)})^{-1} s(\boldsymbol{\beta}^{(t)}) \\ &= \boldsymbol{\beta}^{(t)} + ({\bf X}^T {\bf W(\boldsymbol{\beta}^{(t)})} {\bf X})^{-1} {\bf X}^T {\bf D}(\boldsymbol{\beta}^{(t)}) \Sigma^{-1}(\boldsymbol{\beta}^{(t)})({\bf y}-{\mathbf \mu(\boldsymbol{\beta}^{(t)})}) \\ \end{aligned} \]

Where now we have to write \({\bf W(\boldsymbol{\beta}^{(t)})}\), \({\bf D}(\boldsymbol{\beta}^{(t)})\) and \(\Sigma^{-1}(\boldsymbol{\beta}^{(t)})\) because they all depend on \(\boldsymbol{\beta}\), and this changes every iteration. To make it easier to read, we will drop the \((\boldsymbol{\beta}^{(t)})\) for now, and come back to it later.

From above, \({\bf W}=\text{diag}(\frac{h'(\eta_i)^2}{\text{Var}(Y_i)})\), \(\Sigma=\text{diag}(\text{Var}(Y_i))\) and \({\bf D}=\text{diag}(h'(\eta_i))\) (derivative wrt \(\eta_i\)). We can thus write \({\bf W}= \Sigma^{-1}{\bf D}^2\), i.e. \({\bf D} \Sigma^{-1} = {\bf W}{\bf D}^{-1}\) (diagonal matrices are so cooperative), and get

\[ \begin{aligned} \boldsymbol{\beta}^{(t+1)}&=\boldsymbol{\beta}^{(t)} + F(\boldsymbol{\beta}^{(t)})^{-1} s(\boldsymbol{\beta}^{(t)}) \\ &= \boldsymbol{\beta}^{(t)} + ({\bf X}^T {\bf W} {\bf X})^{-1} {\bf X}^T {\bf W} {\bf D}^{-1}({\bf y}-{\mathbf \mu}) \\ \end{aligned} \]

We now have a \(({\bf X}^T {\bf W} {\bf X})^{-1} {\bf X}^T {\bf W}\) term, of the sort that would appear in weighted least squares. But we need to pre-multiply everything by it. Without looking down, can you see the trick to do this?

No? OK then.

We can pre-multiply \(\boldsymbol{\beta}^{(t+1)} = I \boldsymbol{\beta}^{(t+1)}\) and use \(I = ({\bf X}^T {\bf W} {\bf X})^{-1} ({\bf X}^T {\bf W} {\bf X})\):

\[ \begin{aligned} \boldsymbol{\beta}^{(t+1)}&=\boldsymbol{\beta}^{(t)} + F(\boldsymbol{\beta}^{(t)})^{-1} s(\boldsymbol{\beta}^{(t)}) \\ &= ({\bf X}^T {\bf W} {\bf X})^{-1} ({\bf X}^T {\bf W} {\bf X}) \boldsymbol{\beta}^{(t)} + ({\bf X}^T {\bf W} {\bf X})^{-1} {\bf X}^T {\bf W} {\bf D}^{-1}({\bf y}-{\mathbf \mu}) \\ &= ({\bf X}^T {\bf W} {\bf X})^{-1} ({\bf X}^T {\bf W}) ({\bf X} \boldsymbol{\beta}^{(t)} + {\bf D}^{-1}({\bf y}-{\mathbf \mu})) \end{aligned} \]

Which is the same as weighed least squares with \(W = \text{diag}(\frac{h'(\eta_i)^2}{\text{Var}(Y_i)})\), and \({\bf \tilde{y}} = {\bf X} \boldsymbol{\beta}^{(t)} + {\bf D}^{-1}({\bf y}-{\mathbf \mu})\)

Inserting the dependence on \(\boldsymbol{\beta}(t)\), we get

\[\beta^{(t+1)}=({\bf X}^T {\bf W}(\beta^{(t)}) {\bf X})^{-1} {\bf X}^T {\bf W}(\beta^{(t)})\bf{\tilde y}_i^{(t)}\]

where

\({\bf W^{(t)}}=\text{diag}(\frac{h'(\eta^{(t)}_i)^2}{\text{Var}(Y^{(t)}_i)})\) and the elements are called the working weights
\(\tilde{\bf y}_i^{(t)} = {\bf X} \boldsymbol{\beta}^{(t)} + {\bf D}^{-1}({\bf y}-{\mathbf \mu})\) is called the working response vector

In practice, we use our current value of \(\boldsymbol{\beta}^{(t)}\) to calculate the working weights and working response vector, put those into the weighted least squares, get a new value of \(\boldsymbol{\beta}^{(t+1)}\), and re-calculate the working weights and working response vector etc. We stop when the change in deviance between iterations is small (as a default R uses \(\frac{|\text{Dev}(t+1) - \text{Dev}(t)|}{|\text{Dev}(t+1)| +0.1}<10^{-8}\)).

With full rank of \({\bf X}\) and positive diagonal elements of \({\bf W}\) we are certain that the inverse will exist, so the algorithm will converge, but there might be that the temporary version of \({\bf W}\) can cause problems.

See what is output from glm- observe working weights as weights..

fitgrouped = glm(cbind(y, n - y) ~ ldose, family = "binomial", data = investr::beetle)
names(fitgrouped)
fitgrouped$weights
fitgrouped$residuals

##  [1] "coefficients"      "residuals"         "fitted.values"    
##  [4] "effects"           "R"                 "rank"             
##  [7] "qr"                "family"            "linear.predictors"
## [10] "deviance"          "aic"               "null.deviance"    
## [13] "iter"              "weights"           "prior.weights"    
## [16] "df.residual"       "df.null"           "y"                
## [19] "converged"         "boundary"          "model"            
## [22] "call"              "formula"           "terms"            
## [25] "data"              "offset"            "control"          
## [28] "method"            "contrasts"         "xlevels"          
##         1         2         3         4         5         6         7         8 
##  3.254867  8.227383 14.321313 13.378893 10.261055  5.156671  2.653398  1.230713 
##           1           2           3           4           5           6 
##  0.78115418  0.38388091 -0.31082206 -0.44081641  0.18557365 -0.05641516 
##           7           8 
##  0.67002811  1.02139898

Estimator for dispersion parameter

Let data be grouped as much as possible. With G groups (covariate pattern) with \(n_i\) observations for each group (then \(n=\sum^G n_i=n\)):

\[ \hat{\phi}=\frac{1}{G-p} \sum_{i=1}^G \frac{(y_i-{\hat{\mu}}_i)^2}{b''(\theta_i)/w_i}\]

The motivation behind this estimator is as follows: \[ \text{Var}(Y_i)=\phi b''(\theta_i)/w_i \Leftrightarrow \phi=\text{Var}(Y_i)/(b''(\theta_i)/w_i)\]

Distribution of the MLE

As before we have that maximum likelihood estimator \(\hat{\beta}\) asymptotically follows the multivariate normal distribution with mean \(\beta\) and covariance matrix equal to the inverse of the expected Fisher information matrix. This is also true when we replace the unknown \(\beta\) with the estimated \(\hat{\beta}\) for the expected Fisher information matrix.

\[\hat{\beta} \approx N_p(\beta,F^{-1}(\hat{\beta}))\] and with \[F(\hat{\beta})={\bf X}^T \hat{{\bf W}} {\bf X}\] where \(\hat{{\bf W}}\) denotes that \(\hat{\beta}\) is used then calculating \({\bf W}=\text{diag}(\frac{h'(\eta_i)^2}{\text{Var}(Y_i)})\).

What about the distribution of \(\hat{\beta}, \hat{\phi}\)?

The concept of orthogonal parameters

Hypothesis testing

Same as before - for the Wald we insert the formula for the covariance matrix of \(\hat{\beta}\), for the LRT we insert the loglikelihoods and for the score test we insert formulas for the score function and expected Fisher information matrix.

Model assessment and model choice

Pearson and deviance statistic

Group observations together in groups of maximal size (covariate patterns? interval versions thereof?). Group \(i\) has \(n_i\) observations, and there are \(G\) groups. Asymptotic distribution correct if all groups have big \(n_i\). For the non-continuous individual data asymptotic results can not be trusted.

Deviance \[D=-2[\sum_{i=1}^{g}(l_i(\hat{\mu}_i)-l_i(\bar{y}_i))]\] with approximate \(\chi^2\)-distribution with \(G-p\) degrees of freedom.

Pearson: \[ X_P^2=\sum_{i=1}^G \frac{(y_i-\hat{\mu}_i)^2}{v(\hat{\mu}_i)/w_i}\] with approximate \(\phi \cdot \chi^2\)-distribution with \(G-p\) degrees of freedom.

Remember that the variance function \(v(\hat{\mu}_i)=b''(\theta_i)\) (this is a function of \(\mu_i\) because \(\mu_i=b'(\theta_i)\)).

AIC

Let \(p\) be the number of regression parameters in our model. \[\text{AIC} =-2 \cdot l(\hat{\beta})+2p\] If the dispersion parameter is estimated use \((p+1)\) in place of \(p\).

Interactive session

Work with Problem 1 and 2 in IL, and work on Problems 3-5 by yourself.

If you have more time after Problem 1 and 2, look through the theoretical proofs and derivations listed under “Learning material” on the top of this Module page.

Problem 1: Exam 2011, problem 3

a) Define the class of generalized linear models (GLMs), and explicitly list all requirements for each part of the model.

b) Below are three likelihoods, three link functions, and three linear components listed. Explain which combinations that give valid GLMs (8 in total), and also comment on these models (you do not have to mathematically prove which are valid).

Likelihoods:

Gaussian, \(Y \sim N(\mu, \sigma^2)\)
Binomial, \(Y \sim Bin(n, \pi)\), where \(n\) is not fixed (hence is unknown and be estimated)
Poisson, \(Y \sim Poisson(\lambda)\)

Link functions:

\(\eta = \cos(\mu)\)
\(\eta = \mu\)
\(\eta = \log(\mu)\)

Linear components:

\(\eta = \beta_0 + \beta_1x_1 + \beta_2x_2\)
\(\eta = \beta_0 + \beta_1x_1 + \beta_2x_1^2\)
\(\eta = \beta_0 + \beta_1x_1 + \beta_1^2x_2\)

Problem 2: December 2005, Problem 2 (modified)

Derive the formula for the (scaled) deviance for the binomial distribution.

The covariance matix for the estimated coefficents are given as \(\text{Cov}(\hat{\beta})=({\bf X}^T{\bf W}{\bf X})^{-1}\) where \({\bf X}\) is the design matrix.

(New) The matrix \({\bf W}\) is a diagonal matrix. What is on the diagonal?
Calulate the elements of \({\bf W}\) for a Poisson regression- both with log and identity link. Compare.
Calulate the elements of \({\bf W}\) for a binary regression - both with logit and identity link. Compare.
(New) Which insight did this give you into the role of the link function and its effect on the covariance for the parameter estimates?

Problem 3: Exam 2006, problem 2 (a, b, d)

Let \(Y_1,Y_2,\ldots,Y_N\) be independent and exponentially distributed random variables, where \(Y_i\) has the density

\[f(y_i ; \alpha_i)=\alpha_i e^{-\alpha_iy_i} \text{ for } y_i>0, \alpha_i>0, i=1,2,\ldots,N.\]

a) Show that the distribution of \(Y_i\) comes from the exponential family. Use the general formulas to find E(\(Y_i\)) and \(\text{Var}(Y_i)\) as functions of \(\alpha_i\).

b) Show that the log-likelihood for the data \(y_1, \dots, y_n\) can be written as

\[ l = \sum_{i=1}^N \{-\alpha_i y_i + \ln \alpha_i\} \]

Use this to show that the deviance for a generalized linear model with estimated expectations \(\hat{\mu_i} = \hat{y_i}\) is

\[ D = 2 \sum _{i=1}^N \left\{ \frac{y_i-\hat y_i}{\hat y_i} - \ln \left(\frac{y_i}{\hat y_i} \right) \right\}\]

d) We want to test the null hypothesis

\[H_0: \ \alpha_1 = \alpha_2 = \cdots = \alpha_N = \alpha \]

against the alternative hypothesis that at least one \(\alpha_i\) is different from the others.
Use b) to find a test statistic for this problem.
What distribution does this test statistic have under \(H_0\)?

New: could you also use the Wald test? Explain how to do that (no need to calculated the mathematically explicit test statistic).

Problem 4: Exam 2007, problem 2 a, b, c

Assume that \(Y_1, \dots, Y_N\) are independent continuous distributed random variables, where the density of \(Y_i\) is given by

\[ f(y_i; \gamma_i) = \begin{cases} \frac{\gamma_i^2}{2}y_ie^{-\gamma_iy_i} &\text{ for } y_i \geq 0 \\ 0 &\text{ else} \end{cases}\]

where \(\gamma_i\) is a scalar parameter.

a) Show that the distribution of \(Y_i\) comes from the exponential family. Hint: usually we choose to let \(\phi=\frac{1}{2}\).

Use the general formulas to show that E\((Y_i) = 2/\gamma_i\) and Var\((Y_i) = 2/\gamma_i^2\).

Assume a GLM for \(Y_1, \dots, Y_N\) where the distribution of \(Y_i\) is as above for all \(i\), with the following link function:

\[\eta = g(\mu) = \ln(\mu) = x^T\beta\]

where \(x, \ \beta \in {\rm I\!R}^p\) and \(\mu\) is the expected value of \(y\).

b) Use general formulas to find the score vector \(s(\beta) = [s_1(\beta), \dots, s_p(\beta)]^T\) and the expected Fisher information matrix \(F(\beta) = [F_{ij}(\beta)]_{i,j = 1}^p\), expressed using \(y_1, \dots, y_N, \beta, N\) and the covariates \(x\).
Write down the equation that can be used to find the MLE for \(\beta\). Note that this is a recursive equation.

c) Write down the log-likelihood for the model above. Use this to find the deviance \(D\) for the model as a function of \(y_1, \dots, y_N\) and \(\hat{y_1}, \dots, \hat{y_N}\), where \(\hat{y_i}\) is the estimated expected value of \(y_i\).
Find an expression for the deviance residuals \(d_i\) using \(y_i\) and \(\hat{y_i}\).

Problem 5: Exam UiO December 2017, Problem 2

We assume that the random variable \(\Lambda\) is gamma distributed with pdf

\[f(\lambda; \nu, \mu) = \frac{(\nu/\mu)^\nu}{\Gamma(\nu)} \lambda^{\nu-1} e^{-\lambda/\mu}; \ \lambda > 0\]

and further that given \(\Lambda = \lambda\), the random variable \(Y\) is Poisson distributed with parameter \(\lambda\). Thus the conditional pmf of \(Y\) given \(\Lambda = \lambda\) takes the form

\[\text{P}(Y = y | \lambda)= \frac{\lambda^y}{y!} \exp(-\lambda), \ y = 0, 1, 2, \dots.\]

a) Show that the marginal pmf of \(Y\) is given by \[p(y; \mu, \nu) = \frac{\Gamma(y+\nu)}{\Gamma(\nu)\Gamma(y+1)} \left(\frac{\mu}{\mu+\nu}\right)^y \left(\frac{\nu}{\mu+\nu}\right)^\nu; \ y = 0, 1, 2, \dots\] This is the negative binomial distribution.

We then assume that the parameter \(\nu\) is fixed, and consider the random variable \(Y^* = Y/\nu\). Note that

\[\text{P}(Y^* = y^*) = \text{P}(Y = ky^*) \ \text{for} \ y^* = 0, \frac{1}{k}, \frac{2}{k}, \dots\]

so \(Y^*\) has pmf

\[p^*(y^*; \mu, \nu) = \frac{\Gamma(\nu y^* + \nu)}{\Gamma(\nu)\Gamma(\nu y^* + 1)} \left(\frac{\mu}{\mu+\nu}\right)^{\nu y^*} \left(\frac{\nu}{\mu+\nu}\right)^\nu; \ y^* = 0, \frac{1}{k}, \frac{2}{k}, \dots\]

b) Show that the pmf of \(Y^*\) is an exponential family \[\exp\left\{\frac{y \theta - b(\theta)}{\phi}w + c(y, \phi, w)\right\},\] with \(\theta = \log(\mu/(\mu+\nu))\), \(b(\theta) = -\log(1 - e^{\theta})\), \(w = 1\) and \(\phi = 1/\nu\)

c) Use the expressions for \(b(\theta)\) and \(\phi\) to determine \(\text{E}(Y^*)\) and \(\text{Var}(Y^*)\). Show that \(\text{E}(Y) = \mu\) and find \(\text{Var}(Y)\).

Exam questions

December 2015

One of the important concepts we have discussed in this course, is deviance (for Gaussian regression, Poisson regression and logistic regression).

Explain what deviance is, and how it relates to residual sum of squares (RSS) for Gaussian regression. Remark 2017/2018: we have called this “sums of squares of errors - SSE”
Discuss how it relates to a likelihood ratio test (LRT) for comparing two nested regression models.
Discuss how deviance can be used to construct “ANOVA” tables for Poisson regression and logistic regression. Remark 2017/2018: these are called analysis of deviance tables.
Discuss how deviance can be used to define residuals, for Poisson regression and logistic regression.

TMA4315 Generalized linear models H2018

Module 5: Generalized linear models - common core

Mette Langaas, Department of Mathematical Sciences, NTNU - with contributions from Ingeborg Hem

11.10.2017 [PL], 12.10.2017 [IL]