(Latest changes: 11.010, added links to handwritten materials and dispersion formula, 07.10.2018 first version)
Additional notes (with theoretical focus):
– so, for the first time: no practical examples or data sets to be analysed!
Jump to interactive.
In Module 1 we introduced distributions of the \(Y_i\), that could be written in the form of a univariate exponential family \[ f(y_i\mid \theta_i)=\exp \left( \frac{y_i \theta_i-b(\theta_i)}{\phi}\cdot w_i + c(y_i, \phi, w_i) \right) \] where we said that
\(\theta_i\) is called the canonical parameter and is a parameter of interest
\(\phi\) is called a nuisance parameter (and is not of interest to us=therefore a nuisance (plage))
\(w_i\) is a weight function, in most cases \(w_i=1\) (NB: can not contain any unknown parameters)
\(b\) and \(c\) are known functions.
We have seen:
| Distribution | \(\theta\) | \(b(\theta)\) | \(\phi\) | \(w\) | \(\text{E}(Y)=b'(\theta)\) | \(b''(\theta)\) | \(\text{Var}(Y)=b''(\theta)\phi/w\) |
|---|---|---|---|---|---|---|---|
| normal | \(\mu\) | \(\frac{1}{2} \theta^2\) | \(\sigma^2\) | \(1\) | \(\mu=\theta\) | \(1\) | \(\sigma^2\) |
| Bernoulli | \(\ln \left( \frac{p}{1-p} \right)\) | \(\ln (1+\exp(\theta))\) | \(1\) | \(1\) | \(p=\frac{\exp(\theta)}{1+\exp(\theta)}\) | \(p(1-p)\) | \(p(1-p)\) |
| Poisson | \(\ln \mu\) | \(\exp(\theta)\) | \(1\) | \(1\) | \(\lambda=\exp(\theta)\) | \(\lambda\) | \(\lambda\) |
| gamma | \(-\frac{1}{\mu}\) | \(-\ln (-\theta)\) | \(\frac{1}{\nu}\) | \(1\) | \(\mu=-1/\theta\) | \(\mu^2\) | \(\mu^2/\nu\) |
Nothing new - as always in this course: \(\eta_i={\bf x}_i^T \mathbf{\beta}\), and we require that the \(n \times p\) design matrix \({\bf X}=({\bf x}_1^T, {\bf x}_2^T,\ldots, {\bf x}_n^T)\) has full rank (which is \(p\)).
Remark: in this course we always assume that \(n>>p\).
Link function \[\eta_i=g(\mu_i)\]
Response function \[\mu_i=h(\eta_i)\]
Note that this is a link between \(\eta_i\) and \(\mu_i\), but the exponential family is written as a funciton of \(\theta_i\). For the canonical link \(\eta_i=\theta_i\), so \(g(\mu_i)=\theta_i\), so this is not a problem. We will see the more general form below.
When the canonical link is used some of the results for the GLM (to be studied in the next sections) are simplified.
| random component | response function and link function |
|---|---|
| normal | \(h(\eta_i) = \eta_i\) and \(g(\mu_i) = \mu_i\), “identity link”. |
| binomial | \(h(\eta_i) = \frac{e^{\eta_i}}{1+e^{\eta_i}}\) and \(g(\mu_i) = \ln \left(\frac{\mu_i}{1-\mu_i} \right) = \text{logit}(p_i)\). NB: \(\mu_i = p_i\) in our set-up. |
| Poisson | \(h(\eta_i) = \exp(\eta_i)\) and \(g(\mu_i) = \ln(\mu_i)\), log-link. |
| gamma | \(h(\eta_i) = -\frac{1}{\eta_i}\) and \(g(\mu_i) = -\frac{1}{\mu_i}\), negative inverse, or \(h(\eta_i) = \exp(\eta_i)\) and \(g(\mu_i) = \ln(\mu_i)\), log-ink. |
There are a few formal requirements for the mathematics to work, in particular:
These are the main results: \(\text{E}(Y_i)=b'(\theta_i)\) and \(\text{Var}(Y_i)=b''(\theta_i)\frac{\phi}{w_i}\). But how do we get there?
These are both derivatives with respect to \(\theta\). But the expectations are over the random variables, \(Y\). To derive these we differentiate once and twice w.r.t. \(\theta\), integrate over \(Y\) \(f(Y)\), and look hard enough for the useful terms. So, take a deep breath…
To start off, we show that \(\int \frac{\partial f(Y| \theta)}{\partial \theta} dy =0\), and also \(\int \frac{\partial^2 f(Y| \theta)}{\partial \theta^2} dy =0\).
We do this by reversing the integration and differentiation (which we can because the necessary conditions are met):
\[ \int \frac{\partial f(Y|\theta)}{\partial \theta} dy = \frac{\partial \int f(Y|\theta) dy}{\partial \theta} = \frac{\partial1}{\partial \theta} =0 \] Differentiating this again also gives us 0.
Now we know that these equal 0, we can evaluate them to see what equals 0: these will include \(E(Y)\) and \(Var(Y)\), so we can then solve to find the mean and variance.
This is a video of the proof.
To calculate \(\frac{\partial f(Y|\theta)}{\partial \theta}\) (and end up finding \(\text{E}(Y_i)\)) we start with the pdf for \(Y\):
\[ f(Y|\theta) = \exp \left( (y\theta - b(\theta))\frac{w}{\phi} + c(Y, \phi, w) \right) \]
and the score:
\[ s(\theta_i) = \frac{\partial \log f(Y|\theta)}{\partial \theta} = (y_i - b'(\theta_i))\frac{w}{\phi}. \]
Then using the chain rule
\[ \begin{aligned} \frac{\partial f(Y|\theta)}{\partial \theta} &= \frac{\partial e^{l(\theta|Y)}}{\partial \theta} \\ &= \frac{\partial e^{l(\theta|Y)}}{\partial l(\theta|Y)} \frac{\partial l(\theta|Y)}{\partial \theta} \\ &= e^{l(\theta|Y)} s(\theta) \\ &= f(y|\theta) (y - b'(\theta))\frac{w}{\phi} \end{aligned} \]
Now we integrate this over y (and noting that \(\int f(y) y dy = E(y)\) and \(\int f(y) dy=1\)):
\[ \begin{aligned} \int \frac{\partial f(Y|\theta)}{\partial \theta} dy &= \int f(y) (y - b'(\theta))\frac{w}{\phi} dy \\ &= \frac{w}{\phi} \left( \int f(y) y dy - b'(\theta)) \int f(y) dy \right) \\ &= \frac{w}{\phi} \left( E(y) - b'(\theta) \right) \\ \end{aligned} \]
This, as we showed above, equals 0, so \(\frac{w}{\phi} \left( E(y) - b'(\theta) \right) = 0\), and thus \(E(y) = b'(\theta)\)
First up, a video:
Oscars all round!
First we calculate the second derivative w.r.t \(\theta\), and again then integrate over \(Y\). To start, we plug in the first derivative:
\[ \begin{aligned} \frac{\partial^2 f(Y|\theta)}{\partial \theta^2} &= \frac{\partial \left(f(y) (y - b'(\theta))\frac{w}{\phi}\right)}{\partial \theta} \end{aligned} \]
We can now use the product rule, \(\frac{d u v}{dx} = u \frac{d v}{dx} + v \frac{d u}{dx}\) (and where we see \(f'(\theta)\) note that we have just calculated it):
\[ \begin{aligned} \frac{\partial^2 f(Y|\theta)}{\partial \theta^2} &= \frac{\partial \left(f(y) (y - b'(\theta))\frac{w}{\phi}\right)}{\partial \theta} \\ &= \frac{w}{\phi} \left(f(y) \frac{\partial (y - b'(\theta))}{\partial \theta} + (y - b'(\theta)) f'(\theta) \right)\\ &= \frac{w}{\phi} \left(-f(y) b''(\theta) + (y - b'(\theta)) f(y) (y - b'(\theta))\frac{w}{\phi} \right)\\ &= f(y)\frac{w}{\phi} \left( (y - b'(\theta))^2\frac{w}{\phi} -b''(\theta)\right)\\ \end{aligned} \]
We now integrate this over \(y\). We need to notice that \(b'(\theta)=E(y)\) (as we have just calculated), so \(\int f(y)(y - b'(\theta))^2dy = \int f(y)(y - E(y))^2dy = \text{Var}(y)\), and also \(\int f(y) dy=1\):
\[ \begin{aligned} \int \frac{\partial^2 f(Y|\theta)}{\partial \theta^2} dy &= \int f(y)\frac{w}{\phi} \left((y - b'(\theta))^2\frac{w}{\phi} -b''(\theta)\right) dy\\ &= \left(\frac{w}{\phi}\right)^2 \int f(y)(y - b'(\theta))^2 dy - \frac{w}{\phi}\int f(y) b''(\theta) dy\\ &= \left(\frac{w}{\phi}\right)^2 Var(y) - \frac{w}{\phi}b''(\theta)\int f(y) dy\\ &= \left(\frac{w}{\phi}\right)^2 Var(y) - \frac{w}{\phi}b''(\theta)\\ \end{aligned} \]
Above we saw that \(\int \frac{\partial^2 f(Y|\theta)}{\partial \theta^2} dy = 0\), and so
\[ \begin{aligned} \left(\frac{w}{\phi}\right)^2 Var(y) - \frac{w}{\phi}b''(\theta) &= 0\\ Var(y) &= \frac{\phi}{w} b''(\theta) \end{aligned} \]
Phew.
\(b''(\theta_i)\) is often called the variance function \(v(\mu_i)\).
Via the score
Easy \(\text{E}(s(\theta_i))=0\) and \(s(\theta_i) = (y_i - b'(\theta_i))\frac{w_i}{\phi}\)
The likelihood has to go through a few steps to get to something that can be maximised for \(\beta\).
We have
\[ \begin{aligned} l(y_i|\theta_i) &= (y_i\theta_i - b(\theta_i))\frac{w_i}{\phi} + c(y_i, w_i, \phi) \\ \mu_i &= g^{-1}(\eta_i) \\ \eta_i &= \mathbf{x}_i' \beta \end{aligned} \]
But how do we get from \(\mu_i\) to \(\theta_i\)? With a canonical link \(\mu_i=\theta_i\), but more generally we have to use \(b'(\theta_i) = \mu_i\), which we proved above. So the route between \(\theta_i\) and \(\mathbf{\beta}\) is:
\[ \theta_i\leftrightarrow \mu_i\leftrightarrow \eta_i \leftrightarrow \mathbf{\beta} \]
We need to take this route to get to an expression for the score, which we then use to get the maximum likelihood estimates of \(\beta\), i.e. \(\hat{\beta}\).
See class notes or Fahrmeir et al (2015), Section 5.8.2 for the derivation of the loglikelihood, score and expected Fisher information matrix.
Now let’s work out the log-likelihood as a function of \(\beta\)
\[l(\theta)=\sum_{i=1}^n l_i(\theta)=\sum_{i=1}^n\frac{1}{\phi}(y_i\theta_i-b(\theta_i))w_i+\sum_{i=1}^n c(y_i,\phi,w_i)\]
Getting from there to \(\boldsymbol{\beta}\) takes a few steps:
\[ \begin{aligned} f(y_i | \theta_i) &= exp \left(\frac{y_i \theta - b(\theta_i)}{\phi/w_i} + c(y_i, \phi, w_i) \right) \\ \theta_i &= b^{'-1}(\mu) (\text{from } \mu_i = b'(\theta_i) (=E(Y_i)))\\\\ \mu_i &= g^{-1}(\eta_i) \\ \eta_i &= x_i' \boldsymbol{\beta} \end{aligned} \]
Remark: the part of the loglikelihood involving both the data and the parameter of interest is for a canonical link equal to
\[\sum_{i=1}^n y_i \theta_i=\sum_{i=1}^n y_i {\bf x}_i^T\beta=\sum_{i=1}^n y_i \sum_{j=1}^p x_{ij}\beta_j=\sum_{j=1}^p \beta_j \sum_{i=1}^n y_i x_{ij}\]
Which makes things easier
What is the score function as a function of \(\beta\)? We need to use this chain:
\(\theta_i\leftrightarrow \mu_i\leftrightarrow \eta_i \leftrightarrow \boldsymbol{\beta}\)
And thus a long chain rule…
\[ s(\beta)= \frac{\partial l}{\partial \beta} = \frac{\partial l(\theta)}{\partial \theta} \frac{\partial \theta}{\partial \mu} \frac{\partial \mu}{\partial \eta} \frac{\partial \eta}{\partial \beta} \]
We already have \(\partial l/\partial \theta = (y_i - b'(\theta))\frac{w_i}{\phi}\) above ,so we need the rest. Most are easy:
We get \(\frac{\partial \theta_i}{\partial \mu_i}\) by inverting it, i.e. calculating \(\frac{\partial \mu_i}{\partial \theta_i}\):
\[ \frac{\partial \mu_i}{\partial \theta_i} = \frac{\partial b'(\theta_i)}{\partial \theta_i} = b''(\theta_i) = \frac{w_i \text{Var}(y_i)}{\phi} \]
So we can put it all together (and cancel \(w_i\) and \(\phi\)):
\[ s_i(\beta) = (y_i - b'(\theta_i))\frac{w_i}{\phi} \frac{\phi}{w_i \text{Var}(y_i)} h'(\eta_i) \mathbf{x}_i = \frac{(y_i - b'(\theta_i))}{\text{Var}(y_i)} h'(\eta_i) \mathbf{x}_i \]
Now the total score is the sum of these components:
\[s(\beta)=\sum_{i=1}^n \frac{(y_i-\mu_i){\bf x}_i h'(\eta_i)}{\text{Var}(Y_i)} = {\bf X}^T {\bf D} \Sigma^{-1}({\bf y}-{\mathbf \mu})\]
where \(\Sigma=\text{diag}(\text{Var}(Y_i))\) and \({\bf D}=\text{diag}(h'(\eta_i))\) (derivative wrt \(\eta_i\)).
Observe that \(s(\beta)=0\) only depends on the distribution of \(Y_i\) through \(\mu_i\) and \(\text{Var}(Y_i)\).
Canonical link:
\[ s(\beta)=\sum_{i=1}^n \frac{(y_i-\mu_i){\bf x}_i w_i}{\phi} \]
since \(\frac{\partial \mu_i}{\partial \eta_i}=b''(\theta_i)\).
The exprect Fisher information matrix is \(F_{[h,l]}(\boldsymbol{\beta}) =\text{E}(\mathbf{s}_i(\boldsymbol{\beta}) \mathbf{s}_i'(\boldsymbol{\beta}))\):
\[ \begin{aligned} F_{i}(\boldsymbol{\beta}) &=\text{E}(\mathbf{s}_i(\boldsymbol{\beta}) \mathbf{s}_i'(\boldsymbol{\beta})) \\ &= E \left( \frac{(y_i-\mu_i){\bf x}_i h'(\eta_i)}{\text{Var}(Y_i)} \left(\frac{(y_i-\mu_i){\bf x}_i h'(\eta_i)}{\text{Var}(Y_i)}'\right)\right) \\ &= {\bf x}_i {\bf x}'_i E \left( \frac{(y_i-\mu_i)^2 (h'(\eta_i))^2}{\text{Var}^2(Y_i)}\right) \\ &= {\bf x}_i {\bf x}'_i E(y_i-\mu_i)^2 \frac{ (h'(\eta_i))^2}{\text{Var}^2(Y_i)} \\ &= {\bf x}_i {\bf x}'_i \frac{ (h'(\eta_i))^2}{\text{Var}(Y_i)} \end{aligned} \]
Or
\[F_{[h,l]}(\beta)=\sum_{i=1}^n \frac{x_{ih}x_{il}(h'(\eta_i))^2}{\text{Var}(Y_i)}\]
or
\[F(\beta)={\bf X}^T {\bf W} {\bf X}\]
where \({\bf W}=\text{diag}(\frac{h'(\eta_i)^2}{\text{Var}(Y_i)})\).
Canonical link:
\[ \frac{\partial^2 l_i}{\partial \beta_j \partial \beta_l}=- \frac{x_{ij} w_i}{\phi}(\frac{\partial \mu_i}{\partial \beta_l}) \] which do not contain any random variables, so the observed must be equal to the expected Fisher information matrix.
STILL UNDER CONSTRUCTION. WE HAVE NOT CONVERGED ON A SOLUTION YET
Details on the derivation: IWLS
IWLS uses the Newton-Raphson algorithm, via Fisher scoring. The idea is to replace the matrix of derivatives, the observed Fisher information, by the expected Fisher information. For a canonical link these are the same, of course. The reason this is called iterated reweighted least squares is that each iteration is a weighted least squares calculation. To remind you, for known weights the weighted least squares solution is \(\hat{\beta} = (X' W X)^{-1} X'W{\bf y}\), where \(W\) is a diagonal matrix of known weights (if there are off-diagonal terms, the name changes to generalised least squares). So we will be looking for something that looks like this.
In general, Newton-Raphson solves the equation \(f(x)=0\) for \(x\). It does this by starting with an initial guess, \((x^{(0)}, f(x^{(0)}\), finding the tangent and extrapolating this to the x-axis to find \(x^{(1)}\). Then from \((x^{(1)}, f(x^{(1)}))\) it does the same thing, and continues until \(f(x^{(t+1)})\) is close enough to 0.
In practice this means that at each iteration we start with \(x^{(t)}\), and update with \(x^{(t+1)} = x^{(t)} - \frac{f(x^{(t)})}{f'(x^{(t)})}\). We keep doing this until the change is sm all enough, i.e. \(|f(x_{t+1})-f(x_{t})| < \epsilon\) for some \(f()\) (e.g. the deviance).
For a GLM we need to solve \(s(\boldsymbol{\beta})=0\). We now have a vector we want to solve, but the idea is still the same. Of course, for a GLM \(f'(x)\) is the observed Fisher information. But it can be easier to use the expected Fisher information, \(F(\boldsymbol{\beta}) = {\bf X}^T {\bf W} {\bf X}\). So we get
\[ \begin{aligned} \boldsymbol{\beta}^{(t+1)}&=\boldsymbol{\beta}^{(t)} + F(\boldsymbol{\beta}^{(t)})^{-1} s(\boldsymbol{\beta}^{(t)}) \\ &= \boldsymbol{\beta}^{(t)} + ({\bf X}^T {\bf W(\boldsymbol{\beta}^{(t)})} {\bf X})^{-1} {\bf X}^T {\bf D}(\boldsymbol{\beta}^{(t)}) \Sigma^{-1}(\boldsymbol{\beta}^{(t)})({\bf y}-{\mathbf \mu(\boldsymbol{\beta}^{(t)})}) \\ \end{aligned} \]
Where now we have to write \({\bf W(\boldsymbol{\beta}^{(t)})}\), \({\bf D}(\boldsymbol{\beta}^{(t)})\) and \(\Sigma^{-1}(\boldsymbol{\beta}^{(t)})\) because they all depend on \(\boldsymbol{\beta}\), and this changes every iteration. To make it easier to read, we will drop the \((\boldsymbol{\beta}^{(t)})\) for now, and come back to it later.
From above, \({\bf W}=\text{diag}(\frac{h'(\eta_i)^2}{\text{Var}(Y_i)})\), \(\Sigma=\text{diag}(\text{Var}(Y_i))\) and \({\bf D}=\text{diag}(h'(\eta_i))\) (derivative wrt \(\eta_i\)). We can thus write \({\bf W}= \Sigma^{-1}{\bf D}^2\), i.e. \({\bf D} \Sigma^{-1} = {\bf W}{\bf D}^{-1}\) (diagonal matrices are so cooperative), and get
\[ \begin{aligned} \boldsymbol{\beta}^{(t+1)}&=\boldsymbol{\beta}^{(t)} + F(\boldsymbol{\beta}^{(t)})^{-1} s(\boldsymbol{\beta}^{(t)}) \\ &= \boldsymbol{\beta}^{(t)} + ({\bf X}^T {\bf W} {\bf X})^{-1} {\bf X}^T {\bf W} {\bf D}^{-1}({\bf y}-{\mathbf \mu}) \\ \end{aligned} \]
We now have a \(({\bf X}^T {\bf W} {\bf X})^{-1} {\bf X}^T {\bf W}\) term, of the sort that would appear in weighted least squares. But we need to pre-multiply everything by it. Without looking down, can you see the trick to do this?
No? OK then.
We can pre-multiply \(\boldsymbol{\beta}^{(t+1)} = I \boldsymbol{\beta}^{(t+1)}\) and use \(I = ({\bf X}^T {\bf W} {\bf X})^{-1} ({\bf X}^T {\bf W} {\bf X})\):
\[ \begin{aligned} \boldsymbol{\beta}^{(t+1)}&=\boldsymbol{\beta}^{(t)} + F(\boldsymbol{\beta}^{(t)})^{-1} s(\boldsymbol{\beta}^{(t)}) \\ &= ({\bf X}^T {\bf W} {\bf X})^{-1} ({\bf X}^T {\bf W} {\bf X}) \boldsymbol{\beta}^{(t)} + ({\bf X}^T {\bf W} {\bf X})^{-1} {\bf X}^T {\bf W} {\bf D}^{-1}({\bf y}-{\mathbf \mu}) \\ &= ({\bf X}^T {\bf W} {\bf X})^{-1} ({\bf X}^T {\bf W}) ({\bf X} \boldsymbol{\beta}^{(t)} + {\bf D}^{-1}({\bf y}-{\mathbf \mu})) \end{aligned} \]
Which is the same as weighed least squares with \(W = \text{diag}(\frac{h'(\eta_i)^2}{\text{Var}(Y_i)})\), and \({\bf \tilde{y}} = {\bf X} \boldsymbol{\beta}^{(t)} + {\bf D}^{-1}({\bf y}-{\mathbf \mu})\)
Inserting the dependence on \(\boldsymbol{\beta}(t)\), we get
\[\beta^{(t+1)}=({\bf X}^T {\bf W}(\beta^{(t)}) {\bf X})^{-1} {\bf X}^T {\bf W}(\beta^{(t)})\bf{\tilde y}_i^{(t)}\]
where
In practice, we use our current value of \(\boldsymbol{\beta}^{(t)}\) to calculate the working weights and working response vector, put those into the weighted least squares, get a new value of \(\boldsymbol{\beta}^{(t+1)}\), and re-calculate the working weights and working response vector etc. We stop when the change in deviance between iterations is small (as a default R uses \(\frac{|\text{Dev}(t+1) - \text{Dev}(t)|}{|\text{Dev}(t+1)| +0.1}<10^{-8}\)).
With full rank of \({\bf X}\) and positive diagonal elements of \({\bf W}\) we are certain that the inverse will exist, so the algorithm will converge, but there might be that the temporary version of \({\bf W}\) can cause problems.
See what is output from glm- observe working weights as
weights..
fitgrouped = glm(cbind(y, n - y) ~ ldose, family = "binomial", data = investr::beetle)
names(fitgrouped)
fitgrouped$weights
fitgrouped$residuals
## [1] "coefficients" "residuals" "fitted.values"
## [4] "effects" "R" "rank"
## [7] "qr" "family" "linear.predictors"
## [10] "deviance" "aic" "null.deviance"
## [13] "iter" "weights" "prior.weights"
## [16] "df.residual" "df.null" "y"
## [19] "converged" "boundary" "model"
## [22] "call" "formula" "terms"
## [25] "data" "offset" "control"
## [28] "method" "contrasts" "xlevels"
## 1 2 3 4 5 6 7 8
## 3.254867 8.227383 14.321313 13.378893 10.261055 5.156671 2.653398 1.230713
## 1 2 3 4 5 6
## 0.78115418 0.38388091 -0.31082206 -0.44081641 0.18557365 -0.05641516
## 7 8
## 0.67002811 1.02139898
Let data be grouped as much as possible. With G groups (covariate pattern) with \(n_i\) observations for each group (then \(n=\sum^G n_i=n\)):
\[ \hat{\phi}=\frac{1}{G-p} \sum_{i=1}^G \frac{(y_i-{\hat{\mu}}_i)^2}{b''(\theta_i)/w_i}\]
The motivation behind this estimator is as follows: \[ \text{Var}(Y_i)=\phi b''(\theta_i)/w_i \Leftrightarrow \phi=\text{Var}(Y_i)/(b''(\theta_i)/w_i)\]
As before we have that maximum likelihood estimator \(\hat{\beta}\) asymptotically follows the multivariate normal distribution with mean \(\beta\) and covariance matrix equal to the inverse of the expected Fisher information matrix. This is also true when we replace the unknown \(\beta\) with the estimated \(\hat{\beta}\) for the expected Fisher information matrix.
\[\hat{\beta} \approx N_p(\beta,F^{-1}(\hat{\beta}))\] and with \[F(\hat{\beta})={\bf X}^T \hat{{\bf W}} {\bf X}\] where \(\hat{{\bf W}}\) denotes that \(\hat{\beta}\) is used then calculating \({\bf W}=\text{diag}(\frac{h'(\eta_i)^2}{\text{Var}(Y_i)})\).
Same as before - for the Wald we insert the formula for the covariance matrix of \(\hat{\beta}\), for the LRT we insert the loglikelihoods and for the score test we insert formulas for the score function and expected Fisher information matrix.
Group observations together in groups of maximal size (covariate patterns? interval versions thereof?). Group \(i\) has \(n_i\) observations, and there are \(G\) groups. Asymptotic distribution correct if all groups have big \(n_i\). For the non-continuous individual data asymptotic results can not be trusted.
Deviance \[D=-2[\sum_{i=1}^{g}(l_i(\hat{\mu}_i)-l_i(\bar{y}_i))]\] with approximate \(\chi^2\)-distribution with \(G-p\) degrees of freedom.
Pearson: \[ X_P^2=\sum_{i=1}^G \frac{(y_i-\hat{\mu}_i)^2}{v(\hat{\mu}_i)/w_i}\] with approximate \(\phi \cdot \chi^2\)-distribution with \(G-p\) degrees of freedom.
Remember that the variance function \(v(\hat{\mu}_i)=b''(\theta_i)\) (this is a function of \(\mu_i\) because \(\mu_i=b'(\theta_i)\)).
Let \(p\) be the number of regression parameters in our model. \[\text{AIC} =-2 \cdot l(\hat{\beta})+2p\] If the dispersion parameter is estimated use \((p+1)\) in place of \(p\).
Work with Problem 1 and 2 in IL, and work on Problems 3-5 by yourself.
If you have more time after Problem 1 and 2, look through the theoretical proofs and derivations listed under “Learning material” on the top of this Module page.
a) Define the class of generalized linear models (GLMs), and explicitly list all requirements for each part of the model.
b) Below are three likelihoods, three link functions, and three linear components listed. Explain which combinations that give valid GLMs (8 in total), and also comment on these models (you do not have to mathematically prove which are valid).
Likelihoods:
Link functions:
Linear components:
(New) The matrix \({\bf W}\) is a diagonal matrix. What is on the diagonal?
Calulate the elements of \({\bf W}\) for a Poisson regression- both with log and identity link. Compare.
Calulate the elements of \({\bf W}\) for a binary regression - both with logit and identity link. Compare.
(New) Which insight did this give you into the role of the link function and its effect on the covariance for the parameter estimates?
Let \(Y_1,Y_2,\ldots,Y_N\) be independent and exponentially distributed random variables, where \(Y_i\) has the density
\[f(y_i ; \alpha_i)=\alpha_i e^{-\alpha_iy_i} \text{ for } y_i>0, \alpha_i>0, i=1,2,\ldots,N.\]
a) Show that the distribution of \(Y_i\) comes from the exponential family. Use the general formulas to find E(\(Y_i\)) and \(\text{Var}(Y_i)\) as functions of \(\alpha_i\).
b) Show that the log-likelihood for the data \(y_1, \dots, y_n\) can be written as
\[ l = \sum_{i=1}^N \{-\alpha_i y_i + \ln \alpha_i\} \]
Use this to show that the deviance for a generalized linear model with estimated expectations \(\hat{\mu_i} = \hat{y_i}\) is
\[ D = 2 \sum _{i=1}^N \left\{ \frac{y_i-\hat y_i}{\hat y_i} - \ln \left(\frac{y_i}{\hat y_i} \right) \right\}\]
d) We want to test the null hypothesis
\[H_0: \ \alpha_1 = \alpha_2 = \cdots = \alpha_N = \alpha \]
against the alternative hypothesis that at least one \(\alpha_i\) is different from the
others.
Use b) to find a test statistic for this problem.
What distribution does this test statistic have under \(H_0\)?
New: could you also use the Wald test? Explain how to do that (no need to calculated the mathematically explicit test statistic).
Assume that \(Y_1, \dots, Y_N\) are independent continuous distributed random variables, where the density of \(Y_i\) is given by
\[ f(y_i; \gamma_i) = \begin{cases} \frac{\gamma_i^2}{2}y_ie^{-\gamma_iy_i} &\text{ for } y_i \geq 0 \\ 0 &\text{ else} \end{cases}\]
where \(\gamma_i\) is a scalar parameter.
a) Show that the distribution of \(Y_i\) comes from the exponential family. Hint: usually we choose to let \(\phi=\frac{1}{2}\).
Use the general formulas to show that E\((Y_i) = 2/\gamma_i\) and Var\((Y_i) = 2/\gamma_i^2\).
Assume a GLM for \(Y_1, \dots, Y_N\) where the distribution of \(Y_i\) is as above for all \(i\), with the following link function:
\[\eta = g(\mu) = \ln(\mu) = x^T\beta\]
where \(x, \ \beta \in {\rm I\!R}^p\) and \(\mu\) is the expected value of \(y\).
b) Use general formulas to find the score vector
\(s(\beta) = [s_1(\beta), \dots,
s_p(\beta)]^T\) and the expected Fisher information matrix \(F(\beta) = [F_{ij}(\beta)]_{i,j = 1}^p\),
expressed using \(y_1, \dots, y_N, \beta,
N\) and the covariates \(x\).
Write down the equation that can be used to find the MLE for \(\beta\). Note that this is a recursive
equation.
c) Write down the log-likelihood for the model
above. Use this to find the deviance \(D\) for the model as a function of \(y_1, \dots, y_N\) and \(\hat{y_1}, \dots, \hat{y_N}\), where \(\hat{y_i}\) is the estimated expected value
of \(y_i\).
Find an expression for the deviance residuals \(d_i\) using \(y_i\) and \(\hat{y_i}\).
We assume that the random variable \(\Lambda\) is gamma distributed with pdf
\[f(\lambda; \nu, \mu) = \frac{(\nu/\mu)^\nu}{\Gamma(\nu)} \lambda^{\nu-1} e^{-\lambda/\mu}; \ \lambda > 0\]
and further that given \(\Lambda = \lambda\), the random variable \(Y\) is Poisson distributed with parameter \(\lambda\). Thus the conditional pmf of \(Y\) given \(\Lambda = \lambda\) takes the form
\[\text{P}(Y = y | \lambda)= \frac{\lambda^y}{y!} \exp(-\lambda), \ y = 0, 1, 2, \dots.\]
a) Show that the marginal pmf of \(Y\) is given by \[p(y; \mu, \nu) = \frac{\Gamma(y+\nu)}{\Gamma(\nu)\Gamma(y+1)} \left(\frac{\mu}{\mu+\nu}\right)^y \left(\frac{\nu}{\mu+\nu}\right)^\nu; \ y = 0, 1, 2, \dots\] This is the negative binomial distribution.
We then assume that the parameter \(\nu\) is fixed, and consider the random variable \(Y^* = Y/\nu\). Note that
\[\text{P}(Y^* = y^*) = \text{P}(Y = ky^*) \ \text{for} \ y^* = 0, \frac{1}{k}, \frac{2}{k}, \dots\]
so \(Y^*\) has pmf
\[p^*(y^*; \mu, \nu) = \frac{\Gamma(\nu y^* + \nu)}{\Gamma(\nu)\Gamma(\nu y^* + 1)} \left(\frac{\mu}{\mu+\nu}\right)^{\nu y^*} \left(\frac{\nu}{\mu+\nu}\right)^\nu; \ y^* = 0, \frac{1}{k}, \frac{2}{k}, \dots\]
b) Show that the pmf of \(Y^*\) is an exponential family \[\exp\left\{\frac{y \theta - b(\theta)}{\phi}w + c(y, \phi, w)\right\},\] with \(\theta = \log(\mu/(\mu+\nu))\), \(b(\theta) = -\log(1 - e^{\theta})\), \(w = 1\) and \(\phi = 1/\nu\)
c) Use the expressions for \(b(\theta)\) and \(\phi\) to determine \(\text{E}(Y^*)\) and \(\text{Var}(Y^*)\). Show that \(\text{E}(Y) = \mu\) and find \(\text{Var}(Y)\).
One of the important concepts we have discussed in this course, is deviance (for Gaussian regression, Poisson regression and logistic regression).
Explain what deviance is, and how it relates to residual sum of squares (RSS) for Gaussian regression. Remark 2017/2018: we have called this “sums of squares of errors - SSE”
Discuss how it relates to a likelihood ratio test (LRT) for comparing two nested regression models.
Discuss how deviance can be used to construct “ANOVA” tables for Poisson regression and logistic regression. Remark 2017/2018: these are called analysis of deviance tables.
Discuss how deviance can be used to define residuals, for Poisson regression and logistic regression.