Topics

  • random component: exponential family
    • elements: \(\theta\), \(\phi\), \(w\), \(b(\theta)\)
    • elements for normal, binomial, Poisson and gamma
    • properties: \(\text{E}(Y)=b'(\theta)\) and \(\text{Var}(Y)=b''(\theta)\frac{\phi}{w}\) (and proof)
  • systematic component= linear predictor
    • requirements: full rank of design matrix
  • link function and response function
    • link examples for normal, binomial, Poisson and gamma
    • requirements: one-to-one and twice differentiable
    • canonical link
  • likelihood inference set-up: \(\theta_i\leftrightarrow \mu_i\leftrightarrow \eta_i \leftrightarrow \mathbf{\beta}\)
  • the loglikelihood
  • the score function
  • expected Fisher information matrix for the GLM and covariance for \(\hat{\beta}\)
    • what about covariance of \(\hat{\beta}\) when \(\phi\) needs to be estimated?
    • estimator for dispersion parameter
  • Fisher scoring and iterated reweighted least squares (IRWLS)
  • Pearson and deviance statistic
  • AIC

– so, for the first time: no practical examples or data sets to be analysed!

Interactive session

Work with Problem 1 and 2 in IL, and work on Problems 3-5 by yourself.

If you have more time after Problem 1 and 2, look through the theoretical proofs and derivations listed under “Learning material” on the top of this Module page.

Problem 1: Exam 2011, problem 3

  1. Define the class of generalized linear models (GLMs), and explicitly list all requirements for each part of the model.
Hint There are 3 parts to a GLM, and each has at least one requirement
Answer

Likelihoods:

  1. Independent observation
  2. Member of the exponential family of canonical form.
  3. Should be of same kind for all observations, but possible different expectation/parameter.

Link functions:

  1. Monotone
  2. Differentiable

Linear component:

  1. Linear in parameters
  1. Below are three likelihoods, three link functions, and three linear components listed. Explain which combinations that give valid GLMs (8 in total), and also comment on these models (you do not have to mathematically prove which are valid).

Likelihoods:

  1. Gaussian, \(Y \sim N(\mu, \sigma^2)\)
  2. Binomial, \(Y \sim Bin(n, \pi)\), where \(n\) is not fixed (hence is unknown and be estimated)
  3. Poisson, \(Y \sim Poisson(\lambda)\)

Link functions:

  1. \(\eta = \cos(\mu)\)
  2. \(\eta = \mu\)
  3. \(\eta = \log(\mu)\)

Linear components:

  1. \(\eta = \beta_0 + \beta_1x_1 + \beta_2x_2\)
  2. \(\eta = \beta_0 + \beta_1x_1 + \beta_2x_1^2\)
  3. \(\eta = \beta_0 + \beta_1x_1 + \beta_1^2x_2\)
Answer
  1. (Gaussian) and 3. (Poisson) are ok, as they are from the exponential family of distributions, but 2. (Binomial with unfixed \(n\)) is not, we get a problem with the binomial coefficient which is non-linear in \(y\) (so it need to go into \(c(y, \phi, w)\)), but includes the unknown parameter \(n\) (so it cannot go into \(c(y, \phi, w\) after all). Note that this problem is solved by fixing \(n\), which we do in logistic regression. There are methods to estimate \(n\) in binomial distributions, but they require specific experimental designs, and tend to work badly.

Exponential family: \(f(y) = \exp\left(\frac{y\theta - b(\theta)}{\phi} w + c(y, \phi, w)\right)\) (note that \(\theta\) can be a vector)

Binomial distribution: \(f(y) = \binom{n}{y} \pi^y(1-\pi)^{n-y} = \exp\left( \ln \binom{n}{y} + y \ln \frac{\pi}{(1-\pi)} + n \ln \pi \right)\)

Link functions: 1. (\(cos(\mu)\)) is not monotone and thus not a valid link function. The other two (2. (\(\mu\)) and 3. (\(\log(\mu)\))) are valid.

Linear component: The third linear component is not linear in \(\beta_1\) and thus not valid. The others are linear in \(\mathbf{\beta}\) and are valid (does not have to be linear in the covariates \(x_i\)).

Thus we have two linear components that can be used with any model, which leaves four alternatives for link and repsponse (and in total 8 different model combinations):

  • Gaussian response with identity link: Common linear regression
  • Gaussian response with log-link: Gives a model with only positive expected values (which can (do not have to) have negative observations), only in special situations
  • Poisson response with identity link: Poisson requires positive expectation, and with this link we need restrictions on the parameters to ensure the positive expectation
  • Poisson response with log-link: Common model

Problem 2: December 2005, Problem 2 (modified)

  1. Derive the formula for the (scaled) deviance for the binomial distribution.
Answer

The likelihood for a singlw datum is

\[ f(y) = \text{P}(Y=y) = \binom{n}{y} \pi^y (1-\pi)^{n-y} \]

So the log likelihood is

\[ \log(f) = \log \binom{n}{y} + y\log(\pi) + (n-y)\log(1-\pi) = \log \binom{n}{y} + y \log\left(\frac{\pi}{1-\pi}\right) + n\log(1-\pi) \]

For \(n\) data points \(l(y, \mu) = \sum_{i=1}^n \log(f)\), and the deviance, \(D(y, \hat{\mu}) = -2\left(l(\hat{\mu}, y) - l(y, y)\right)\), so

\[ \begin{aligned} D^*(y, \hat{\mu}) &= 2 l(y, y) - 2 l(\hat{\mu}, y) \\ &= 2 \sum\left[y_i \log\left( \frac{y_i}{n_i-y_i}\right) + n_i \log\left( \frac{n_i-y_i}{n_i}\right) - y_i \log\left( \frac{\frac{\hat{\mu}_i}{n_i}}{\frac{n_i-\hat{\mu}_i}{n_i}}\right) - n_i \log\left( \frac{n_i - \hat{\mu}_i}{n_i}\right) \right] \\ &= 2 \sum\left[y_i \log\left( \frac{y_i}{\hat{\mu}_i}\right) + n_i \log\left( \frac{n_i-y_i}{n_i - \hat{\mu}_i}\right) - y_i \log\left( \frac{n_i - y_i}{n_i - \hat{\mu}_i}\right) \right] \end{aligned} \]

  1. The covariance matrix for the estimated coefficients are given as \(\text{Cov}(\hat{\beta})=({\bf X}^T{\bf W}{\bf X})^{-1}\) where \({\bf X}\) is the design matrix.
  1. (New) The matrix \({\bf W}\) is a diagonal matrix. What is on the diagonal?
Answer

\[ w_{ii} = \left(\frac{\partial \mu_i}{\partial \eta_i}\right)^2 \Big/\text{Var}(y_i) \]

  1. Calculate the elements of \({\bf W}\) for a Poisson regression- both with log and identity link. Compare.
Hint For a canonical link, \(\mu = \eta\)
Answer

log link:

\[ \mu = \eta \implies \frac{\partial \mu}{\partial \eta} = 1 \implies w_{ii} = \frac{1}{\text{Var}(y_i)} \]

Identity link;

\[ \mu = e^{\eta} \implies \frac{\partial \mu}{\partial \eta} = e^{\eta} \implies w_{ii} = \frac{e^{2 \eta_i}}{e^{\eta_i}} = e^{\eta_i} = \mu_i \]

Each is the inverse of the other.

  1. Calculate the elements of \({\bf W}\) for a binary regression - both with logit and identity link. Compare.
Answer

logit:

\[ \mu = \eta \implies \frac{\partial \mu}{\partial \eta} = 1 \implies w_{ii} = \frac{1}{n_i \pi_i (1-\pi_i)} \]

Identity:

\[ \mu = n\frac{e^{\eta}}{1+e^{\eta}} \implies \frac{\partial \mu}{\partial \eta} = n\frac{e^{\eta}}{(1+e^{\eta})^2} = n \pi (1-\pi) \implies w_{ii} = \frac{(n_i \pi_i (1-\pi_i))^2}{n_i \pi_i (1-\pi_i)} = n_i \pi_i (1-\pi_i) \]

Again, each is the inverse of the other.
  1. (New) Which insight did this give you into the role of the link function and its effect on the covariance for the parameter estimates?
Answer Different link functions change the covariance: in particular they change the mean-variance relationship.

Problem 3: Exam 2006, problem 2 (a, b, d)

Let \(Y_1,Y_2,\ldots,Y_N\) be independent and exponentially distributed random variables, where \(Y_i\) has the density

\[f(y_i ; \alpha_i)=\alpha_i e^{-\alpha_iy_i} \text{ for } y_i>0, \alpha_i>0, i=1,2,\ldots,N.\]

  1. Show that the distribution of \(Y_i\) comes from the exponential family. Use the general formulas to find E(\(Y_i\)) and \(\text{Var}(Y_i)\) as functions of \(\alpha_i\).
Answer

\[f(y_i ; \alpha_i)=\alpha_i e^{-\alpha_iy_i} = e^{-\alpha_i y_i + \ln \alpha_i} \]

which comes from the exponential family:

\[ f(y) = \exp\left(\frac{y\theta - b(\theta)}{\phi} w + c(y, \phi, w)\right) = \exp\Big(\theta y_i -\ln(-1/\theta)\Big) \]

with \(\phi = 1\), \(w = 1\), \(c(y, \phi, w) = 0\), \(\theta = - \alpha_i\), and \(b(\theta) = - \ln (\alpha_i) = - \ln(-\theta)\).
Answer for E(\(Y_i\)) and \(\text{Var}(Y_i)\)

Expected value: E\((Y_i) = b'(\theta) = - \frac{1}{\theta} = \frac{1}{\alpha_i}\)

The variance: Var\((Y_i) = b''(\theta) \frac{\phi}{w} = \frac{-1}{\theta^2} = \frac{1}{\alpha_i^2}\)

Note that you could also use \(\theta = \alpha_i\), \(b(\theta) = \ln(\theta) = \ln(\alpha)\), \(c(y, \phi, w) = 0\), and either \(\phi\) or \(w\) equal to -1, the other equal to 1, so \(\frac{\phi}{w} = -1\). Then E\((Y_i) = b'(\theta) = \frac{1}{\theta_i} = \frac{1}{\alpha_i}\) and Var\((Y_i) = -b''(\theta) = -\frac{-1}{\theta^2} = \frac{1}{\alpha_i^2}\). These two parameterizations are equivalent, which shows that they are not necessarily unique! It is however some parameterizations that is the standard. For the Gamma (and thus the special case exponential) we use the first one, where \(\phi = w = 1\).

  1. Show that the log-likelihood for the data \(y_1, \dots, y_n\) can be written as

\[ l = \sum_{i=1}^N \{-\alpha_i y_i + \ln \alpha_i\} \]

Answer/Hint (not yet added)

Per definition:

\[ l = \sum_{i=1}^N \ln f(y_i;\alpha_i) = \sum_{i=1}^N(-\alpha_iy_i + \ln\alpha_i) \]

Use this to show that the deviance for a generalized linear model with estimated expectations \(\hat{\mu_i} = \hat{y_i}\) is

\[ D = 2 \sum _{i=1}^N \left\{ \frac{y_i-\hat y_i}{\hat y_i} - \ln \left(\frac{y_i}{\hat y_i} \right) \right\} \]

Hint The deviance is \(D = 2(l_{saturated}-l_{model})\), where a saturated model has one parameter for each data point.
Answer

The deviance is

\[ D = 2(l_{saturated}-l_{model}) \]

Saturated model: All \(\alpha_i\) are different, so \(\frac{\partial l}{\partial \alpha_i} = -y_i + 1/\alpha_i\). MLE: \(-y_i + 1/\alpha_i = 0 \ \implies \ \hat{\alpha_i} = 1/y_i\) and thus

\[ l_{saturated} = \sum_{i=1}^N(-1 + \ln(1/y_i)) \]

In GLM-models we estimate the mean \(\mu_i\) using \(\hat{\mu_i}\) (do not need the expression for this), which we call \(\hat{y_i}\), and we have

\[ l_{model} = \sum_{i=1}^N \left( -\frac{y_i}{\hat{y_i}} +\ln \frac{1}{\hat{y_i}} \right) = \sum_{i=1}^N \left( -\frac{y_i}{\hat{y_i}} -\ln \hat{y_i} \right) \]

Then we get the deviance

\[ D = \sum_{i=1}^N \left(-1-\ln(y_i) + \frac{y_i}{\hat{y_i}} + \ln \hat{y_i} \right) = D = \sum_{i=1}^N \left( \frac{y_i-\hat{y_i}}{\hat{y_i}} - \ln\frac{y_i}{\hat{y_i}} \right) \]

  1. We want to test the null hypothesis

\[H_0: \ \alpha_1 = \alpha_2 = \cdots = \alpha_N = \alpha \]

against the alternative hypothesis that at least one \(\alpha_i\) is different from the others.
Use b) to find a test statistic for this problem.
What distribution does this test statistic have under \(H_0\)?

Hint You will need to estimate \(\alpha\), but it does have a simple cosed-form solution.
Answer

\(H_0\) corresponds to a GLM where all \(\alpha_i\) are equal to the same value, here denoted \(\alpha\). Then

\[ l = - \alpha \sum_{i=1}^N y_i + N \ln \alpha \]

Take the derivative and set equal to zero gives

\[ \frac{dl}{d\alpha} = \sum_{i=1}^N y_i + \frac{N}{\alpha} \implies \hat{\alpha} = \frac{1}{\frac{1}{N}\sum_{i=1}^N y_i} = \frac{1}{\bar{y}} \]

Thus, \(\alpha_i = \frac{1}{\bar{y}}\) for each i, which means that \(\hat{\mu_i} = \hat{y_i} = \bar{y}\). Then the deviance becomes

\[ D = 2 \sum_{i=1}^N \left( {\frac{y_i - \bar{y}}{\bar{y}} - \ln \frac{y_i}{\bar{y}}} \right) \] which is approximately \(\chi_{N-1}^2\) distributed under \(H_0\). \(N-1\) = no. of parameters in the saturated model minus no. of parameters in the model under \(H_0\). We reject \(H_0\) if the deviance \(D\) is large.

New: could you also use the Wald test? Explain how to do that (no need to calculated the mathematically explicit test statistic).

Answer Yes you could. You would need to fit the model where \(\alpha_1 \ne \alpha_2 \ne \cdots \ne \alpha_N\), and insert the formula for the covariance matrix of \(\hat{\alpha}\).

Problem 4: Exam 2007, problem 2 a, b, c

Assume that \(Y_1, \dots, Y_N\) are independent continuous distributed random variables, where the density of \(Y_i\) is given by

\[ f(y_i; \gamma_i) = \begin{cases} \frac{\gamma_i^2}{2}y_ie^{-\gamma_iy_i} &\text{ for } y_i \geq 0 \\ 0 &\text{ else} \end{cases}\]

where \(\gamma_i\) is a scalar parameter.

  1. Show that the distribution of \(Y_i\) comes from the exponential family.
Hint Usually we choose to let \(\phi=\frac{1}{2}\).
Answer

This is the Gamma distribution with \(\nu = 2\) and \(\mu = 2/\gamma\).

\[ f(y_i; \gamma_i) = \frac{\gamma_i^2}{2}y_ie^{-\gamma_iy_i} = \exp\Big( -\gamma_i y_i + 2\ln\gamma_i + \ln y_i - \ln2 \Big) = \exp \left( -\frac{-\frac{-y\gamma_i}{2}+\ln(-\gamma_i/2)}{1/2} + \ln y_i - \ln2 \right)\]

\[ f(y) = \exp\left(\frac{y\theta - b(\theta)}{\phi} w + c(y, \phi, w)\right) = \exp\Big( \frac{y_i(-\gamma_i) - (-2\ln\gamma_i)}{1}1 + (\ln y_i - \ln2) \Big) \\ = \exp\Big( \frac{y_i(\gamma_i) - (2\ln\gamma_i)}{-1}1 + (\ln y_i - \ln2) \Big) \] Also here we have several choices for parameterization. We choose the following (which is the standard for the Gamma distribution): \(w = 1\), \(\phi = 1/2\), \(c(y, \phi, w) = \ln y_i - \ln2\), \(\theta = -\gamma_i/2\) and \(b(\theta) = -\ln(-\theta)\).

(An alternative that might be simpler is: \(w = 1\), \(\phi = -1\), \(c(y, \phi, w) = \ln y_i - \ln2\), \(\theta = \gamma_i\), and \(b(\theta) = 2 \ln \theta\). The results will be the same, but this is not the standard way and we do not use this parameterization.)

Use the general formulas to show that E\((Y_i) = 2/\gamma_i\) and Var\((Y_i) = 2/\gamma_i^2\).

Answer

Expected value: E\((Y_i) = b'(\theta) = \frac{-1}{\theta} = \frac{2}{\gamma_i}\)

The variance: Var\((Y_i) = b''(\theta) \frac{\phi}{w} = \frac{2}{\theta^2} = \frac{2}{\gamma_i^2}\)

Assume a GLM for \(Y_1, \dots, Y_N\) where the distribution of \(Y_i\) is as above for all \(i\), with the following link function:

\[\eta = g(\mu) = \ln(\mu) = x^T\beta\]

where \(x, \ \beta \in {\rm I\!R}^p\) and \(\mu\) is the expected value of \(y\).

  1. Use general formulas to find the score vector \(s(\beta) = [s_1(\beta), \dots, s_p(\beta)]^T\) and the expected Fisher information matrix \(F(\beta) = [F_{ij}(\beta)]_{i,j = 1}^p\), expressed using \(y_1, \dots, y_N, \beta, N\) and the covariates \(x\).
Answer for score vector

Score: \(s(\beta) = [s_1(\beta), \dots, s_p(\beta)]^T\)

\[ s_j(\beta) = \sum_{i=1}^N \Big( \frac{y_i-\mu_i}{Var(Y_i)} x_{ij} \frac{\partial \mu_i}{\partial \eta_i} \Big) \]

\[\frac{\partial \eta}{\partial \mu} = 1/\mu \implies \frac{\partial \mu}{\partial \eta} = \mu = e^{x^T\beta} \]

Using that \(\ln \mu = x^T\beta\) and \(\mu = 2/\theta\) we get \(\mu = 2/\theta = e^{x^T\beta} \implies \theta = 2e^{-x^T\beta}\). Using this and the expected value and variance from a), the score \(s_j(\beta)\) becomes

\[ s_j(\beta) = \sum_{i=1}^N \left[ \frac{y_i - 2/\theta_i}{2/\theta_i^2}x_{ij}e^{x_i^T\beta} \right] = \sum_{i=1}^N \left[ \frac{y_i - e^{x_i^T\beta}}{\frac{1}{2}e^{2x_i^T\beta}}x_{ij}e^{x_i^T\beta} \right] = 2\sum_{i=1}^N \left[ \frac{y_i - e^{x_i^T\beta}}{e^{x_i^T\beta}}x_{ij} \right] \]
Answer for Fisher Information

Fisher: \(F(\beta) = [F_{ij}(\beta)]_{i,j = 1}^p\)

The general formula for the expected Fisher information is:

\[ F_{jk}(\beta) = \sum_{i=1}^N \left[\frac{x_{ij} x_{ik}}{Var(Y_i)} \left(\frac{\partial \mu_i}{\partial \eta_i}\right)^2 \right] \]

Thus we get

\[ F_{jk}(\beta) = \sum_{i=1}^N \left[\frac{x_{ij} x_{ik}}{2/\theta_i^2} \left(e^{x_i^T\beta}\right)^2 \right] = \sum_{i=1}^N \left[\frac{x_{ij} x_{ik}}{\frac{1}{2}e^{2x_i^T\beta}} \left(e^{x_i^T\beta}\right)^2 e^{2x_i^T\beta} \right] = 2\sum_{i=1}^Nx_{ij}x_{ik} \]

which means that the full matrix can be written as \(F(\beta) = 2X^TX\). Note that it is not a function of \(\beta\).

Write down the equation that can be used to find the MLE for \(\beta\). Note that this is a recursive equation.

Answer/Hint (not yet added)

The recursive equation we use to find the MLE is

\[ \beta^{(m)} = \beta^{(m-1)} + F^{-1}s(\beta^{(m-1)}) \]

c)Write down the log-likelihood for the model above. Use this to find the deviance \(D\) for the model as a function of \(y_1, \dots, y_N\) and \(\hat{y_1}, \dots, \hat{y_N}\), where \(\hat{y_i}\) is the estimated expected value of \(y_i\).

Find an expression for the deviance residuals \(d_i\) using \(y_i\) and \(\hat{y_i}\).

Answer/Hint (not yet added) 42

Problem 5: Exam UiO December 2017, Problem 2

We assume that the random variable \(\Lambda\) is gamma distributed with pdf

\[f(\lambda; \nu, \mu) = \frac{(\nu/\mu)^\nu}{\Gamma(\nu)} \lambda^{\nu-1} e^{-\lambda/\mu}; \ \lambda > 0\]

and further that given \(\Lambda = \lambda\), the random variable \(Y\) is Poisson distributed with parameter \(\lambda\). Thus the conditional pmf of \(Y\) given \(\Lambda = \lambda\) takes the form

\[\text{P}(Y = y | \lambda)= \frac{\lambda^y}{y!} \exp(-\lambda), \ y = 0, 1, 2, \dots.\]

  1. Show that the marginal pmf of \(Y\) is given by \[p(y; \mu, \nu) = \frac{\Gamma(y+\nu)}{\Gamma(\nu)\Gamma(y+1)} \left(\frac{\mu}{\mu+\nu}\right)^y \left(\frac{\nu}{\mu+\nu}\right)^\nu; \ y = 0, 1, 2, \dots\] This is the negative binomial distribution.
Answer/Hint (not yet added) 42

We then assume that the parameter \(\nu\) is fixed, and consider the random variable \(Y^* = Y/\nu\). Note that

\[\text{P}(Y^* = y^*) = \text{P}(Y = ky^*) \ \text{for} \ y^* = 0, \frac{1}{k}, \frac{2}{k}, \dots\]

so \(Y^*\) has pmf

\[p^*(y^*; \mu, \nu) = \frac{\Gamma(\nu y^* + \nu)}{\Gamma(\nu)\Gamma(\nu y^* + 1)} \left(\frac{\mu}{\mu+\nu}\right)^{\nu y^*} \left(\frac{\nu}{\mu+\nu}\right)^\nu; \ y^* = 0, \frac{1}{k}, \frac{2}{k}, \dots\]

  1. Show that the pmf of \(Y^*\) is an exponential family \[\exp\left\{\frac{y \theta - b(\theta)}{\phi}w + c(y, \phi, w)\right\},\] with \(\theta = \log(\mu/(\mu+\nu))\), \(b(\theta) = -\log(1 - e^{\theta})\), \(w = 1\) and \(\phi = 1/\nu\)
Answer/Hint (not yet added) 42
  1. Use the expressions for \(b(\theta)\) and \(\phi\) to determine \(\text{E}(Y^*)\) and \(\text{Var}(Y^*)\). Show that \(\text{E}(Y) = \mu\) and find \(\text{Var}(Y)\).
Answer/Hint (not yet added) 42

Exam questions

December 2015

One of the important concepts we have discussed in this course, is deviance (for Gaussian regression, Poisson regression and logistic regression).

  1. Explain what deviance is, and how it relates to residual sum of squares (RSS) for Gaussian regression. Remark 2017/2018: we have called this “sums of squares of errors - SSE”

  2. Discuss how it relates to a likelihood ratio test (LRT) for comparing two nested regression models.

  3. Discuss how deviance can be used to construct “ANOVA” tables for Poisson regression and logistic regression. Remark 2017/2018: these are called analysis of deviance tables.

  4. Discuss how deviance can be used to define residuals, for Poisson regression and logistic regression.

Further reading

  • A. Agresti (2015): “Foundations of Linear and Generalized Linear Models.” Wiley.