(Latest changes: 11.010, added links to handwritten materials and dispersion formula, 07.10.2018 first version)
Additional notes (with theoretical focus):
– so, for the first time: no practical examples or data sets to be analysed!
Jump to interactive.
In Module 1 we introduced distributions of the \(Y_i\), that could be written in the form of a univariate exponential family \[ f(y_i\mid \theta_i)=\exp \left( \frac{y_i \theta_i-b(\theta_i)}{\phi}\cdot w_i + c(y_i, \phi, w_i) \right) \] where we said that
\(\theta_i\) is called the canonical parameter and is a parameter of interest
\(\phi\) is called a nuisance parameter (and is not of interest to us=therefore a nuisance (plage))
\(w_i\) is a weight function, in most cases \(w_i=1\) (NB: can not contain any unknown parameters)
\(b\) and \(c\) are known functions.
We have seen:
Distribution | \(\theta\) | \(b(\theta)\) | \(\phi\) | \(w\) | \(\text{E}(Y)=b'(\theta)\) | \(b''(\theta)\) | \(\text{Var}(Y)=b''(\theta)\phi/w\) |
---|---|---|---|---|---|---|---|
normal | \(\mu\) | \(\frac{1}{2} \theta^2\) | \(\sigma^2\) | \(1\) | \(\mu=\theta\) | \(1\) | \(\sigma^2\) |
Bernoulli | \(\ln \left( \frac{p}{1-p} \right)\) | \(\ln (1+\exp(\theta))\) | \(1\) | \(1\) | \(p=\frac{\exp(\theta)}{1+\exp(\theta)}\) | \(p(1-p)\) | \(p(1-p)\) |
Poisson | \(\ln \mu\) | \(\exp(\theta)\) | \(1\) | \(1\) | \(\lambda=\exp(\theta)\) | \(\lambda\) | \(\lambda\) |
gamma | \(-\frac{1}{\mu}\) | \(-\ln (-\theta)\) | \(\frac{1}{\nu}\) | \(1\) | \(\mu=-1/\theta\) | \(\mu^2\) | \(\mu^2/\nu\) |
\(\text{E}(Y_i)=b'(\theta_i)\) and \(\text{Var}(Y_i)=b''(\theta_i)\frac{\phi}{w_i}\)
In class we study the handwritten proof together: Proof
\(b''(\theta_i)\) is often called the variance function \(v(\mu_i)\).
Nothing new - as always in this course: \(\eta_i={\bf x}_i^T \mathbf{\beta}\), and we require that the \(n \times p\) design matrix \({\bf X}=({\bf x}_1^T, {\bf x}_2^T,\ldots, {\bf x}_n^T)\) has full rank (which is \(p\)).
Remark: in this course we always assume that \(n>>p\).
Link function \[\eta_i=g(\mu_i)\]
Response function \[\mu_i=h(\eta_i)\]
random component | response function and link function |
---|---|
normal | \(h(\eta_i) = \eta_i\) and \(g(\mu_i) = \mu_i\), “identity link”. |
binomial | \(h(\eta_i) = \frac{e^{\eta_i}}{1+e^{\eta_i}}\) and \(g(\mu_i) = \ln \left(\frac{\mu_i}{1-\mu_i} \right) = \text{logit}(p_i)\). NB: \(\mu_i = p_i\) in our set-up. |
Poisson | \(h(\eta_i) = \exp(\eta_i)\) and \(g(\mu_i) = \ln(\mu_i)\), log-link. |
gamma | \(h(\eta_i) = -\frac{1}{\eta_i}\) and \(g(\mu_i) = -\frac{1}{\mu_i}\), negative inverse, or \(h(\eta_i) = \exp(\eta_i)\) and \(g(\mu_i) = \ln(\mu_i)\), log-ink. |
\[\eta_i=\theta_i\] so \[g(\mu_i)=\theta_i\] When the canonical link is used some of the results for the GLM (to be studied in the next sections) are simplified.
\(\theta_i\leftrightarrow \mu_i\leftrightarrow \eta_i \leftrightarrow \mathbf{\beta}\)
A more informative drawing made in class.
See class notes or Fahrmeir et al (2015), Section 5.8.2 for the derivation of the loglikelihood, score and expected Fisher information matrix.
\[l(\beta)=\sum_{i=1}^n l_i(\beta)=\sum_{i=1}^n\frac{1}{\phi}(y_i\theta_i-b(\theta_i))w_i+\sum_{i=1}^n c(y_i,\phi,w_i)\]
Remark: the part of the loglikelihood involving both the data and the parameter of interest is for a canonical link equal to
\[\sum_{i=1}^n y_i \theta_i=\sum_{i=1}^n y_i {\bf x}_i^T\beta=\sum_{i=1}^n y_i \sum_{j=1}^p x_{ij}\beta_j=\sum_{j=1}^p \beta_j \sum_{i=1}^n y_i x_{ij}\]
\[s(\beta)=\sum_{i=1}^n \frac{(y_i-\mu_i){\bf x}_i h'(\eta_i)}{\text{Var}(Y_i)}={\bf X}^T {\bf D} \Sigma^{-1}({\bf y}-{\mathbf \mu})\] where \(\Sigma=\text{diag}(\text{Var}(Y_i))\) and \({\bf D}=\text{diag}(h'(\eta_i))\) (derivative wrt \(\eta_i\)).
Remark: observe that \(s(\beta)=0\) only depends on the distribution of \(Y_i\) through \(\mu_i\) and \(\text{Var}(Y_i)\).
Canonical link:
\[s(\beta)=\sum_{i=1}^n \frac{(y_i-\mu_i){\bf x}_i w_i}{\phi}\] since \(\frac{\partial \mu_i}{\partial \eta_i}=b''(\theta_i)\).
\[F_{[h,l]}(\beta)=\sum_{i=1}^n \frac{x_{ih}x_{il}(h'(\eta_i))^2}{\text{Var}(Y_i)}\]
\[F(\beta)={\bf X}^T {\bf W} {\bf X}\]
where \({\bf W}=\text{diag}(\frac{h'(\eta_i)^2}{\text{Var}(Y_i)})\).
Canonical link:
\[\frac{\partial^2 l_i}{\partial \beta_j \partial \beta_l}=- \frac{x_{ij} w_i}{\phi}(\frac{\partial \mu_i}{\partial \beta_l})\] which do not contain any random variables, so the observed must be equal to the expected Fisher information matrix.
Details on the derivation: IRWLS
\[\beta^{(t+1)}=\beta^{(t)} + F(\beta^{(t)})^{-1} s(\beta^{(t)})\] Insert formulas for expected Fisher information and score function.
\[\beta^{(t+1)}=({\bf X}^T {\bf W}(\beta^{(t)}) {\bf X})^{-1} {\bf X}^T {\bf W}(\beta^{(t)})\tilde{\bf y}_i^{(t)}\] where \({\bf W}\) is as before \({\bf W}=\text{diag}(\frac{h'(\eta_i)^2}{\text{Var}(Y_i)})\) - but now the current version of \({\boldsymbol{\beta}}^{(t)}\) is used. The diagonal elements are called the working weights. The \(\tilde{\bf y}_i^{(t)}\) is called the working response vector and has element \(i\) given as
\[\tilde{\bf y}_i^{(t)}={\bf x}_i^T\beta^{(t)}+\frac{y_i-h({\bf x}_i^T\beta^{(t)})}{h'({\bf x}_i^T\beta^{(t)})}.\]
Remark: Convergens? With full rank of \({\bf X}\) and positive diagonal elements of \({\bf W}\) we are certain that the inverse will exist, but there might be that the temporary version of \({\bf W}\) can cause problems.
See what is output from glm
- observe working weights as weights
..
fitgrouped = glm(cbind(y, n - y) ~ ldose, family = "binomial", data = investr::beetle)
names(fitgrouped)
fitgrouped$weights
fitgrouped$residuals
## [1] "coefficients" "residuals" "fitted.values"
## [4] "effects" "R" "rank"
## [7] "qr" "family" "linear.predictors"
## [10] "deviance" "aic" "null.deviance"
## [13] "iter" "weights" "prior.weights"
## [16] "df.residual" "df.null" "y"
## [19] "converged" "boundary" "model"
## [22] "call" "formula" "terms"
## [25] "data" "offset" "control"
## [28] "method" "contrasts" "xlevels"
## 1 2 3 4 5 6 7
## 3.254867 8.227383 14.321313 13.378893 10.261055 5.156671 2.653398
## 8
## 1.230713
## 1 2 3 4 5 6
## 0.78115418 0.38388091 -0.31082206 -0.44081641 0.18557365 -0.05641516
## 7 8
## 0.67002811 1.02139898
Let data be grouped as much as possible. With G groups (covariate pattern) with \(n_i\) observations for each group (then \(n=\sum^G n_i=n\)):
\[ \hat{\phi}=\frac{1}{G-p} \sum_{i=1}^G \frac{(y_i-{\hat{\mu}}_i)^2}{b''(\theta_i)/w_i}\]
The motivation behind this estimator is as follows: \[ \text{Var}(Y_i)=\phi b''(\theta_i)/w_i \Leftrightarrow \phi=\text{Var}(Y_i)/(b''(\theta_i)/w_i)\]
As before we have that maximum likelihood estimator \(\hat{\beta}\) asymptotically follows the multivariate normal distribution with mean \(\beta\) and covariance matrix equal to the inverse of the expected Fisher information matrix. This is also true when we replace the unknown \(\beta\) with the estimated \(\hat{\beta}\) for the expected Fisher information matrix.
\[\hat{\beta} \approx N_p(\beta,F^{-1}(\hat{\beta}))\] and with \[F(\hat{\beta})={\bf X}^T \hat{{\bf W}} {\bf X}\] where \(\hat{{\bf W}}\) denotes that \(\hat{\beta}\) is used then calculating \({\bf W}=\text{diag}(\frac{h'(\eta_i)^2}{\text{Var}(Y_i)})\).
Same as before - for the Wald we insert the formula for the covariance matrix of \(\hat{\beta}\), for the LRT we insert the loglikelihoods and for the score test we insert formulas for the score function and expected Fisher information matrix.
Group observations together in groups of maximal size (covariate patterns? interval versions thereof?). Group \(i\) has \(n_i\) observations, and there are \(G\) groups. Asymptotic distribution correct if all groups have big \(n_i\). For individual data asymptotic results can not be trusted.
Deviance \[D=-2[\sum_{i=1}^{g}(l_i(\hat{\mu}_i)-l_i(\bar{y}_i))]\] with approximate \(\chi^2\)-distribution with \(G-p\) degrees of freedom.
Pearson: \[ X_P^2=\sum_{i=1}^G \frac{(y_i-\hat{\mu}_i)^2}{v(\hat{\mu}_i)/w_i}\] with approximate \(\phi \cdot \chi^2\)-distribution with \(G-p\) degrees of freedom.
Remember that the variance function \(v(\hat{\mu}_i)=b''(\theta_i)\) (this is a function of \(\mu_i\) because \(\mu_i=b'(\theta_i)\)).
Let \(p\) be the number of regression parameters in our model. \[\text{AIC} =-2 \cdot l(\hat{\beta})+2p\] If the dispersion parameter is estimated use \((p+1)\) in place of \(p\).
Work with Problem 1 and 2 in IL, and work on Problems 3-5 by yourself.
If you have more time after Problem 1 and 2, look through the theoretical proofs and derivations listed under “Learning material” on the top of this Module page.
a) Define the class of generalized linear models (GLMs), and explicitly list all requirements for each part of the model.
b) Below are three likelihoods, three link functions, and three linear components listed. Explain which combinations that give valid GLMs (8 in total), and also comment on these models (you do not have to mathematically prove which are valid).
Likelihoods:
Link functions:
Linear components:
(New) The matrix \({\bf W}\) is a diagonal matrix. What is on the diagonal?
Calulate the elements of \({\bf W}\) for a Poisson regression- both with log and identity link. Compare.
Calulate the elements of \({\bf W}\) for a binary regression - both with logit and identity link. Compare.
(New) Which insight did this give you into the role of the link function and its effect on the covariance for the parameter estimates?
Let \(Y_1,Y_2,\ldots,Y_N\) be independent and exponentially distributed random variables, where \(Y_i\) has the density
\[f(y_i ; \alpha_i)=\alpha_i e^{-\alpha_iy_i} \text{ for } y_i>0, \alpha_i>0, i=1,2,\ldots,N.\]
a) Show that the distribution of \(Y_i\) comes from the exponential family. Use the general formulas to find E(\(Y_i\)) and \(\text{Var}(Y_i)\) as functions of \(\alpha_i\).
b) Show that the log-likelihood for the data \(y_1, \dots, y_n\) can be written as
\[ l = \sum_{i=1}^N \{-\alpha_i y_i + \ln \alpha_i\} \]
Use this to show that the deviance for a generalized linear model with estimated expectations \(\hat{\mu_i} = \hat{y_i}\) is
\[ D = 2 \sum _{i=1}^N \left\{ \frac{y_i-\hat y_i}{\hat y_i} - \ln \left(\frac{y_i}{\hat y_i} \right) \right\}\]
d) We want to test the null hypothesis
\[H_0: \ \alpha_1 = \alpha_2 = \cdots = \alpha_N = \alpha \]
against the alternative hypothesis that at least one \(\alpha_i\) is different from the others.
Use b) to find a test statistic for this problem.
What distribution does this test statistic have under \(H_0\)?
New: could you also use the Wald test? Explain how to do that (no need to calculated the mathematically explicit test statistic).
Assume that \(Y_1, \dots, Y_N\) are independent continuous distributed random variables, where the density of \(Y_i\) is given by
\[ f(y_i; \gamma_i) = \begin{cases} \frac{\gamma_i^2}{2}y_ie^{-\gamma_iy_i} &\text{ for } y_i \geq 0 \\ 0 &\text{ else} \end{cases}\]
where \(\gamma_i\) is a scalar parameter.
a) Show that the distribution of \(Y_i\) comes from the exponential family. Hint: usually we choose to let \(\phi=\frac{1}{2}\).
Use the general formulas to show that E\((Y_i) = 2/\gamma_i\) and Var\((Y_i) = 2/\gamma_i^2\).
Assume a GLM for \(Y_1, \dots, Y_N\) where the distribution of \(Y_i\) is as above for all \(i\), with the following link function:
\[\eta = g(\mu) = \ln(\mu) = x^T\beta\]
where \(x, \ \beta \in {\rm I\!R}^p\) and \(\mu\) is the expected value of \(y\).
b) Use general formulas to find the score vector \(s(\beta) = [s_1(\beta), \dots, s_p(\beta)]^T\) and the expected Fisher information matrix \(F(\beta) = [F_{ij}(\beta)]_{i,j = 1}^p\), expressed using \(y_1, \dots, y_N, \beta, N\) and the covariates \(x\).
Write down the equation that can be used to find the MLE for \(\beta\). Note that this is a recursive equation.
c) Write down the log-likelihood for the model above. Use this to find the deviance \(D\) for the model as a function of \(y_1, \dots, y_N\) and \(\hat{y_1}, \dots, \hat{y_N}\), where \(\hat{y_i}\) is the estimated expected value of \(y_i\).
Find an expression for the deviance residuals \(d_i\) using \(y_i\) and \(\hat{y_i}\).
We assume that the random variable \(\Lambda\) is gamma distributed with pdf
\[f(\lambda; \nu, \mu) = \frac{(\nu/\mu)^\nu}{\Gamma(\nu)} \lambda^{\nu-1} e^{-\lambda/\mu}; \ \lambda > 0\]
and further that given \(\Lambda = \lambda\), the random variable \(Y\) is Poisson distributed with parameter \(\lambda\). Thus the conditional pmf of \(Y\) given \(\Lambda = \lambda\) takes the form
\[\text{P}(Y = y | \lambda)= \frac{\lambda^y}{y!} \exp(-\lambda), \ y = 0, 1, 2, \dots.\]
a) Show that the marginal pmf of \(Y\) is given by \[p(y; \mu, \nu) = \frac{\Gamma(y+\nu)}{\Gamma(\nu)\Gamma(y+1)} \left(\frac{\mu}{\mu+\nu}\right)^y \left(\frac{\nu}{\mu+\nu}\right)^\nu; \ y = 0, 1, 2, \dots\] This is the negative binomial distribution.
We then assume that the parameter \(\nu\) is fixed, and consider the random variable \(Y^* = Y/\nu\). Note that
\[\text{P}(Y^* = y^*) = \text{P}(Y = ky^*) \ \text{for} \ y^* = 0, \frac{1}{k}, \frac{2}{k}, \dots\]
so \(Y^*\) has pmf
\[p^*(y^*; \mu, \nu) = \frac{\Gamma(\nu y^* + \nu)}{\Gamma(\nu)\Gamma(\nu y^* + 1)} \left(\frac{\mu}{\mu+\nu}\right)^{\nu y^*} \left(\frac{\nu}{\mu+\nu}\right)^\nu; \ y^* = 0, \frac{1}{k}, \frac{2}{k}, \dots\]
b) Show that the pmf of \(Y^*\) is an exponential family \[\exp\left\{\frac{y \theta - b(\theta)}{\phi}w + c(y, \phi, w)\right\},\] with \(\theta = \log(\mu/(\mu+\nu))\), \(b(\theta) = -\log(1 - e^{\theta})\), \(w = 1\) and \(\phi = 1/\nu\)
c) Use the expressions for \(b(\theta)\) and \(\phi\) to determine \(\text{E}(Y^*)\) and \(\text{Var}(Y^*)\). Show that \(\text{E}(Y) = \mu\) and find \(\text{Var}(Y)\).
One of the important concepts we have discussed in this course, is deviance (for Gaussian regression, Poisson regression and logistic regression).
Explain what deviance is, and how it relates to residual sum of squares (RSS) for Gaussian regression. Remark 2017/2018: we have called this “sums of squares of errors - SSE”
Discuss how it relates to a likelihood ratio test (LRT) for comparing two nested regression models.
Discuss how deviance can be used to construct “ANOVA” tables for Poisson regression and logistic regression. Remark 2017/2018: these are called analysis of deviance tables.
Discuss how deviance can be used to define residuals, for Poisson regression and logistic regression.