– so, for the first time: no practical examples or data sets to be analysed!
Work with Problem 1 and 2 in IL, and work on Problems 3-5 by yourself.
If you have more time after Problem 1 and 2, look through the theoretical proofs and derivations listed under “Learning material” on the top of this Module page.
Likelihoods:
Link functions:
Linear component:
Likelihoods:
Link functions:
Linear components:
Exponential family: \(f(y) = \exp\left(\frac{y\theta - b(\theta)}{\phi} w + c(y, \phi, w)\right)\) (note that \(\theta\) can be a vector)
Binomial distribution: \(f(y) = \binom{n}{y} \pi^y(1-\pi)^{n-y} = \exp\left( \ln \binom{n}{y} + y \ln \frac{\pi}{(1-\pi)} + n \ln \pi \right)\)
Link functions: 1. (\(cos(\mu)\)) is not monotone and thus not a valid link function. The other two (2. (\(\mu\)) and 3. (\(\log(\mu)\))) are valid.
Linear component: The third linear component is not linear in \(\beta_1\) and thus not valid. The others are linear in \(\mathbf{\beta}\) and are valid (does not have to be linear in the covariates \(x_i\)).
Thus we have two linear components that can be used with any model, which leaves four alternatives for link and repsponse (and in total 8 different model combinations):
The likelihood for a singlw datum is
\[ f(y) = \text{P}(Y=y) = \binom{n}{y} \pi^y (1-\pi)^{n-y} \]
So the log likelihood is
\[ \log(f) = \log \binom{n}{y} + y\log(\pi) + (n-y)\log(1-\pi) = \log \binom{n}{y} + y \log\left(\frac{\pi}{1-\pi}\right) + n\log(1-\pi) \]
For \(n\) data points \(l(y, \mu) = \sum_{i=1}^n \log(f)\), and the deviance, \(D(y, \hat{\mu}) = -2\left(l(\hat{\mu}, y) - l(y, y)\right)\), so
\[ \begin{aligned} D^*(y, \hat{\mu}) &= 2 l(y, y) - 2 l(\hat{\mu}, y) \\ &= 2 \sum\left[y_i \log\left( \frac{y_i}{n_i-y_i}\right) + n_i \log\left( \frac{n_i-y_i}{n_i}\right) - y_i \log\left( \frac{\frac{\hat{\mu}_i}{n_i}}{\frac{n_i-\hat{\mu}_i}{n_i}}\right) - n_i \log\left( \frac{n_i - \hat{\mu}_i}{n_i}\right) \right] \\ &= 2 \sum\left[y_i \log\left( \frac{y_i}{\hat{\mu}_i}\right) + n_i \log\left( \frac{n_i-y_i}{n_i - \hat{\mu}_i}\right) - y_i \log\left( \frac{n_i - y_i}{n_i - \hat{\mu}_i}\right) \right] \end{aligned} \]
\[ w_{ii} = \left(\frac{\partial \mu_i}{\partial \eta_i}\right)^2 \Big/\text{Var}(y_i) \]
log link:
\[ \mu = \eta \implies \frac{\partial \mu}{\partial \eta} = 1 \implies w_{ii} = \frac{1}{\text{Var}(y_i)} \]
Identity link;
\[ \mu = e^{\eta} \implies \frac{\partial \mu}{\partial \eta} = e^{\eta} \implies w_{ii} = \frac{e^{2 \eta_i}}{e^{\eta_i}} = e^{\eta_i} = \mu_i \]
Each is the inverse of the other.
logit:
\[ \mu = \eta \implies \frac{\partial \mu}{\partial \eta} = 1 \implies w_{ii} = \frac{1}{n_i \pi_i (1-\pi_i)} \]
Identity:
\[ \mu = n\frac{e^{\eta}}{1+e^{\eta}} \implies \frac{\partial \mu}{\partial \eta} = n\frac{e^{\eta}}{(1+e^{\eta})^2} = n \pi (1-\pi) \implies w_{ii} = \frac{(n_i \pi_i (1-\pi_i))^2}{n_i \pi_i (1-\pi_i)} = n_i \pi_i (1-\pi_i) \]
Again, each is the inverse of the other.Let \(Y_1,Y_2,\ldots,Y_N\) be independent and exponentially distributed random variables, where \(Y_i\) has the density
\[f(y_i ; \alpha_i)=\alpha_i e^{-\alpha_iy_i} \text{ for } y_i>0, \alpha_i>0, i=1,2,\ldots,N.\]
\[f(y_i ; \alpha_i)=\alpha_i e^{-\alpha_iy_i} = e^{-\alpha_i y_i + \ln \alpha_i} \]
which comes from the exponential family:
\[ f(y) = \exp\left(\frac{y\theta - b(\theta)}{\phi} w + c(y, \phi, w)\right) = \exp\Big(\theta y_i -\ln(-1/\theta)\Big) \]
with \(\phi = 1\), \(w = 1\), \(c(y, \phi, w) = 0\), \(\theta = - \alpha_i\), and \(b(\theta) = - \ln (\alpha_i) = - \ln(-\theta)\).Expected value: E\((Y_i) = b'(\theta) = - \frac{1}{\theta} = \frac{1}{\alpha_i}\)
The variance: Var\((Y_i) = b''(\theta) \frac{\phi}{w} = \frac{-1}{\theta^2} = \frac{1}{\alpha_i^2}\)
Note that you could also use \(\theta = \alpha_i\), \(b(\theta) = \ln(\theta) = \ln(\alpha)\), \(c(y, \phi, w) = 0\), and either \(\phi\) or \(w\) equal to -1, the other equal to 1, so \(\frac{\phi}{w} = -1\). Then E\((Y_i) = b'(\theta) = \frac{1}{\theta_i} = \frac{1}{\alpha_i}\) and Var\((Y_i) = -b''(\theta) = -\frac{-1}{\theta^2} = \frac{1}{\alpha_i^2}\). These two parameterizations are equivalent, which shows that they are not necessarily unique! It is however some parameterizations that is the standard. For the Gamma (and thus the special case exponential) we use the first one, where \(\phi = w = 1\).
\[ l = \sum_{i=1}^N \{-\alpha_i y_i + \ln \alpha_i\} \]
Per definition:
\[ l = \sum_{i=1}^N \ln f(y_i;\alpha_i) = \sum_{i=1}^N(-\alpha_iy_i + \ln\alpha_i) \]
Use this to show that the deviance for a generalized linear model with estimated expectations \(\hat{\mu_i} = \hat{y_i}\) is
\[ D = 2 \sum _{i=1}^N \left\{ \frac{y_i-\hat y_i}{\hat y_i} - \ln \left(\frac{y_i}{\hat y_i} \right) \right\} \]
The deviance is
\[ D = 2(l_{saturated}-l_{model}) \]
Saturated model: All \(\alpha_i\) are different, so \(\frac{\partial l}{\partial \alpha_i} = -y_i + 1/\alpha_i\). MLE: \(-y_i + 1/\alpha_i = 0 \ \implies \ \hat{\alpha_i} = 1/y_i\) and thus
\[ l_{saturated} = \sum_{i=1}^N(-1 + \ln(1/y_i)) \]
In GLM-models we estimate the mean \(\mu_i\) using \(\hat{\mu_i}\) (do not need the expression for this), which we call \(\hat{y_i}\), and we have
\[ l_{model} = \sum_{i=1}^N \left( -\frac{y_i}{\hat{y_i}} +\ln \frac{1}{\hat{y_i}} \right) = \sum_{i=1}^N \left( -\frac{y_i}{\hat{y_i}} -\ln \hat{y_i} \right) \]
Then we get the deviance
\[ D = \sum_{i=1}^N \left(-1-\ln(y_i) + \frac{y_i}{\hat{y_i}} + \ln \hat{y_i} \right) = D = \sum_{i=1}^N \left( \frac{y_i-\hat{y_i}}{\hat{y_i}} - \ln\frac{y_i}{\hat{y_i}} \right) \]
\[H_0: \ \alpha_1 = \alpha_2 = \cdots = \alpha_N = \alpha \]
against the alternative hypothesis that at least one \(\alpha_i\) is different from the
others.
Use b) to find a test statistic for this
problem.
What distribution does this test statistic have under \(H_0\)?
\(H_0\) corresponds to a GLM where all \(\alpha_i\) are equal to the same value, here denoted \(\alpha\). Then
\[ l = - \alpha \sum_{i=1}^N y_i + N \ln \alpha \]
Take the derivative and set equal to zero gives
\[ \frac{dl}{d\alpha} = \sum_{i=1}^N y_i + \frac{N}{\alpha} \implies \hat{\alpha} = \frac{1}{\frac{1}{N}\sum_{i=1}^N y_i} = \frac{1}{\bar{y}} \]
Thus, \(\alpha_i = \frac{1}{\bar{y}}\) for each i, which means that \(\hat{\mu_i} = \hat{y_i} = \bar{y}\). Then the deviance becomes
\[ D = 2 \sum_{i=1}^N \left( {\frac{y_i - \bar{y}}{\bar{y}} - \ln \frac{y_i}{\bar{y}}} \right) \] which is approximately \(\chi_{N-1}^2\) distributed under \(H_0\). \(N-1\) = no. of parameters in the saturated model minus no. of parameters in the model under \(H_0\). We reject \(H_0\) if the deviance \(D\) is large.
New: could you also use the Wald test? Explain how to do that (no need to calculated the mathematically explicit test statistic).
Assume that \(Y_1, \dots, Y_N\) are independent continuous distributed random variables, where the density of \(Y_i\) is given by
\[ f(y_i; \gamma_i) = \begin{cases} \frac{\gamma_i^2}{2}y_ie^{-\gamma_iy_i} &\text{ for } y_i \geq 0 \\ 0 &\text{ else} \end{cases}\]
where \(\gamma_i\) is a scalar parameter.
This is the Gamma distribution with \(\nu = 2\) and \(\mu = 2/\gamma\).
\[ f(y_i; \gamma_i) = \frac{\gamma_i^2}{2}y_ie^{-\gamma_iy_i} = \exp\Big( -\gamma_i y_i + 2\ln\gamma_i + \ln y_i - \ln2 \Big) = \exp \left( -\frac{-\frac{-y\gamma_i}{2}+\ln(-\gamma_i/2)}{1/2} + \ln y_i - \ln2 \right)\]
\[ f(y) = \exp\left(\frac{y\theta - b(\theta)}{\phi} w + c(y, \phi, w)\right) = \exp\Big( \frac{y_i(-\gamma_i) - (-2\ln\gamma_i)}{1}1 + (\ln y_i - \ln2) \Big) \\ = \exp\Big( \frac{y_i(\gamma_i) - (2\ln\gamma_i)}{-1}1 + (\ln y_i - \ln2) \Big) \] Also here we have several choices for parameterization. We choose the following (which is the standard for the Gamma distribution): \(w = 1\), \(\phi = 1/2\), \(c(y, \phi, w) = \ln y_i - \ln2\), \(\theta = -\gamma_i/2\) and \(b(\theta) = -\ln(-\theta)\).
(An alternative that might be simpler is: \(w = 1\), \(\phi = -1\), \(c(y, \phi, w) = \ln y_i - \ln2\), \(\theta = \gamma_i\), and \(b(\theta) = 2 \ln \theta\). The results will be the same, but this is not the standard way and we do not use this parameterization.)
Use the general formulas to show that E\((Y_i) = 2/\gamma_i\) and Var\((Y_i) = 2/\gamma_i^2\).
Expected value: E\((Y_i) = b'(\theta) = \frac{-1}{\theta} = \frac{2}{\gamma_i}\)
The variance: Var\((Y_i) = b''(\theta) \frac{\phi}{w} = \frac{2}{\theta^2} = \frac{2}{\gamma_i^2}\)
Assume a GLM for \(Y_1, \dots, Y_N\) where the distribution of \(Y_i\) is as above for all \(i\), with the following link function:
\[\eta = g(\mu) = \ln(\mu) = x^T\beta\]
where \(x, \ \beta \in {\rm I\!R}^p\) and \(\mu\) is the expected value of \(y\).
Score: \(s(\beta) = [s_1(\beta), \dots, s_p(\beta)]^T\)
\[ s_j(\beta) = \sum_{i=1}^N \Big( \frac{y_i-\mu_i}{Var(Y_i)} x_{ij} \frac{\partial \mu_i}{\partial \eta_i} \Big) \]
\[\frac{\partial \eta}{\partial \mu} = 1/\mu \implies \frac{\partial \mu}{\partial \eta} = \mu = e^{x^T\beta} \]
Using that \(\ln \mu = x^T\beta\) and \(\mu = 2/\theta\) we get \(\mu = 2/\theta = e^{x^T\beta} \implies \theta = 2e^{-x^T\beta}\). Using this and the expected value and variance from a), the score \(s_j(\beta)\) becomes
\[ s_j(\beta) = \sum_{i=1}^N \left[ \frac{y_i - 2/\theta_i}{2/\theta_i^2}x_{ij}e^{x_i^T\beta} \right] = \sum_{i=1}^N \left[ \frac{y_i - e^{x_i^T\beta}}{\frac{1}{2}e^{2x_i^T\beta}}x_{ij}e^{x_i^T\beta} \right] = 2\sum_{i=1}^N \left[ \frac{y_i - e^{x_i^T\beta}}{e^{x_i^T\beta}}x_{ij} \right] \]Fisher: \(F(\beta) = [F_{ij}(\beta)]_{i,j = 1}^p\)
The general formula for the expected Fisher information is:
\[ F_{jk}(\beta) = \sum_{i=1}^N \left[\frac{x_{ij} x_{ik}}{Var(Y_i)} \left(\frac{\partial \mu_i}{\partial \eta_i}\right)^2 \right] \]
Thus we get
\[ F_{jk}(\beta) = \sum_{i=1}^N \left[\frac{x_{ij} x_{ik}}{2/\theta_i^2} \left(e^{x_i^T\beta}\right)^2 \right] = \sum_{i=1}^N \left[\frac{x_{ij} x_{ik}}{\frac{1}{2}e^{2x_i^T\beta}} \left(e^{x_i^T\beta}\right)^2 e^{2x_i^T\beta} \right] = 2\sum_{i=1}^Nx_{ij}x_{ik} \]
which means that the full matrix can be written as \(F(\beta) = 2X^TX\). Note that it is not a function of \(\beta\).
Write down the equation that can be used to find the MLE for \(\beta\). Note that this is a recursive equation.
The recursive equation we use to find the MLE is
\[ \beta^{(m)} = \beta^{(m-1)} + F^{-1}s(\beta^{(m-1)}) \]
c)Write down the log-likelihood for the model above. Use this to find the deviance \(D\) for the model as a function of \(y_1, \dots, y_N\) and \(\hat{y_1}, \dots, \hat{y_N}\), where \(\hat{y_i}\) is the estimated expected value of \(y_i\).
Find an expression for the deviance residuals \(d_i\) using \(y_i\) and \(\hat{y_i}\).
We assume that the random variable \(\Lambda\) is gamma distributed with pdf
\[f(\lambda; \nu, \mu) = \frac{(\nu/\mu)^\nu}{\Gamma(\nu)} \lambda^{\nu-1} e^{-\lambda/\mu}; \ \lambda > 0\]
and further that given \(\Lambda = \lambda\), the random variable \(Y\) is Poisson distributed with parameter \(\lambda\). Thus the conditional pmf of \(Y\) given \(\Lambda = \lambda\) takes the form
\[\text{P}(Y = y | \lambda)= \frac{\lambda^y}{y!} \exp(-\lambda), \ y = 0, 1, 2, \dots.\]
We then assume that the parameter \(\nu\) is fixed, and consider the random variable \(Y^* = Y/\nu\). Note that
\[\text{P}(Y^* = y^*) = \text{P}(Y = ky^*) \ \text{for} \ y^* = 0, \frac{1}{k}, \frac{2}{k}, \dots\]
so \(Y^*\) has pmf
\[p^*(y^*; \mu, \nu) = \frac{\Gamma(\nu y^* + \nu)}{\Gamma(\nu)\Gamma(\nu y^* + 1)} \left(\frac{\mu}{\mu+\nu}\right)^{\nu y^*} \left(\frac{\nu}{\mu+\nu}\right)^\nu; \ y^* = 0, \frac{1}{k}, \frac{2}{k}, \dots\]
One of the important concepts we have discussed in this course, is deviance (for Gaussian regression, Poisson regression and logistic regression).
Explain what deviance is, and how it relates to residual sum of squares (RSS) for Gaussian regression. Remark 2017/2018: we have called this “sums of squares of errors - SSE”
Discuss how it relates to a likelihood ratio test (LRT) for comparing two nested regression models.
Discuss how deviance can be used to construct “ANOVA” tables for Poisson regression and logistic regression. Remark 2017/2018: these are called analysis of deviance tables.
Discuss how deviance can be used to define residuals, for Poisson regression and logistic regression.