Problem 1

a.

We assume that \(Y_i \sim \operatorname{Poisson}(\lambda)\) with pmf \(f(y_i) = \frac{\exp(-\lambda)\lambda^{y_i}}{y_i!}.\)

The canonical link function for Poisson regression is the log function and we can write

\[ \log(\lambda_i) = \mathbf{x_i}^T \boldsymbol{\beta} \]

and therefore

\[ \lambda_i = exp(\mathbf{x_i}^T \boldsymbol{\beta}). \] Assuming independence the log-likelihood is derived as follows

\[\begin{align} l(\beta) &= \log L(\beta) \\ & = \log \prod_{i=1}^n \frac{\exp(-\lambda_i)\lambda_i^{y_i}}{y_i!}\\ & = \sum_{i=1}^ny_i\log(\lambda_i) - \sum_{i=1}^n\lambda_i - \sum_{i=1}^n\log(y_i!)\\ & = \sum_{i=1}^ny_i\mathbf{x_i}^T\boldsymbol{\beta} - \sum_{i=1}^n\exp(\mathbf{x_i}^T\boldsymbol{\beta}) - \sum_{i=1}^n\log(y_i!) \end{align}\]

The score function is

\[ s(\boldsymbol \beta) = \frac{\partial l (\boldsymbol \beta)}{\partial \boldsymbol \beta} = \sum_{i=1}^n[y_i - \exp(\mathbf{x_i}^T\boldsymbol\beta)]\mathbf{x_i} = \sum_{i=1}^n(y_i - \lambda_i)\mathbf{x_i} = X^T(\mathbf y - \boldsymbol \lambda) \]

For the observed Fisher information we have

\[\begin{align} H(\beta) &= -\frac{\partial ^2L(\boldsymbol \beta)}{\partial \boldsymbol \beta \partial \boldsymbol \beta^T} = - \frac{\partial s(\boldsymbol \beta)}{\partial \boldsymbol \beta^T} = - \sum_{i=1}^n \frac{\partial s_i(\boldsymbol \beta)}{\partial \boldsymbol \beta^T}, \end{align}\]

where

\[\begin{align} \frac{\partial s_i(\boldsymbol \beta)}{\partial \boldsymbol \beta^T} &= \frac{\partial (y_i-\lambda_i) \mathbf{x_i}}{\partial \boldsymbol \beta^T}\\ &= \frac{\partial (y_i-\lambda_i)\mathbf{x_i}}{\partial \lambda_i} \frac{\partial \lambda_i}{\partial \boldsymbol \beta^T}\\ &= -\mathbf{x_i}\lambda_i\mathbf{x_i}^T, \end{align}\]

therefore

\[ H(\beta) = \sum_{i=1}^n \mathbf{x_i}\mathbf{x_i}^T \lambda_i = X^T diag(\lambda_1,\dots,\lambda_n) X \] Finally, the expected Fisher information is equal to the observed Fisher information

\[ F(\boldsymbol \beta) = E[H(\boldsymbol \beta)] = E[\sum_{i=1}^n \mathbf{x_i}\mathbf{x_i}^T \lambda_i] = \sum_{i=1}^n \mathbf{x_i}\mathbf{x_i}^T \lambda_i= X^T diag(\lambda_1,\dots,\lambda_n) X, \] since it is independent of \(y_i.\)

b.Implementation

The imlememtation is based on the Fisher scoring algorithm

\[ \beta^{(t+1)}=\beta^{(t)} + F^{-1}(\beta^{(t)}) s(\beta^{(t)}), \,\,\, t= 0,1,2,\ldots \]

myglm = function(formula, data, start = rep(0, ncol(model.matrix(formula,data)))) {
  
  y = data$y
  X = model.matrix(formula,data)
  beta = start
  s=1
  while (s>1e-10) {
    eta = as.vector(X %*% beta) # linear predictor
    score =t(X) %*% (y - exp(eta)) # score function
    Fisherinfo =  t(X) %*% diag(exp(eta)) %*% X # Fisher information matrix
    s =  sum(score^2)
    beta = beta + solve(Fisherinfo) %*% score
  }
  
  vcov_mat = solve(Fisherinfo) # covariance matrix
  mu_hat = exp(X%*%beta) 
  log.cand = dpois(y, mu_hat, log = T)
  log.sat  = dpois(y, y, log = T)
  D = 2 * sum(log.sat - log.cand) # deviance
  return(list(coef = beta, deviance = D, vcov = vcov_mat))
}

The saturated model deviance is the deviance of a model where each observation has there own unique \(\lambda_i\) and the MLE of \(\hat \lambda_i = y_i\)

\[\begin{align} D &= -2\sum_{i=1}^n(l(candidate) - l(saturated)) \\ &= 2\sum_{i=1}^n[y_i \log(y_i) - y_i - y_i \log(\hat \lambda_i) + \lambda_i]\\ &= 2\sum_{i=1}^n[y_i \log\frac{(y_i)}{\hat \lambda_i} + \hat \lambda_i - y_i] \end{align}\]

Note that we have many 0 counts (i.e. \(y_i = 0\)) and \(\log0 = - \infty.\) To avoid this problem we can use the definition of Deviance (see code).

c. Test function

ns = 1e3
beta0 = 1; beta1 = 3
x = runif(ns)
y = rpois(ns, lambda = exp(beta0 + beta1*x))
dat = data.frame(y,x)
myglm(y~x, data = dat)
## $coef
##                 [,1]
## (Intercept) 1.008061
## x           2.991456
## 
## $deviance
## [1] 996.8517
## 
## $vcov
##               (Intercept)             x
## (Intercept)  0.0005833756 -0.0007306007
## x           -0.0007306007  0.0010152005
glm_mod = glm(y~x, data = dat, family = "poisson")
coef(glm_mod)
## (Intercept)           x 
##    1.008061    2.991456
deviance(glm_mod)
## [1] 996.8517
vcov(glm_mod)
##               (Intercept)             x
## (Intercept)  0.0005833740 -0.0007305989
## x           -0.0007305989  0.0010151984

Problem 2

load(url("https://www.math.ntnu.no/emner/TMA4315/2020h/hoge-veluwe.Rdata"))
d = data
head(d)
##    y  t
## 1  6 16
## 2  7 16
## 3 10 17
## 4  7 16
## 5  9  9
## 6  7 18

a.

  • \(\theta\) is the optimal time to produce the fledglings

  • \(\lambda_0\) is the number of fledgings at the optimal time (i.e. the maximum expected number of fledgings)

  • \(\omega\) is the sd of time of breeding initiation

b.

\[\begin{align} \lambda_0 \exp(-\frac{(t-\theta)^2}{2\omega^2}) &= \exp(-\frac{(t-\theta)^2}{2\omega^2} + \log \lambda_0)\\ & = \exp(-\frac{\theta^2}{2\omega^2}+\log \lambda_0 + \frac{2\theta t}{2\omega^2} - \frac{t^2}{2\omega^2}) \end{align}\]

\[\begin{align} \beta_0 &= -\frac{\theta^2}{2\omega^2}+\log \lambda_0 \\ \beta_1 &= \frac{2\theta }{2\omega^2} \\ \beta_2 &= - \frac{1}{2\omega^2} \end{align}\]

c.

mod = myglm(formula = y~1+t+I(t^2), data = d)
mod$coef
##                     [,1]
## (Intercept)  1.420130461
## t            0.085183057
## I(t^2)      -0.003298608

d.

In order to test the significance of the quadratic effect, i.e.

\[H_0: \beta_2=0 \quad vs \quad H_1:\beta_2 \neq 0 \]

we can use the wald test or the likelihood ratio test.

  1. The Wald statistic in this case reduces to \[w = \frac{\hat \beta_2}{\operatorname {Var}(\hat \beta_2)}\]

  2. For the LRT we have that \(2(l_1(\hat \beta) - l_0(\hat \beta))\sim\chi_1^2.\) Equivalently we can write \[2(l_1(\hat \beta) - l_0(\hat \beta)) = 2(l_{saturated} - l_0(\hat \beta) - l_{saturated}+l_1(\hat \beta)) = D_0 - D_1\sim \chi^2_1.\]

mod2 = myglm(formula = y~1+t, data = d)
LRT =  mod2$deviance - mod$deviance
pchisq(LRT, df = 1, lower.tail = F)
## [1] 0.0005524157
wald =mod$coef[3]/sqrt(mod$vcov[3,3])
(pval=2*pnorm(wald,0,1,lower.tail = T))
## [1] 0.001213743
  • Both tests indicate the sigificance of the quadratic effect.

e.

(D = mod$deviance)
## [1] 277.4613
qchisq(0.95, df = nrow(d) - length(mod$coef))
## [1] 159.8135

Since \(D = 277> \chi_{135-3} = 159,\) we reject the hypotheses that the model fits the data well.

ggplot(data.frame(y = d$y), aes(x=y)) + geom_bar() + theme_minimal()

We observe that we have many 0 counts and therefore the poisson assumption is violated. A zero inflated model may be more suitable.

f.

From b) we have that

\[\begin{align} \beta_0 &= -\frac{\theta^2}{2\omega^2}+\log \lambda_0 \\ \beta_1 &= \frac{2\theta }{2\omega^2} \\ \beta_2 &= - \frac{1}{2\omega^2} \end{align}\]

By solving the system we can obtain the following relationships between \(\beta_0, \beta_1, \beta_2\) and \(\omega, \theta \text{ and} \lambda_0\)

\[\begin{align} \omega^2 &= -\frac{1}{2\beta_2} \\ \theta &= -\frac{\beta_1 }{2\beta_2} \\ \lambda_0 &= \exp(\beta_0 + \frac{\beta_1^2}{4\beta_2}) \end{align}\]

b0 = mod$coef[1];  b1 = mod$coef[2]; b2  = mod$coef[3]
omega_hat  = sqrt(-1/(2*b2))
theta_hat = -b1/(2*b2)
lambda_hat  = exp(b0 + b1^2/(4*b2))

c(omega_hat = omega_hat,theta_hat = theta_hat,lambda_hat=lambda_hat)
##  omega_hat  theta_hat lambda_hat 
##  12.311746  12.911969   2.387364

Delta Method

It holds that

\[ Var(G(\theta)) \approx \nabla G(\theta)^T \cdot Cov(\theta) \cdot \nabla G(\theta) \]

We can write

  • \(\omega=G_1(\boldsymbol \beta) = \sqrt{-\frac{1}{2\beta_2}}\) and \(\nabla G_1(\boldsymbol \beta)^T = [0\,\,\,0\,\,\,\frac{1}{2^{3/2}(-\beta_2)^{3/2}}]\)

  • \(\theta=G_2(\boldsymbol \beta) = -\frac{\beta_1}{2\beta_2}\) and \(\nabla G_2(\boldsymbol \beta)^T = [0\,\,\,-\frac{1}{2\beta_2}\,\,\,\frac{\beta_1}{2\beta_2^2}]\)

cov_X= mod$vcov
grad_omega = c(0, 0, (-2*b2)^(-3/2))
grad_theta = c(0, -1/(2*b2), b1/(2*b2^2))
var_omega = t(grad_omega) %*% cov_X %*% grad_omega
sd_omega = drop(sqrt(var_omega))
var_theta = t(grad_theta) %*% cov_X %*% grad_theta
var_theta = drop(var_theta)
sd_theta = drop(sqrt(var_theta))
c(var_omega = drop(var_omega) ,var_theta = drop(var_theta))
## var_omega var_theta 
##  3.619607  2.590123
c(sd_omega = sd_omega, sd_theta = sd_theta)
## sd_omega sd_theta 
## 1.902526 1.609386

g.

We would like to test the following hypothesis

\[ H_0:\mu = \theta \quad vs \quad H_1:\mu \neq \theta, \] or equivalently

\[ H_0:\mu - \theta = 0 \quad vs \quad H_1:\mu - \theta \neq0. \]

The central limit theorem implies that the mean time \(\hat \mu\) is normaly distributed and the same holds for mle estimator \(\hat \theta\) since the sample size is relativley large. Hence \(\hat\mu - \hat\theta\) is normaly distibuted as linear combination of gaussian random variables and \(E(\hat\mu - \hat\theta) = \mu-\theta.\) To account for the uncertainty in the estimates we assume that \(\hat \mu\) and \(\hat \theta\) are independent, and thererefore

\[\operatorname{Var}(\hat\mu - \hat\theta) = \operatorname{Var}\hat\mu + \operatorname{Var}\hat\theta, \] where \(\operatorname{Var}\hat\mu = \operatorname{Var}(\frac{1}{n}\sum_{i=1}^nt_i) = \frac{\sigma^2}{n}\) and \(\operatorname{Var}\hat\theta\) is given in f).

Thus have that the following statistic

\[ Z = \frac{(\hat\mu - \hat\theta)-(\mu - \theta)}{\sqrt{\operatorname{Var}\hat{\mu} + \operatorname{Var}\hat\theta}} \sim N(0,1). \]

Under \(H_0,\) \(\mu-\theta = 0\) and we can calculate the two sided p-value as follows

mu_hat = mean(d$t)
# Under H0 mu-theta = 0
Z_stat = (mu_hat - theta_hat)/sqrt(var(d$t)/nrow(d) +  var_theta)
2*pnorm(Z_stat, lower.tail = F)
## [1] 0.06860242
  • Since p-value is greater than \(\alpha = 0.05\) we can not reject the null hypothesis that the mean of time is significantly different from the estimated optimal date.

Problem 3

B=1000
beta = matrix(mod$coef, ncol=1)
mm = model.matrix(y~t+I(t^2), data = d)
eta = mm%*%beta
n = nrow(d)
coeffs = matrix(NA, nrow = B, ncol = nrow(beta))

for(b in 1:B){
  y = rpois(n, lambda = exp(eta))
  dat = data.frame(y = y, t = d$t)
  coeffs[b,] = myglm(y~t+I(t^2), data = dat)$coef 
}

cov(coeffs)
##               [,1]          [,2]          [,3]
## [1,]  0.0724389874 -8.547311e-03  2.375257e-04
## [2,] -0.0085473110  1.080811e-03 -3.187895e-05
## [3,]  0.0002375257 -3.187895e-05  9.984857e-07
mod$vcov
##               (Intercept)             t        I(t^2)
## (Intercept)  0.0797693783 -9.308596e-03  2.550195e-04
## t           -0.0093085957  1.159672e-03 -3.369024e-05
## I(t^2)       0.0002550195 -3.369024e-05  1.039306e-06