We assume that \(Y_i \sim \operatorname{Poisson}(\lambda)\) with pmf \(f(y_i) = \frac{\exp(-\lambda)\lambda^{y_i}}{y_i!}.\)
The canonical link function for Poisson regression is the log function and we can write
\[ \log(\lambda_i) = \mathbf{x_i}^T \boldsymbol{\beta} \]
and therefore
\[ \lambda_i = exp(\mathbf{x_i}^T \boldsymbol{\beta}). \] Assuming independence the log-likelihood is derived as follows
\[\begin{align} l(\beta) &= \log L(\beta) \\ & = \log \prod_{i=1}^n \frac{\exp(-\lambda_i)\lambda_i^{y_i}}{y_i!}\\ & = \sum_{i=1}^ny_i\log(\lambda_i) - \sum_{i=1}^n\lambda_i - \sum_{i=1}^n\log(y_i!)\\ & = \sum_{i=1}^ny_i\mathbf{x_i}^T\boldsymbol{\beta} - \sum_{i=1}^n\exp(\mathbf{x_i}^T\boldsymbol{\beta}) - \sum_{i=1}^n\log(y_i!) \end{align}\]
The score function is
\[ s(\boldsymbol \beta) = \frac{\partial l (\boldsymbol \beta)}{\partial \boldsymbol \beta} = \sum_{i=1}^n[y_i - \exp(\mathbf{x_i}^T\boldsymbol\beta)]\mathbf{x_i} = \sum_{i=1}^n(y_i - \lambda_i)\mathbf{x_i} = X^T(\mathbf y - \boldsymbol \lambda) \]
For the observed Fisher information we have
\[\begin{align} H(\beta) &= -\frac{\partial ^2L(\boldsymbol \beta)}{\partial \boldsymbol \beta \partial \boldsymbol \beta^T} = - \frac{\partial s(\boldsymbol \beta)}{\partial \boldsymbol \beta^T} = - \sum_{i=1}^n \frac{\partial s_i(\boldsymbol \beta)}{\partial \boldsymbol \beta^T}, \end{align}\]
where
\[\begin{align} \frac{\partial s_i(\boldsymbol \beta)}{\partial \boldsymbol \beta^T} &= \frac{\partial (y_i-\lambda_i) \mathbf{x_i}}{\partial \boldsymbol \beta^T}\\ &= \frac{\partial (y_i-\lambda_i)\mathbf{x_i}}{\partial \lambda_i} \frac{\partial \lambda_i}{\partial \boldsymbol \beta^T}\\ &= -\mathbf{x_i}\lambda_i\mathbf{x_i}^T, \end{align}\]
therefore
\[ H(\beta) = \sum_{i=1}^n \mathbf{x_i}\mathbf{x_i}^T \lambda_i = X^T diag(\lambda_1,\dots,\lambda_n) X \] Finally, the expected Fisher information is equal to the observed Fisher information
\[ F(\boldsymbol \beta) = E[H(\boldsymbol \beta)] = E[\sum_{i=1}^n \mathbf{x_i}\mathbf{x_i}^T \lambda_i] = \sum_{i=1}^n \mathbf{x_i}\mathbf{x_i}^T \lambda_i= X^T diag(\lambda_1,\dots,\lambda_n) X, \] since it is independent of \(y_i.\)
The imlememtation is based on the Fisher scoring algorithm
\[ \beta^{(t+1)}=\beta^{(t)} + F^{-1}(\beta^{(t)}) s(\beta^{(t)}), \,\,\, t= 0,1,2,\ldots \]
myglm = function(formula, data, start = rep(0, ncol(model.matrix(formula,data)))) {
y = data$y
X = model.matrix(formula,data)
beta = start
s=1
while (s>1e-10) {
eta = as.vector(X %*% beta) # linear predictor
score =t(X) %*% (y - exp(eta)) # score function
Fisherinfo = t(X) %*% diag(exp(eta)) %*% X # Fisher information matrix
s = sum(score^2)
beta = beta + solve(Fisherinfo) %*% score
}
vcov_mat = solve(Fisherinfo) # covariance matrix
mu_hat = exp(X%*%beta)
log.cand = dpois(y, mu_hat, log = T)
log.sat = dpois(y, y, log = T)
D = 2 * sum(log.sat - log.cand) # deviance
return(list(coef = beta, deviance = D, vcov = vcov_mat))
}
The saturated model deviance is the deviance of a model where each observation has there own unique \(\lambda_i\) and the MLE of \(\hat \lambda_i = y_i\)
\[\begin{align} D &= -2\sum_{i=1}^n(l(candidate) - l(saturated)) \\ &= 2\sum_{i=1}^n[y_i \log(y_i) - y_i - y_i \log(\hat \lambda_i) + \lambda_i]\\ &= 2\sum_{i=1}^n[y_i \log\frac{(y_i)}{\hat \lambda_i} + \hat \lambda_i - y_i] \end{align}\]
Note that we have many 0 counts (i.e. \(y_i = 0\)) and \(\log0 = - \infty.\) To avoid this problem we can use the definition of Deviance (see code).
ns = 1e3
beta0 = 1; beta1 = 3
x = runif(ns)
y = rpois(ns, lambda = exp(beta0 + beta1*x))
dat = data.frame(y,x)
myglm(y~x, data = dat)
## $coef
## [,1]
## (Intercept) 1.008061
## x 2.991456
##
## $deviance
## [1] 996.8517
##
## $vcov
## (Intercept) x
## (Intercept) 0.0005833756 -0.0007306007
## x -0.0007306007 0.0010152005
glm_mod = glm(y~x, data = dat, family = "poisson")
coef(glm_mod)
## (Intercept) x
## 1.008061 2.991456
deviance(glm_mod)
## [1] 996.8517
vcov(glm_mod)
## (Intercept) x
## (Intercept) 0.0005833740 -0.0007305989
## x -0.0007305989 0.0010151984
load(url("https://www.math.ntnu.no/emner/TMA4315/2020h/hoge-veluwe.Rdata"))
d = data
head(d)
## y t
## 1 6 16
## 2 7 16
## 3 10 17
## 4 7 16
## 5 9 9
## 6 7 18
\(\theta\) is the optimal time to produce the fledglings
\(\lambda_0\) is the number of fledgings at the optimal time (i.e. the maximum expected number of fledgings)
\(\omega\) is the sd of time of breeding initiation
\[\begin{align} \lambda_0 \exp(-\frac{(t-\theta)^2}{2\omega^2}) &= \exp(-\frac{(t-\theta)^2}{2\omega^2} + \log \lambda_0)\\ & = \exp(-\frac{\theta^2}{2\omega^2}+\log \lambda_0 + \frac{2\theta t}{2\omega^2} - \frac{t^2}{2\omega^2}) \end{align}\]
\[\begin{align} \beta_0 &= -\frac{\theta^2}{2\omega^2}+\log \lambda_0 \\ \beta_1 &= \frac{2\theta }{2\omega^2} \\ \beta_2 &= - \frac{1}{2\omega^2} \end{align}\]
mod = myglm(formula = y~1+t+I(t^2), data = d)
mod$coef
## [,1]
## (Intercept) 1.420130461
## t 0.085183057
## I(t^2) -0.003298608
In order to test the significance of the quadratic effect, i.e.
\[H_0: \beta_2=0 \quad vs \quad H_1:\beta_2 \neq 0 \]
we can use the wald test or the likelihood ratio test.
The Wald statistic in this case reduces to \[w = \frac{\hat \beta_2}{\operatorname {Var}(\hat \beta_2)}\]
For the LRT we have that \(2(l_1(\hat \beta) - l_0(\hat \beta))\sim\chi_1^2.\) Equivalently we can write \[2(l_1(\hat \beta) - l_0(\hat \beta)) = 2(l_{saturated} - l_0(\hat \beta) - l_{saturated}+l_1(\hat \beta)) = D_0 - D_1\sim \chi^2_1.\]
mod2 = myglm(formula = y~1+t, data = d)
LRT = mod2$deviance - mod$deviance
pchisq(LRT, df = 1, lower.tail = F)
## [1] 0.0005524157
wald =mod$coef[3]/sqrt(mod$vcov[3,3])
(pval=2*pnorm(wald,0,1,lower.tail = T))
## [1] 0.001213743
(D = mod$deviance)
## [1] 277.4613
qchisq(0.95, df = nrow(d) - length(mod$coef))
## [1] 159.8135
Since \(D = 277> \chi_{135-3} = 159,\) we reject the hypotheses that the model fits the data well.
ggplot(data.frame(y = d$y), aes(x=y)) + geom_bar() + theme_minimal()
We observe that we have many 0 counts and therefore the poisson assumption is violated. A zero inflated model may be more suitable.
From b) we have that
\[\begin{align} \beta_0 &= -\frac{\theta^2}{2\omega^2}+\log \lambda_0 \\ \beta_1 &= \frac{2\theta }{2\omega^2} \\ \beta_2 &= - \frac{1}{2\omega^2} \end{align}\]
By solving the system we can obtain the following relationships between \(\beta_0, \beta_1, \beta_2\) and \(\omega, \theta \text{ and} \lambda_0\)
\[\begin{align} \omega^2 &= -\frac{1}{2\beta_2} \\ \theta &= -\frac{\beta_1 }{2\beta_2} \\ \lambda_0 &= \exp(\beta_0 + \frac{\beta_1^2}{4\beta_2}) \end{align}\]
b0 = mod$coef[1]; b1 = mod$coef[2]; b2 = mod$coef[3]
omega_hat = sqrt(-1/(2*b2))
theta_hat = -b1/(2*b2)
lambda_hat = exp(b0 + b1^2/(4*b2))
c(omega_hat = omega_hat,theta_hat = theta_hat,lambda_hat=lambda_hat)
## omega_hat theta_hat lambda_hat
## 12.311746 12.911969 2.387364
Delta Method
It holds that
\[ Var(G(\theta)) \approx \nabla G(\theta)^T \cdot Cov(\theta) \cdot \nabla G(\theta) \]
We can write
\(\omega=G_1(\boldsymbol \beta) = \sqrt{-\frac{1}{2\beta_2}}\) and \(\nabla G_1(\boldsymbol \beta)^T = [0\,\,\,0\,\,\,\frac{1}{2^{3/2}(-\beta_2)^{3/2}}]\)
\(\theta=G_2(\boldsymbol \beta) = -\frac{\beta_1}{2\beta_2}\) and \(\nabla G_2(\boldsymbol \beta)^T = [0\,\,\,-\frac{1}{2\beta_2}\,\,\,\frac{\beta_1}{2\beta_2^2}]\)
cov_X= mod$vcov
grad_omega = c(0, 0, (-2*b2)^(-3/2))
grad_theta = c(0, -1/(2*b2), b1/(2*b2^2))
var_omega = t(grad_omega) %*% cov_X %*% grad_omega
sd_omega = drop(sqrt(var_omega))
var_theta = t(grad_theta) %*% cov_X %*% grad_theta
var_theta = drop(var_theta)
sd_theta = drop(sqrt(var_theta))
c(var_omega = drop(var_omega) ,var_theta = drop(var_theta))
## var_omega var_theta
## 3.619607 2.590123
c(sd_omega = sd_omega, sd_theta = sd_theta)
## sd_omega sd_theta
## 1.902526 1.609386
We would like to test the following hypothesis
\[ H_0:\mu = \theta \quad vs \quad H_1:\mu \neq \theta, \] or equivalently
\[ H_0:\mu - \theta = 0 \quad vs \quad H_1:\mu - \theta \neq0. \]
The central limit theorem implies that the mean time \(\hat \mu\) is normaly distributed and the same holds for mle estimator \(\hat \theta\) since the sample size is relativley large. Hence \(\hat\mu - \hat\theta\) is normaly distibuted as linear combination of gaussian random variables and \(E(\hat\mu - \hat\theta) = \mu-\theta.\) To account for the uncertainty in the estimates we assume that \(\hat \mu\) and \(\hat \theta\) are independent, and thererefore
\[\operatorname{Var}(\hat\mu - \hat\theta) = \operatorname{Var}\hat\mu + \operatorname{Var}\hat\theta, \] where \(\operatorname{Var}\hat\mu = \operatorname{Var}(\frac{1}{n}\sum_{i=1}^nt_i) = \frac{\sigma^2}{n}\) and \(\operatorname{Var}\hat\theta\) is given in f).
Thus have that the following statistic
\[ Z = \frac{(\hat\mu - \hat\theta)-(\mu - \theta)}{\sqrt{\operatorname{Var}\hat{\mu} + \operatorname{Var}\hat\theta}} \sim N(0,1). \]
Under \(H_0,\) \(\mu-\theta = 0\) and we can calculate the two sided p-value as follows
mu_hat = mean(d$t)
# Under H0 mu-theta = 0
Z_stat = (mu_hat - theta_hat)/sqrt(var(d$t)/nrow(d) + var_theta)
2*pnorm(Z_stat, lower.tail = F)
## [1] 0.06860242
B=1000
beta = matrix(mod$coef, ncol=1)
mm = model.matrix(y~t+I(t^2), data = d)
eta = mm%*%beta
n = nrow(d)
coeffs = matrix(NA, nrow = B, ncol = nrow(beta))
for(b in 1:B){
y = rpois(n, lambda = exp(eta))
dat = data.frame(y = y, t = d$t)
coeffs[b,] = myglm(y~t+I(t^2), data = dat)$coef
}
cov(coeffs)
## [,1] [,2] [,3]
## [1,] 0.0724389874 -8.547311e-03 2.375257e-04
## [2,] -0.0085473110 1.080811e-03 -3.187895e-05
## [3,] 0.0002375257 -3.187895e-05 9.984857e-07
mod$vcov
## (Intercept) t I(t^2)
## (Intercept) 0.0797693783 -9.308596e-03 2.550195e-04
## t -0.0093085957 1.159672e-03 -3.369024e-05
## I(t^2) 0.0002550195 -3.369024e-05 1.039306e-06