What is the definition of the log-likelihood?
For the logit model the log-likelihood is \[l(\boldsymbol{\beta})=\sum_{j=1}^G[\ln \binom{n_j}{y_j}+ y_j \ln \pi_j-y_j\ln(1-\pi_j)+n_j\ln(1-\pi_j)]\] for grouped data. Explain how we have arrived at this formula?
Write the version of the loglikelihood for individual data (i.e. \(n_j=1\) and \(G=n\)).
Where is \(\boldsymbol{\beta}\) in the loglikelihood in c? Rewrite this to be a function of \(\boldsymbol{\beta}\).
Why can we ignore the normalizing constant (what is the constant?) in the case of \(n_j = 1 \ \forall j\)? Considering what the log-likelihood is used for, why can we ignore the normalizing constant in all cases (i.e., also when \(n_j \neq 1\))?
What does this graph of \(l\) look like as a function of \(\boldsymbol{\beta}\) for the beetle data? First discuss shortly and then to aid you in answering this we look at the loglikelihood for the beetle data. Read the R code, discuss what is done and work on interpreting the final graph - in particular comment on the yellow ridge in the plot.
The beetle data has only one covariate (in addition to the intercept) - so this means that we have \(\boldsymbol{\beta}=(\boldsymbol{\beta}_0,\boldsymbol{\beta}_1)\). Look at the following code and explain what is done - remark: we have used the \(n_i=1\) version of the loglikelihood here.
library(investr)
library(ggplot2)
library(viridis)
# from aggregated to individual data (because these data were aggregated)
ldose <- rep(investr::beetle$ldose, investr::beetle$n)
y <- NULL
for (i in 1:8) y = c(y, rep(0, investr::beetle$n[i] - investr::beetle$y[i]), rep(1, investr::beetle$y[i]))
beetleds = data.frame(killed = y, ldose = ldose)
loglik <- function(par, args){
y <- args$y; x <- args$x; n <- args$n
res <- sum(y*x%*%par - n*log(1 + exp(x%*%par)))
return(res)
}
loglik(c(1,1), args = list(y = beetleds$killed,
x = cbind(rep(1, nrow(beetleds)), beetleds$ldose),
n = rep(1, nrow(beetleds))))
## [1] -549.2543
loglikmat <- matrix(NA, nrow = 100, ncol = 100)
loglikframe <- data.frame()
beta_0 <- seq(-90,-30, length.out = 100)
beta_1 <- seq(20, 50, length.out = 100)
for (i in 1:length(beta_0)){
for (j in 1:length(beta_1)){
loglikmat[i,j] <- loglik(c(beta_0[i], beta_1[j]), args = list(y = beetleds$killed,
x = cbind(rep(1, nrow(beetleds)), beetleds$ldose),
n = rep(1, nrow(beetleds))))
loglikframe <- rbind(loglikframe, c(beta_0[i], beta_1[j], loglikmat[i,j]))
}
}
names(loglikframe) <- c("beta_0", "beta_1", "loglik")
head(loglikframe)
## beta_0 beta_1 loglik
## 1 -90 20.00000 -15545.83
## 2 -90 20.30303 -15384.56
## 3 -90 20.60606 -15223.28
## 4 -90 20.90909 -15062.01
## 5 -90 21.21212 -14900.73
## 6 -90 21.51515 -14739.46
ggplot(data = loglikframe, mapping = aes(x = beta_0, y = beta_1, z = loglik)) + geom_raster(aes(fill = exp(0.0001*loglik))) +
geom_point(data = loglikframe[which.max(loglikframe$loglik),], mapping = aes(x = beta_0, y = beta_1),
size = 5, col = "red", shape = 21, stroke = 2) + scale_shape(solid = FALSE) +
scale_fill_viridis() + geom_contour(col = "black")
Comments to the code: for the loglik
function we have
two arguments: par= the parameters to be estimated, and args=a list with
data. The reason for only having these two arguments is that it is
easier to use when we later perform optimization (with
optim
) of the loglikelihood to find the ML estimates.
(We did not cover this in the lecture week 1, but we know one of the definitions from Module 2. Either you skip Problem 4 and move to Problem 5, or you look at the section “Properties of the score function”, and “The expected Fisher information matrix \(F(\boldsymbol{\beta})\)” together.)
What is the definition of the expected (and the observed) Fisher information matrix? What is the dimension of thise matrix (matrices)?
What is the role of these matrices in ML estimation?
For the logit model with grouped data the expected and the observed Fisher information matrix are equal and given as
\[F(\boldsymbol{\beta})=\sum_{j=1}^G {\bf x}_j {\bf x}_j^T n_j \pi_j (1-\pi_j)\] Where is \(\boldsymbol{\beta}\) in this expression?
To find the ML estimate for \(\boldsymbol{\beta}\) we may either use the
function glm
or optimize the log-likelihood manually. We
will do both.
glm
function in R, and we also check
that the individual and the grouped data give the same parameter
estimates for the \(\boldsymbol{\beta}\). Read the R-code,
notice the different input structures and check the results.# the beetle.ds was made above
fitind=glm(killed ~ ldose, family = "binomial", data = beetleds) # individual data
summary(fitind)
##
## Call:
## glm(formula = killed ~ ldose, family = "binomial", data = beetleds)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -60.717 5.181 -11.72 <2e-16 ***
## ldose 34.270 2.912 11.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.44 on 480 degrees of freedom
## Residual deviance: 372.47 on 479 degrees of freedom
## AIC: 376.47
##
## Number of Fisher Scoring iterations: 5
fitgrouped=glm(cbind(y, n-y) ~ ldose, family = "binomial", data = investr::beetle) # grouped data. response is #success AND #fails (here we have defined a dead beetle as a success)
summary(fitgrouped)
##
## Call:
## glm(formula = cbind(y, n - y) ~ ldose, family = "binomial", data = investr::beetle)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -60.717 5.181 -11.72 <2e-16 ***
## ldose 34.270 2.912 11.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 284.202 on 7 degrees of freedom
## Residual deviance: 11.232 on 6 degrees of freedom
## AIC: 41.43
##
## Number of Fisher Scoring iterations: 4
What is the default convergence criterion for the glm? (Note:
IRWLS used in glm
- more in Module 5.)
We implemented the log-likelihood as a function in item 2 above.
Now we will use this together with the optim
function on
the beetle data set to optimize the loglikelihood. Read the R-code, take
notice of how we put data into the args
slot and how the
optimization is called with optim
. (In Compulsory exercise
2 you will use this in a Poisson regression.)
loglik_gr <- function(par, args) {
y <- args$y
x <- args$x
n <- args$n
res <- y %*% x - t(t(n * x) %*% ((1 + exp(-x %*% par))^(-1)))
return(res)
}
opt <- optim(c(-60, 30), fn = loglik, gr = loglik_gr, args = list(y = beetleds$killed,
x = cbind(rep(1, nrow(beetleds)), beetleds$ldose), n = rep(1, nrow(beetleds))),
control = list(fnscale = -1), hessian = TRUE,method="BFGS")
opt
## $par
## [1] -60.71748 34.27034
##
## $value
## [1] -186.2354
##
## $counts
## function gradient
## 24 9
##
## $convergence
## [1] 0
##
## $message
## NULL
##
## $hessian
## [,1] [,2]
## [1,] -58.48417 -104.0105
## [2,] -104.01047 -185.0941
sqrt(diag(-solve(opt$hessian))) # calculate the standard deviations of the parameters
## [1] 5.180709 2.912138