Interactive lecture - first week

Theoretical questions - with and without use of R

Problem 1: Model assumptions

  1. What are the model assumptions for a binary regression?
  2. Which link function and response function is used for the logit model?
  3. What is the difference between the logit model and a logistic regression?

Problem 2: Log-likelihood.

  1. What is the definition of the log-likelihood?

  2. For the logit model the log-likelihood is \[l(\boldsymbol{\beta})=\sum_{j=1}^G[\ln \binom{n_j}{y_j}+ y_j \ln \pi_j-y_j\ln(1-\pi_j)+n_j\ln(1-\pi_j)]\] for grouped data. Explain how we have arrived at this formula?

  3. Write the version of the loglikelihood for individual data (i.e. \(n_j=1\) and \(G=n\)).

  4. Where is \(\boldsymbol{\beta}\) in the loglikelihood in c? Rewrite this to be a function of \(\boldsymbol{\beta}\).

  5. Why can we ignore the normalizing constant (what is the constant?) in the case of \(n_j = 1 \ \forall j\)? Considering what the log-likelihood is used for, why can we ignore the normalizing constant in all cases (i.e., also when \(n_j \neq 1\))?

  6. What does this graph of \(l\) look like as a function of \(\boldsymbol{\beta}\) for the beetle data? First discuss shortly and then to aid you in answering this we look at the loglikelihood for the beetle data. Read the R code, discuss what is done and work on interpreting the final graph - in particular comment on the yellow ridge in the plot.

The beetle data has only one covariate (in addition to the intercept) - so this means that we have \(\boldsymbol{\beta}=(\boldsymbol{\beta}_0,\boldsymbol{\beta}_1)\). Look at the following code and explain what is done - remark: we have used the \(n_i=1\) version of the loglikelihood here.

library(investr)
library(ggplot2)
library(viridis)

# from aggregated to individual data (because these data were aggregated)
ldose <- rep(investr::beetle$ldose, investr::beetle$n)
y <- NULL
for (i in 1:8) y = c(y, rep(0, investr::beetle$n[i] - investr::beetle$y[i]), rep(1, investr::beetle$y[i]))
beetleds = data.frame(killed = y, ldose = ldose)

loglik <- function(par, args){
  y <- args$y; x <- args$x; n <- args$n
  res <- sum(y*x%*%par - n*log(1 + exp(x%*%par)))
  return(res)
}

loglik(c(1,1), args = list(y = beetleds$killed, 
                           x = cbind(rep(1, nrow(beetleds)), beetleds$ldose), 
                           n = rep(1, nrow(beetleds))))
## [1] -549.2543
loglikmat <- matrix(NA, nrow = 100, ncol = 100)
loglikframe <- data.frame()
beta_0 <- seq(-90,-30, length.out = 100)
beta_1 <- seq(20, 50, length.out = 100)

for (i in 1:length(beta_0)){
  for (j in 1:length(beta_1)){
    
    loglikmat[i,j] <- loglik(c(beta_0[i], beta_1[j]), args = list(y = beetleds$killed, 
                                                                  x = cbind(rep(1, nrow(beetleds)), beetleds$ldose), 
                                                                  n = rep(1, nrow(beetleds))))
    
    loglikframe <- rbind(loglikframe, c(beta_0[i], beta_1[j], loglikmat[i,j]))
  
  }
}
names(loglikframe) <- c("beta_0", "beta_1", "loglik")
head(loglikframe)
##   beta_0   beta_1    loglik
## 1    -90 20.00000 -15545.83
## 2    -90 20.30303 -15384.56
## 3    -90 20.60606 -15223.28
## 4    -90 20.90909 -15062.01
## 5    -90 21.21212 -14900.73
## 6    -90 21.51515 -14739.46
ggplot(data = loglikframe, mapping = aes(x = beta_0, y = beta_1, z = loglik)) + geom_raster(aes(fill = exp(0.0001*loglik))) +
  geom_point(data = loglikframe[which.max(loglikframe$loglik),], mapping = aes(x = beta_0, y = beta_1), 
             size = 5, col = "red", shape = 21, stroke = 2) + scale_shape(solid = FALSE) +
  scale_fill_viridis() + geom_contour(col = "black")

Comments to the code: for the loglik function we have two arguments: par= the parameters to be estimated, and args=a list with data. The reason for only having these two arguments is that it is easier to use when we later perform optimization (with optim) of the loglikelihood to find the ML estimates.


Problem 3: Score function

  1. What is the definition of the score function? What is the dimension of the score function?
  2. Derive the score function for the logit model (individual data). The result should be \[s(\boldsymbol{\beta})=\sum_{i=1}^n {\bf x}_i (y_i-\pi_i)=\sum_{i=1}^n {\bf x}_i (y_i-\frac{\exp({\bf x}_i^T\boldsymbol{\beta})}{1+\exp({\bf x}_i^T\boldsymbol{\beta})})\]
  3. What do we need the score function for?

Problem 4: Fisher information.

(We did not cover this in the lecture week 1, but we know one of the definitions from Module 2. Either you skip Problem 4 and move to Problem 5, or you look at the section “Properties of the score function”, and “The expected Fisher information matrix \(F(\boldsymbol{\beta})\) together.)

  1. What is the definition of the expected (and the observed) Fisher information matrix? What is the dimension of thise matrix (matrices)?

  2. What is the role of these matrices in ML estimation?

  3. For the logit model with grouped data the expected and the observed Fisher information matrix are equal and given as

\[F(\boldsymbol{\beta})=\sum_{j=1}^G {\bf x}_j {\bf x}_j^T n_j \pi_j (1-\pi_j)\] Where is \(\boldsymbol{\beta}\) in this expression?

  1. Write the version of the expected Fisher information for individual data (i.e. \(n_j=1\) and \(G=n\)).

Problem 5: Maximum likelihood

To find the ML estimate for \(\boldsymbol{\beta}\) we may either use the function glm or optimize the log-likelihood manually. We will do both.

  1. First we use the glm function in R, and we also check that the individual and the grouped data give the same parameter estimates for the \(\boldsymbol{\beta}\). Read the R-code, notice the different input structures and check the results.
# the beetle.ds was made above
fitind=glm(killed ~ ldose, family = "binomial", data = beetleds) # individual data
summary(fitind)
## 
## Call:
## glm(formula = killed ~ ldose, family = "binomial", data = beetleds)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -60.717      5.181  -11.72   <2e-16 ***
## ldose         34.270      2.912   11.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.44  on 480  degrees of freedom
## Residual deviance: 372.47  on 479  degrees of freedom
## AIC: 376.47
## 
## Number of Fisher Scoring iterations: 5
fitgrouped=glm(cbind(y, n-y) ~ ldose, family = "binomial", data = investr::beetle) # grouped data. response is #success AND #fails (here we have defined a dead beetle as a success)
summary(fitgrouped)
## 
## Call:
## glm(formula = cbind(y, n - y) ~ ldose, family = "binomial", data = investr::beetle)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -60.717      5.181  -11.72   <2e-16 ***
## ldose         34.270      2.912   11.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 284.202  on 7  degrees of freedom
## Residual deviance:  11.232  on 6  degrees of freedom
## AIC: 41.43
## 
## Number of Fisher Scoring iterations: 4
  1. What is the default convergence criterion for the glm? (Note: IRWLS used in glm - more in Module 5.)

  2. We implemented the log-likelihood as a function in item 2 above. Now we will use this together with the optim function on the beetle data set to optimize the loglikelihood. Read the R-code, take notice of how we put data into the args slot and how the optimization is called with optim. (In Compulsory exercise 2 you will use this in a Poisson regression.)

loglik_gr <- function(par, args) {
    
    y <- args$y
    x <- args$x
    n <- args$n
    
    res <- y %*% x - t(t(n * x) %*% ((1 + exp(-x %*% par))^(-1)))
    return(res)
}

opt <- optim(c(-60, 30), fn = loglik, gr = loglik_gr, args = list(y = beetleds$killed, 
    x = cbind(rep(1, nrow(beetleds)), beetleds$ldose), n = rep(1, nrow(beetleds))), 
    control = list(fnscale = -1), hessian = TRUE,method="BFGS")
opt 
## $par
## [1] -60.71748  34.27034
## 
## $value
## [1] -186.2354
## 
## $counts
## function gradient 
##       24        9 
## 
## $convergence
## [1] 0
## 
## $message
## NULL
## 
## $hessian
##            [,1]      [,2]
## [1,]  -58.48417 -104.0105
## [2,] -104.01047 -185.0941
sqrt(diag(-solve(opt$hessian))) # calculate the standard deviations of the parameters
## [1] 5.180709 2.912138

Problem 6: Interpreting results

  1. Interpret the estimated \(\boldsymbol{\beta}\)´s. Odds ratio is useful for this.
  2. Plot the predicted probability of a beetle dying against the dosage and discuss what you see. (Yes, since this is the last question you may try to program by yourself!)

References for further reading

  • A. Agresti (2015): “Foundations of Linear and Generalized Linear Models.” Wiley.
  • A. J. Dobson and A. G. Barnett (2008): “An Introduction to Generalized Linear Models”, Third edition.
  • J. Faraway (2015): “Extending the Linear Model with R”, Second Edition. http://www.maths.bath.ac.uk/~jjf23/ELM/
  • P. McCullagh and J. A. Nelder (1989): “Generalized Linear Models”. Second edition.