Hints and reminders are in bold

Questions appear in blue.

Some hints and answers are hidden away in a fold like this Well done, you found me!

The exam style questions are optional, but will be helpful for, well, the exam. They will also help you now to see if you understand the content.

Interactive lecture - second week

We will use a data set on contraceptive use in Fiji (data from 1975). The data is to be analysed with “current use of contraceptive” as response and some (or all of) “age”, “level of education”, “desire for more children” as covariates.

The data is available at https://grodri.github.io/datasets/cuse.dat with the following description:

  • Contraceptive use: yes (using) or no (notUsing)
  • age: categorical variable with 5 levels: “<25”, “25-29”, “30-39”, “40-49”
  • education: categorical variable with 2 levels giving highest level of education obtained: Lower and Upper
  • wantsMore: Desires more children: yes or no
ds = read.table("https://grodri.github.io/datasets/cuse.dat", header = TRUE)
names(ds)
## [1] "age"       "education" "wantsMore" "notUsing"  "using"
summary(ds)
##      age             education          wantsMore            notUsing     
##  Length:16          Length:16          Length:16          Min.   :  8.00  
##  Class :character   Class :character   Class :character   1st Qu.: 31.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 56.50  
##                                                           Mean   : 68.75  
##                                                           3rd Qu.: 85.75  
##                                                           Max.   :212.00  
##      using      
##  Min.   : 4.00  
##  1st Qu.: 9.50  
##  Median :29.00  
##  Mean   :31.69  
##  3rd Qu.:49.00  
##  Max.   :80.00
dim(ds)
## [1] 16  5
head(ds)
##     age education wantsMore notUsing using
## 1   <25       low       yes       53     6
## 2   <25       low        no       10     4
## 3   <25      high       yes      212    52
## 4   <25      high        no       50    10
## 5 25-29       low       yes       60    14
## 6 25-29       low        no       19    10

We will study binary regression using the logit model, and we will work with grouped data. So the proportion using contraception is using/(using + notUsing). For grouped data as a response, R wants 2 columns, with success in the fist column, and failure in the second. We can write this as cbind(success, failure) ~ X1 + X2.

Plan: Start with Problem 2, then move to 3 and 4, and if you have time you look at Problem 1.

Problem 1: The null model - no covariates included.

(This is the most theoretical of the problems and rather technical - but with some cool results on the null model.)

a) Fit the null model and call it fit0. Explain what you see.

Answer
fit0 = glm(cbind(using, notUsing) ~ 1, data = ds, family = binomial(link = logit))
summary(fit0)
## 
## Call:
## glm(formula = cbind(using, notUsing) ~ 1, family = binomial(link = logit), 
##     data = ds)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.77455    0.05368  -14.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 165.77  on 15  degrees of freedom
## Residual deviance: 165.77  on 15  degrees of freedom
## AIC: 239.28
## 
## Number of Fisher Scoring iterations: 4

We see some output. The estimate of the intercept is -0.77: this is less than 0.5, so the mean probability that a woman uses contraception is <0.5 (the response is written as cbind(success, failure): there are other ways of writing it, see the help page for glm).

The null and residual deviances are the same, because this is the null model.

Observe: When we only have an intercept in the model, the model matrix will be an \(n \times 1\)-matrix with only ones. Then \({\bf x}_j^T\boldsymbol{\beta}=\beta_0\). The log-likelihood can be written as (let \(j=1,\ldots,G\))

\[\begin{align} l(\boldsymbol{\beta}) &=\sum_{j=1}^G (y_j{\bf x}_j^T \boldsymbol{\beta} - n_j \log(1+\exp({\bf x}_j^T\boldsymbol{\beta})))= \sum_{j=1}^G (y_j\beta_0 - n_j \log(1+\exp(\beta_0)))\\ &= \beta_0 \sum_{j=1}^G y_j - \log(1+\exp(\beta_0))\sum_{j=1}^G n_j =\beta_0 N_1 - \log(1+\exp(\beta_0))N \end{align}\]

where \(N_1=\sum_{j=1}^G y_j\) is the total number of successes and \(N=\sum_{j=1}^G n_j\) is the total number of trials (over all covariates patterns, that is, here the number of individuals in the data set). Also \(N_2=N-N_1\) is the total number of failures.

We will use this loglikelihood in the next question.

b) What is the relationship between your estimated coefficient and the proportion in the data set using contraception (\(N_1/N\))?

Hint 0 What would be your intuitive estimator for \(\pi\) (common to all groups).
Hint 1 Find the derivative of the log-likelihood with respect to \(\beta_0\), and use this to find the MLE for \(\beta_0\).
Hint 2 maybe easier to see what \(\hat{\pi}\) is, where \(\hat{\pi}=\frac{\exp{\hat{\beta}_0}}{1+\exp{\hat{\beta}_0}}\) (so plogis), and then \(\hat{\boldsymbol{\beta}}_0=\text{logit}(\hat{\pi})\) (so qlogis).
Hint 3 You can verify by using the estimate from glm.
Answer

The intuitive estimate of \(\hat{\pi}\) is \(N_1/N\). From above, \(l(\boldsymbol{\beta}) = \beta_0 N_1 - \log(1+\exp(\beta_0))N\), so we can differentiate (to get the score), set that to 0 and solve…

\[ \begin{align} \frac{dl(\boldsymbol{\beta})}{d\beta_0} &= N_1 - \frac{\log(1+\exp(\beta_0))N}{d\beta_0} \\ &= N_1 - N \frac{\log(u)}{du}\frac{(1+\exp(\beta_0))}{d\beta_0} && \text{using } u=1+\exp(\beta_0) \\ &= N_1 - N \frac{1}{1+\exp(\beta_0)}\exp(\beta_0) = 0 \\ \hat{\pi} &= \frac{N_1}{N} = \frac{\exp(\beta_0)}{1+\exp(\beta_0)} \end{align} \]

Recognise this? it’s the logistic function, so the inverse is the logit link: \(\beta_0 = \log(\hat{\pi}/(1-\hat{\pi}))\).

We can verify this in R:

N = sum(ds$using + ds$notUsing)
N1 = sum(ds$using)
N2 = N - N1

log(N1/N2)
## [1] -0.7745545
fit0$coefficients
## (Intercept) 
##  -0.7745545

c) We know that the (asymptotic) estimated covariance matrix of the ML estimate is the inverse of the expected Fisher information matrix, here the matrix is only a scalar and the covariance matrix is just the variance.

Find the mathematical expression for the estimated variance of our estimated intercept.

Hint 1 We have \(F(\boldsymbol{\beta}) = \sum_{j=1}^G x_jx_j^T n_j \pi_j(1-\pi_j)\), and then insert \(x_j = 1\) and \(\pi_j(1-\pi_j)=\pi(1-\pi)\), and hopefully you found above that \(\hat{\pi}=N_1/N\) in our model with only intercept.
Hint 2 \(\frac{N}{N_1 \cdot N_2}=\frac{1}{N_1}+\frac{1}{N_2}\) to make things prettier.
Answer

\(Var(\boldsymbol{\beta}) = F^{-1}(\boldsymbol{\beta})\), so find \(F(\boldsymbol{\beta})\):

\[ \begin{aligned} F(\boldsymbol{\beta}) &= \sum_{j=1}^G x_jx_j^T n_j \pi_j(1-\pi_j) \\ &= N \pi(1-\pi) \\ &= N \frac{N_1}{N}\left(\frac{N_2}{N}\right) \\ &= \frac{N_1 N_2}{N} \end{aligned} \]

So, \(F^{-1}(\boldsymbol{\beta}) = Var(\boldsymbol{\beta}) = N/(N_1 N_2) = \frac{1}{N_1}+\frac{1}{N_2}\)

d) What is the estimated (numerical value) standard deviation of the parameter estimate? Did your calculation above gives the same result?

Hint vcov(fit0)
Answer
vcov(fit0)
##             (Intercept)
## (Intercept) 0.002881477
Hopefully, “yes” (up to rounding error), especially if 1/sum(ds$notUsing) + 1/sum(ds$using) = 0.0028815

e) What is the asymptotic distribution of the estimated regression coefficient? Use this to write down the formula for the 95% confidence interval for the intercept, and calculate the interval in R. Compare numerically to confint.default and confint.

Answer

The coefficient is asymptotically normally distributed, so the (approximate) 95% confidence interval is

\(\hat{\beta}_0 \pm 1.96 \sqrt{\sigma^2_{\beta_0}} =\) -0.77 \(\pm 1.96 \times\) 0.05 = (-0.88, -0.67).

c(coef(fit0) - 1.96 * sqrt(vcov(fit0)), coef(fit0) + 1.96 * sqrt(vcov(fit0)))
## [1] -0.8797661 -0.6693428
(ci = confint.default(fit0))
##                  2.5 %     97.5 %
## (Intercept) -0.8797641 -0.6693448
confint(fit0)
##      2.5 %     97.5 % 
## -0.8804716 -0.6700014

So, the asymptotic answer is (almost) the same as the one confint.default() gives. confint() (which is actually confint.glm()) is different in the third decimal place. It uses a different approach, something called a profile likelihood.

f) Translate the 95% CI to probability scale
Hint

plogis is the inverse logit. Or write your own function.

Conceptually, think about what probability mass is above the upper limit on the linear scale and after it has been transformed.
Answer

If a transformation is linear, then an x% quantile before transformation is also an x% (or (100-x)%) quantile on the transformed scale. Recall that this does not hold for other statistics (e.g. the mean, mode or standard deviation).

So, we can simply transform the ci:

plogis(ci)
##                 2.5 %    97.5 %
## (Intercept) 0.2932267 0.3386436
fit0$family$linkinv(ci)
##                 2.5 %    97.5 %
## (Intercept) 0.2932267 0.3386436

Problem 2: We then study the effect of the covariate “wantsMore”

(a little less maths)

Fit a regression with wantsMore as covariate, and call this fit1. a) Explain what the estimated coefficient (\(\beta_1\)) means.

Hint

Interpretation using odds – if `wantsMore´ goes from 0=no to 1=yes, then…

Also, check whether a “success” is using or not using contraceptives.
Answer
fit1 = glm(cbind(using, notUsing) ~ wantsMore, data = ds, family = binomial)
summary(fit1)
## 
## Call:
## glm(formula = cbind(using, notUsing) ~ wantsMore, family = binomial, 
##     data = ds)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.18636    0.07971  -2.338   0.0194 *  
## wantsMoreyes -1.04863    0.11067  -9.475   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 165.772  on 15  degrees of freedom
## Residual deviance:  74.098  on 14  degrees of freedom
## AIC: 149.61
## 
## Number of Fisher Scoring iterations: 4
exp(fit1$coefficients)
##  (Intercept) wantsMoreyes 
##    0.8299712    0.3504178
\(\beta_1\) is the difference between the two levels of wantsMore, i.e. it is the difference in the log odds between women who want more children and those who do not. The estimate is negative (-1.05), so a woman who wants more children is less likely to be using contraceptives. The odds for a woman who wants more children are 35% lower than they are for those who want more children, . Put the other way around, they are 2.9 (\(=1/0.35\)) times higher for women who do not want more children.

b) Is this covariate significant?

  • Write down the null and alternative hypothesis to test.
  • Then test using the Wald test - write down the formula yourself.
  • Use vcov(fit1) to access the inverse of the Fisher information matrix.
  • What is the degrees of freedom for this test?
  • Compare to the print-out from summary.
Answer
  • Write down the null and alternative hypothesis to test.

\(H_0: \beta_1=0\) \(H_1: \beta_1 \ne 0\)

  • Then test using the Wald test - write down the formula yourself.

Formula:

\[ w = \frac{\beta_1 - 0}{\sqrt{a_{kk}(\hat{\boldsymbol{\beta}})}} \]

  • Use vcov(fit1) to access the inverse of the Fisher information matrix.

In code:

(w <- coef(fit1)["wantsMoreyes"]^2/vcov(fit1)["wantsMoreyes", "wantsMoreyes"])
## wantsMoreyes 
##     89.77763
(z <- coef(fit1)["wantsMoreyes"]/sqrt(vcov(fit1)["wantsMoreyes", "wantsMoreyes"]))
## wantsMoreyes 
##    -9.475106
  • What is the degrees of freedom for this test?

1

  • Compare to the print-out from summary.
summary(fit1)
## 
## Call:
## glm(formula = cbind(using, notUsing) ~ wantsMore, family = binomial, 
##     data = ds)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.18636    0.07971  -2.338   0.0194 *  
## wantsMoreyes -1.04863    0.11067  -9.475   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 165.772  on 15  degrees of freedom
## Residual deviance:  74.098  on 14  degrees of freedom
## AIC: 149.61
## 
## Number of Fisher Scoring iterations: 4

They are the same!

c) Alternatively, the likelihood ratio test can be used.

Write down the formula for the likelihood ratio test statistic. Use this in combination with the fit0 and fit1 objects to calculate the likelihood ratio statistic. What is the \(p\)-value?

Hint Write a function to calculate the log likelihoods
Answer

\[ - 2\ln \lambda=-2(\ln L(\hat{\boldsymbol{\beta}}_B)-\ln L(\hat{\boldsymbol{\beta}}_A)) \] We can calculate the log likelihoods:

calcloglik <- function(par, args) {
    y <- args$y
    x <- args$x
    n <- args$n
    res <- sum(y * x %*% par - n * log(1 + exp(x %*% par)))
    return(res)
}

args0 <- list(y = ds$using, n = ds$using + ds$notUsing, x = cbind(rep(1, nrow(ds))))
args1 <- list(y = ds$using, n = ds$using + ds$notUsing, x = cbind(rep(1, nrow(ds)),
    as.numeric(ds$wantsMore == "yes")))

betas0 <- fit0$coefficients
betas1 <- fit1$coefficients

(ll0 <- calcloglik(betas0, args0))
## [1] -1001.847
(ll1 <- calcloglik(betas1, args1))
## [1] -956.0096
(LRTstat <- -2 * (ll0 - ll1))
## [1] 91.6744

d) The likelihood ratio test statistic can alternatively be calculated using the residual deviance in fit1 and fit0, and is given as fit0$deviance-fit1$deviance.

Do you see why?

Hint

“No” may be correct, but is not the answer we are looking for.

Also, think about deviances being additive.
Answer

Yes, of course I see why! The total deviance is the sum of the “explained” and residual deviance, so we can look at either the increase in deviance or the decrease in residual deviance: they are the same.

fit0$deviance - fit1$deviance
## [1] 91.6744

e) Are the two test statistics (Wald and LRT) equal? Do the two tests give the same conclusions?

Answer No, they are not quite equal, but lead to the same conclusion: there is strong evidence for an effect of wanting more children on the probability of using contraception.

Problem 3: Now two covariates - deviance and model comparison

(no maths - only definitions and print-out)

Now we study the response together with age and wantsMore. We will consider the following 5 models. See R-code and print-out below.

ds$Y <- cbind(ds$using, ds$notUsing)
models <- list(null = glm(Y ~ 1, family = binomial, data = ds), age = glm(Y ~ age,
    family = binomial, data = ds), desire = glm(Y ~ wantsMore, family = binomial,
    data = ds), additive = glm(Y ~ age + wantsMore, family = binomial, data = ds),
    interact = glm(Y ~ age * wantsMore, family = binomial, data = ds))
models
## $null
## 
## Call:  glm(formula = Y ~ 1, family = binomial, data = ds)
## 
## Coefficients:
## (Intercept)  
##     -0.7746  
## 
## Degrees of Freedom: 15 Total (i.e. Null);  15 Residual
## Null Deviance:       165.8 
## Residual Deviance: 165.8     AIC: 239.3
## 
## $age
## 
## Call:  glm(formula = Y ~ age, family = binomial, data = ds)
## 
## Coefficients:
## (Intercept)     age25-29     age30-39     age40-49  
##     -1.5072       0.4607       1.0483       1.4246  
## 
## Degrees of Freedom: 15 Total (i.e. Null);  12 Residual
## Null Deviance:       165.8 
## Residual Deviance: 86.58     AIC: 166.1
## 
## $desire
## 
## Call:  glm(formula = Y ~ wantsMore, family = binomial, data = ds)
## 
## Coefficients:
##  (Intercept)  wantsMoreyes  
##      -0.1864       -1.0486  
## 
## Degrees of Freedom: 15 Total (i.e. Null);  14 Residual
## Null Deviance:       165.8 
## Residual Deviance: 74.1  AIC: 149.6
## 
## $additive
## 
## Call:  glm(formula = Y ~ age + wantsMore, family = binomial, data = ds)
## 
## Coefficients:
##  (Intercept)      age25-29      age30-39      age40-49  wantsMoreyes  
##      -0.8698        0.3678        0.8078        1.0226       -0.8241  
## 
## Degrees of Freedom: 15 Total (i.e. Null);  11 Residual
## Null Deviance:       165.8 
## Residual Deviance: 36.89     AIC: 118.4
## 
## $interact
## 
## Call:  glm(formula = Y ~ age * wantsMore, family = binomial, data = ds)
## 
## Coefficients:
##           (Intercept)               age25-29               age30-39  
##               -1.4553                 0.6354                 1.5411  
##              age40-49           wantsMoreyes  age25-29:wantsMoreyes  
##                1.7643                -0.0640                -0.2672  
## age30-39:wantsMoreyes  age40-49:wantsMoreyes  
##               -1.0905                -1.3671  
## 
## Degrees of Freedom: 15 Total (i.e. Null);  8 Residual
## Null Deviance:       165.8 
## Residual Deviance: 20.1  AIC: 107.6
lapply(models, summary)
## $null
## 
## Call:
## glm(formula = Y ~ 1, family = binomial, data = ds)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.77455    0.05368  -14.43   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 165.77  on 15  degrees of freedom
## Residual deviance: 165.77  on 15  degrees of freedom
## AIC: 239.28
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## $age
## 
## Call:
## glm(formula = Y ~ age, family = binomial, data = ds)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.5072     0.1303 -11.571  < 2e-16 ***
## age25-29      0.4607     0.1727   2.667  0.00765 ** 
## age30-39      1.0483     0.1544   6.788 1.14e-11 ***
## age40-49      1.4246     0.1940   7.345 2.06e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 165.772  on 15  degrees of freedom
## Residual deviance:  86.581  on 12  degrees of freedom
## AIC: 166.09
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## $desire
## 
## Call:
## glm(formula = Y ~ wantsMore, family = binomial, data = ds)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.18636    0.07971  -2.338   0.0194 *  
## wantsMoreyes -1.04863    0.11067  -9.475   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 165.772  on 15  degrees of freedom
## Residual deviance:  74.098  on 14  degrees of freedom
## AIC: 149.61
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## $additive
## 
## Call:
## glm(formula = Y ~ age + wantsMore, family = binomial, data = ds)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.8698     0.1571  -5.536 3.10e-08 ***
## age25-29       0.3678     0.1754   2.097    0.036 *  
## age30-39       0.8078     0.1598   5.056 4.27e-07 ***
## age40-49       1.0226     0.2039   5.014 5.32e-07 ***
## wantsMoreyes  -0.8241     0.1171  -7.037 1.97e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 165.772  on 15  degrees of freedom
## Residual deviance:  36.888  on 11  degrees of freedom
## AIC: 118.4
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## $interact
## 
## Call:
## glm(formula = Y ~ age * wantsMore, family = binomial, data = ds)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -1.4553     0.2968  -4.903 9.43e-07 ***
## age25-29                0.6354     0.3564   1.783  0.07463 .  
## age30-39                1.5412     0.3183   4.842 1.29e-06 ***
## age40-49                1.7643     0.3435   5.136 2.80e-07 ***
## wantsMoreyes           -0.0640     0.3303  -0.194  0.84637    
## age25-29:wantsMoreyes  -0.2672     0.4091  -0.653  0.51366    
## age30-39:wantsMoreyes  -1.0905     0.3733  -2.921  0.00349 ** 
## age40-49:wantsMoreyes  -1.3672     0.4834  -2.828  0.00468 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 165.772  on 15  degrees of freedom
## Residual deviance:  20.099  on  8  degrees of freedom
## AIC: 107.61
## 
## Number of Fisher Scoring iterations: 4
data.frame(deviance = round(unlist(lapply(models, deviance)), 2), df = unlist(lapply(models,
    df.residual)), aic = round(unlist(lapply(models, AIC))))
##          deviance df aic
## null       165.77 15 239
## age         86.58 12 166
## desire      74.10 14 150
## additive    36.89 11 118
## interact    20.10  8 108

Note: The function lapply (list apply) will apply a function to each element of a list.

a) Explain what each of these models include.

Hint We’re interested in which covariate is in which model
Answer
  • null: Constant only, no effects. This is the same as the model in Q1
  • age: only has Age as an effect
  • desire: Only has whether the woman wants more children. This is the model in Q2
  • additive: has two main effects: from the age and desire models,
  • interact: includes an interaction, i.e. the effect of age depends on whether the woman wants more children.

b) What is the definition of the deviance and df. What can we use the deviance for?

Hint

df = degrees of freedom

Deviance is a function of likelihoods.
Answer

The deviance is defined as the likelihood ratio statistic, where we put the saturated model in place of the larger model A and our candidate model in place of the smaller model B:

\[\begin{aligned} D&=-2(\ln L(\text{candidate model})-\ln L(\text{saturated model})) \\ &=-2(l(\hat{\pi})-l(\tilde{\pi}))= -2\sum_{j=1}^G(l_j(\hat{\pi}_j)-l_j(\tilde{\pi}_j)) \end{aligned}\]

The df is the difference in the number of parameters of the models being compared.

We can use it to (a) compare models, (b) test how goodness of fit of a model, and (c) test for overdispersion.

Optional: derive the formula for the deviance for the logit model (the derivation is not given on the module pages.)

Hint

The deviance is

\[ D= -2\sum_{j=1}^G(l_j(\hat{\pi}_j)-l_j(\tilde{\pi}_j)) = 2\sum_{j=1}^G [y_j\ln(\frac{y_j}{n_j\hat{\pi}_j})+(n_j-y_j)\ln(\frac{n_j-y_j}{n_j-n_j\hat{\pi}_j})] \]

Answer

Because \(D = -2\sum_{j=1}^G(l_j(\hat{\pi}_j)-l_j(\tilde{\pi}_j))\), we can just work with \(l_j(\hat{\pi}_j)-l_j(\tilde{\pi}_j)\), where \(\hat{\pi}_j\) is the estimate at our model and \(l_j(\tilde{\pi}_j) = y_j/n_j\) is the estimate at the saturated model.

\[\begin{aligned} l_j({\hat{\pi}}_j)-l_j({\tilde{\pi}}_j) &= \left( y_j \ln {\hat{\pi}}_j + (1-y_j) \ln(n_j-{\hat{\pi}}_j) \right) - \left( y_j \ln {\tilde{\pi}}_j + (1-y_j) \ln(n_j-{\tilde{\pi}}_j) \right) \\ &= y_j \ln \frac{{\hat{\pi}}_j}{{\tilde{\pi}}_j} + (1-y_j) \ln \frac{{n_j-\hat{\pi}}_j}{{1-\tilde{\pi}}_j} \\ &= y_j \ln \frac{{\hat{\pi}}_j}{{\tilde{\pi}}_j} + (1-y_j) \ln \frac{{n_j-\hat{\pi}}_j}{{1-\tilde{\pi}}_j} \\ &= y_j \ln \frac{n_j {\hat{\pi}}_j}{{y_j}} + (n_j-y_j) \ln \frac{n_j (1-{\hat{\pi}}_j)}{n_j-y_j}= y_j \ln \frac{n_j {\hat{\pi}}_j}{{y_j}} + (n_j-y_j) \ln \frac{n_j-n_j {\hat{\pi}}_j}{n_j-y_j} \end{aligned}\]

So now we can plug this into \(-2\sum D_j\) to get

\[ D = -2 \sum_{j=1}^G y_j \ln \frac{n_j {\hat{\pi}}_j}{{y_j}} + (1-y_j) \ln \frac{n_j-n_j {\hat{\pi}}_j}{n_j-y_j} \]

c) Perform a likelihood ratio test - based on the deviance results given in the data frame in the R chunk- to compare the additive and interact models. First write down the null and alternative hypothesis you are testing. Which of the two models would you prefer?
Hint

The models are

additive: Y ~ age + wantsMore interact: Y ~ age * wantsMore
Answer

The null hypothesis is \(H_0\): Y ~ age + wantsMore is the correct model, i.e. all of the interaction terms are 0. The alternative hypothesis is \(H_1\): Y ~ age * wantsMore is the correct model.

We could do this using anova(additive, interact), but let’s do it by hand!

The deviances are (using summary(models$additive)$deviance):

additive: R summary(models$additive)$deviance interact: R summary(models$interact)$deviance

with R summary(models$additive)$df.residual and R summary(models$interact)$df.residual degrees of freedom respectively (using summary(models$additive)$df.residual). The test statistic is then:

(test.dev <- summary(models$additive)$deviance - summary(models$interact)$deviance)
## [1] 16.78881
(test.df <- summary(models$additive)$df.residual - summary(models$interact)$df.residual)
## [1] 3
pchisq(test.dev, test.df, lower.tail = FALSE)
## [1] 0.0007810536

Hence, we reject the null hypothesis, and prefer the model with the interactions.

d) What if you use the AIC for model selection, which model (out of the 5) will you then use?

Hint Lower AIC is better.
Answer

We can extract the AICs from each object (or just read the output above). Then we can use R to find the model with the lowest AIC:

(AICs <- unlist(lapply(models, AIC)))  # lapply() returns a list, so unlist() tidies this up
##     null      age   desire additive interact 
## 239.2803 166.0886 149.6059 118.3957 107.6069
which(AICs == min(AICs))
## interact 
##        5

And we find the interaction model, once again, is best.

Problem 4: Plotting

(yes, mainly R and trying to understand)

Finally, we want to use the additive model with all three covariates: Y~age+education+wantsMore to study different plots.

fit3add = glm(cbind(using, notUsing) ~ age + education + wantsMore, data = ds, family = binomial)
summary(fit3add)
## 
## Call:
## glm(formula = cbind(using, notUsing) ~ age + education + wantsMore, 
##     family = binomial, data = ds)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.8082     0.1590  -5.083 3.71e-07 ***
## age25-29       0.3894     0.1759   2.214  0.02681 *  
## age30-39       0.9086     0.1646   5.519 3.40e-08 ***
## age40-49       1.1892     0.2144   5.546 2.92e-08 ***
## educationlow  -0.3250     0.1240  -2.620  0.00879 ** 
## wantsMoreyes  -0.8330     0.1175  -7.091 1.33e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 165.772  on 15  degrees of freedom
## Residual deviance:  29.917  on 10  degrees of freedom
## AIC: 113.43
## 
## Number of Fisher Scoring iterations: 4
exp(fit3add$coefficients)
##  (Intercept)     age25-29     age30-39     age40-49 educationlow wantsMoreyes 
##    0.4456506    1.4760678    2.4808804    3.2845805    0.7225312    0.4347628

a) Comment on the output from summary. Use the deviance to test if this is a good model. Which model do you then compare to (what is the saturated model here)?.

Hint

Think about both the directions of effects and their size.

You should know how a saturates model is defined. For extra marks, can you write the R formula for it?

Answer

The summary suggests a lot is going on. In particular:

  • low education and wanting more children are associated with a lower probability of using contraceptives: the effect of wanting more children is much stronger (perhaps not surprisingly).
  • the age levels are contrasts to “<25”, so the effect of age 25-29 is to increade the odds \(e^{0.39} = 1.48\) times.
  • we can see that as age increases, the age effect also increases, i.e. the older women are, the more likely they are to use contraception.

We can test the goodness of the model by using the deviance. This is, as you now know, based on the difference between the likelihood for the model, and the likelihood for a saturated model. The saturated model has one parameter per data point (i.e. row in the data). Because the data are all factors, and we have every comnination, this is the same as the model with all possible interactions, i.e. cbind(using,notUsing) ~ age*education*wantsMore.

We can test if this is a good model from this line:

Residual deviance: 29.917 on 10 degrees of freedom

We can compare this to a \(\chi^2\) distribution:

pchisq(29.917, 10, lower.tail = FALSE)
## [1] 0.0008838311
pchisq(fit3add$deviance, fit3add$df.residual, lower.tail = FALSE) < 0.05
## [1] TRUE

The second line is just to show you how to (a) extract the statistics from the model, and (b) reduce hte p-value to a yes-no, i.e. if it is less than 0.05.

We can see that the test is significant, i.e. the model does not appear to be good.

b) Look at the plot of fitted values vs. deviance residuals. Do you see a trend?

Answer I don’t. With 16 data points, random can look weird in all sorts of ways.
library(ggplot2)
plotdf <- data.frame(dres = fit3add$residuals, fitted = fit3add$fitted.values, age = ds$age)
ggplot(plotdf, aes(x = fitted, y = dres)) + geom_point() + labs(x = "Fitted values",
    y = "Deviance residuals")

c) Here is a code to produce a plot of fitted values for the 16 covariate patterns together with a saturated model (called frac) - on the logit scale.

library(ggplot2)
frac = ds$using/(ds$using + ds$notUsing)
logitdf = data.frame(lfitted = qlogis(fit3add$fitted), lfrac = qlogis(frac), age = ds$age,
    wantsMore = ds$wantsMore, education = ds$education)
ggplot(logitdf, aes(x = age)) + geom_point(aes(y = lfrac, colour = "saturated")) +
    geom_point(aes(y = lfitted, colour = "fitted")) + facet_wrap(facets = ~wantsMore *
    education) + labs(x = "age", y = "")

How can you see from the dots for fitted values that we are assuming an additive model on the logit scale (that is, the pattern for each panel for logit fitted vs age is the same for each panel if we have an additive model on the logit scale - you see that from the fitted points)? Do you also see that for the observed values? Implication? Any other observations?

Hint

If it helps, write Age as continuous, and ask how the line changes when factor levels change.

For the observed values, are there clear patterns? Note that this is a subjective judgement!
Answer

If the model is additive on the scale here, the plots should look similar: the relative effects of age should be the same, i.e. the effect of going from high to low education should be to change the age effect by the same amount. So, with some obvious notation we can go from

\[ \eta_i = \beta_0 + \beta_{age25-29} + \beta_{educationHigh} \]

to

\[ \eta_i = \beta_0 + \beta_{age25-29} + \beta_{educationLow} = \beta_1 + \beta_{age25-29} \]

and changing age changes \(\beta_{age}\), but not \(\beta_1\).

For the observed values this is not so clear. For no, low (i.e. not wanting more children and low education) the residuals are close to the fitted line. But for no, high the line seems steeper, and yes, low has a lower value for ages 40-59.

The problem with interpreting these is that it is easy to over-interpret random noise (this is called Pareidolia). But we know from the test of the residual deviance that this model does not fit the data well enough, so perhaps there is something going on. Or perhaps this extra variation is random noise (Robert Abelson in his book summarised this as “chance is lumpy”).

Exam questions

For this module the following are exam questions to work with

In addition these essay-type exam questions are closely related to this module.

December 2014

There are two main asymptotic results that are behind essentially everything we have done with respect to inference for generalized linear models. These are

  1. asymptotic normality of maximum likelihood estimators (MLE), and

  2. asymptotic result for likelihood ratio tests (LRT).

State, describe and discuss the main assumptions behind these two asymptotic results, how these results are used in practice to do inference for parameters, testing various hypotheses and comparing nested models, for logistic regression.

December 2016

We will consider (binomial or binary) logistic regression, where we have independent observations \[Y_i \sim \text{Binomial}(n_i,\pi_i) \text{ } i=1,\ldots,n\] so that \[P(Y_i)=\binom{n_i}{y_i}\pi_i^{y_i}(1-\pi_i)^{n_i-y_i}, \text{ } y_i=0,1,\ldots, n_i\] The linear predictor is \[ \eta_i={\bf x}_i^T\boldsymbol{\beta}\] and \[ \pi_i=\frac{\exp(\eta_i)}{1+\exp(\eta_i)}\] or \[ \text{logit}(\pi_i)=\eta_i.\] Here, \({\bf x}_i\) is a vector of the \(p\) covariates for the \(i\)th observation \(y_i\) with size (number of trials) \(n_i\), and \(\boldsymbol{\beta}\) is the vector of \(p\) unknown regression coefficients.

Write an introduction to logistic regression and its practical usage, for a student with a good background in statistics, but no knowledge about Generalized Linear Models (GLM). Topics you may want to consider, are

  • When to use it? Underlying assumptions.
  • Parameter estimation, limiting results for the MLE, Fisher information and observed Fisher information, confidence intervals and hypothesis testing.
  • Output analysis, residual plots (when it is possible) and interpretation of the \(\boldsymbol{\beta}\)-coefficients
  • Deviance and its usage.

R packages

install.packages(c("tidyverse", "investr", "knitr", "kableExtra", "faraway", "viridis",
    "statmod"))

References for further reading

  • A. Agresti (2015): “Foundations of Linear and Generalized Linear Models.” Wiley.
  • A. J. Dobson and A. G. Barnett (2008): “An Introduction to Generalized Linear Models”, Third edition.
  • J. Faraway (2015): “Extending the Linear Model with R”, Second Edition. http://www.maths.bath.ac.uk/~jjf23/ELM/
  • P. McCullagh and J. A. Nelder (1989): “Generalized Linear Models”. Second edition.