Hints and reminders are in bold
Questions appear in blue.
The exam style questions are optional, but will be helpful for, well, the exam. They will also help you now to see if you understand the content.
We will use a data set on contraceptive use in Fiji (data from 1975). The data is to be analysed with “current use of contraceptive” as response and some (or all of) “age”, “level of education”, “desire for more children” as covariates.
The data is available at https://grodri.github.io/datasets/cuse.dat with the following description:
using
) or no
(notUsing
)age
: categorical variable with 5 levels: “<25”,
“25-29”, “30-39”, “40-49”education
: categorical variable with 2 levels giving
highest level of education obtained: Lower and UpperwantsMore
: Desires more children: yes or nods = read.table("https://grodri.github.io/datasets/cuse.dat", header = TRUE)
names(ds)
## [1] "age" "education" "wantsMore" "notUsing" "using"
summary(ds)
## age education wantsMore notUsing
## Length:16 Length:16 Length:16 Min. : 8.00
## Class :character Class :character Class :character 1st Qu.: 31.00
## Mode :character Mode :character Mode :character Median : 56.50
## Mean : 68.75
## 3rd Qu.: 85.75
## Max. :212.00
## using
## Min. : 4.00
## 1st Qu.: 9.50
## Median :29.00
## Mean :31.69
## 3rd Qu.:49.00
## Max. :80.00
dim(ds)
## [1] 16 5
head(ds)
## age education wantsMore notUsing using
## 1 <25 low yes 53 6
## 2 <25 low no 10 4
## 3 <25 high yes 212 52
## 4 <25 high no 50 10
## 5 25-29 low yes 60 14
## 6 25-29 low no 19 10
We will study binary regression using the logit model, and we will
work with grouped data. So the proportion using contraception is
using
/(using
+ notUsing
). For
grouped data as a response, R wants 2 columns, with success in the fist
column, and failure in the second. We can write this as
cbind(success, failure) ~ X1 + X2
.
Plan: Start with Problem 2, then move to 3 and 4, and if you have time you look at Problem 1.
(This is the most theoretical of the problems and rather technical - but with some cool results on the null model.)
a) Fit the null model and
call it fit0
. Explain what you see.
fit0 = glm(cbind(using, notUsing) ~ 1, data = ds, family = binomial(link = logit))
summary(fit0)
##
## Call:
## glm(formula = cbind(using, notUsing) ~ 1, family = binomial(link = logit),
## data = ds)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.77455 0.05368 -14.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 165.77 on 15 degrees of freedom
## Residual deviance: 165.77 on 15 degrees of freedom
## AIC: 239.28
##
## Number of Fisher Scoring iterations: 4
We see some output. The estimate of the intercept is -0.77: this is
less than 0.5, so the mean probability that a woman uses contraception
is <0.5 (the response is written as
cbind(success, failure)
: there are other ways of writing
it, see the help page for glm
).
The null and residual deviances are the same, because this is the null model.
Observe: When we only have an intercept in the model, the model matrix will be an \(n \times 1\)-matrix with only ones. Then \({\bf x}_j^T\boldsymbol{\beta}=\beta_0\). The log-likelihood can be written as (let \(j=1,\ldots,G\))
\[\begin{align} l(\boldsymbol{\beta}) &=\sum_{j=1}^G (y_j{\bf x}_j^T \boldsymbol{\beta} - n_j \log(1+\exp({\bf x}_j^T\boldsymbol{\beta})))= \sum_{j=1}^G (y_j\beta_0 - n_j \log(1+\exp(\beta_0)))\\ &= \beta_0 \sum_{j=1}^G y_j - \log(1+\exp(\beta_0))\sum_{j=1}^G n_j =\beta_0 N_1 - \log(1+\exp(\beta_0))N \end{align}\]
where \(N_1=\sum_{j=1}^G y_j\) is the total number of successes and \(N=\sum_{j=1}^G n_j\) is the total number of trials (over all covariates patterns, that is, here the number of individuals in the data set). Also \(N_2=N-N_1\) is the total number of failures.
We will use this loglikelihood in the next question.
b) What is the relationship between your estimated coefficient and the proportion in the data set using contraception (\(N_1/N\))?
plogis
), and then \(\hat{\boldsymbol{\beta}}_0=\text{logit}(\hat{\pi})\)
(so qlogis
).
glm
.
The intuitive estimate of \(\hat{\pi}\) is \(N_1/N\). From above, \(l(\boldsymbol{\beta}) = \beta_0 N_1 - \log(1+\exp(\beta_0))N\), so we can differentiate (to get the score), set that to 0 and solve…
\[ \begin{align} \frac{dl(\boldsymbol{\beta})}{d\beta_0} &= N_1 - \frac{\log(1+\exp(\beta_0))N}{d\beta_0} \\ &= N_1 - N \frac{\log(u)}{du}\frac{(1+\exp(\beta_0))}{d\beta_0} && \text{using } u=1+\exp(\beta_0) \\ &= N_1 - N \frac{1}{1+\exp(\beta_0)}\exp(\beta_0) = 0 \\ \hat{\pi} &= \frac{N_1}{N} = \frac{\exp(\beta_0)}{1+\exp(\beta_0)} \end{align} \]
Recognise this? it’s the logistic function, so the inverse is the logit link: \(\beta_0 = \log(\hat{\pi}/(1-\hat{\pi}))\).
We can verify this in R:
N = sum(ds$using + ds$notUsing)
N1 = sum(ds$using)
N2 = N - N1
log(N1/N2)
## [1] -0.7745545
fit0$coefficients
## (Intercept)
## -0.7745545
c) We know that the (asymptotic) estimated covariance matrix of the ML estimate is the inverse of the expected Fisher information matrix, here the matrix is only a scalar and the covariance matrix is just the variance.
Find the mathematical expression for the estimated variance of our estimated intercept.
\(Var(\boldsymbol{\beta}) = F^{-1}(\boldsymbol{\beta})\), so find \(F(\boldsymbol{\beta})\):
\[ \begin{aligned} F(\boldsymbol{\beta}) &= \sum_{j=1}^G x_jx_j^T n_j \pi_j(1-\pi_j) \\ &= N \pi(1-\pi) \\ &= N \frac{N_1}{N}\left(\frac{N_2}{N}\right) \\ &= \frac{N_1 N_2}{N} \end{aligned} \]
So, \(F^{-1}(\boldsymbol{\beta}) = Var(\boldsymbol{\beta}) = N/(N_1 N_2) = \frac{1}{N_1}+\frac{1}{N_2}\)d) What is the estimated (numerical value) standard deviation of the parameter estimate? Did your calculation above gives the same result?
vcov(fit0)
vcov(fit0)
## (Intercept)
## (Intercept) 0.002881477
Hopefully, “yes” (up to rounding error), especially if
1/sum(ds$notUsing) + 1/sum(ds$using) =
0.0028815
e) What is the asymptotic
distribution of the estimated regression coefficient? Use this to write
down the formula for the 95% confidence interval for the intercept, and
calculate the interval in R. Compare numerically to
confint.default
and confint
.
The coefficient is asymptotically normally distributed, so the (approximate) 95% confidence interval is
\(\hat{\beta}_0 \pm 1.96 \sqrt{\sigma^2_{\beta_0}} =\) -0.77 \(\pm 1.96 \times\) 0.05 = (-0.88, -0.67).
c(coef(fit0) - 1.96 * sqrt(vcov(fit0)), coef(fit0) + 1.96 * sqrt(vcov(fit0)))
## [1] -0.8797661 -0.6693428
(ci = confint.default(fit0))
## 2.5 % 97.5 %
## (Intercept) -0.8797641 -0.6693448
confint(fit0)
## 2.5 % 97.5 %
## -0.8804716 -0.6700014
So, the asymptotic answer is (almost) the same as the one
confint.default()
gives. confint()
(which is
actually confint.glm()
) is different in the third decimal
place. It uses a different approach, something called a profile
likelihood.
plogis
is the inverse logit. Or write your own
function.
If a transformation is linear, then an x% quantile before transformation is also an x% (or (100-x)%) quantile on the transformed scale. Recall that this does not hold for other statistics (e.g. the mean, mode or standard deviation).
So, we can simply transform the ci
:
plogis(ci)
## 2.5 % 97.5 %
## (Intercept) 0.2932267 0.3386436
fit0$family$linkinv(ci)
## 2.5 % 97.5 %
## (Intercept) 0.2932267 0.3386436
(a little less maths)
Fit a regression with wantsMore
as covariate, and call
this fit1
. a)
Explain what the estimated coefficient (\(\beta_1\)) means.
Interpretation using odds – if `wantsMore´ goes from 0=no to 1=yes, then…
Also, check whether a “success” is using or not using contraceptives.fit1 = glm(cbind(using, notUsing) ~ wantsMore, data = ds, family = binomial)
summary(fit1)
##
## Call:
## glm(formula = cbind(using, notUsing) ~ wantsMore, family = binomial,
## data = ds)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.18636 0.07971 -2.338 0.0194 *
## wantsMoreyes -1.04863 0.11067 -9.475 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 165.772 on 15 degrees of freedom
## Residual deviance: 74.098 on 14 degrees of freedom
## AIC: 149.61
##
## Number of Fisher Scoring iterations: 4
exp(fit1$coefficients)
## (Intercept) wantsMoreyes
## 0.8299712 0.3504178
\(\beta_1\) is the difference between
the two levels of wantsMore
, i.e. it is the difference in
the log odds between women who want more children and those who do not.
The estimate is negative (-1.05), so a woman who wants more children is
less likely to be using contraceptives. The odds for a woman who wants
more children are 35% lower than they are for those who want more
children, . Put the other way around, they are 2.9 (\(=1/0.35\)) times higher for women who do
not want more children.
b) Is this covariate significant?
vcov(fit1)
to access the inverse of the Fisher
information matrix.summary
.\(H_0: \beta_1=0\) \(H_1: \beta_1 \ne 0\)
Formula:
\[ w = \frac{\beta_1 - 0}{\sqrt{a_{kk}(\hat{\boldsymbol{\beta}})}} \]
vcov(fit1)
to access the inverse of the Fisher
information matrix.In code:
(w <- coef(fit1)["wantsMoreyes"]^2/vcov(fit1)["wantsMoreyes", "wantsMoreyes"])
## wantsMoreyes
## 89.77763
(z <- coef(fit1)["wantsMoreyes"]/sqrt(vcov(fit1)["wantsMoreyes", "wantsMoreyes"]))
## wantsMoreyes
## -9.475106
1
summary
.summary(fit1)
##
## Call:
## glm(formula = cbind(using, notUsing) ~ wantsMore, family = binomial,
## data = ds)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.18636 0.07971 -2.338 0.0194 *
## wantsMoreyes -1.04863 0.11067 -9.475 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 165.772 on 15 degrees of freedom
## Residual deviance: 74.098 on 14 degrees of freedom
## AIC: 149.61
##
## Number of Fisher Scoring iterations: 4
They are the same!
c) Alternatively, the likelihood ratio test can be used.
Write down the formula for the likelihood
ratio test statistic. Use this in combination with the fit0
and fit1
objects to calculate the likelihood ratio
statistic. What is the \(p\)-value?
\[ - 2\ln \lambda=-2(\ln L(\hat{\boldsymbol{\beta}}_B)-\ln L(\hat{\boldsymbol{\beta}}_A)) \] We can calculate the log likelihoods:
calcloglik <- function(par, args) {
y <- args$y
x <- args$x
n <- args$n
res <- sum(y * x %*% par - n * log(1 + exp(x %*% par)))
return(res)
}
args0 <- list(y = ds$using, n = ds$using + ds$notUsing, x = cbind(rep(1, nrow(ds))))
args1 <- list(y = ds$using, n = ds$using + ds$notUsing, x = cbind(rep(1, nrow(ds)),
as.numeric(ds$wantsMore == "yes")))
betas0 <- fit0$coefficients
betas1 <- fit1$coefficients
(ll0 <- calcloglik(betas0, args0))
## [1] -1001.847
(ll1 <- calcloglik(betas1, args1))
## [1] -956.0096
(LRTstat <- -2 * (ll0 - ll1))
## [1] 91.6744
d) The likelihood ratio test statistic can
alternatively be calculated using the residual deviance in
fit1
and fit0
, and is given as
fit0$deviance-fit1$deviance
.
Do you see why?
“No” may be correct, but is not the answer we are looking for.
Also, think about deviances being additive.Yes, of course I see why! The total deviance is the sum of the “explained” and residual deviance, so we can look at either the increase in deviance or the decrease in residual deviance: they are the same.
fit0$deviance - fit1$deviance
## [1] 91.6744
e) Are the two test statistics (Wald and LRT) equal? Do the two tests give the same conclusions?
(no maths - only definitions and print-out)
Now we study the response together with age
and
wantsMore
. We will consider the following 5 models. See
R-code and print-out below.
ds$Y <- cbind(ds$using, ds$notUsing)
models <- list(null = glm(Y ~ 1, family = binomial, data = ds), age = glm(Y ~ age,
family = binomial, data = ds), desire = glm(Y ~ wantsMore, family = binomial,
data = ds), additive = glm(Y ~ age + wantsMore, family = binomial, data = ds),
interact = glm(Y ~ age * wantsMore, family = binomial, data = ds))
models
## $null
##
## Call: glm(formula = Y ~ 1, family = binomial, data = ds)
##
## Coefficients:
## (Intercept)
## -0.7746
##
## Degrees of Freedom: 15 Total (i.e. Null); 15 Residual
## Null Deviance: 165.8
## Residual Deviance: 165.8 AIC: 239.3
##
## $age
##
## Call: glm(formula = Y ~ age, family = binomial, data = ds)
##
## Coefficients:
## (Intercept) age25-29 age30-39 age40-49
## -1.5072 0.4607 1.0483 1.4246
##
## Degrees of Freedom: 15 Total (i.e. Null); 12 Residual
## Null Deviance: 165.8
## Residual Deviance: 86.58 AIC: 166.1
##
## $desire
##
## Call: glm(formula = Y ~ wantsMore, family = binomial, data = ds)
##
## Coefficients:
## (Intercept) wantsMoreyes
## -0.1864 -1.0486
##
## Degrees of Freedom: 15 Total (i.e. Null); 14 Residual
## Null Deviance: 165.8
## Residual Deviance: 74.1 AIC: 149.6
##
## $additive
##
## Call: glm(formula = Y ~ age + wantsMore, family = binomial, data = ds)
##
## Coefficients:
## (Intercept) age25-29 age30-39 age40-49 wantsMoreyes
## -0.8698 0.3678 0.8078 1.0226 -0.8241
##
## Degrees of Freedom: 15 Total (i.e. Null); 11 Residual
## Null Deviance: 165.8
## Residual Deviance: 36.89 AIC: 118.4
##
## $interact
##
## Call: glm(formula = Y ~ age * wantsMore, family = binomial, data = ds)
##
## Coefficients:
## (Intercept) age25-29 age30-39
## -1.4553 0.6354 1.5411
## age40-49 wantsMoreyes age25-29:wantsMoreyes
## 1.7643 -0.0640 -0.2672
## age30-39:wantsMoreyes age40-49:wantsMoreyes
## -1.0905 -1.3671
##
## Degrees of Freedom: 15 Total (i.e. Null); 8 Residual
## Null Deviance: 165.8
## Residual Deviance: 20.1 AIC: 107.6
lapply(models, summary)
## $null
##
## Call:
## glm(formula = Y ~ 1, family = binomial, data = ds)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.77455 0.05368 -14.43 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 165.77 on 15 degrees of freedom
## Residual deviance: 165.77 on 15 degrees of freedom
## AIC: 239.28
##
## Number of Fisher Scoring iterations: 4
##
##
## $age
##
## Call:
## glm(formula = Y ~ age, family = binomial, data = ds)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.5072 0.1303 -11.571 < 2e-16 ***
## age25-29 0.4607 0.1727 2.667 0.00765 **
## age30-39 1.0483 0.1544 6.788 1.14e-11 ***
## age40-49 1.4246 0.1940 7.345 2.06e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 165.772 on 15 degrees of freedom
## Residual deviance: 86.581 on 12 degrees of freedom
## AIC: 166.09
##
## Number of Fisher Scoring iterations: 4
##
##
## $desire
##
## Call:
## glm(formula = Y ~ wantsMore, family = binomial, data = ds)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.18636 0.07971 -2.338 0.0194 *
## wantsMoreyes -1.04863 0.11067 -9.475 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 165.772 on 15 degrees of freedom
## Residual deviance: 74.098 on 14 degrees of freedom
## AIC: 149.61
##
## Number of Fisher Scoring iterations: 4
##
##
## $additive
##
## Call:
## glm(formula = Y ~ age + wantsMore, family = binomial, data = ds)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8698 0.1571 -5.536 3.10e-08 ***
## age25-29 0.3678 0.1754 2.097 0.036 *
## age30-39 0.8078 0.1598 5.056 4.27e-07 ***
## age40-49 1.0226 0.2039 5.014 5.32e-07 ***
## wantsMoreyes -0.8241 0.1171 -7.037 1.97e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 165.772 on 15 degrees of freedom
## Residual deviance: 36.888 on 11 degrees of freedom
## AIC: 118.4
##
## Number of Fisher Scoring iterations: 4
##
##
## $interact
##
## Call:
## glm(formula = Y ~ age * wantsMore, family = binomial, data = ds)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.4553 0.2968 -4.903 9.43e-07 ***
## age25-29 0.6354 0.3564 1.783 0.07463 .
## age30-39 1.5412 0.3183 4.842 1.29e-06 ***
## age40-49 1.7643 0.3435 5.136 2.80e-07 ***
## wantsMoreyes -0.0640 0.3303 -0.194 0.84637
## age25-29:wantsMoreyes -0.2672 0.4091 -0.653 0.51366
## age30-39:wantsMoreyes -1.0905 0.3733 -2.921 0.00349 **
## age40-49:wantsMoreyes -1.3672 0.4834 -2.828 0.00468 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 165.772 on 15 degrees of freedom
## Residual deviance: 20.099 on 8 degrees of freedom
## AIC: 107.61
##
## Number of Fisher Scoring iterations: 4
data.frame(deviance = round(unlist(lapply(models, deviance)), 2), df = unlist(lapply(models,
df.residual)), aic = round(unlist(lapply(models, AIC))))
## deviance df aic
## null 165.77 15 239
## age 86.58 12 166
## desire 74.10 14 150
## additive 36.89 11 118
## interact 20.10 8 108
Note: The function lapply
(list apply) will apply a
function to each element of a list.
a) Explain what each of these models include.
b) What is the definition of the deviance and df. What can we use the deviance for?
df = degrees of freedom
Deviance is a function of likelihoods.The deviance is defined as the likelihood ratio statistic, where we put the saturated model in place of the larger model A and our candidate model in place of the smaller model B:
\[\begin{aligned} D&=-2(\ln L(\text{candidate model})-\ln L(\text{saturated model})) \\ &=-2(l(\hat{\pi})-l(\tilde{\pi}))= -2\sum_{j=1}^G(l_j(\hat{\pi}_j)-l_j(\tilde{\pi}_j)) \end{aligned}\]The df is the difference in the number of parameters of the models being compared.
We can use it to (a) compare models, (b) test how goodness of fit of a model, and (c) test for overdispersion.
Optional: derive the formula for the deviance for the logit model (the derivation is not given on the module pages.)
The deviance is
\[ D= -2\sum_{j=1}^G(l_j(\hat{\pi}_j)-l_j(\tilde{\pi}_j)) = 2\sum_{j=1}^G [y_j\ln(\frac{y_j}{n_j\hat{\pi}_j})+(n_j-y_j)\ln(\frac{n_j-y_j}{n_j-n_j\hat{\pi}_j})] \]
Because \(D = -2\sum_{j=1}^G(l_j(\hat{\pi}_j)-l_j(\tilde{\pi}_j))\), we can just work with \(l_j(\hat{\pi}_j)-l_j(\tilde{\pi}_j)\), where \(\hat{\pi}_j\) is the estimate at our model and \(l_j(\tilde{\pi}_j) = y_j/n_j\) is the estimate at the saturated model.
\[\begin{aligned} l_j({\hat{\pi}}_j)-l_j({\tilde{\pi}}_j) &= \left( y_j \ln {\hat{\pi}}_j + (1-y_j) \ln(n_j-{\hat{\pi}}_j) \right) - \left( y_j \ln {\tilde{\pi}}_j + (1-y_j) \ln(n_j-{\tilde{\pi}}_j) \right) \\ &= y_j \ln \frac{{\hat{\pi}}_j}{{\tilde{\pi}}_j} + (1-y_j) \ln \frac{{n_j-\hat{\pi}}_j}{{1-\tilde{\pi}}_j} \\ &= y_j \ln \frac{{\hat{\pi}}_j}{{\tilde{\pi}}_j} + (1-y_j) \ln \frac{{n_j-\hat{\pi}}_j}{{1-\tilde{\pi}}_j} \\ &= y_j \ln \frac{n_j {\hat{\pi}}_j}{{y_j}} + (n_j-y_j) \ln \frac{n_j (1-{\hat{\pi}}_j)}{n_j-y_j}= y_j \ln \frac{n_j {\hat{\pi}}_j}{{y_j}} + (n_j-y_j) \ln \frac{n_j-n_j {\hat{\pi}}_j}{n_j-y_j} \end{aligned}\]So now we can plug this into \(-2\sum D_j\) to get
\[ D = -2 \sum_{j=1}^G y_j \ln \frac{n_j {\hat{\pi}}_j}{{y_j}} + (1-y_j) \ln \frac{n_j-n_j {\hat{\pi}}_j}{n_j-y_j} \]
The models are
additive:Y ~ age + wantsMore
interact:
Y ~ age * wantsMore
The null hypothesis is \(H_0\):
Y ~ age + wantsMore
is the correct model, i.e. all of the
interaction terms are 0. The alternative hypothesis is \(H_1\): Y ~ age * wantsMore
is
the correct model.
We could do this using anova(additive, interact)
, but
let’s do it by hand!
The deviances are (using
summary(models$additive)$deviance
):
additive: R summary(models$additive)$deviance
interact:
R summary(models$interact)$deviance
with R summary(models$additive)$df.residual
and
R summary(models$interact)$df.residual
degrees of freedom
respectively (using summary(models$additive)$df.residual
).
The test statistic is then:
(test.dev <- summary(models$additive)$deviance - summary(models$interact)$deviance)
## [1] 16.78881
(test.df <- summary(models$additive)$df.residual - summary(models$interact)$df.residual)
## [1] 3
pchisq(test.dev, test.df, lower.tail = FALSE)
## [1] 0.0007810536
Hence, we reject the null hypothesis, and prefer the model with the interactions.
d) What if you use the AIC for model selection, which model (out of the 5) will you then use?
We can extract the AICs from each object (or just read the output above). Then we can use R to find the model with the lowest AIC:
(AICs <- unlist(lapply(models, AIC))) # lapply() returns a list, so unlist() tidies this up
## null age desire additive interact
## 239.2803 166.0886 149.6059 118.3957 107.6069
which(AICs == min(AICs))
## interact
## 5
And we find the interaction model, once again, is best.
(yes, mainly R and trying to understand)
Finally, we want to use the additive model with all three covariates:
Y~age+education+wantsMore
to study different plots.
fit3add = glm(cbind(using, notUsing) ~ age + education + wantsMore, data = ds, family = binomial)
summary(fit3add)
##
## Call:
## glm(formula = cbind(using, notUsing) ~ age + education + wantsMore,
## family = binomial, data = ds)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8082 0.1590 -5.083 3.71e-07 ***
## age25-29 0.3894 0.1759 2.214 0.02681 *
## age30-39 0.9086 0.1646 5.519 3.40e-08 ***
## age40-49 1.1892 0.2144 5.546 2.92e-08 ***
## educationlow -0.3250 0.1240 -2.620 0.00879 **
## wantsMoreyes -0.8330 0.1175 -7.091 1.33e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 165.772 on 15 degrees of freedom
## Residual deviance: 29.917 on 10 degrees of freedom
## AIC: 113.43
##
## Number of Fisher Scoring iterations: 4
exp(fit3add$coefficients)
## (Intercept) age25-29 age30-39 age40-49 educationlow wantsMoreyes
## 0.4456506 1.4760678 2.4808804 3.2845805 0.7225312 0.4347628
a) Comment on the output from
summary
. Use the deviance to test if this is a good model.
Which model do you then compare to (what is the saturated model
here)?.
Think about both the directions of effects and their size.
You should know how a saturates model is defined. For extra marks, can you write the R formula for it?
The summary suggests a lot is going on. In particular:
We can test the goodness of the model by using the deviance. This is,
as you now know, based on the difference between the likelihood for the
model, and the likelihood for a saturated model. The saturated model has
one parameter per data point (i.e. row in the data). Because the data
are all factors, and we have every comnination, this is the same as the
model with all possible interactions,
i.e. cbind(using,notUsing) ~ age*education*wantsMore
.
We can test if this is a good model from this line:
Residual deviance: 29.917 on 10 degrees of freedom
We can compare this to a \(\chi^2\) distribution:
pchisq(29.917, 10, lower.tail = FALSE)
## [1] 0.0008838311
pchisq(fit3add$deviance, fit3add$df.residual, lower.tail = FALSE) < 0.05
## [1] TRUE
The second line is just to show you how to (a) extract the statistics from the model, and (b) reduce hte p-value to a yes-no, i.e. if it is less than 0.05.
We can see that the test is significant, i.e. the model does not appear to be good.
b) Look at the plot of fitted values vs. deviance residuals. Do you see a trend?
library(ggplot2)
plotdf <- data.frame(dres = fit3add$residuals, fitted = fit3add$fitted.values, age = ds$age)
ggplot(plotdf, aes(x = fitted, y = dres)) + geom_point() + labs(x = "Fitted values",
y = "Deviance residuals")
c) Here is a code to produce a plot of fitted values for the 16 covariate patterns together with a saturated model (called frac) - on the logit scale.
library(ggplot2)
frac = ds$using/(ds$using + ds$notUsing)
logitdf = data.frame(lfitted = qlogis(fit3add$fitted), lfrac = qlogis(frac), age = ds$age,
wantsMore = ds$wantsMore, education = ds$education)
ggplot(logitdf, aes(x = age)) + geom_point(aes(y = lfrac, colour = "saturated")) +
geom_point(aes(y = lfitted, colour = "fitted")) + facet_wrap(facets = ~wantsMore *
education) + labs(x = "age", y = "")
How can you see from the dots for fitted values that we are assuming an additive model on the logit scale (that is, the pattern for each panel for logit fitted vs age is the same for each panel if we have an additive model on the logit scale - you see that from the fitted points)? Do you also see that for the observed values? Implication? Any other observations?
If it helps, write Age as continuous, and ask how the line changes when factor levels change.
For the observed values, are there clear patterns? Note that this is a subjective judgement!If the model is additive on the scale here, the plots should look similar: the relative effects of age should be the same, i.e. the effect of going from high to low education should be to change the age effect by the same amount. So, with some obvious notation we can go from
\[ \eta_i = \beta_0 + \beta_{age25-29} + \beta_{educationHigh} \]
to
\[ \eta_i = \beta_0 + \beta_{age25-29} + \beta_{educationLow} = \beta_1 + \beta_{age25-29} \]
and changing age changes \(\beta_{age}\), but not \(\beta_1\).
For the observed values this is not so clear. For
no, low
(i.e. not wanting more children and low education)
the residuals are close to the fitted line. But for
no, high
the line seems steeper, and yes, low
has a lower value for ages 40-59.
The problem with interpreting these is that it is easy to over-interpret random noise (this is called Pareidolia). But we know from the test of the residual deviance that this model does not fit the data well enough, so perhaps there is something going on. Or perhaps this extra variation is random noise (Robert Abelson in his book summarised this as “chance is lumpy”).
For this module the following are exam questions to work with
In addition these essay-type exam questions are closely related to this module.
There are two main asymptotic results that are behind essentially everything we have done with respect to inference for generalized linear models. These are
asymptotic normality of maximum likelihood estimators (MLE), and
asymptotic result for likelihood ratio tests (LRT).
State, describe and discuss the main assumptions behind these two asymptotic results, how these results are used in practice to do inference for parameters, testing various hypotheses and comparing nested models, for logistic regression.
We will consider (binomial or binary) logistic regression, where we have independent observations \[Y_i \sim \text{Binomial}(n_i,\pi_i) \text{ } i=1,\ldots,n\] so that \[P(Y_i)=\binom{n_i}{y_i}\pi_i^{y_i}(1-\pi_i)^{n_i-y_i}, \text{ } y_i=0,1,\ldots, n_i\] The linear predictor is \[ \eta_i={\bf x}_i^T\boldsymbol{\beta}\] and \[ \pi_i=\frac{\exp(\eta_i)}{1+\exp(\eta_i)}\] or \[ \text{logit}(\pi_i)=\eta_i.\] Here, \({\bf x}_i\) is a vector of the \(p\) covariates for the \(i\)th observation \(y_i\) with size (number of trials) \(n_i\), and \(\boldsymbol{\beta}\) is the vector of \(p\) unknown regression coefficients.
Write an introduction to logistic regression and its practical usage, for a student with a good background in statistics, but no knowledge about Generalized Linear Models (GLM). Topics you may want to consider, are
install.packages(c("tidyverse", "investr", "knitr", "kableExtra", "faraway", "viridis",
"statmod"))