#Interactive session - second week
(Permitted aids for the exam was “Tabeller og formler i statistikk”, Matematisk formelsamling (Rottmann), one A5 sheet with your own handwritten notes, and a simple calculator.)
The dataset given in smoking.txt
consists of four
variables:
deaths
: number of lung cancer deaths over a period of
six years [remark: incorrectly 1 year in exam question]population
: the number of people [remark: incorrectly
in 100 000 people in exam question]age
: in five-year age groups
(40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80+
)ageLevel
: age group as numbers from 1 to 9 (1
corresponds to 40-44, 2 to 45-59, and so on)smoking status
: doesn’t smoke (no
), smokes
cigars or pipe only (cigarPipeOnly
), smokes cigarettes and
cigar or pipe (cigarettePlus
), and smokes cigarettes only
(cigaretteOnly
)You can look at the dataset here: https://www.math.ntnu.no/emner/TMA4315/2018h/smoking.txt, and can be found here as well: http://data.princeton.edu/wws509/datasets/#smoking
Remark: the data set is probably taken from https://www.jstor.org/stable/41983444?seq=1#page_scan_tab_contents and there it is said the the persons under study are males who have contributed in wars before 1956 and who answered a questionary. The authors point out that the dataset is not representative for the whole population.
We are interested in studying if the mortality rate due to lung cancer (the number of deaths due to lung cancer per individual during six year) controlled for age group and smoking status. Assume that the number of deaths for each set of covariate values, \(Y_i\), can be considered Poisson distributed, \(Y_i \sim \text{Poisson}(\lambda_i)\). We fit the following model:
# load data:
smoking <- read.table(file = "https://www.math.ntnu.no/emner/TMA4315/2018h/smoking.txt")
head(smoking)
nrow(smoking)
model1 <- glm(dead ~ age + smoke, family = "poisson", data = smoking, offset = log(pop))
# note that the size of the population for each combination of the covaraites
# is the offset here
summary(model1)
## dead pop age ageLevel smoke
## 1 18 656 40-44 1 no
## 2 22 359 45-59 2 no
## 3 19 249 50-54 3 no
## 4 55 632 55-59 4 no
## 5 117 1067 60-64 5 no
## 6 170 897 65-69 6 no
## [1] 36
##
## Call:
## glm(formula = dead ~ age + smoke, family = "poisson", data = smoking,
## offset = log(pop))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.63222 0.06783 -53.552 < 2e-16 ***
## age45-59 0.55388 0.07999 6.924 4.38e-12 ***
## age50-54 0.98039 0.07682 12.762 < 2e-16 ***
## age55-59 1.37946 0.06526 21.138 < 2e-16 ***
## age60-64 1.65423 0.06257 26.439 < 2e-16 ***
## age65-69 1.99817 0.06279 31.824 < 2e-16 ***
## age70-74 2.27141 0.06435 35.296 < 2e-16 ***
## age75-79 2.55858 0.06778 37.746 < 2e-16 ***
## age80+ 2.84692 0.07242 39.310 < 2e-16 ***
## smokecigarretteOnly 0.36915 0.03791 9.737 < 2e-16 ***
## smokecigarrettePlus 0.17015 0.03643 4.671 3.00e-06 ***
## smokeno -0.04781 0.04699 -1.017 0.309
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 4055.984 on 35 degrees of freedom
## Residual deviance: 21.487 on 24 degrees of freedom
## AIC: 285.51
##
## Number of Fisher Scoring iterations: 4
a) What is an offset?
For 53-year old non-smokers, what is the estimated number of deaths due
to lung cancer (per person over 6 years)?
Why is the number of degrees of freedom for the deviance of this model
24? Does the model give a good fit?
b) Let \(\lambda(a,s)\) denote the expected number of lung cancer deaths per person in age group \(a\) with smoking status \(s\). For two different smoking statuses \(s_1\) and \(s_2\), define
\[r(a, s_1, s_2) = \frac{\lambda(a, s_1)}{\lambda(a, s_2)}.\]
Explain why \(r(a, s_1, s_2)\) does
not vary as a function of \(a\) in model1
.
For \(s_1 =\)
cigarPipeOnly
and \(s_2
=\) cigaretteOnly
, find an estimate value for \(r(a, s_1, s_2)\) and an approximate 90 %
confidence interval. Is there a significant difference in the expected
number of lung cancer deaths for individuals that smoke cigarettes
cigaretteOnly
versus those that smoke cigar/pipe
cigarPipeOnly
?
c) We will now consider two alternative models,
model2
and model3
:
model2 <- glm(dead ~ smoke, family = "poisson", data = smoking, offset = log(pop))
model3 <- glm(dead ~ ageLevel + smoke, family = "poisson", data = smoking, offset = log(pop))
summary(model2)
summary(model3)
##
## Call:
## glm(formula = dead ~ smoke, family = "poisson", data = smoking,
## offset = log(pop))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.47319 0.03099 -47.532 < 2e-16 ***
## smokecigarretteOnly -0.31219 0.03576 -8.729 < 2e-16 ***
## smokecigarrettePlus -0.43013 0.03468 -12.402 < 2e-16 ***
## smokeno -0.36678 0.04669 -7.855 3.98e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 4056.0 on 35 degrees of freedom
## Residual deviance: 3910.7 on 32 degrees of freedom
## AIC: 4158.7
##
## Number of Fisher Scoring iterations: 5
##
##
## Call:
## glm(formula = dead ~ ageLevel + smoke, family = "poisson", data = smoking,
## offset = log(pop))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.705950 0.050717 -73.071 < 2e-16 ***
## ageLevel 0.333006 0.005591 59.559 < 2e-16 ***
## smokecigarretteOnly 0.405019 0.037463 10.811 < 2e-16 ***
## smokecigarrettePlus 0.203426 0.035996 5.651 1.59e-08 ***
## smokeno -0.032927 0.046894 -0.702 0.483
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 4055.984 on 35 degrees of freedom
## Residual deviance: 75.734 on 31 degrees of freedom
## AIC: 325.76
##
## Number of Fisher Scoring iterations: 4
Why does model2
and model3
have 32 and 31
degrees of freedom, respectively? If we want to compare the three models
model1
, model2
and model3
, which
model would you choose as the best? Justify your answer by formulating
relevant hypotheses and perform hypothesis tests, based on the
print-outs above. Hint: remember that to compare two models then one
model needs to be nested within the other model.
New: For model3
: Plot the regression line for the
expected number of lung cancer deaths per person as a function of
age
for each of the four different smoking levels in the
same plot. First use all coefficients, and then only the ones that are
significant. Use ggplot
! Discuss what you see.
Remark: the text is slightly modified from the original exam since we parameterized the gamma as in our textbook.
Remark: Problems 1 (binomial) and 2 (multinomial) at the 2012 exam also asked about percipitation.
We want to model the amount of daily precipitation given that it is precipitation, and denote this quantity \(Y\). It is common to model \(Y\) as a gamma distributed random variable, \(Y \sim Gamma(\nu,\mu)\), with density
\[ f_Y(y) = \frac{1}{\Gamma(\nu)} \left(\frac{\nu}{\mu}\right)^{\nu} y^{\nu-1}\exp\left(-\frac{\nu}{\mu} y \right) \]
In this problem we consider \(N\) observations, each gamma distributed with \(Y_i \sim Gamma(\nu, \mu_i)\) (remark: common \(\nu\)). Here \(\nu\) is considered to be a known nuisance parameter, and the \(\mu_i\)s are unknown.
a) Show that the gamma distribution function is
member of the exponential family when \(\mu_i\) is the parameter of interest.
Use this to find expressions for the expected value and the variance of
\(Y_i\), in terms of \((\nu,\mu_i)\), and interpret \(\nu\).
New: What is the canonical link for the Gamma distribution?
Hint: if you want to focus on discussing - you may look at the solutions from Module 1 together.
b) Explain what a saturated model is.
Set up the log-likelihood function expressed by \(\mu_i\), and use it to find the maximum
likelihood estimators for \(\mu_i\)-s
of the saturated model.
Find the deviance (based on all \(N\)
observations).
Hint: Do you see directly that \(\hat{\mu}_i=y_i\)? If not, you may look at the likelihood for one observation and solve that the derivative equals 0.
c) We now want to construct a model for amount of
precipitation (given that there are occurrence) with precipitation
forecast as explanatory variable.
Let \(Y_i\) be amount of precipitation
for day \(i\), and let \(x_i\) be the precipitation forecast valid
for day \(i\). Set up a GLM for this,
and argue for your choice of link function and linear predictor.
Hint: explain why this fits into our GLM-gamma framework and set up our 3 equation model (random, systematic, link).
(For reference: here is the original exam).
This is a problem on the logistic regression.
Do not look at the dataset before the end of the exercise! You should
solve the exercise without using R
, just as if you were at
an exam.
In this problem you shall consider data of survivals from a study of treatment for breast cancer. The response is the number that survived for three years. The covariates were the four factors
app
: appearance of tumor, two levels, 1 = malignant, 2
= benigninfl
: inflammatory reaction, two levels, 1 = minimal, 2
= moderate or severeage
: age of patients, three levels, 1 = under 50, 2 =
50 to 69, 3 = 70 or oldercountry
: hospital of treatment, three levels, 1 =
Japan, 2 = US, 3 = UKThe dataset we have used differs slightly from the one they used at UiO (the number of survivals and non-survivals differ).
The number of survivors is modelled as a binomially distributed variable using a canonical logit link. Dummy variable coding is used for all factors, with the level coded as “1” as the reference category.
a) The output from fitting the model where only appearance and country are used as covariates, i.e., a model with predictor on the form
\[ \eta = \beta_0 + \beta_1 \texttt{ fapp} + \beta_2 \texttt{ fcountry2} + \beta_3 \texttt{ fcountry3} \]
is displayed below. What is the interpretation of the estimate of the
coefficient of appearance, fapp
(f
means
factor)? Explain also how this coefficient can be expressed in terms of
an odds ratio.
Hint: first explain what is the predicted odds of survivial for country \(j\) for a benign tumor, then the same for a malignant tumor- and then make an odds ratio.
New: The main difference between our dataset and UiO’s dataset is the number of degrees of freedom for the null model; we have 34, they have 35. How many observations do we have in the dataset? And how many does UiO have? Hint: All the covariates in the dataset are categorical, two with 3 levels, and two with 2 levels. How many possible combinations of observations does that make?
Call:
glm(formula = cbind(surv, nsurv) ~ fapp + fcountry, family = binomial,
data = brc)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8142 -0.7279 0.2147 0.7576 1.8715
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.0803 0.1656 6.522 6.93e-11 ***
fapp2 0.5157 0.1662 3.103 0.001913 **
fcountry2 -0.6589 0.1998 -3.298 0.000972 ***
fcountry3 -0.4945 0.2071 -2.388 0.016946 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 57.588 on 34 degrees of freedom
Residual deviance: 36.625 on 31 degrees of freedom
b) The output below is an analysis of deviance for
comparing various model spesifications. Fill out the positions indicated
i-iv
.
Note: y ~ x1*x2
gives both the linear components and the
interaction, i.e., it is the same as y ~ x1 + x2 + x1:x2
(x1:x2
gives interaction only).
Remark: remember that anova
gives the sequential
comparison of deviances - that is - comparing each model to the previous
model.
Analysis of Deviance Table
Model 1: cbind(surv, nsurv) ~ fapp + fage + fcountry
Model 2: cbind(surv, nsurv) ~ fapp + fage + finfl + fcountry
Model 3: cbind(surv, nsurv) ~ fapp + finfl + fage * fcountry
Model 4: cbind(surv, nsurv) ~ fapp * finfl + fage * fcountry
Model 5: cbind(surv, nsurv) ~ fapp * finfl + fapp * fage + fage * fcountry
Model 6: cbind(surv, nsurv) ~ fapp * finfl * fage * fcountry
Resid. Df Resid. Dev Df Deviance
1 29 33.102
2 i 33.095 1 0.0065
3 24 25.674 ii 7.4210
4 23 25.504 1 iii
5 21 22.021 2 3.4823
6 0 0.000 iv 22.0214
In the remaining parts of this problem we return to the model in part a) and consider the hypothesis
\[ H_0 : \beta_2 + \beta_3 = -1 \text{ versus } H_1 : \beta_2 + \beta_3 \neq -1 \]
Note: what are you testing now?
c) The estimated covariance matrix between the estimators of the coefficients \(\beta_2\) and \(\beta_3\) above is \(\left(\begin{matrix} 0.040 & 0.021 \\ 0.021 & 0.043\end{matrix} \right)\). Use a Wald test to test the null hypothesis above.
Note: remember - this was a written exam so you do this by hand!
d) Explain how the null hypothesis can be tested
with a likelihood ratio test by fitting two suitable models. No
numerical calculations are necessary, but it must be specified how the
predictors should be defined.
Remark: rather technical - and hint: offset term?
New: Do this test in R
. Here
is the data set!
We will consider the following Poisson regression \[Y_i \sim \text{Poisson}(\exp(\eta_i)), \text{ }i=1,\ldots,n\] where the linear predictor is \(\eta_i={\bf x}_i^T\beta\). Here \({\bf x}_i\) is a vector of the \(p\) covariates for the \(i\)th observation \(Y_i\) and \(\beta\) is unknown \(p\) dimensional column vector of unknown regression coefficients.
Write an introduction to Poisson regression and its practical usage, for a student with a good background in statistics, but no knowledge about Generalized Linear Models (GLM). Topics you may want to consider, are
install.packages(c("tidyverse", "ggplot2", "statmod", "corrplot", "ggplot2", "GGally",
"boot"))