#Interactive session - second week

Problem 1: Exam 2007 (Problem 1, a bit modified) - Smoking and lung cancer

(Permitted aids for the exam was “Tabeller og formler i statistikk”, Matematisk formelsamling (Rottmann), one A5 sheet with your own handwritten notes, and a simple calculator.)

The dataset given in smoking.txt consists of four variables:

  • deaths: number of lung cancer deaths over a period of six years [remark: incorrectly 1 year in exam question]
  • population: the number of people [remark: incorrectly in 100 000 people in exam question]
  • age: in five-year age groups (40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80+)
  • ageLevel: age group as numbers from 1 to 9 (1 corresponds to 40-44, 2 to 45-59, and so on)
  • smoking status: doesn’t smoke (no), smokes cigars or pipe only (cigarPipeOnly), smokes cigarettes and cigar or pipe (cigarettePlus), and smokes cigarettes only (cigaretteOnly)

You can look at the dataset here: https://www.math.ntnu.no/emner/TMA4315/2018h/smoking.txt, and can be found here as well: http://data.princeton.edu/wws509/datasets/#smoking

Remark: the data set is probably taken from https://www.jstor.org/stable/41983444?seq=1#page_scan_tab_contents and there it is said the the persons under study are males who have contributed in wars before 1956 and who answered a questionary. The authors point out that the dataset is not representative for the whole population.

We are interested in studying if the mortality rate due to lung cancer (the number of deaths due to lung cancer per individual during six year) controlled for age group and smoking status. Assume that the number of deaths for each set of covariate values, \(Y_i\), can be considered Poisson distributed, \(Y_i \sim \text{Poisson}(\lambda_i)\). We fit the following model:

# load data:
smoking <- read.table(file = "https://www.math.ntnu.no/emner/TMA4315/2018h/smoking.txt")
head(smoking)
nrow(smoking)
model1 <- glm(dead ~ age + smoke, family = "poisson", data = smoking, offset = log(pop))
# note that the size of the population for each combination of the covaraites
# is the offset here
summary(model1)
##   dead  pop   age ageLevel smoke
## 1   18  656 40-44        1    no
## 2   22  359 45-59        2    no
## 3   19  249 50-54        3    no
## 4   55  632 55-59        4    no
## 5  117 1067 60-64        5    no
## 6  170  897 65-69        6    no
## [1] 36
## 
## Call:
## glm(formula = dead ~ age + smoke, family = "poisson", data = smoking, 
##     offset = log(pop))
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -3.63222    0.06783 -53.552  < 2e-16 ***
## age45-59             0.55388    0.07999   6.924 4.38e-12 ***
## age50-54             0.98039    0.07682  12.762  < 2e-16 ***
## age55-59             1.37946    0.06526  21.138  < 2e-16 ***
## age60-64             1.65423    0.06257  26.439  < 2e-16 ***
## age65-69             1.99817    0.06279  31.824  < 2e-16 ***
## age70-74             2.27141    0.06435  35.296  < 2e-16 ***
## age75-79             2.55858    0.06778  37.746  < 2e-16 ***
## age80+               2.84692    0.07242  39.310  < 2e-16 ***
## smokecigarretteOnly  0.36915    0.03791   9.737  < 2e-16 ***
## smokecigarrettePlus  0.17015    0.03643   4.671 3.00e-06 ***
## smokeno             -0.04781    0.04699  -1.017    0.309    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4055.984  on 35  degrees of freedom
## Residual deviance:   21.487  on 24  degrees of freedom
## AIC: 285.51
## 
## Number of Fisher Scoring iterations: 4

a) What is an offset?
For 53-year old non-smokers, what is the estimated number of deaths due to lung cancer (per person over 6 years)?
Why is the number of degrees of freedom for the deviance of this model 24? Does the model give a good fit?

b) Let \(\lambda(a,s)\) denote the expected number of lung cancer deaths per person in age group \(a\) with smoking status \(s\). For two different smoking statuses \(s_1\) and \(s_2\), define

\[r(a, s_1, s_2) = \frac{\lambda(a, s_1)}{\lambda(a, s_2)}.\]

Explain why \(r(a, s_1, s_2)\) does not vary as a function of \(a\) in model1.
For \(s_1 =\) cigarPipeOnly and \(s_2 =\) cigaretteOnly, find an estimate value for \(r(a, s_1, s_2)\) and an approximate 90 % confidence interval. Is there a significant difference in the expected number of lung cancer deaths for individuals that smoke cigarettes cigaretteOnly versus those that smoke cigar/pipe cigarPipeOnly?

c) We will now consider two alternative models, model2 and model3:

model2 <- glm(dead ~ smoke, family = "poisson", data = smoking, offset = log(pop))
model3 <- glm(dead ~ ageLevel + smoke, family = "poisson", data = smoking, offset = log(pop))

summary(model2)
summary(model3)
## 
## Call:
## glm(formula = dead ~ smoke, family = "poisson", data = smoking, 
##     offset = log(pop))
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.47319    0.03099 -47.532  < 2e-16 ***
## smokecigarretteOnly -0.31219    0.03576  -8.729  < 2e-16 ***
## smokecigarrettePlus -0.43013    0.03468 -12.402  < 2e-16 ***
## smokeno             -0.36678    0.04669  -7.855 3.98e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4056.0  on 35  degrees of freedom
## Residual deviance: 3910.7  on 32  degrees of freedom
## AIC: 4158.7
## 
## Number of Fisher Scoring iterations: 5
## 
## 
## Call:
## glm(formula = dead ~ ageLevel + smoke, family = "poisson", data = smoking, 
##     offset = log(pop))
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -3.705950   0.050717 -73.071  < 2e-16 ***
## ageLevel             0.333006   0.005591  59.559  < 2e-16 ***
## smokecigarretteOnly  0.405019   0.037463  10.811  < 2e-16 ***
## smokecigarrettePlus  0.203426   0.035996   5.651 1.59e-08 ***
## smokeno             -0.032927   0.046894  -0.702    0.483    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4055.984  on 35  degrees of freedom
## Residual deviance:   75.734  on 31  degrees of freedom
## AIC: 325.76
## 
## Number of Fisher Scoring iterations: 4

Why does model2 and model3 have 32 and 31 degrees of freedom, respectively? If we want to compare the three models model1, model2 and model3, which model would you choose as the best? Justify your answer by formulating relevant hypotheses and perform hypothesis tests, based on the print-outs above. Hint: remember that to compare two models then one model needs to be nested within the other model.

New: For model3: Plot the regression line for the expected number of lung cancer deaths per person as a function of age for each of the four different smoking levels in the same plot. First use all coefficients, and then only the ones that are significant. Use ggplot! Discuss what you see.

Problem 2: TMA4315 Exam 2012, Problem 3: Precipitation in Trondheim, amount

Remark: the text is slightly modified from the original exam since we parameterized the gamma as in our textbook.

Remark: Problems 1 (binomial) and 2 (multinomial) at the 2012 exam also asked about percipitation.

We want to model the amount of daily precipitation given that it is precipitation, and denote this quantity \(Y\). It is common to model \(Y\) as a gamma distributed random variable, \(Y \sim Gamma(\nu,\mu)\), with density

\[ f_Y(y) = \frac{1}{\Gamma(\nu)} \left(\frac{\nu}{\mu}\right)^{\nu} y^{\nu-1}\exp\left(-\frac{\nu}{\mu} y \right) \]

In this problem we consider \(N\) observations, each gamma distributed with \(Y_i \sim Gamma(\nu, \mu_i)\) (remark: common \(\nu\)). Here \(\nu\) is considered to be a known nuisance parameter, and the \(\mu_i\)s are unknown.

a) Show that the gamma distribution function is member of the exponential family when \(\mu_i\) is the parameter of interest.
Use this to find expressions for the expected value and the variance of \(Y_i\), in terms of \((\nu,\mu_i)\), and interpret \(\nu\).
New: What is the canonical link for the Gamma distribution?

Hint: if you want to focus on discussing - you may look at the solutions from Module 1 together.

b) Explain what a saturated model is.
Set up the log-likelihood function expressed by \(\mu_i\), and use it to find the maximum likelihood estimators for \(\mu_i\)-s of the saturated model.
Find the deviance (based on all \(N\) observations).

Hint: Do you see directly that \(\hat{\mu}_i=y_i\)? If not, you may look at the likelihood for one observation and solve that the derivative equals 0.

c) We now want to construct a model for amount of precipitation (given that there are occurrence) with precipitation forecast as explanatory variable.
Let \(Y_i\) be amount of precipitation for day \(i\), and let \(x_i\) be the precipitation forecast valid for day \(i\). Set up a GLM for this, and argue for your choice of link function and linear predictor.

Hint: explain why this fits into our GLM-gamma framework and set up our 3 equation model (random, systematic, link).


Problem 3: Taken from UiO, STK3100, 2015, problem 2

(For reference: here is the original exam).

This is a problem on the logistic regression.

Do not look at the dataset before the end of the exercise! You should solve the exercise without using R, just as if you were at an exam.

In this problem you shall consider data of survivals from a study of treatment for breast cancer. The response is the number that survived for three years. The covariates were the four factors

  • app: appearance of tumor, two levels, 1 = malignant, 2 = benign
  • infl: inflammatory reaction, two levels, 1 = minimal, 2 = moderate or severe
  • age: age of patients, three levels, 1 = under 50, 2 = 50 to 69, 3 = 70 or older
  • country: hospital of treatment, three levels, 1 = Japan, 2 = US, 3 = UK

The dataset we have used differs slightly from the one they used at UiO (the number of survivals and non-survivals differ).

The number of survivors is modelled as a binomially distributed variable using a canonical logit link. Dummy variable coding is used for all factors, with the level coded as “1” as the reference category.

a) The output from fitting the model where only appearance and country are used as covariates, i.e., a model with predictor on the form

\[ \eta = \beta_0 + \beta_1 \texttt{ fapp} + \beta_2 \texttt{ fcountry2} + \beta_3 \texttt{ fcountry3} \]

is displayed below. What is the interpretation of the estimate of the coefficient of appearance, fapp (f means factor)? Explain also how this coefficient can be expressed in terms of an odds ratio.

Hint: first explain what is the predicted odds of survivial for country \(j\) for a benign tumor, then the same for a malignant tumor- and then make an odds ratio.

New: The main difference between our dataset and UiO’s dataset is the number of degrees of freedom for the null model; we have 34, they have 35. How many observations do we have in the dataset? And how many does UiO have? Hint: All the covariates in the dataset are categorical, two with 3 levels, and two with 2 levels. How many possible combinations of observations does that make?

Call:
glm(formula = cbind(surv, nsurv) ~ fapp + fcountry, family = binomial, 
    data = brc)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8142  -0.7279   0.2147   0.7576   1.8715  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   1.0803     0.1656   6.522 6.93e-11 ***
fapp2         0.5157     0.1662   3.103 0.001913 ** 
fcountry2    -0.6589     0.1998  -3.298 0.000972 ***
fcountry3    -0.4945     0.2071  -2.388 0.016946 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 57.588  on 34  degrees of freedom
Residual deviance: 36.625  on 31  degrees of freedom

b) The output below is an analysis of deviance for comparing various model spesifications. Fill out the positions indicated i-iv.
Note: y ~ x1*x2 gives both the linear components and the interaction, i.e., it is the same as y ~ x1 + x2 + x1:x2 (x1:x2 gives interaction only).

Remark: remember that anova gives the sequential comparison of deviances - that is - comparing each model to the previous model.

Analysis of Deviance Table

Model 1: cbind(surv, nsurv) ~ fapp + fage + fcountry
Model 2: cbind(surv, nsurv) ~ fapp + fage + finfl + fcountry
Model 3: cbind(surv, nsurv) ~ fapp + finfl + fage * fcountry
Model 4: cbind(surv, nsurv) ~ fapp * finfl + fage * fcountry
Model 5: cbind(surv, nsurv) ~ fapp * finfl + fapp * fage + fage * fcountry
Model 6: cbind(surv, nsurv) ~ fapp * finfl * fage * fcountry
  Resid. Df Resid. Dev Df Deviance
1        29     33.102            
2         i     33.095  1   0.0065
3        24     25.674 ii   7.4210
4        23     25.504  1      iii
5        21     22.021  2   3.4823
6         0      0.000 iv  22.0214

In the remaining parts of this problem we return to the model in part a) and consider the hypothesis

\[ H_0 : \beta_2 + \beta_3 = -1 \text{ versus } H_1 : \beta_2 + \beta_3 \neq -1 \]

Note: what are you testing now?

c) The estimated covariance matrix between the estimators of the coefficients \(\beta_2\) and \(\beta_3\) above is \(\left(\begin{matrix} 0.040 & 0.021 \\ 0.021 & 0.043\end{matrix} \right)\). Use a Wald test to test the null hypothesis above.

Note: remember - this was a written exam so you do this by hand!

d) Explain how the null hypothesis can be tested with a likelihood ratio test by fitting two suitable models. No numerical calculations are necessary, but it must be specified how the predictors should be defined.
Remark: rather technical - and hint: offset term?

New: Do this test in R. Here is the data set!

Work on your own: Exam questions

December 2013 (Essay exam)

We will consider the following Poisson regression \[Y_i \sim \text{Poisson}(\exp(\eta_i)), \text{ }i=1,\ldots,n\] where the linear predictor is \(\eta_i={\bf x}_i^T\beta\). Here \({\bf x}_i\) is a vector of the \(p\) covariates for the \(i\)th observation \(Y_i\) and \(\beta\) is unknown \(p\) dimensional column vector of unknown regression coefficients.

Write an introduction to Poisson regression and its practical usage, for a student with a good background in statistics, but no knowledge about Generalized Linear Models (GLM). Topics you may want to consider, are

  • When to use it? Underlying assumptions.
  • Parameter estimation, limiting results for the MLE, Fisher information and observed Fisher information, confidence intervals and hypothesis testing.
  • Output analysis, residual plots and interpretation of results.
  • Deviance and its usage.
  • What do we do when a covariate is a factor, and should the results be interpreted?
  • The use of Poisson regression in the analysis of contingency tables.

R packages

install.packages(c("tidyverse", "ggplot2", "statmod", "corrplot", "ggplot2", "GGally",
    "boot"))

Further reading

  • A. Agresti (1996): “An Introduction to Categorical Data Analysis”.
  • A. Agresti (2015): “Foundations of Linear and Generalized Linear Models.” Wiley.
  • A. J. Dobson and A. G. Barnett (2008): “An Introduction to Generalized Linear Models”, Third edition.
  • J. Faraway (2015): “Extending the Linear Model with R”, Second Edition. http://www.maths.bath.ac.uk/~jjf23/ELM/
  • P. McCullagh and J. A. Nelder (1989): “Generalized Linear Models”. Second edition.