Problem 1

a.

library(VGAM)
attach(marital.nz)
#plot(age~mstatus)
mod1 <- vglm(mstatus ~ age, multinomial)
summary(mod1)
## 
## Call:
## vglm(formula = mstatus ~ age, family = multinomial)
## 
## Pearson residuals:
##                       Min      1Q   Median       3Q    Max
## log(mu[,1]/mu[,4]) -11.75 -0.1441 -0.13965 -0.13372  5.706
## log(mu[,2]/mu[,4]) -13.53  0.2871  0.31147  0.40939  1.212
## log(mu[,3]/mu[,4]) -12.47 -0.2364 -0.09098 -0.02037 82.311
## 
## Coefficients: 
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept):1  6.753157   0.515150   13.11   <2e-16 ***
## (Intercept):2  9.531824   0.482073   19.77   <2e-16 ***
## (Intercept):3 13.121214   0.513771   25.54   <2e-16 ***
## age:1         -0.099335   0.008043  -12.35   <2e-16 ***
## age:2         -0.102873   0.007100  -14.49   <2e-16 ***
## age:3         -0.252080   0.008955  -28.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Names of linear predictors: log(mu[,1]/mu[,4]), log(mu[,2]/mu[,4]), 
## log(mu[,3]/mu[,4])
## 
## Residual deviance: 6822.79 on 18153 degrees of freedom
## 
## Log-likelihood: -3411.395 on 18153 degrees of freedom
## 
## Number of Fisher scoring iterations: 7 
## 
## Warning: Hauck-Donner effect detected in the following estimate(s):
## '(Intercept):2', 'age:3'
## 
## 
## Reference group is level  4  of the response

i) Assumptions

  • We assume that the categorical response marital status \(Y \sim Multinomial(n,\boldsymbol{\pi}),\) where \(\boldsymbol{\pi}\) is the vector of probabilities for each (unordered) category and \(n\) is the number of independent trials.

  • The categorical response \(Y\) consists of 4 categories and the reference category is Widowed (category with least counts-default choice in R).

  • The probability of occurence for each non-reference category \(r=1,2,3\) is given by

\[ \pi_{ir} = P(Y_i = r) = \frac{\exp(\beta_{0r}+\beta_{1r}X_i)}{1+\sum_{s=1}^3\exp(\beta_{0s}+\beta_{1s}X_i)} \] and for the reference category widowed \(c+1=4\)

\[ \pi_{i4} = P(Y_i = 4) = 1-\sum_{s=1}^3\pi_{is} = \frac{1}{1+\sum_{s=1}^3\exp(\beta_{0s}+\beta_{1s}X_i)}, \] where \(X\) is the age.

ii) Interpretation of the parameters as certain ratio odds:

From the two equations above it follows that

\[ \frac{\pi_{ir}}{\pi_{i4}} = \exp(\beta_{0r} + \beta_{1r}X_i), \] which is the odds of category \(r\) occuring relative to the reference category widowed (c+1). If age is increased one unit the relative risk will increase by \(\exp(\beta_{1r}).\)

iii) \(\pi_{ir}/(1-\pi_{ir})\)

This interpretation can be valid for the ratio \(\pi_{ir}/(1-\pi_{ir})\) in logistic regression with binary response. However, if the response consists of more than two categories the probability \(\pi_{ir}\) and consequently the ratio admits a more complex expression. For \(r=1,2,3\) we have that \[ \frac{\pi_{ir}}{1-\pi_{ir}} = \frac{\exp(\beta_{0r}+ \beta_{1r}X_i)}{1+\sum_{s=1,s\neq r}^3\exp(\beta_{0s}+ \beta_{1s}X_i)} \] and for the reference category

\[ \frac{\pi_{i4}}{1-\pi_{i4}} = \frac{1}{1+\sum_{s=1}^3\exp(\beta_{0s}+ \beta_{1s}X_i)}. \]

Furthermore the probabilities are not necessarily monotonic functions of age and as a counterexample we see that in the plot of b).

iv) Testing the significance of a linear effect

anova(mod1,test="LRT", type="III")
## Analysis of Deviance Table (Type III tests: each term added last)
## 
## Model: 'multinomial', 'VGAMcategorical'
## 
## Link: 'multilogitlink'
## 
## Response: mstatus
## 
##     Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## age  3   1600.9     18156     8423.7 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since p-value is \(\approx 0,\) we conclude that the effect of age effect is significant.

Note that the test statistic is \(-2[l(\hat\beta_{null}) - (\hat\beta_{alternative})]\sim\chi^2_{6-3}.\)

b.

# predictions
newdata = data.frame(age = 16:88)
p1 = predictvglm(mod1, newdata = newdata,  type = "response")
df = cbind(newdata, p1)
dfm = melt(df, id.vars = "age", 
           variable.name = "mstatus", 
           value.name = "probabilities") # predictions df
# observed data
dobs = subset(marital.nz, select = c("age", "mstatus"))
ta = (table(dobs)) 
fr = t(apply(ta, 1, function(x) x/sum(x))) # calculate frequencies of each category for each age
dffr = data.frame(fr) 
dffr$age = as.numeric(rownames(dffr))
rownames(dffr) = c() # remove row names
colnames(dffr)[-5] = names(df)[-1] # set names which are changed from using table() function
dfobs = melt(dffr, id.vars = "age",
             variable.name = "mstatus", 
             value.name = "probabilities") # observed df
# ggplot
pl = ggplot(dfm, aes(x = age, y = probabilities, group = mstatus, colour = mstatus)) +
  geom_line(show.legend = T) +
  geom_point(data = dfobs, 
             aes(x = age, y = probabilities, group = mstatus, colour = mstatus, alpha = 0.5),
             show.legend = F) + 
  scale_alpha(guide = 'none') + 
  scale_color_brewer(palette="Dark2")
pl

The probability of Married is increasing monotonically up to age 55. Then it decreases monotonically while the probability of Widowed is increasing which is reasonable. In the left side of the plot for young ages the Divorce and Widowed categories have probability 0 which make sense. However, we observe that the probability of Married is quite high \(\approx 0.25\) and the probability of Single is relatively large.

c.

Fit models for degrees 1 to 4.

dgr = 1:4
mods = list()
for (i in seq_along(dgr)) { # fit models for polynomial orders 1 to 4
  mods[[i]] = vglm(mstatus ~ poly(age,dgr[i]), multinomial)
}
#lapply(mods, summary)

Predict for ages 16 to 88 and create data frames for plotting

preds = list()
for(i in seq_along(mods)){ # predict and create dfs for plotting
  df = cbind(newdata,predictvglm(mods[[i]], newdata = newdata, type = "response"))
  df$degree = rep(paste0("poly(age,",i, ")"), nrow(newdata))
  preds[[i]] = melt(df, id.vars = c("age","degree"), variable.name = "mstatus", 
                    value.name = "probabilities") 
}
df_all = do.call(rbind,preds) # concatenate dfs
ggplot(df_all, aes(x = age, y = probabilities, group = mstatus, colour = mstatus)) +
  geom_line() + facet_wrap(.~degree, nrow = 2)+
  scale_color_brewer(palette="Dark2") 

fn_aic = function(mod,d) -2*logLik(mod)+2*3*(d+1)
AICs = rep(NA,4)
for(i in seq_along(mods)) AICs[i] = fn_aic(mods[[i]],i)
names(AICs) = paste0("AIC degree ", 1:4)
AICs
## AIC degree 1 AIC degree 2 AIC degree 3 AIC degree 4 
##     6834.790     6583.123     6555.048     6552.665

AIC estimates the relative amount of information lost by a givem model relative to the true process that generated the data. The aim is to choose the model that minimize this information lost, that is the model with the lowest AIC value. It deals with both the risk of over and underfitting by penalizing the model complexity.

In the plot we observe that second order polynomial fit improves on the probabilities of young Married and Single categories. However, the probability of Single increases from age 75, which is not reasonable. A polynomial of order 3 improves the model fit by removing this effect on aged single people. The best (minimum) AIC value belongs to the polynomial of order 4, but it seems that overfits the data as for ages \(>80\) the probability of Married is increased and Widowed is decreased.

To summarise, polynomial of order 3 improves on the pathologies of first (linear) and second order polynomial models and its AIC value is very close to the order 4 polynomial (best AIC) which overfits the data.