TMA4315 Generalized Linear Models

Compulsory exercise 2: Logistic- and Poisson-regression

Deadline: Friday, October 27, 2017 at 16.00

To be handed in on Blackboard.

Students may work on the assignment in groups of two or three, and are also encouraged to collaborate across groups. You do not have to use the same group as in compulsory 1 (maybe it is a positive learning experience to NOT use the same groups as for compulsory 1?). Each group needs to hand in an R Markdown document with answers to the questions as well as all source code. Hand in the file as an R Markdown file (.rmd) and a pdf file (.pdf). Include all code you have written for the new functions (or library) in the end of your file (hint: use eval = FALSE in the chunk option).

Please write the names of the group members and the group number you have on Blackboard on top of your document!

Guidance by the teaching assistant (Ingeborg) will be given Tuesdays after 9 in her office (Central Building 2, 12th floor, office 1224). You can either just show up, or send an email to Ingeborg at ingeborg.hem@ntnu.no to make sure she is available.

In this exercise you shall first use the built-in glm-package to perform logistic regression. Then you must create a function that calculates the regression coefficients for a Poisson regression (you do not have to create a myglm-package, as only the estimates of \(\beta\) are needed, but a framework is provided for those who want to look at it).

Part 1: Logistic regression (4 points)

Use the built-in R-function glm for this part!

Wikipedia’s List of highest mountains (https://en.wikipedia.org/wiki/List_of_highest_mountains_on_Earth) lists 118 of the world’s highest mountains, along with some properties of each one, including the number of successful and failed attampts at reaching the summit as of 2004. In this problem, we will consider a data set consisting of the height (in meters), topographic prominence (also in meters), number of successful ascents and number of failed attempts for 113 of the mountains on the list. The mountains Mount Everest (height 8848, prominence 8848), Muztagh Ata, Ismoil Somoni Peak, and Jengish Chokusu/Tömür/Pk Pobeda are excluded from the dataset because of incomplete data. In addition the mountain Chogolisa (height 7665, prominence 1624) is removed from the data set to be used for prediction.

In the following, let \(y_i\) be the number of successful ascents, and let \(n_i\) be the total number of attempts (sum of successful and failed) of the \(i\)th mountain. Use a binary regression with logit link to model the probability of success. That is,

  1. Model for response: \(Y_i \sim \text{Bin}(n_i, \pi_i) \quad \text{for} \ i = 1,\ldots,113\)
  2. Linear predictor: \(\eta_i = \mathbf{x}_i^T\boldsymbol{\beta}\)
  3. Link function: \(\eta_i = \ln \left(\frac{\pi_i}{1-\pi_i}\right)\)

where \(\mathbf{x}_i\) is \(p\) dimensional column vector of covariates for observation \(i\), and \(\boldsymbol\beta\) is the vector of regression parameters.

a) (1 point)

  • Write down the log-likelihood function \(\ell(\boldsymbol\beta)\) for this model.
  • Explain in a few words what we do in practice to arrive at maximum likelihood estimates for the parameters \(\beta\).

b) (1 point)

Load the dataset:

filepath <- "https://www.math.ntnu.no/emner/TMA4315/2017h/mountains"
mount <- read.table(file = filepath, header = TRUE, col.names = c("height", 
    "prominence", "fail", "success"))

Note that some entries differs from the Wikipedia list.

Fit the model, with height and prominence as predictors. Hint: To fit this model using the glm function, you write

glm(cbind(success, fail) ~ height + prominence, data = mount, family = "binomial")
  • Interpret the model parameters (hint: odds).
  • Discuss their significance (hint: Wald or LRT).
  • Create a 95 % confidence interval for the height parameter. Let \((\hat{\beta}_L,\hat{\beta}_H)\) denote the limits of your interval.
  • If you calculate \((\exp(\hat{\beta}_L),\exp(\hat{\beta}_H)\), how can you interpret this confidence interval?

c) (1 point)

For each of the covariates (height and prominence), use ggplot to plot the deviance residuals as a function of the covariate. Describe what you see. Then use the model deviance to assess the model fit.

Now you shall plot the estimated probabilities as a function of both height and prominence. The procedure will be similar to what we did in the interactive lecture in the second week of module 3. Use geom_raster. The functions geom_contour and scale_fill_gradientn(colours = terrain.colors(10)) can make this graph easier to interpret. Only include estimated probabilities inside the data range. Comment briefly on the plot.

d) (1 point)

The height and prominence of Mount Everest are both equal to 8848 meters. Based on the fitted model, estimate the probability of successfully ascending Mount Everest, and construct a confidence interval for the probability (hint: use the asympotic distribution of the MLE to find the asymptotic distribution of the linear predictor, then make a confidence interval for the linear predictor, finally transform this interval (i.e., the lower and upper limits of the interval) to get an interval for the probability of sucess). Write down the mathematical formula for this confidence interval.
Is it reasonable to use this model to estimate the probability of ascending Mount Everest? Why/why not?
Do the same with Chogolisa (height 7665 and prominence 1624). The data rom Chogolisa gave 2 failures and 20 successes. Comment briefly on the result.

Part 2: Poisson regression – Tippeligaen 2016 (6 points)

In Compulsory exercise 1 you made your own mylm package to handle Gaussian responses. In this exercise you will create a function to estimate regression parameters in a Poisson regression. The main difference from before is that it is now necessary to perform numerical optimization to find the parameter estimates. You can create the function in any way you want. However, you may of course make an R package with only one function if you want, and then a framework can be found here: myglm.R. You do not have to create print, summary, plot or anova this time, only the myglm function is necessary.

Introduction

If you choose to create and build a package, follow the instructions from Exercise 1, but now use the myglm.R file from the link above.

Here are some tips and suggestions that can make the implementation of the function easier:

  1. Use the R function optim to find the maximum likelihood estimates for \(\boldsymbol\beta\). Note that optim minimizes by default, but you can either use the negative log-likelihood, or read under Details in the documentation for optim. You must create an R function for the log-likelihood and provide it to optim (the function can be included in the same file as the code for the package). Hint: we did that for Module 2, interactive session in week 2.
  2. If you use hessian = TRUE in optim it will return the Hessian at the maximum likelihood estimate. Calculate the estimated covariance matrix based on this Hessian (but you will not actually need these values, you only need the estimates of \(\boldsymbol\beta\), but it is cool to know).
  3. You might want to implement a function that evaluates derivatives of the (log)likelihood and provide it to optim if you want to get the same results as the standard glm function (but this is not necessary).

Load the data set in R:

filepath <- "https://www.math.ntnu.no/emner/TMA4315/2017h/tippeligaen2016.dat"
tippeliga <- read.table(file = filepath, header = TRUE, colClasses = c("character", 
    "character", "numeric", "numeric"))

The data set consists of four columns:

  • home: the name of the home team
  • away: the name of the away team
  • yh: the score of the home team
  • ya: the score of the away team

The data set contains results from all football matches of the 2016 (Norwegian) Tippeligaen. You shall use this dataset for the exercise, but the method works on all seasons/leagues. You can also analyse the on-going Eliteserien (2017 season), but this is not a part of the exercise and shall not be included in your solution. See the Wikipedia article on Tippeligaen 2016 for information on the 2016 season.

Each row of the data corresponds to one played match, where the home team played on their home turf against the away team, who was visiting. Each team in the league faces all other teams twice during the season. Once as the home team, and once as the away team. Hence, a league contested by \(n\) teams will consist of \(n(n-1)\) matches. In our case, \(n=16\), and we have data on 240 matches.

Your task is to investigate whether the official winner of the season really was the best team, and to study the uncertainty of the final ranking. Specifically, for the 2016 Tippeliga, you should be able to give some answer to the question: How likely or unlikely were Rosenborg to become champions? You should begin by reading the attached article (Lee, 1997), which gives more background. However, the model we use in this exercise is a simplification of the model in the article.

Our model is as follows: For each game, we assume that the score (the number of goals) of the home team is independent of the score of the away team.

We assume that each team has a single parameter that measures its strength. We denote this strength parameter \(\beta_\text{A}\) for team A, and so on. (In the article by Lee (1997) they use two parameter, one offensive and one defensive. We use only one!)

For a match where the home team is A and the away team is B, the score (number of goals) for team A will have the Poisson distribution with mean \(\lambda\), where \[ \ln\lambda=\beta_0+\beta_\text{home} + \beta_\text{A} - \beta_\text{B}, \] and the score (number of goals) for team B will have the Poisson distribution with mean \(\lambda\), with \[ \ln\lambda=\beta_0-\beta_\text{A} + \beta_\text{B},\] where \(\beta_0\) is our intercept and \(\beta_{\text{home}}\) is a home advantage parameter, both are taken to be the same for all teams. If two teams have equal strength, then \(\beta_A=\beta_B\), then \(\exp(\beta_0)\) will give the expected number of goals for the away team, and \(\exp(\beta_0+\beta_{\text{home}})\) will give the expected numer of goals for the home team.

a) (1 point)

Is the assumption of independence between the goals made by the home and away teams reasonable?

Hint: See the attached article (Lee, 1997), and Wikipedia article on the chi-squared test. Useful concepts include contingency tables and Pearson’s \(\chi^2\) test. The function chisq.test in R can be used, but you can also implement it yourselves to get a better understanding of the test. Testing for independence in contingency tables will be covered in the lectures (probably Module 4, week 2).

b) (1 point)

If a match has a winning team, the winner gets 3 points and the loser gets 0. If the match is a draw, both teams get 1 point each. Produce the final ranking for the season (see e.g. Wikipedia article on Tippeligaen 2016 to ensure you have done the ranking calculation correctly). Note that you need to calculate the goal differences as well.

c) (2 points)

Using Poisson regression, estimate the intercept, the home advantage and the strength parameter for each team. Produce a ranking based on estimated strengths and compare with the ranking from b). Discuss.

Hints: The response vector – the scores – has length \(2n(n-1)=480\) and contains the last two columns of tippeliga (yh and ya). From the formulas above, we see that the linear predictors are of the form \[\ln \text{E}(Y)=\beta_0+\beta_\text{home}x_\text{home}+\beta_\text{Aalesund}x_\text{Aalesund}+\cdots+\beta_\text{Stabaek}x_\text{Stabaek}.\] Let A be the home team and B the away team. If the score \(Y\) is for A, then \(x_\text{home}=1\), \(x_\text A=1\), \(x_\text B=-1\) and \(x_\text C=0\) for all other teams C. If the score \(Y\) is for B, then \(x_\text{home}=0\), \(x_\text A=-1\), \(x_\text B=1\) and \(x_\text C=0\) for all other teams C. Thus we have \(n+1+1=18\) covariates (including the intercept).

Instead of constructing \(18\) covariate vectors, you may find it easier to construct the \(480\times18\) design matrix X. Then you can specify the model as formula = goals ~ -1+X, where goals is the response vector, and -1 since R automatically adds an additional intercept term if you do not explicitely prohibit it in this way.

There is a complication: The \(16\) team covariate vectors are linearly dependent (they sum to the zero vector), and thus there are infinitely many parameter vectors giving the same predictor – also for the predictor maximizing the likelihood. You may choose any of them, but to get optim to work, the simplest is to force one of the parameters, say \(\beta_\text{Aalesund}\), to be zero, which is equivalent to removing the covariate \(x_\text{Aalesund}\), thus reducing the number of team covariates from \(16\) to \(15\). In effect, the strength parameter of Ålesund serves as a reference having value zero, with stronger teams having greater strength parameter and weaker teams having less.

To make interpretation easier, make sure to name your covariates (use colnames for the columns of X) according to the names of the teams, and e.g. HomeAdvantage for the home advantage covariate.

You may want to compare your results with the results of the built-in function glm:

summary(glm(goals ~ -1 + X, family = "poisson"))

d) (2 points)

Finally, we want to investigate if the team that won the season was really the best team, by means of simulation instead of comparing estimated strength. To do this, we use the estimated strengths of each team and the intercept and the home advantage, and simulate \(1\,000\) seasons. For each season, produce the final ranking, and then study the simulated distribution of the final rank for each team. Compare with the final ranking of the (real) 2016 season and discuss.

Hints: The simulation may take some time, so you can do the simulation one time, and save the simulated results in a txt file or an R file (see write.table and saveRDS). Then you can load the simulated data into R when they are needed. If you want your simulations to be reproducable you may set.seed(123) (or any other favorite number of yours) before you perform the simulations.

The distributions of the final rank for each team can be visualized using histograms. Useful commands in R include the function melt from the reshape2 library, and the command facet_wrap(~ teams, scales = "free_x") helps with displaying the histograms together.