Problem 1

a)

We have the following random effect model

\[ y_{ij} = \beta_0 + \beta_1 x +\gamma_{0i} + \gamma_{1i}x + \epsilon_{ij} \tag{1} \] where \(\boldsymbol{\gamma}_i = [\gamma_{0i} \,\,\, \gamma_{1i}]^T\) are binormally distributed with zero mean and variance matrix

\[ \begin{bmatrix} \tau_0^2 & \tau_{01} \\ \tau_{01} & \tau_1^2 \end{bmatrix} \] and \(\varepsilon_{ij}\) are iid \(N(0,\sigma^2)\) for \(i=1,\dots,m\) and \(j=1,\dots,n_i.\) The remaining terms of (1) can be written as vectors

\(\boldsymbol \beta=[\beta_0 \,\,\, \beta_1]^T,\) \(\mathbf{x}_{ij} = [1 \,\,\, x_{ij}]^T\) and \(\mathbf{u}_{ij} = [1 \,\,\, x_{ij}]^T.\)

For a specific cluster \(i\) equation (1) takes the following form \[ \mathbf{y}_i = \mathbf{X}_i \boldsymbol{\beta} + \mathbf{U}_i\boldsymbol{\gamma}_i + \boldsymbol{\varepsilon}_i, \quad i=1,\cdots,m, \tag{2} \] where

\(\mathbf{y}_i = [y_{i1} \,\, \cdots \,\, y_{in_i}]^T,\) \(\mathbf{X}_i = [x_{i1}^T \,\, \cdots \,\, x_{in_i}^T]^T,\) \(\mathbf{U}_i = [u_{i1}^T \,\, \cdots \,\, u_{in_i}^T]^T\) and \(\boldsymbol{\varepsilon}_i = [\varepsilon_{i1} \,\, \cdots \,\, \varepsilon_{in_i}]^T.\)

Finally we can write the generic LMM equation

\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{U}\boldsymbol{\gamma} + \boldsymbol{\varepsilon}, \tag{3} \] where

\(\mathbf{y} = [\mathbf{y}_{1} \,\, \cdots \,\, \mathbf{y}_{i} \,\, \cdots \,\, \mathbf{y}_{m}]^T,\) \(\boldsymbol{\varepsilon} = [\boldsymbol{\varepsilon}_{1} \,\, \cdots \,\, \boldsymbol{\varepsilon}_{i} \,\, \cdots \,\, \boldsymbol{\varepsilon}_{m}]^T,\) \(\boldsymbol{\gamma} = [\boldsymbol{\gamma}_{1} \,\, \cdots \,\, \boldsymbol{\gamma}_{i} \,\, \cdots \,\, \boldsymbol{\gamma}_{m}]^T\) and \(\mathbf{U} = \operatorname{blogdiag}({\mathbf{U}_1, \ldots \mathbf{U}_i, \ldots \mathbf{U}_m}).\)

b)

mylmm = function(y, x, group, REML = FALSE, start=c(1,1,0,1)) {
  library(Matrix)
  n = length(y)
  m = nlevels(group)
  #X = cbind(1,x)
  X = model.matrix(~1+x)
  Ul = list()
  u = unique(group)
  for (i in 1:m){
    Ul[[i]] = cbind(1,x[group==u[i]])
  } 
  U = bdiag(Ul)
  
  # Cov matrix
  Vfn = function(theta) {
    r2 = theta[3]
    tau01 = r2*sqrt(theta[1]*theta[2])
    # 2X2 matrix of tau parameters
    mm = matrix(c(theta[1], tau01, tau01, theta[2]), nrow = 2, byrow = T)
    mlist = list()
    for(i in 1:m){
      mlist[[i]] = mm
    }
    G = bdiag(mlist)
    
    R = theta[4]*diag(n)
    as.matrix(U %*% G %*% t(U) + R)
  }
  
  # estim beta using hat(cov)
  betahatfn = function(V) {
    W = solve(V)
    t_X.W.X = t(X) %*% W %*% X
    solve(t_X.W.X) %*% t(X) %*% W %*% y
  }
  # profile likelihood, ML estimator
  # if true REML
  lfn = function(theta) {
    V = Vfn(theta)
    W = solve(V)
    epsilonhat = y - X %*% betahatfn(V)
    l =
      -.5 * determinant(V)$modulus- .5 * t(epsilonhat) %*% W %*% epsilonhat
    if (REML)
      l = l - .5 * determinant(t(X) %*% W %*% X)$modulus
    l
  }
  opt = optim(start, lfn, method = "L-BFGS-B",control = list(fnscale = -1),
              lower = c(0,0,-1,0), upper = c(Inf,Inf,1,Inf))
  theta = opt$par
  betahat = betahatfn(Vfn(theta))
  W = solve(Vfn(theta))
  var_beta = solve(t(X) %*% W %*%X)
  sd_beta = sqrt(diag(var_beta))
  cor_beta = var_beta[1,2]/(sd_beta[1]*sd_beta[2])
  Fixed_ef = data.frame(Estimate = betahat, Std_Error = sd_beta)
  #list(theta = theta, beta = betahat, sd_beta = sd_beta, cor_beta = cor_beta)
  list(theta = theta, Fixed_ef =Fixed_ef,  cor_beta = cor_beta)

}

c)

library(lme4)

## Loading required package: Matrix

mymod = mylmm(y = sleepstudy$Reaction, x = sleepstudy$Days
              , group = sleepstudy$Subject, REML = FALSE, start=c(1,1,1,1))

rmod = lme4::lmer(Reaction ~ 1 + Days + (1+ Days|Subject), REML=FALSE, data = sleepstudy)
summary(rmod)

## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: Reaction ~ 1 + Days + (1 + Days | Subject)
##    Data: sleepstudy
## 
##      AIC      BIC   logLik deviance df.resid 
##   1763.9   1783.1   -876.0   1751.9      174 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9416 -0.4656  0.0289  0.4636  5.1793 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  Subject  (Intercept) 565.48   23.780       
##           Days         32.68    5.717   0.08
##  Residual             654.95   25.592       
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  251.405      6.632  37.907
## Days          10.467      1.502   6.968
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.138

mymod

## $theta
## [1] 565.52507288  32.68247834   0.08131161 654.94128788
## 
## $Fixed_ef
##              Estimate Std_Error
## (Intercept) 251.40510  6.632318
## x            10.46729  1.502242
## 
## $cor_beta
## (Intercept) 
##  -0.1375578

REML produce unbiased estimates of variance and covariance parameters by removing all information from the mean estimator.

Problem 2

long <- read.csv("https://www.math.ntnu.no/emner/TMA4315/2020h/eliteserie.csv",
                 colClasses = c("factor","factor","factor","numeric"))

a)

library(glmmTMB)
mod <- glmmTMB(goals ~ home + (1|attack) + (1|defence), poisson, data=long, REML=TRUE)

Given the random effects \(\gamma_{ai}, \gamma_{dj}\) and the covariate \(x_{k}\) the responses \(Y_{ijk}\) are conditionally independent and the conditional distribution \(f(y_{ijk}\mid \gamma_{ai}, \gamma_{dj})\) is Poisson.
The conditional mean \(\mu_{ijk}=\text{E}(Y_{ijk}\mid \gamma_{ai}, \gamma_{dj})\) is linked to the linear predictor \(\eta_{ij}= \beta_0+ \beta_1x_k + \gamma_{ai}+ \gamma_{dj}\) through the link function \[\eta_{ijk}=g(\mu_{ijk}),\] or \[\mu_{ij}=h(\eta_{ij})\] where \(h^{-1}=g\). Here we use the log link function.
The random effects \(\gamma_{ai}, \gamma_{dj}\), \(i=1,\ldots,m,\) \(j=1,\ldots,m\), \(i\neq j\) are independent and identically distributed \[ \gamma_{ai} \sim N(0, \tau_a) \text{ and } \gamma_{dj} \sim N(0, \tau_d).\] These random effects represent the attack and defence strengths of each team. Note that \(\gamma_{ai}\) and \(\gamma_{ai}\) are assumed to be independent.

The model \(y_{ijk} = \exp(\beta_0 + \beta_1x_{k} + \gamma_{ai} + \gamma_{di})\) gives the expected number of goals that the “attack” team \(i\) will score if it plays against “defence” team \(j.\) The dummy variable \(x_{k}\) represent the factor home having 2 levels (\(k=1,2\)) and it is equal to 1 for \(k=2\) meaning that the “attack” team \(i\) has home field advantage or 0 if \(k=1\) meaning that the “defence” team \(j\) has home field advantage.

b)

summary(mod)

##  Family: poisson  ( log )
## Formula:          goals ~ home + (1 | attack) + (1 | defence)
## Data: long
## 
##      AIC      BIC   logLik deviance df.resid 
##   1147.2   1163.1   -569.6   1139.2      382 
## 
## Random effects:
## 
## Conditional model:
##  Groups  Name        Variance Std.Dev.
##  attack  (Intercept) 0.007478 0.08647 
##  defence (Intercept) 0.016383 0.12800 
## Number of obs: 384, groups:  attack, 16; defence, 16
## 
## Conditional model:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.12421    0.07809   1.591    0.112    
## homeyes      0.40716    0.08745   4.656 3.22e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The estimate of home advantage is 0.41 which means that the home team scores on average \(\exp(0.41) = 1.5\) more goals, that is 50% more goals, regardless of team strengths which is reasonable for a footbal match. We observe that the variance of defence strengths (\(\tau_a = 0.08647\)) vary more than the attack strengths (\(\tau_d = 0.128\)).

To make interpretation easier we can define the strength which is the difference between attack and diffence

rf = ranef(mod)
m = data.frame(attack = unlist(rf$cond$attack), defence = unlist(rf$cond$defence))
rownames(m) = rownames(rf$cond$attack)
m$strength = m$attack-m$defence
m

##                          attack      defence     strength
## BodoeGlimt         -0.036781062 -0.042616090  0.005835029
## Brann               0.012026209 -0.123934761  0.135960970
## Haugesund           0.011223106 -0.061931278  0.073154385
## Kristiansund       -0.011367328  0.008112432 -0.019479760
## Lillestroem        -0.049915996  0.030699257 -0.080615253
## Molde               0.078390643 -0.036630979  0.115021622
## Odd                 0.003654179 -0.052013600  0.055667779
## Ranheim_TF          0.023375599  0.062209734 -0.038834135
## Rosenborg           0.050622609 -0.152631173  0.203253783
## Sandefjord_Fotball -0.058333079  0.133164228 -0.191497307
## Sarpsborg08         0.026946364  0.006574064  0.020372301
## Stabaek            -0.026801293  0.085376126 -0.112177420
## Start              -0.060500163  0.081958112 -0.142458276
## Stroemsgodset       0.024556017  0.040486666 -0.015930650
## Tromsoe             0.005756700 -0.009852817  0.015609517
## Vaalerenga          0.007147494  0.031030079 -0.023882585

The attack and defence strengths of teams are closely related to the ranking of the teams so far as you can see in e).

A team of average attack strength has \(\gamma_a=0\) and similarly an defence attack team has \(\gamma_d=0.\) Since the number of goals given the random effects \(\gamma_a, \gamma_d\) follows Poisson distribution the expection and variance of number of goals are equal

\[ E(y_{ij}\mid \gamma_{a,i}=0, \gamma_{d,i}=0) = Var(y_{ij}\mid \gamma_{a,i}=0, \gamma_{d,i}=0) = \exp(\beta_0+\beta_1X_{home}) \]

betas = fixef(mod)$cond

goals_homeyes = exp(sum(betas))
goals_homeno = exp(betas[1])

c(goals_homeyes = as.numeric(goals_homeyes), goals_homeno = as.numeric(goals_homeno))

## goals_homeyes  goals_homeno 
##      1.701265      1.132258

c)

First, it holds that if a random variable \(Y=\ln(X)\) follows a normal distribution with mean 0 and variance \(\tau^2,\) then X follows a Lognormal distribution with mean \(\exp(\frac{\tau^2}{2})\) and variance \((\exp(\tau^2)-1)\exp(\tau^2).\)

Since the random variables \(\gamma_{a}\sim N(0,\tau^2_a)\) and \(\gamma_d\sim N(0,\tau^2_d)\) are independent, the sum is normally distributed with mean equal to the sum of the corresponding means and variance equal to the sum of the corresponding variances. We write that

\[ \gamma_a+\gamma_d \sim N(0,\tau^2), \] where \(\tau^2 = \tau_a^2 + \tau_d^2.\)

To derive \(E(y_{ij})\) we use the low of total expectation

\[ \begin{align} E(y_{ij}) &= E[E(y_{ij}\mid\gamma_{a,i},\gamma_{d,i})]\\ &= E(\exp(\beta_0+\beta_1X_{home}+\gamma_{a,i} + \gamma_{d,i})) \\ &= \exp(\beta_0+\beta_1X_{home})E(\exp(\gamma_{a,i}+\gamma_{d,i})) \\ &= \exp(\beta_0+\beta_1X_{home})\exp(\frac{\tau^2}{2}) \\ \end{align} \]

To derive \(\operatorname{Var} (y_{ij})\) we use the low of total variance

\[ \begin{align} \operatorname{Var}(y_{ij}) &= E[\operatorname{Var}(y_{ij}\mid\gamma_{a,i},\gamma_{d,i})] + \operatorname{Var}[E(y_{ij}\mid\gamma_{a,i},\gamma_{d,i})]\\ &= E[\exp(\beta_0+\beta_1X_{home}+\gamma_{a,i} + \gamma_{d,i})] + \operatorname{Var}[\exp(\beta_0+\beta_1X_{home}+\gamma_{a,i} + \gamma_{d,i})]\\ &= E(y_{ij}) + \exp[2(\beta_0+\beta_1X_{home})]\operatorname{Var}[\exp(\gamma_{a,i}+\gamma_{d,i})]\\ &= E(y_{ij}) + \exp[2(\beta_0+\beta_1X_{home})](\exp(\tau^2)-1)\exp(\tau^2)\\ \end{align} \]

(tau2_attack = VarCorr(mod)$cond$attack[1])

## [1] 0.007477577

(tau2_defence= VarCorr(mod)$cond$defence[1])

## [1] 0.01638331

tau2 = tau2_attack + tau2_defence

marginal expectation and variance of goals for home field team

marg_exp_homeyes = exp(sum(betas) + 0.5*(tau2) )
marg_var_homeyes = marg_exp_homeyes + exp(2*sum(betas)) * (exp(tau2)-1) * exp(tau2)
c(marg_exp_homeyes=as.vector(marg_exp_homeyes), marg_var_homeyes = as.vector(marg_var_homeyes))

## marg_exp_homeyes marg_var_homeyes 
##         1.721683         1.793262

marginal expectation and variance of goals for away team

marg_exp_homeno = exp(betas[1]+ 0.5*(tau2) )
marg_var_homeno = marg_exp_homeno +  exp(2*betas[1]) * (exp(tau2)-1) * exp(tau2)
c(marg_exp_homeno=as.numeric(marg_exp_homeno), marg_var_homeno = as.numeric(marg_var_homeno))

## marg_exp_homeno marg_var_homeno 
##        1.145847        1.177552

The total variance of the number of goals is \(\operatorname{Var}(y_{ij})\)
the variance of the game itsels is \(E[\operatorname{Var}(y_{ij}\mid\gamma_{a,i},\gamma_{d,i})]\)
the variance of difference between strengths of teams is [E(y_{ij}{a,i},{d,i})]

For the home team we have

var_game_prop_hy = marg_exp_homeyes/marg_var_homeyes
var_strength_prop_hy = exp(2*sum(betas)) * (exp(tau2)-1) * exp(tau2)/marg_var_homeyes
c(var_game_prop_hy =as.numeric(var_game_prop_hy),var_strength_prop_hy =as.numeric(var_strength_prop_hy))

##     var_game_prop_hy var_strength_prop_hy 
##           0.96008456           0.03991544

For the away team we have

var_game_prop_hn = marg_exp_homeno/marg_var_homeno
var_strength_prop_hn = exp(2*betas[1]) * (exp(tau2)-1) * exp(tau2)/marg_var_homeno
c(var_game_prop_hn =as.numeric(var_game_prop_hn),var_strength_prop_hn =as.numeric(var_strength_prop_hn))

##     var_game_prop_hn var_strength_prop_hn 
##           0.97307527           0.02692473

d)

The test of significance of random effects \(\gamma_{a}\sim N(0.\tau_a^2)\) and \(\gamma_{d}\sim N(0.\tau_d^2)\) is equivalent to testing if the variance of each random effect \(\tau_a^2\) and \(\tau_d^2\) is 0 or greater than 0. We construct the following hypothesis testing

\[ H_o:\tau_a^2 = 0 \text{ vs } H_a:\tau_a^2 > 0 \] for testing if the random effect of attack is significant and the same holds for defence as well.

no_at_mod = glmmTMB(goals ~ home + (1|defence), poisson, data=long, REML=TRUE)
LRT_attack = as.numeric(2*(logLik(mod)-logLik(no_at_mod)))

pval = 0.5 * pchisq(LRT_attack, df = 1, lower.tail = FALSE) + 
  0.5 * pchisq(LRT_attack, df = 0 , lower.tail = FALSE) 
pval

## [1] 0.2587558

no_de_mod = glmmTMB(goals ~ home + (1|attack), poisson, data=long, REML=TRUE)
LRT_defence = as.numeric(2*(logLik(mod)-logLik(no_de_mod)))

pval = 0.5 * pchisq(LRT_defence, df = 1, lower.tail = FALSE) + 
  0.5 * pchisq(LRT_defence, df = 0, lower.tail = FALSE) 
pval

## [1] 0.09843786

To test the significance of of the fixed effect parameter for the home field advantage we must use the ordinary instead of the restricted likelihoods as the restricted likelihoods, at least in the LMM setting, involves different subsets of the data and are thus not comparable.

mod_ml = glmmTMB(goals ~ (1|attack) + (1|defence), poisson, data=long, REML=FALSE)
no_home_mod = glmmTMB(goals ~ (1|attack) + (1|defence), poisson, data=long, REML=FALSE)
LRT_home = as.numeric(2*(logLik(mod_ml)-logLik(no_home_mod)))
pchisq(LRT_home, df = 1, lower.tail = F)

## [1] 1

e)

rankings = function(data) {
  data = data[complete.cases(data),]
  tmp =
    data.frame(
      team = levels(data$attack),
      points = 0,
      goalsscored = 0,
      goalsconceeded = 0,
      scorediff = 0
    )
  # add points to each team going through each match separately
  for (i in seq(1, nrow(data), by = 2)) {
    j = data[i,]$attack
    k = data[i,]$defence
    if (data[i,]$goals > data[i + 1,]$goals)
      tmp[tmp$team == j,]$points =
      tmp[tmp$team == j,]$points + 3 # team j won
    else if (data[i,]$goals < data[i + 1,]$goals)
      tmp[tmp$team == k,]$points =
      tmp[tmp$team == k,]$points + 3 # team k won
    else {
      tmp[tmp$team == j,]$points =
        tmp[tmp$team == j,]$points + 1 # draw
      tmp[tmp$team == k,]$points =
        tmp[tmp$team == k,]$points + 1
    }
  }
  # compute goal differences and goals scored
  for (j in levels(data$attack)) {
    tmp[tmp$team == j, ]$goalsscored =
      sum(data[data$attack == j,]$goals)
    tmp[tmp$team == j, ]$goalsconceeded =
      sum(data[data$defence == j,]$goals)
  }
  tmp$scorediff = tmp$goalsscored - tmp$goalsconceeded
  # compute the rankings
  tmp$rank = data.table::frankv(
    tmp,
    col = c("points", "scorediff", "goalsscored"),
    ties.method = "random",
    order = -1
  )
  # return everything
  tmp[order(tmp$points, decreasing = T),]
}
rnk  = rankings(long)
rownames(rnk) = NULL
rnk

##                  team points goalsscored goalsconceeded scorediff rank
## 1           Rosenborg     52          43             20        23    1
## 2               Brann     48          36             23        13    2
## 3               Molde     43          48             30        18    3
## 4           Haugesund     41          36             28         8    4
## 5          Ranheim_TF     38          38             40        -2    5
## 6          Vaalerenga     36          35             37        -2    6
## 7                 Odd     34          35             29         6    7
## 8             Tromsoe     33          35             33         2    8
## 9         Sarpsborg08     32          39             34         5    9
## 10       Kristiansund     31          32             35        -3   10
## 11         BodoeGlimt     27          28             30        -2   11
## 12      Stroemsgodset     26          38             38         0   12
## 13        Lillestroem     25          26             37       -11   13
## 14            Stabaek     23          29             43       -14   14
## 15              Start     23          24             42       -18   15
## 16 Sandefjord_Fotball     15          24             47       -23   16

f)

Simulating the whole series

lambdas = predict(mod, newdata = long, type = "response")
sim_rank = matrix(NA,nrow = 1000, ncol = 16)

for (i in 1:1000) {
  res = long
  res$goals = rpois(length(lambdas), lambda = lambdas)
  r = rankings(res)
  sim_rank[i,] = as.character(r$team)
}

nm = levels(long$attack) # team names
lnm = length(nm) # total number of teams
probs = matrix(0, nrow = lnm, ncol = lnm) # matrix to store the probabilities of rankings
rownames(probs) = nm
colnames(probs) = paste0(1:lnm,"-place")

for (j in 1:ncol(sim_rank)) {
  prob = table(sim_rank[,j])/1000 # calculate probabilities of j-th place
  nms = names(prob) # team names of corresponding probabilities
  for(i in 1:length(prob)){
    ind = which(nm == nms[i]) # find row of i-th name
    probs[ind,j] = as.numeric(prob[i]) # store probability
  }
}

> probs
                   1-place 2-place 3-place 4-place 5-place 6-place 7-place 8-place 9-place
BodoeGlimt           0.059   0.071   0.069   0.064   0.076   0.061   0.058   0.072   0.061
Brann                0.127   0.101   0.112   0.094   0.076   0.068   0.067   0.060   0.064
Haugesund            0.085   0.082   0.082   0.089   0.074   0.067   0.086   0.073   0.056
Kristiansund         0.052   0.054   0.076   0.067   0.052   0.050   0.069   0.079   0.059
Lillestroem          0.031   0.030   0.043   0.063   0.044   0.074   0.069   0.063   0.055
Molde                0.117   0.106   0.084   0.082   0.096   0.079   0.066   0.071   0.063
Odd                  0.070   0.091   0.091   0.080   0.086   0.070   0.057   0.077   0.065
Ranheim_TF           0.039   0.048   0.057   0.048   0.057   0.070   0.063   0.046   0.094
Rosenborg            0.182   0.132   0.087   0.105   0.083   0.080   0.067   0.051   0.048
Sandefjord_Fotball   0.010   0.015   0.017   0.020   0.033   0.030   0.041   0.036   0.059
Sarpsborg08          0.054   0.058   0.072   0.060   0.075   0.073   0.068   0.072   0.078
Stabaek              0.019   0.021   0.032   0.029   0.043   0.048   0.054   0.050   0.053
Start                0.014   0.024   0.019   0.035   0.041   0.043   0.039   0.046   0.059
Stroemsgodset        0.040   0.066   0.047   0.048   0.049   0.062   0.064   0.060   0.056
Tromsoe              0.063   0.057   0.060   0.055   0.055   0.063   0.073   0.074   0.071
Vaalerenga           0.038   0.044   0.052   0.061   0.060   0.062   0.059   0.070   0.059
                   10-place 11-place 12-place 13-place 14-place 15-place 16-place
BodoeGlimt            0.066    0.081    0.056    0.060    0.053    0.051    0.042
Brann                 0.048    0.053    0.042    0.027    0.028    0.015    0.018
Haugesund             0.052    0.056    0.053    0.043    0.051    0.030    0.021
Kristiansund          0.060    0.062    0.075    0.062    0.080    0.058    0.045
Lillestroem           0.072    0.059    0.076    0.082    0.078    0.075    0.086
Molde                 0.039    0.035    0.053    0.036    0.034    0.024    0.015
Odd                   0.056    0.058    0.043    0.054    0.035    0.039    0.028
Ranheim_TF            0.066    0.074    0.077    0.083    0.061    0.066    0.051
Rosenborg             0.027    0.038    0.026    0.020    0.024    0.021    0.009
Sandefjord_Fotball    0.078    0.055    0.083    0.082    0.111    0.128    0.202
Sarpsborg08           0.059    0.064    0.057    0.065    0.053    0.052    0.040
Stabaek               0.097    0.063    0.071    0.085    0.090    0.113    0.132
Start                 0.073    0.082    0.081    0.095    0.078    0.120    0.151
Stroemsgodset         0.062    0.074    0.070    0.078    0.089    0.076    0.059
Tromsoe               0.082    0.072    0.067    0.057    0.059    0.043    0.049
Vaalerenga            0.063    0.074    0.070    0.071    0.076    0.089    0.052
>

The expected rankings for the whole series are

expect_ranking = matrix(0, nrow = nrow(probs))
rownames(expect_ranking) = rownames(probs)
for(i in 1:nrow(probs)){
  expect_ranking[i,] = sum(1:16*probs[i,])
}
t(expect_ranking)

##      BodoeGlimt Brann Haugesund Kristiansund Lillestroem Molde   Odd Ranheim_TF
## [1,]      8.123 6.177     7.144         8.56       9.558 6.415 7.234      8.995
##      Rosenborg Sandefjord_Fotball Sarpsborg08 Stabaek  Start Stroemsgodset
## [1,]     5.386             11.775       8.178  10.718 11.041         9.172
##      Tromsoe Vaalerenga
## [1,]   8.384       9.14

Simulating the remaining matches

lambdas = predict(mod, newdata = long[!complete.cases(long),], type = "response")
sim_rank = matrix(NA,nrow = 1000, ncol = 16)

for (i in 1:1000) {
  res = long
  res$goals[!complete.cases(res$goals)] = rpois(length(lambdas), lambda = lambdas)
  r = rankings(res)
  sim_rank[i,] = as.character(r$team)
}

nm = levels(long$attack) # team names
lnm = length(nm) # total number of teams
probs = matrix(0, nrow = lnm, ncol = lnm) # matrix to store the probabilities of rankings
rownames(probs) = nm
colnames(probs) = paste0(1:lnm,"-place")

for (j in 1:ncol(sim_rank)) {
  prob = table(sim_rank[,j])/1000 # calculate probabilities of j-th place
  nms = names(prob) # team names of corresponding probabilities
  for(i in 1:length(prob)){
    ind = which(nm == nms[i]) # find row of i-th name
    probs[ind,j] = as.numeric(prob[i]) # store probability
  }
}

> probs
                   1-place 2-place 3-place 4-place 5-place 6-place 7-place 8-place 9-place
BodoeGlimt           0.000   0.000   0.000   0.000   0.001   0.003   0.014   0.034   0.080
Brann                0.252   0.669   0.073   0.006   0.000   0.000   0.000   0.000   0.000
Haugesund            0.000   0.031   0.371   0.368   0.161   0.049   0.014   0.005   0.001
Kristiansund         0.000   0.000   0.000   0.001   0.025   0.091   0.117   0.173   0.210
Lillestroem          0.000   0.000   0.000   0.000   0.000   0.000   0.002   0.006   0.023
Molde                0.006   0.058   0.486   0.346   0.082   0.019   0.000   0.003   0.000
Odd                  0.000   0.000   0.003   0.026   0.114   0.188   0.225   0.205   0.128
Ranheim_TF           0.000   0.004   0.037   0.186   0.345   0.202   0.125   0.066   0.024
Rosenborg            0.742   0.238   0.019   0.001   0.000   0.000   0.000   0.000   0.000
Sandefjord_Fotball   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
Sarpsborg08          0.000   0.000   0.001   0.004   0.032   0.070   0.116   0.157   0.208
Stabaek              0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.002   0.009
Start                0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.002   0.004
Stroemsgodset        0.000   0.000   0.000   0.000   0.000   0.001   0.004   0.026   0.043
Tromsoe              0.000   0.000   0.001   0.010   0.054   0.131   0.175   0.185   0.180
Vaalerenga           0.000   0.000   0.009   0.052   0.186   0.246   0.208   0.136   0.090
                   10-place 11-place 12-place 13-place 14-place 15-place 16-place
BodoeGlimt            0.149    0.247    0.238    0.130    0.077    0.027    0.000
Brann                 0.000    0.000    0.000    0.000    0.000    0.000    0.000
Haugesund             0.000    0.000    0.000    0.000    0.000    0.000    0.000
Kristiansund          0.186    0.126    0.047    0.020    0.003    0.001    0.000
Lillestroem           0.050    0.100    0.169    0.271    0.233    0.142    0.004
Molde                 0.000    0.000    0.000    0.000    0.000    0.000    0.000
Odd                   0.080    0.024    0.005    0.002    0.000    0.000    0.000
Ranheim_TF            0.010    0.001    0.000    0.000    0.000    0.000    0.000
Rosenborg             0.000    0.000    0.000    0.000    0.000    0.000    0.000
Sandefjord_Fotball    0.000    0.000    0.000    0.001    0.009    0.067    0.923
Sarpsborg08           0.221    0.120    0.053    0.016    0.002    0.000    0.000
Stabaek               0.022    0.084    0.131    0.213    0.283    0.241    0.015
Start                 0.009    0.037    0.063    0.154    0.237    0.437    0.057
Stroemsgodset         0.088    0.152    0.261    0.187    0.152    0.085    0.001
Tromsoe               0.139    0.085    0.030    0.006    0.004    0.000    0.000
Vaalerenga            0.046    0.024    0.003    0.000    0.000    0.000    0.000

The expected rankings for the remaining series are

expect_ranking = matrix(0, nrow = nrow(probs))
rownames(expect_ranking) = rownames(probs)
for(i in 1:nrow(probs)){
  expect_ranking[i,] = sum(1:16*probs[i,])
}
t(expect_ranking)

##      BodoeGlimt Brann Haugesund Kristiansund Lillestroem Molde   Odd Ranheim_TF
## [1,]      8.123 6.177     7.144         8.56       9.558 6.415 7.234      8.995
##      Rosenborg Sandefjord_Fotball Sarpsborg08 Stabaek  Start Stroemsgodset
## [1,]     5.386             11.775       8.178  10.718 11.041         9.172
##      Tromsoe Vaalerenga
## [1,]   8.384       9.14

g)

library(ggplot2)
nm = levels(long$attack)
pos = matrix(NA, nrow = nrow(sim_rank), ncol = ncol(sim_rank))
for(i in 1:nrow(pos)){
  pos[i,] =  match(nm, sim_rank[i,])
}

pos = data.frame(pos)
colnames(pos)  = nm
pos_plot = tidyr::gather(pos, key = "team", value = "pos")
ggplot(data = pos_plot, aes(pos)) + geom_bar() + facet_wrap(~team)

We observe that the uncertainty regarding the first and last ranking positions is smaller compared to the in between rankings.

exp_pos  = colMeans(pos)
sd_pos = apply(pos,2,sd)/sqrt(1000)
dfplot = data.frame(team = nm, exp_pos = exp_pos, sd_pos = sd_pos, strength = m$strength)
bs = lm(exp_pos~strength, data = dfplot)$coefficients
pd <- position_dodge(0.1) 
ggplot(dfplot, aes(x=strength, y=exp_pos, colour = team)) + 
  geom_point()+
  geom_errorbar(aes(ymin=exp_pos-1.96*sd_pos, ymax=exp_pos+1.96*sd_pos)) +
  geom_abline(intercept = bs[1] ,slope = bs[2], colour = "grey")

Unsurprisingly, in the plot it seems that there is a strong relation between the expected ranking and team strengths. The relationship between the estimated expected rankings based on simulation of the whole series and the difference between attack and defence strength of each team does not seem to be completely montonic but this may be because of Monte carlo error so it remains somewhat unclear if the relationship is monotonic or not.

sessionInfo()

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] glmmTMB_1.0.2.1 lme4_1.1-23     Matrix_1.2-18   reshape2_1.4.4 
## [5] ggplot2_3.3.2  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5        TMB_1.7.18        nloptr_1.2.2.2    pillar_1.4.6     
##  [5] compiler_4.0.3    plyr_1.8.6        tools_4.0.3       boot_1.3-25      
##  [9] digest_0.6.27     statmod_1.4.35    nlme_3.1-149      evaluate_0.14    
## [13] lifecycle_0.2.0   tibble_3.0.4      gtable_0.3.0      lattice_0.20-41  
## [17] pkgconfig_2.0.3   rlang_0.4.8       yaml_2.2.1        xfun_0.19        
## [21] withr_2.3.0       dplyr_1.0.2       stringr_1.4.0     knitr_1.30       
## [25] generics_0.1.0    vctrs_0.3.5       grid_4.0.3        tidyselect_1.1.0 
## [29] data.table_1.13.2 glue_1.4.2        R6_2.5.0          rmarkdown_2.5    
## [33] minqa_1.2.4       farver_2.0.3      tidyr_1.1.2       purrr_0.3.4      
## [37] magrittr_1.5      MASS_7.3-53       scales_1.1.1      ellipsis_0.3.1   
## [41] htmltools_0.5.0   splines_4.0.3     colorspace_2.0-0  labeling_0.4.2   
## [45] stringi_1.5.3     munsell_0.5.0     crayon_1.3.4

Obligatory Exercise 3-Solutions

TMA4315 Generalized linear models, autumn 2020

Michail Spitieris

Deadline: November 20

Problem 1

a)

b)

c)

Problem 2

a)

b)

c)

d)

e)

f)

Simulating the whole series

Simulating the remaining matches

g)