We have the following random effect model
\[ y_{ij} = \beta_0 + \beta_1 x +\gamma_{0i} + \gamma_{1i}x + \epsilon_{ij} \tag{1} \] where \(\boldsymbol{\gamma}_i = [\gamma_{0i} \,\,\, \gamma_{1i}]^T\) are binormally distributed with zero mean and variance matrix
\[ \begin{bmatrix} \tau_0^2 & \tau_{01} \\ \tau_{01} & \tau_1^2 \end{bmatrix} \] and \(\varepsilon_{ij}\) are iid \(N(0,\sigma^2)\) for \(i=1,\dots,m\) and \(j=1,\dots,n_i.\) The remaining terms of (1) can be written as vectors
\(\boldsymbol \beta=[\beta_0 \,\,\, \beta_1]^T,\) \(\mathbf{x}_{ij} = [1 \,\,\, x_{ij}]^T\) and \(\mathbf{u}_{ij} = [1 \,\,\, x_{ij}]^T.\)
For a specific cluster \(i\) equation (1) takes the following form \[ \mathbf{y}_i = \mathbf{X}_i \boldsymbol{\beta} + \mathbf{U}_i\boldsymbol{\gamma}_i + \boldsymbol{\varepsilon}_i, \quad i=1,\cdots,m, \tag{2} \] where
\(\mathbf{y}_i = [y_{i1} \,\, \cdots \,\, y_{in_i}]^T,\) \(\mathbf{X}_i = [x_{i1}^T \,\, \cdots \,\, x_{in_i}^T]^T,\) \(\mathbf{U}_i = [u_{i1}^T \,\, \cdots \,\, u_{in_i}^T]^T\) and \(\boldsymbol{\varepsilon}_i = [\varepsilon_{i1} \,\, \cdots \,\, \varepsilon_{in_i}]^T.\)
Finally we can write the generic LMM equation
\[ \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \mathbf{U}\boldsymbol{\gamma} + \boldsymbol{\varepsilon}, \tag{3} \] where
\(\mathbf{y} = [\mathbf{y}_{1} \,\, \cdots \,\, \mathbf{y}_{i} \,\, \cdots \,\, \mathbf{y}_{m}]^T,\) \(\boldsymbol{\varepsilon} = [\boldsymbol{\varepsilon}_{1} \,\, \cdots \,\, \boldsymbol{\varepsilon}_{i} \,\, \cdots \,\, \boldsymbol{\varepsilon}_{m}]^T,\) \(\boldsymbol{\gamma} = [\boldsymbol{\gamma}_{1} \,\, \cdots \,\, \boldsymbol{\gamma}_{i} \,\, \cdots \,\, \boldsymbol{\gamma}_{m}]^T\) and \(\mathbf{U} = \operatorname{blogdiag}({\mathbf{U}_1, \ldots \mathbf{U}_i, \ldots \mathbf{U}_m}).\)
mylmm = function(y, x, group, REML = FALSE, start=c(1,1,0,1)) {
library(Matrix)
n = length(y)
m = nlevels(group)
#X = cbind(1,x)
X = model.matrix(~1+x)
Ul = list()
u = unique(group)
for (i in 1:m){
Ul[[i]] = cbind(1,x[group==u[i]])
}
U = bdiag(Ul)
# Cov matrix
Vfn = function(theta) {
r2 = theta[3]
tau01 = r2*sqrt(theta[1]*theta[2])
# 2X2 matrix of tau parameters
mm = matrix(c(theta[1], tau01, tau01, theta[2]), nrow = 2, byrow = T)
mlist = list()
for(i in 1:m){
mlist[[i]] = mm
}
G = bdiag(mlist)
R = theta[4]*diag(n)
as.matrix(U %*% G %*% t(U) + R)
}
# estim beta using hat(cov)
betahatfn = function(V) {
W = solve(V)
t_X.W.X = t(X) %*% W %*% X
solve(t_X.W.X) %*% t(X) %*% W %*% y
}
# profile likelihood, ML estimator
# if true REML
lfn = function(theta) {
V = Vfn(theta)
W = solve(V)
epsilonhat = y - X %*% betahatfn(V)
l =
-.5 * determinant(V)$modulus- .5 * t(epsilonhat) %*% W %*% epsilonhat
if (REML)
l = l - .5 * determinant(t(X) %*% W %*% X)$modulus
l
}
opt = optim(start, lfn, method = "L-BFGS-B",control = list(fnscale = -1),
lower = c(0,0,-1,0), upper = c(Inf,Inf,1,Inf))
theta = opt$par
betahat = betahatfn(Vfn(theta))
W = solve(Vfn(theta))
var_beta = solve(t(X) %*% W %*%X)
sd_beta = sqrt(diag(var_beta))
cor_beta = var_beta[1,2]/(sd_beta[1]*sd_beta[2])
Fixed_ef = data.frame(Estimate = betahat, Std_Error = sd_beta)
#list(theta = theta, beta = betahat, sd_beta = sd_beta, cor_beta = cor_beta)
list(theta = theta, Fixed_ef =Fixed_ef, cor_beta = cor_beta)
}
library(lme4)
## Loading required package: Matrix
mymod = mylmm(y = sleepstudy$Reaction, x = sleepstudy$Days
, group = sleepstudy$Subject, REML = FALSE, start=c(1,1,1,1))
rmod = lme4::lmer(Reaction ~ 1 + Days + (1+ Days|Subject), REML=FALSE, data = sleepstudy)
summary(rmod)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: Reaction ~ 1 + Days + (1 + Days | Subject)
## Data: sleepstudy
##
## AIC BIC logLik deviance df.resid
## 1763.9 1783.1 -876.0 1751.9 174
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.9416 -0.4656 0.0289 0.4636 5.1793
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## Subject (Intercept) 565.48 23.780
## Days 32.68 5.717 0.08
## Residual 654.95 25.592
## Number of obs: 180, groups: Subject, 18
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 251.405 6.632 37.907
## Days 10.467 1.502 6.968
##
## Correlation of Fixed Effects:
## (Intr)
## Days -0.138
mymod
## $theta
## [1] 565.52507288 32.68247834 0.08131161 654.94128788
##
## $Fixed_ef
## Estimate Std_Error
## (Intercept) 251.40510 6.632318
## x 10.46729 1.502242
##
## $cor_beta
## (Intercept)
## -0.1375578
long <- read.csv("https://www.math.ntnu.no/emner/TMA4315/2020h/eliteserie.csv",
colClasses = c("factor","factor","factor","numeric"))
library(glmmTMB)
mod <- glmmTMB(goals ~ home + (1|attack) + (1|defence), poisson, data=long, REML=TRUE)
Given the random effects \(\gamma_{ai}, \gamma_{dj}\) and the covariate \(x_{k}\) the responses \(Y_{ijk}\) are conditionally independent and the conditional distribution \(f(y_{ijk}\mid \gamma_{ai}, \gamma_{dj})\) is Poisson.
The conditional mean \(\mu_{ijk}=\text{E}(Y_{ijk}\mid \gamma_{ai}, \gamma_{dj})\) is linked to the linear predictor \(\eta_{ij}= \beta_0+ \beta_1x_k + \gamma_{ai}+ \gamma_{dj}\) through the link function \[\eta_{ijk}=g(\mu_{ijk}),\] or \[\mu_{ij}=h(\eta_{ij})\] where \(h^{-1}=g\). Here we use the log link function.
The random effects \(\gamma_{ai}, \gamma_{dj}\), \(i=1,\ldots,m,\) \(j=1,\ldots,m\), \(i\neq j\) are independent and identically distributed \[ \gamma_{ai} \sim N(0, \tau_a) \text{ and } \gamma_{dj} \sim N(0, \tau_d).\] These random effects represent the attack and defence strengths of each team. Note that \(\gamma_{ai}\) and \(\gamma_{ai}\) are assumed to be independent.
The model \(y_{ijk} = \exp(\beta_0 + \beta_1x_{k} + \gamma_{ai} + \gamma_{di})\) gives the expected number of goals that the “attack” team \(i\) will score if it plays against “defence” team \(j.\) The dummy variable \(x_{k}\) represent the factor home having 2 levels (\(k=1,2\)) and it is equal to 1 for \(k=2\) meaning that the “attack” team \(i\) has home field advantage or 0 if \(k=1\) meaning that the “defence” team \(j\) has home field advantage.
summary(mod)
## Family: poisson ( log )
## Formula: goals ~ home + (1 | attack) + (1 | defence)
## Data: long
##
## AIC BIC logLik deviance df.resid
## 1147.2 1163.1 -569.6 1139.2 382
##
## Random effects:
##
## Conditional model:
## Groups Name Variance Std.Dev.
## attack (Intercept) 0.007478 0.08647
## defence (Intercept) 0.016383 0.12800
## Number of obs: 384, groups: attack, 16; defence, 16
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.12421 0.07809 1.591 0.112
## homeyes 0.40716 0.08745 4.656 3.22e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The estimate of home advantage is 0.41 which means that the home team scores on average \(\exp(0.41) = 1.5\) more goals, that is 50% more goals, regardless of team strengths which is reasonable for a footbal match. We observe that the variance of defence strengths (\(\tau_a = 0.08647\)) vary more than the attack strengths (\(\tau_d = 0.128\)).
To make interpretation easier we can define the strength which is the difference between attack and diffence
rf = ranef(mod)
m = data.frame(attack = unlist(rf$cond$attack), defence = unlist(rf$cond$defence))
rownames(m) = rownames(rf$cond$attack)
m$strength = m$attack-m$defence
m
## attack defence strength
## BodoeGlimt -0.036781062 -0.042616090 0.005835029
## Brann 0.012026209 -0.123934761 0.135960970
## Haugesund 0.011223106 -0.061931278 0.073154385
## Kristiansund -0.011367328 0.008112432 -0.019479760
## Lillestroem -0.049915996 0.030699257 -0.080615253
## Molde 0.078390643 -0.036630979 0.115021622
## Odd 0.003654179 -0.052013600 0.055667779
## Ranheim_TF 0.023375599 0.062209734 -0.038834135
## Rosenborg 0.050622609 -0.152631173 0.203253783
## Sandefjord_Fotball -0.058333079 0.133164228 -0.191497307
## Sarpsborg08 0.026946364 0.006574064 0.020372301
## Stabaek -0.026801293 0.085376126 -0.112177420
## Start -0.060500163 0.081958112 -0.142458276
## Stroemsgodset 0.024556017 0.040486666 -0.015930650
## Tromsoe 0.005756700 -0.009852817 0.015609517
## Vaalerenga 0.007147494 0.031030079 -0.023882585
The attack and defence strengths of teams are closely related to the ranking of the teams so far as you can see in e).
A team of average attack strength has \(\gamma_a=0\) and similarly an defence attack team has \(\gamma_d=0.\) Since the number of goals given the random effects \(\gamma_a, \gamma_d\) follows Poisson distribution the expection and variance of number of goals are equal
\[ E(y_{ij}\mid \gamma_{a,i}=0, \gamma_{d,i}=0) = Var(y_{ij}\mid \gamma_{a,i}=0, \gamma_{d,i}=0) = \exp(\beta_0+\beta_1X_{home}) \]
betas = fixef(mod)$cond
goals_homeyes = exp(sum(betas))
goals_homeno = exp(betas[1])
c(goals_homeyes = as.numeric(goals_homeyes), goals_homeno = as.numeric(goals_homeno))
## goals_homeyes goals_homeno
## 1.701265 1.132258
First, it holds that if a random variable \(Y=\ln(X)\) follows a normal distribution with mean 0 and variance \(\tau^2,\) then X follows a Lognormal distribution with mean \(\exp(\frac{\tau^2}{2})\) and variance \((\exp(\tau^2)-1)\exp(\tau^2).\)
Since the random variables \(\gamma_{a}\sim N(0,\tau^2_a)\) and \(\gamma_d\sim N(0,\tau^2_d)\) are independent, the sum is normally distributed with mean equal to the sum of the corresponding means and variance equal to the sum of the corresponding variances. We write that
\[ \gamma_a+\gamma_d \sim N(0,\tau^2), \] where \(\tau^2 = \tau_a^2 + \tau_d^2.\)
To derive \(E(y_{ij})\) we use the low of total expectation
\[ \begin{align} E(y_{ij}) &= E[E(y_{ij}\mid\gamma_{a,i},\gamma_{d,i})]\\ &= E(\exp(\beta_0+\beta_1X_{home}+\gamma_{a,i} + \gamma_{d,i})) \\ &= \exp(\beta_0+\beta_1X_{home})E(\exp(\gamma_{a,i}+\gamma_{d,i})) \\ &= \exp(\beta_0+\beta_1X_{home})\exp(\frac{\tau^2}{2}) \\ \end{align} \]
To derive \(\operatorname{Var} (y_{ij})\) we use the low of total variance
\[ \begin{align} \operatorname{Var}(y_{ij}) &= E[\operatorname{Var}(y_{ij}\mid\gamma_{a,i},\gamma_{d,i})] + \operatorname{Var}[E(y_{ij}\mid\gamma_{a,i},\gamma_{d,i})]\\ &= E[\exp(\beta_0+\beta_1X_{home}+\gamma_{a,i} + \gamma_{d,i})] + \operatorname{Var}[\exp(\beta_0+\beta_1X_{home}+\gamma_{a,i} + \gamma_{d,i})]\\ &= E(y_{ij}) + \exp[2(\beta_0+\beta_1X_{home})]\operatorname{Var}[\exp(\gamma_{a,i}+\gamma_{d,i})]\\ &= E(y_{ij}) + \exp[2(\beta_0+\beta_1X_{home})](\exp(\tau^2)-1)\exp(\tau^2)\\ \end{align} \]
(tau2_attack = VarCorr(mod)$cond$attack[1])
## [1] 0.007477577
(tau2_defence= VarCorr(mod)$cond$defence[1])
## [1] 0.01638331
tau2 = tau2_attack + tau2_defence
marg_exp_homeyes = exp(sum(betas) + 0.5*(tau2) )
marg_var_homeyes = marg_exp_homeyes + exp(2*sum(betas)) * (exp(tau2)-1) * exp(tau2)
c(marg_exp_homeyes=as.vector(marg_exp_homeyes), marg_var_homeyes = as.vector(marg_var_homeyes))
## marg_exp_homeyes marg_var_homeyes
## 1.721683 1.793262
marg_exp_homeno = exp(betas[1]+ 0.5*(tau2) )
marg_var_homeno = marg_exp_homeno + exp(2*betas[1]) * (exp(tau2)-1) * exp(tau2)
c(marg_exp_homeno=as.numeric(marg_exp_homeno), marg_var_homeno = as.numeric(marg_var_homeno))
## marg_exp_homeno marg_var_homeno
## 1.145847 1.177552
For the home team we have
var_game_prop_hy = marg_exp_homeyes/marg_var_homeyes
var_strength_prop_hy = exp(2*sum(betas)) * (exp(tau2)-1) * exp(tau2)/marg_var_homeyes
c(var_game_prop_hy =as.numeric(var_game_prop_hy),var_strength_prop_hy =as.numeric(var_strength_prop_hy))
## var_game_prop_hy var_strength_prop_hy
## 0.96008456 0.03991544
For the away team we have
var_game_prop_hn = marg_exp_homeno/marg_var_homeno
var_strength_prop_hn = exp(2*betas[1]) * (exp(tau2)-1) * exp(tau2)/marg_var_homeno
c(var_game_prop_hn =as.numeric(var_game_prop_hn),var_strength_prop_hn =as.numeric(var_strength_prop_hn))
## var_game_prop_hn var_strength_prop_hn
## 0.97307527 0.02692473
The test of significance of random effects \(\gamma_{a}\sim N(0.\tau_a^2)\) and \(\gamma_{d}\sim N(0.\tau_d^2)\) is equivalent to testing if the variance of each random effect \(\tau_a^2\) and \(\tau_d^2\) is 0 or greater than 0. We construct the following hypothesis testing
\[ H_o:\tau_a^2 = 0 \text{ vs } H_a:\tau_a^2 > 0 \] for testing if the random effect of attack is significant and the same holds for defence as well.
no_at_mod = glmmTMB(goals ~ home + (1|defence), poisson, data=long, REML=TRUE)
LRT_attack = as.numeric(2*(logLik(mod)-logLik(no_at_mod)))
pval = 0.5 * pchisq(LRT_attack, df = 1, lower.tail = FALSE) +
0.5 * pchisq(LRT_attack, df = 0 , lower.tail = FALSE)
pval
## [1] 0.2587558
no_de_mod = glmmTMB(goals ~ home + (1|attack), poisson, data=long, REML=TRUE)
LRT_defence = as.numeric(2*(logLik(mod)-logLik(no_de_mod)))
pval = 0.5 * pchisq(LRT_defence, df = 1, lower.tail = FALSE) +
0.5 * pchisq(LRT_defence, df = 0, lower.tail = FALSE)
pval
## [1] 0.09843786
To test the significance of of the fixed effect parameter for the home field advantage we must use the ordinary instead of the restricted likelihoods as the restricted likelihoods, at least in the LMM setting, involves different subsets of the data and are thus not comparable.
mod_ml = glmmTMB(goals ~ (1|attack) + (1|defence), poisson, data=long, REML=FALSE)
no_home_mod = glmmTMB(goals ~ (1|attack) + (1|defence), poisson, data=long, REML=FALSE)
LRT_home = as.numeric(2*(logLik(mod_ml)-logLik(no_home_mod)))
pchisq(LRT_home, df = 1, lower.tail = F)
## [1] 1
rankings = function(data) {
data = data[complete.cases(data),]
tmp =
data.frame(
team = levels(data$attack),
points = 0,
goalsscored = 0,
goalsconceeded = 0,
scorediff = 0
)
# add points to each team going through each match separately
for (i in seq(1, nrow(data), by = 2)) {
j = data[i,]$attack
k = data[i,]$defence
if (data[i,]$goals > data[i + 1,]$goals)
tmp[tmp$team == j,]$points =
tmp[tmp$team == j,]$points + 3 # team j won
else if (data[i,]$goals < data[i + 1,]$goals)
tmp[tmp$team == k,]$points =
tmp[tmp$team == k,]$points + 3 # team k won
else {
tmp[tmp$team == j,]$points =
tmp[tmp$team == j,]$points + 1 # draw
tmp[tmp$team == k,]$points =
tmp[tmp$team == k,]$points + 1
}
}
# compute goal differences and goals scored
for (j in levels(data$attack)) {
tmp[tmp$team == j, ]$goalsscored =
sum(data[data$attack == j,]$goals)
tmp[tmp$team == j, ]$goalsconceeded =
sum(data[data$defence == j,]$goals)
}
tmp$scorediff = tmp$goalsscored - tmp$goalsconceeded
# compute the rankings
tmp$rank = data.table::frankv(
tmp,
col = c("points", "scorediff", "goalsscored"),
ties.method = "random",
order = -1
)
# return everything
tmp[order(tmp$points, decreasing = T),]
}
rnk = rankings(long)
rownames(rnk) = NULL
rnk
## team points goalsscored goalsconceeded scorediff rank
## 1 Rosenborg 52 43 20 23 1
## 2 Brann 48 36 23 13 2
## 3 Molde 43 48 30 18 3
## 4 Haugesund 41 36 28 8 4
## 5 Ranheim_TF 38 38 40 -2 5
## 6 Vaalerenga 36 35 37 -2 6
## 7 Odd 34 35 29 6 7
## 8 Tromsoe 33 35 33 2 8
## 9 Sarpsborg08 32 39 34 5 9
## 10 Kristiansund 31 32 35 -3 10
## 11 BodoeGlimt 27 28 30 -2 11
## 12 Stroemsgodset 26 38 38 0 12
## 13 Lillestroem 25 26 37 -11 13
## 14 Stabaek 23 29 43 -14 14
## 15 Start 23 24 42 -18 15
## 16 Sandefjord_Fotball 15 24 47 -23 16
lambdas = predict(mod, newdata = long, type = "response")
sim_rank = matrix(NA,nrow = 1000, ncol = 16)
for (i in 1:1000) {
res = long
res$goals = rpois(length(lambdas), lambda = lambdas)
r = rankings(res)
sim_rank[i,] = as.character(r$team)
}
nm = levels(long$attack) # team names
lnm = length(nm) # total number of teams
probs = matrix(0, nrow = lnm, ncol = lnm) # matrix to store the probabilities of rankings
rownames(probs) = nm
colnames(probs) = paste0(1:lnm,"-place")
for (j in 1:ncol(sim_rank)) {
prob = table(sim_rank[,j])/1000 # calculate probabilities of j-th place
nms = names(prob) # team names of corresponding probabilities
for(i in 1:length(prob)){
ind = which(nm == nms[i]) # find row of i-th name
probs[ind,j] = as.numeric(prob[i]) # store probability
}
}
> probs
1-place 2-place 3-place 4-place 5-place 6-place 7-place 8-place 9-place
BodoeGlimt 0.059 0.071 0.069 0.064 0.076 0.061 0.058 0.072 0.061
Brann 0.127 0.101 0.112 0.094 0.076 0.068 0.067 0.060 0.064
Haugesund 0.085 0.082 0.082 0.089 0.074 0.067 0.086 0.073 0.056
Kristiansund 0.052 0.054 0.076 0.067 0.052 0.050 0.069 0.079 0.059
Lillestroem 0.031 0.030 0.043 0.063 0.044 0.074 0.069 0.063 0.055
Molde 0.117 0.106 0.084 0.082 0.096 0.079 0.066 0.071 0.063
Odd 0.070 0.091 0.091 0.080 0.086 0.070 0.057 0.077 0.065
Ranheim_TF 0.039 0.048 0.057 0.048 0.057 0.070 0.063 0.046 0.094
Rosenborg 0.182 0.132 0.087 0.105 0.083 0.080 0.067 0.051 0.048
Sandefjord_Fotball 0.010 0.015 0.017 0.020 0.033 0.030 0.041 0.036 0.059
Sarpsborg08 0.054 0.058 0.072 0.060 0.075 0.073 0.068 0.072 0.078
Stabaek 0.019 0.021 0.032 0.029 0.043 0.048 0.054 0.050 0.053
Start 0.014 0.024 0.019 0.035 0.041 0.043 0.039 0.046 0.059
Stroemsgodset 0.040 0.066 0.047 0.048 0.049 0.062 0.064 0.060 0.056
Tromsoe 0.063 0.057 0.060 0.055 0.055 0.063 0.073 0.074 0.071
Vaalerenga 0.038 0.044 0.052 0.061 0.060 0.062 0.059 0.070 0.059
10-place 11-place 12-place 13-place 14-place 15-place 16-place
BodoeGlimt 0.066 0.081 0.056 0.060 0.053 0.051 0.042
Brann 0.048 0.053 0.042 0.027 0.028 0.015 0.018
Haugesund 0.052 0.056 0.053 0.043 0.051 0.030 0.021
Kristiansund 0.060 0.062 0.075 0.062 0.080 0.058 0.045
Lillestroem 0.072 0.059 0.076 0.082 0.078 0.075 0.086
Molde 0.039 0.035 0.053 0.036 0.034 0.024 0.015
Odd 0.056 0.058 0.043 0.054 0.035 0.039 0.028
Ranheim_TF 0.066 0.074 0.077 0.083 0.061 0.066 0.051
Rosenborg 0.027 0.038 0.026 0.020 0.024 0.021 0.009
Sandefjord_Fotball 0.078 0.055 0.083 0.082 0.111 0.128 0.202
Sarpsborg08 0.059 0.064 0.057 0.065 0.053 0.052 0.040
Stabaek 0.097 0.063 0.071 0.085 0.090 0.113 0.132
Start 0.073 0.082 0.081 0.095 0.078 0.120 0.151
Stroemsgodset 0.062 0.074 0.070 0.078 0.089 0.076 0.059
Tromsoe 0.082 0.072 0.067 0.057 0.059 0.043 0.049
Vaalerenga 0.063 0.074 0.070 0.071 0.076 0.089 0.052
>
expect_ranking = matrix(0, nrow = nrow(probs))
rownames(expect_ranking) = rownames(probs)
for(i in 1:nrow(probs)){
expect_ranking[i,] = sum(1:16*probs[i,])
}
t(expect_ranking)
## BodoeGlimt Brann Haugesund Kristiansund Lillestroem Molde Odd Ranheim_TF
## [1,] 8.123 6.177 7.144 8.56 9.558 6.415 7.234 8.995
## Rosenborg Sandefjord_Fotball Sarpsborg08 Stabaek Start Stroemsgodset
## [1,] 5.386 11.775 8.178 10.718 11.041 9.172
## Tromsoe Vaalerenga
## [1,] 8.384 9.14
lambdas = predict(mod, newdata = long[!complete.cases(long),], type = "response")
sim_rank = matrix(NA,nrow = 1000, ncol = 16)
for (i in 1:1000) {
res = long
res$goals[!complete.cases(res$goals)] = rpois(length(lambdas), lambda = lambdas)
r = rankings(res)
sim_rank[i,] = as.character(r$team)
}
nm = levels(long$attack) # team names
lnm = length(nm) # total number of teams
probs = matrix(0, nrow = lnm, ncol = lnm) # matrix to store the probabilities of rankings
rownames(probs) = nm
colnames(probs) = paste0(1:lnm,"-place")
for (j in 1:ncol(sim_rank)) {
prob = table(sim_rank[,j])/1000 # calculate probabilities of j-th place
nms = names(prob) # team names of corresponding probabilities
for(i in 1:length(prob)){
ind = which(nm == nms[i]) # find row of i-th name
probs[ind,j] = as.numeric(prob[i]) # store probability
}
}
> probs
1-place 2-place 3-place 4-place 5-place 6-place 7-place 8-place 9-place
BodoeGlimt 0.000 0.000 0.000 0.000 0.001 0.003 0.014 0.034 0.080
Brann 0.252 0.669 0.073 0.006 0.000 0.000 0.000 0.000 0.000
Haugesund 0.000 0.031 0.371 0.368 0.161 0.049 0.014 0.005 0.001
Kristiansund 0.000 0.000 0.000 0.001 0.025 0.091 0.117 0.173 0.210
Lillestroem 0.000 0.000 0.000 0.000 0.000 0.000 0.002 0.006 0.023
Molde 0.006 0.058 0.486 0.346 0.082 0.019 0.000 0.003 0.000
Odd 0.000 0.000 0.003 0.026 0.114 0.188 0.225 0.205 0.128
Ranheim_TF 0.000 0.004 0.037 0.186 0.345 0.202 0.125 0.066 0.024
Rosenborg 0.742 0.238 0.019 0.001 0.000 0.000 0.000 0.000 0.000
Sandefjord_Fotball 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Sarpsborg08 0.000 0.000 0.001 0.004 0.032 0.070 0.116 0.157 0.208
Stabaek 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.002 0.009
Start 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.002 0.004
Stroemsgodset 0.000 0.000 0.000 0.000 0.000 0.001 0.004 0.026 0.043
Tromsoe 0.000 0.000 0.001 0.010 0.054 0.131 0.175 0.185 0.180
Vaalerenga 0.000 0.000 0.009 0.052 0.186 0.246 0.208 0.136 0.090
10-place 11-place 12-place 13-place 14-place 15-place 16-place
BodoeGlimt 0.149 0.247 0.238 0.130 0.077 0.027 0.000
Brann 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Haugesund 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Kristiansund 0.186 0.126 0.047 0.020 0.003 0.001 0.000
Lillestroem 0.050 0.100 0.169 0.271 0.233 0.142 0.004
Molde 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Odd 0.080 0.024 0.005 0.002 0.000 0.000 0.000
Ranheim_TF 0.010 0.001 0.000 0.000 0.000 0.000 0.000
Rosenborg 0.000 0.000 0.000 0.000 0.000 0.000 0.000
Sandefjord_Fotball 0.000 0.000 0.000 0.001 0.009 0.067 0.923
Sarpsborg08 0.221 0.120 0.053 0.016 0.002 0.000 0.000
Stabaek 0.022 0.084 0.131 0.213 0.283 0.241 0.015
Start 0.009 0.037 0.063 0.154 0.237 0.437 0.057
Stroemsgodset 0.088 0.152 0.261 0.187 0.152 0.085 0.001
Tromsoe 0.139 0.085 0.030 0.006 0.004 0.000 0.000
Vaalerenga 0.046 0.024 0.003 0.000 0.000 0.000 0.000
expect_ranking = matrix(0, nrow = nrow(probs))
rownames(expect_ranking) = rownames(probs)
for(i in 1:nrow(probs)){
expect_ranking[i,] = sum(1:16*probs[i,])
}
t(expect_ranking)
## BodoeGlimt Brann Haugesund Kristiansund Lillestroem Molde Odd Ranheim_TF
## [1,] 8.123 6.177 7.144 8.56 9.558 6.415 7.234 8.995
## Rosenborg Sandefjord_Fotball Sarpsborg08 Stabaek Start Stroemsgodset
## [1,] 5.386 11.775 8.178 10.718 11.041 9.172
## Tromsoe Vaalerenga
## [1,] 8.384 9.14
library(ggplot2)
nm = levels(long$attack)
pos = matrix(NA, nrow = nrow(sim_rank), ncol = ncol(sim_rank))
for(i in 1:nrow(pos)){
pos[i,] = match(nm, sim_rank[i,])
}
pos = data.frame(pos)
colnames(pos) = nm
pos_plot = tidyr::gather(pos, key = "team", value = "pos")
ggplot(data = pos_plot, aes(pos)) + geom_bar() + facet_wrap(~team)
We observe that the uncertainty regarding the first and last ranking positions is smaller compared to the in between rankings.
exp_pos = colMeans(pos)
sd_pos = apply(pos,2,sd)/sqrt(1000)
dfplot = data.frame(team = nm, exp_pos = exp_pos, sd_pos = sd_pos, strength = m$strength)
bs = lm(exp_pos~strength, data = dfplot)$coefficients
pd <- position_dodge(0.1)
ggplot(dfplot, aes(x=strength, y=exp_pos, colour = team)) +
geom_point()+
geom_errorbar(aes(ymin=exp_pos-1.96*sd_pos, ymax=exp_pos+1.96*sd_pos)) +
geom_abline(intercept = bs[1] ,slope = bs[2], colour = "grey")
Unsurprisingly, in the plot it seems that there is a strong relation between the expected ranking and team strengths. The relationship between the estimated expected rankings based on simulation of the whole series and the difference between attack and defence strength of each team does not seem to be completely montonic but this may be because of Monte carlo error so it remains somewhat unclear if the relationship is monotonic or not.
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] glmmTMB_1.0.2.1 lme4_1.1-23 Matrix_1.2-18 reshape2_1.4.4
## [5] ggplot2_3.3.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.5 TMB_1.7.18 nloptr_1.2.2.2 pillar_1.4.6
## [5] compiler_4.0.3 plyr_1.8.6 tools_4.0.3 boot_1.3-25
## [9] digest_0.6.27 statmod_1.4.35 nlme_3.1-149 evaluate_0.14
## [13] lifecycle_0.2.0 tibble_3.0.4 gtable_0.3.0 lattice_0.20-41
## [17] pkgconfig_2.0.3 rlang_0.4.8 yaml_2.2.1 xfun_0.19
## [21] withr_2.3.0 dplyr_1.0.2 stringr_1.4.0 knitr_1.30
## [25] generics_0.1.0 vctrs_0.3.5 grid_4.0.3 tidyselect_1.1.0
## [29] data.table_1.13.2 glue_1.4.2 R6_2.5.0 rmarkdown_2.5
## [33] minqa_1.2.4 farver_2.0.3 tidyr_1.1.2 purrr_0.3.4
## [37] magrittr_1.5 MASS_7.3-53 scales_1.1.1 ellipsis_0.3.1
## [41] htmltools_0.5.0 splines_4.0.3 colorspace_2.0-0 labeling_0.4.2
## [45] stringi_1.5.3 munsell_0.5.0 crayon_1.3.4