Sunday, October 28, 2012

Mixed distribution

In the football data there was some reason to use a mixed distribution in the football data (ref) so I tried doing that. It was more difficult than I expected. Not only is a mixture of distributions fairly difficult, also the system was over parameterized, causing grief.

mixed distribution

data from a mixed distribution is simply said data which is from a combination of distributions. Most obvious example is length of people; males are longer than females. Combine their lengths and a joint non normal distribution appears. Separate them, two normal distributions. This problem is equivalent to the eyes problem in winbugs. Except that I use JAGS and it was supposed to be part of a larger model. The example section in JAGS shows two ways how to do this. In the end I chose to use the dnormmix distribution in the mix module. To quote the JAGS manual: 'The mix module defines a novel distribution dnormmix(mu,tau,pi) representing a finite mixture of normal distributions' and 'If you want to use the dnormmix distribution but do not care about label switching, then you can disable the tempered transition sampler with
set factory "mix::TemperedMix" off, type(sampler)'
That is me, so here is how to do that in R:
library(R2jags)
Loading required package: coda
Loading required package: lattice
Loading required package: R2WinBUGS
Loading required package: rjags
linking to JAGS 3.3.0
module basemod loaded
module bugs loaded
Loading required package: abind
Loading required package: parallel

Attaching package: 'R2jags'

The following object(s) are masked from 'package:coda':

traceplot

load.module("mix")
module mix loaded
set.factory("mix::TemperedMix", 'sampler', FALSE)
NULL
list.factories('sampler')
factory status
1 mix::TemperedMix FALSE
2 bugs::DSum TRUE
3 bugs::Conjugate TRUE
4 bugs::Dirichlet TRUE
5 bugs::MNormal TRUE
6 base::Finite TRUE
7 base::Slice TRUE

football model

This is the previous model with dnormmix grafted in and overall strength with attack/defense removed in an attempt to simplify and understand the estimation problems . Note that Astr[1] is fixed to 1, Dstr[1] is almost fixed to 0 and all other AStr and DStr are free. I am not 100% sure about those parts. Astr needs to be fixed because otherwise the whole set of Astr[] and Dstr[] changes in a similar way over the MCMC samples. It is their relative numbers that count. Somehow, there is still a need to fix Dstr[1], as there is still an overall change. Yet, I have the feeling fixing Astr[1] gives information to determine Dstr[2:nclub] hence Astr[2:nclub] hence Dstr[1]. Yet, they all move similar, and Dstr[1] is clearly not 0 as suggested by the prior.
fbmodel1 <- function() {
  for (i in 1:N) {
    HomeMadeGoals[i] ~ dpois(HS[i])
    OutMadeGoals[i] ~ dpois(OS[i])
    log(HS[i]) <- Home  + Dstr[OutClub[i]] + Astr[HomeClub[i]]
    log(OS[i]) <-        Dstr[HomeClub[i]] + Astr[OutClub[i]] 
  }
  Astr[1] <- 1
  Dstr[1] ~ dnorm(0,10)
  for (i in 2:nclub) {
    Astr[i] ~  dnormmix(MAStr,tauAStr1,EtaAStr1)
    Dstr[i] ~  dnormmix(MDStr,tauDStr1,EtaDStr1)
  }
  for (i in 1:3) {
    MAStr[i] ~ dnorm(0,.01)
    MDStr[i] ~ dnorm(0,.01)
    tauDStr1[i] <- tauDStr
    tauAStr1[i] <- tauAStr
    eee[i] <- 3
  }
  EtaAStr1[1:3] ~ ddirch(eee[1:3])
  EtaDStr1[1:3]  ~ ddirch(eee[1:3])
  sigmaAstr <- 1/sqrt(tauAStr)
  tauAStr ~ dgamma(.001,.001)
  sigmaDstr <- 1/sqrt(tauDStr)
  tauDStr ~ dgamma(.001,.001)
  Home ~ dnorm(0,.0001)
}
params <- c("Dstr","Astr","sigmaAstr","sigmaDstr","Home")
inits <- function(){
    list(TopStr=rnorm(JAGSData$nclub),
         AD=rnorm(JAGSData$nclub),
         sigmaAD=runif(1),
         sigmaStr=runif(0,.5),
         Home=rnorm(1)
         )
}
jagsfit <- jags(JAGSData, model=fbmodel1, inits=inits, 
                parameters=params,progress.bar="gui",
                n.iter=5000)
jagsfit

Inference for Bugs model at "C:/Users/.../RtmpSGssTD/modeld94a9b3849.txt", fit using jags,
 3 chains, each with 15000 iterations (first 7500 discarded), n.thin = 7
 n.sims = 3216 iterations saved
           mu.vect sd.vect     2.5%      25%      50%      75%    97.5%  Rhat n.eff
Astr[1]      1.000   0.000    1.000    1.000    1.000    1.000    1.000 1.000     1
Astr[2]      1.641   0.160    1.338    1.533    1.639    1.748    1.971 1.033    69
Astr[3]      1.328   0.196    0.943    1.196    1.324    1.463    1.710 1.023    91
Astr[4]      0.878   0.185    0.513    0.753    0.880    1.007    1.228 1.021   120
Astr[5]      0.753   0.207    0.326    0.617    0.755    0.894    1.143 1.017   140
Astr[6]      0.949   0.177    0.598    0.836    0.950    1.065    1.304 1.021   110
Astr[7]      1.559   0.158    1.257    1.450    1.559    1.669    1.875 1.032    73
Astr[8]      1.167   0.193    0.804    1.034    1.161    1.293    1.566 1.022   120
Astr[9]      1.426   0.180    1.078    1.304    1.429    1.551    1.756 1.029    84
Astr[10]     1.117   0.187    0.764    0.990    1.114    1.237    1.499 1.024    99
Astr[11]     1.006   0.179    0.660    0.889    1.003    1.121    1.368 1.024    98
Astr[12]     0.955   0.181    0.603    0.830    0.957    1.073    1.312 1.024    91
Astr[13]     1.599   0.159    1.296    1.490    1.599    1.712    1.919 1.034    71
Astr[14]     0.927   0.179    0.566    0.809    0.928    1.053    1.277 1.025    90
Astr[15]     1.178   0.195    0.810    1.046    1.174    1.308    1.574 1.016   160
Astr[16]     1.539   0.164    1.216    1.429    1.538    1.656    1.854 1.034    70
Astr[17]     1.040   0.183    0.693    0.917    1.039    1.154    1.414 1.019   130
Astr[18]     0.968   0.180    0.614    0.849    0.971    1.090    1.315 1.029    91
Dstr[1]     -0.640   0.162   -0.977   -0.747   -0.638   -0.530   -0.331 1.031    79
Dstr[2]     -1.183   0.179   -1.533   -1.301   -1.183   -1.062   -0.844 1.034    66
Dstr[3]     -1.203   0.181   -1.568   -1.323   -1.196   -1.083   -0.853 1.023    97
Dstr[4]     -0.709   0.160   -1.011   -0.814   -0.710   -0.604   -0.400 1.040    58
Dstr[5]     -0.697   0.162   -1.020   -0.804   -0.697   -0.591   -0.375 1.050    46
Dstr[6]     -0.840   0.177   -1.208   -0.956   -0.833   -0.719   -0.508 1.031    70
Dstr[7]     -1.054   0.189   -1.413   -1.191   -1.055   -0.927   -0.675 1.039    62
Dstr[8]     -0.869   0.181   -1.237   -0.987   -0.862   -0.743   -0.538 1.023    91
Dstr[9]     -1.177   0.177   -1.533   -1.293   -1.178   -1.059   -0.832 1.022   100
Dstr[10]    -0.821   0.169   -1.174   -0.929   -0.812   -0.704   -0.503 1.022    98
Dstr[11]    -0.944   0.188   -1.307   -1.074   -0.935   -0.812   -0.593 1.023    95
Dstr[12]    -1.092   0.177   -1.433   -1.216   -1.092   -0.972   -0.739 1.024    87
Dstr[13]    -1.030   0.188   -1.386   -1.162   -1.036   -0.900   -0.650 1.023   100
Dstr[14]    -1.031   0.189   -1.384   -1.160   -1.028   -0.899   -0.665 1.035    64
Dstr[15]    -0.735   0.164   -1.065   -0.843   -0.734   -0.624   -0.418 1.042    54
Dstr[16]    -0.836   0.178   -1.212   -0.952   -0.830   -0.714   -0.500 1.025    88
Dstr[17]    -1.116   0.178   -1.456   -1.237   -1.121   -0.998   -0.758 1.042    55
Dstr[18]    -0.681   0.163   -1.005   -0.791   -0.681   -0.571   -0.367 1.041    54
Home         0.334   0.062    0.210    0.293    0.335    0.376    0.453 1.012   170
sigmaAstr    0.185   0.103    0.042    0.102    0.169    0.254    0.410 1.014   220
sigmaDstr    0.135   0.079    0.028    0.067    0.123    0.190    0.301 1.033    68
deviance  1889.263   8.560 1873.823 1883.430 1888.635 1894.484 1907.580 1.001  2400

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 36.6 and DIC = 1925.9
DIC is an estimate of expected predictive error (lower deviance is better).





Sunday, October 14, 2012

Putting a football model into JAGS

In this post the football model is programmed into JAGS. There are all the reasons to do so. Jags 3.3 is recently released, I was stimulated by Gianluca's post . Obviously I could copy the model in his paper, but that would be too easy and not sure about copyright either. So, in this post, variations of the model of an earlier post is recreated.

Data structure

The starting data looks like this:
 head(Jold)
    Seizoen      Datum      Thuisclub          Uitclub Thuisscore Uitscore
1 2011-2012 2011-08-05      Excelsior        Feyenoord          0        2
2 2011-2012 2011-08-06   RKC Waalwijk  Heracles Almelo          2        2
3 2011-2012 2011-08-06        Roda JC     FC Groningen          2        1
4 2011-2012 2011-08-06  SC Heerenveen              NEC          2        2
5 2011-2012 2011-08-06      VVV-Venlo       FC Utrecht          0        0
6 2011-2012 2011-08-07   ADO Den Haag          Vitesse          0        0
The data has to be made ready to be placed into JAGS. All factors have to be made numbers. Note that each game is still one row. The two outcomes will be resolved in JAGS
JAGSData <- with(Jold,list(

N=length(Datum),
nclub = nlevels(Thuisclub),
HomeClub=(1:nlevels(Thuisclub))[Thuisclub],
OutClub=(1:nlevels(Uitclub))[Uitclub],
HomeMadeGoals=Thuisscore,
OutMadeGoals=Uitscore))
str(JAGSData)
List of 6
 $ N            : int 306
 $ nclub        : int 18
 $ HomeClub     : int [1:306] 5 14 15 16 18 1 3 4 11 7 ...
 $ OutClub      : int [1:306] 9 10 6 12 8 17 13 2 7 3 ...
 $ HomeMadeGoals: int [1:306] 0 2 2 2 0 0 3 1 0 2 ...
 $ OutMadeGoals : int [1:306] 2 2 1 2 0 0 1 4 1 0 ...

Model

The model in JAGS has to contain all the details and all distributions. That makes it much longer, but also gives the opportunity to fiddle around with details. For now it is reasonably simple. Each club has two properties, an attack strength (Astr) and a defense strength (Dstr). Based on the combination of these strengths is the number of goals, again Poisson distributed. The interesting part is in the way attack and defense strength are constructed. These are created from an overall strength (TopStr) and an attack/defense balance (AD). 
fbmodel1 <- function() {
  for (i in 1:N) {
    HomeMadeGoals[i] ~ dpois(HS[i])
    OutMadeGoals[i]  ~ dpois(OS[i])
    log(HS[i]) <- Home + Astr[HomeClub[i]] + Dstr[OutClub[i]]
    log(OS[i]) <-        Dstr[HomeClub[i]] + Astr[OutClub[i]]
  }
  for (i in 1:nclub) {
    Astr[i] <-  TopStr[i]+AD[i]
    Dstr[i] <-  TopStr[i]-AD[i]
    TopStr[i] ~ dnorm(0,tauTop)
    AD[i]     ~ dnorm(0,tauAD)
  }
  tauAD <- pow(sigmaAD,-2)
  sigmaAD ~ dunif(0,100)
  tauTop <- pow(sigmaTop,-2)
  sigmaTop ~ dunif(0,100)
  Home ~ dnorm(0,.0001)
}
To make this model work it needs a series on incantations in R. To keep things short only a basic output plot is given
params <- c("TopStr","AD","sigmaAD","sigmaTop","Home")
inits <- function(){
    list(TopStr=rnorm(JAGSData$nclub),
         AD=rnorm(JAGSData$nclub),
         sigmaAD=runif(1),
         sigmaTop=runif(0,.5),
         Home=rnorm(1)
         )
}
jagsfit <- jags(JAGSData, model=fbmodel1, inits=inits, 
                parameters=params,progress.bar="gui",
                n.iter=4000)
plot(jagsfit)

Parameter extraction

As the interest is in the teams, their properties are extracted. As Gianluca wrote we considered a slightly more complex structure in which we included information on each team's propensity to be "good", "average", or "poor". This helped avoid overshrinkage in the estimations  To examine this I looked at the teams relative scores.The plots look like they have a shoulder, so there is a point there.  
jags.smat <- jagsfit$BUGSoutput$summary
for (i in 1:nlevels(Jold$Thuisclub))  
  rownames(jags.smat) <- sub(paste('[',i,']',sep=''),levels(Jold$Thuisclub)[i],
      rownames(jags.smat),fixed=TRUE) 
plot(density(jags.smat[grep('^Top',rownames(jags.smat)),1])
                       ,main='Distribution of Strength')
points(x=jags.smat[grep('^Top',rownames(jags.smat)),1],
       y=rep(0,nlevels(Jold$Thuisclub))) 
plot(density(jags.smat[grep('^AD',rownames(jags.smat)),1]),
main='Distribution of Attack / Defense')
points(x=jags.smat[    grep('^AD',rownames(jags.smat)),1],
y=rep(0,nlevels(Jold$Thuisclub)))

model variation

There are a number of ways to create distributions for attack and defense strength (Astr and Dstr). One variation is to make them independent, but also to give them fatter tails, which is handled by a t distribution rather than a normal distribution. This fatter tail could be accommodate teams much better or worse than the majority. Unfortunately this did not work, the number of degrees of freedom for the t distribution varied a lot and was too high to make this an alternative model.
# section replaced in fbmodel1 
 for (i in 1:nclub) {
    Astr[i] ~ dt(0,tauStr,nuStr)
    Dstr[i] ~ dt(0,tauStr,nuStr)
  }
  tauStr <- pow(sigmaStr,-2)
  sigmaStr ~ dunif(0,100)
  nuStr <- 1/InuStr
  InuStr ~ dunif(0,.5)

Sunday, October 7, 2012

Footbal ordinal model: examination and predictions

In the previous entry an ordinal model for football games was developed. It is now time to look a bit better at the model and use it. This means three sections; A look at likelihood and link function, a model interpretation part, which focuses on the effect of playing away or at home and a look at some predictions.

model definition

Just to recall the model defined. 
clm4b <- clm(oGoals ~OffenseClub + DefenseClub*OffThuis, data=StartData)

link function and likelihood

Looking at the vignette 'Analysis of ordinal data with cumulative link models — estimation with the R-package ordinal' there is a section where the link function is examined. I am not the expert here, but logit, probit and cloglog are most often used. Logit and probit are similar, except for extreme results. cloglog and loglog are asymetrical. Cauchit, I never encountered in the literature, but since it's there it is taken along. A model with logit link is a proportional odds model, a model with cloglog link is a proportional hazard model. See also McCullagh and Nelder (2nd edition, 1989). 
The example in the vignette ran thus:
links <- c("logit", "probit", "cloglog", "loglog", "cauchit")
sapply(links, function(link) {
      clm(oGoals ~OffenseClub + DefenseClub*OffThuis 
          ,link=link,data=StartData)$logLik })
    logit    probit   cloglog    loglog   cauchit 
-906.3591 -906.9762 -914.2396 -915.1181 -924.0992 
Luckily the proportional odds model holds.

Slice can be used to plot the behavior of the likelihood as function of parameter estimates. It does give one plot for each parameter. To keep this post short only the first 9 are shown. These include the estimates of the category thresholds. It can be seen that thresholds 5|6 and 6|7 are asymmetrical, but close to the maximum likelihood value they are close to quadratic (detail plot not shown).

slice.fm4b <- slice(clm4b, lambda = 5)
par(mfrow = c(3,3))
plot(slice.fm4b,1:9)

Parameter interpretation

The most interesting parameter to assess is the interaction; DefenseClub*OffThuis. It is not simple to do so, all parameters are dependent on each other and the odd parameter is not present to keep the model estimable. As a way out the difference for each team between playing at home or away is used. Ideally this is against a similar team, so. The chances for a team when playing against itself are used. It is a very abstract idea; a team which plays at home against itself when away. The merit is in the interpretation.
library(lattice)
teams <- data.frame(Off=levels(StartData$OffenseClub),
    Def=levels(StartData$OffenseClub))
homeaway <- morepred(clm4b,teams)
longha <- reshape(homeaway[,-1],
            idvar='club2',varying=list(chance=names(homeaway)[3:5]),
            direction='long',v.names='chance',
            times=c('Home','Draw','Away'),timevar='Location')
dotplot(club2 ~ chance,groups=Location,data=longha,
    auto.key=list(columns=3,space='bottom'),
    xlim=c(0,1),xlab='Chance',
    main='Chance of a club to win from itself, depending on home or away')
As can be seen, VVV-Venlo has a lot of advantage playing at home, as do Heracles Almelo, AZ and ADO Den Haag (these all have a black dot far right and a blue dot far left). In contrast, SC Heerenveen, FC Utrecht and De Graafschap show better chances away than at home. This I don't believe, if this were a Bayesian model this believe might be enforced, as this is a frequentist model I have to live the estimates. Clearly  the plot could be improved with confidence intervals, hopefully showing quite some overlap for these estimates. Going Bayesian is clearly in the frame some when, as Gianluca showed. 

Predictions

Previously it was attempted to predict the outcomes of one weekend of games of current season. The results and predictions given below.

Roda JC      - Utrecht      0-1
PEC Zwolle   - Groningen    1-2
RKC Waalwijk - VVV          1-1
Vitesse      - Heracles     1-1
NEC          - Willem II    0-0
ADO Den Haag - Ajax         1-1
Twente       - Heerenveen   1-0
NAC Breda    - AZ           2-1
PSV          - Feyenoord    3-0
These are the original expectations:
         club1           club2      win1     equal      win2
1      Roda JC      FC Utrecht 0.4580659 0.2126926 0.3291782
2 RKC Waalwijk       VVV-Venlo 0.6076020 0.2180298 0.1743364
3      Vitesse Heracles Almelo 0.5723334 0.2275537 0.2000907
4 ADO Den Haag            Ajax 0.1037534 0.1511710 0.7446496
5    FC Twente   SC Heerenveen 0.6605607 0.1558135 0.1822923
6    NAC Breda              AZ 0.2539698 0.2627759 0.4832506
7          PSV       Feyenoord 0.5055082 0.2147899 0.2796468
The new model has different predictions; Most interesting FC Utrecht has now a slightly better chance to win than lose, while in the previous model it was predicted to lose. Big changes are in RKC Waalwijk-VVV Venlo and FC Twente-SC Heerenveen
         club1           club2       win1     equal      win2
1      Roda JC      FC Utrecht 0.36966744 0.2506325 0.3797000
2 RKC Waalwijk       VVV-Venlo 0.78326557 0.1343054 0.0824290
3      Vitesse Heracles Almelo 0.69169975 0.1776653 0.1306350
4 ADO Den Haag            Ajax 0.09980196 0.1591822 0.7410159
5    FC Twente   SC Heerenveen 0.47168784 0.2140003 0.3143119
6    NAC Breda              AZ 0.29419688 0.2577802 0.4480229
7          PSV       Feyenoord 0.47611532 0.2253700 0.2985147

code for predictions:

topred <- read.table(textConnection("
            'Roda JC'        'FC Utrecht'
            'PEC Zwolle'     'FC Groningen'
            'RKC Waalwijk'   'VVV-Venlo'
            'Vitesse'        'Heracles Almelo'
            'NEC'            'Willem II'
            'ADO Den Haag'   'Ajax'
            'FC Twente'      'SC Heerenveen'
            'NAC Breda'      'AZ'
            'PSV'            'Feyenoord'"
    ),col.names=c('Off','Def'))
morepred(clm4b,topred)

Additional code

fbpredict <- function(object,club1,club2) {
  UseMethod('fbpredict',object)
}

fbpredict.polr <- function(object,club1,club2) {
  top <- data.frame(OffenseClub=c(club1,club2),DefenseClub=c(club2,club1),OffThuis=c(1,0))
  prepred <- predict(object,top,type='p')
  oo <- outer(prepred[2,],prepred[1,])
  rownames(oo) <- 0:(ncol(prepred)-1)
  colnames(oo) <- rownames(oo)
  class(oo) <- c('fboo',class(oo))
  attr(oo,'row') <- club1
  attr(oo,'col') <- club2
  wel <- c(sum(oo[upper.tri(oo)]),sum(diag(oo)),sum(oo[lower.tri(oo)]))
  names(wel) <- c(club1,'draw',club2)
  return(list(details=oo,'summary chances'=wel))
}

fbpredict.clm <- function(object,club1,club2) {
  top <- data.frame(OffenseClub=c(club1,club2),DefenseClub=c(club2,club1),OffThuis=c(1,0))
  prepred <- predict(object,top,type='p')$fit
  oo <- outer(prepred[2,],prepred[1,])
  rownames(oo) <- 0:(ncol(prepred)-1)
  colnames(oo) <- rownames(oo)
  class(oo) <- c('fboo',class(oo))
  attr(oo,'row') <- club1
  attr(oo,'col') <- club2
  wel <- c(sum(oo[upper.tri(oo)]),sum(diag(oo)),sum(oo[lower.tri(oo)]))
  names(wel) <- c(club1,'draw',club2)
  return(list(details=oo,'summary chances'=wel))
}

print.fboo <- function(x,...) {
  cat(attr(x,'row'),'in rows against',attr(x,'col'),'in columns \n')
  class(x) <- class(x)[-1]
  attr(x,'row') <- NULL
  attr(x,'col') <- NULL
  oo <- formatC(x,format='f',width=4)
  oo <- gsub('\\.0+$','       ',oo)
  oo <- substr(oo,1,6)
  print(oo,quote=FALSE,justify='left')
}

morepred <- function(mymodel,topred) {
  UseMethod('morepred',mymodel)
}

morepred.polr <- function(mymodel,topred) {
  topred <- topred[topred[,1] %in% mymodel$xlevels$OffenseClub & 
          topred[,2] %in% mymodel$xlevels$OffenseClub ,]
  ap <- lapply(1:nrow(topred),function(irow) {
        fbp <- fbpredict(mymodel,as.character(topred[irow,1]),
            as.character(topred[irow,2]))
        sec2 <- fbp[[2]]
        mydf <- data.frame(club1=topred[irow,1],
            club2=topred[irow,2],
            win1=sec2[1],
            equal=sec2[2],
            win2=sec2[3])
      })
  dc <- do.call(rbind,ap)
  rownames(dc) <- 1:nrow(dc)
  dc
}

morepred.sclm <- function(mymodel,topred) {
  topred <- topred[topred[,1] %in% mymodel$xlevels$OffenseClub & 
          topred[,2] %in% mymodel$xlevels$OffenseClub ,]
  ap <- lapply(1:nrow(topred),function(irow) {
        fbp <- fbpredict(mymodel,as.character(topred[irow,1]),
            as.character(topred[irow,2]))
        sec2 <- fbp[[2]]
        mydf <- data.frame(club1=topred[irow,1],
            club2=topred[irow,2],
            win1=sec2[1],
            equal=sec2[2],
            win2=sec2[3])
      })
  dc <- do.call(rbind,ap)
  rownames(dc) <- 1:nrow(dc)
  dc
}