Sunday, September 30, 2012

Football, an ordinal model

On September 19th, flo2speak remarked under a post that his/her experience is that ordinal models had better performance. That seems reason enough to try, so there we are. In examining this type of model it is found that more complex models can be used. There is now an interaction between defense powers and home/away and an effect of winter stop. This leads to a change in prediction values.

Ordinal model and Log linear regression

ordinal model

An ordinal model has as a response a series of ordered categories. An example is liking of a food product (very much dislike, dislike, neutral, like, like very much) or stars for an Amazon product. Essential is that these are a fixed number of ordered categories. For football, the goals could be categories, although this does not have the fixed number of categories. In practice, this is not much of a problem, higher number of goals hardly occur, and the highest category is just including more goals up to infinity. 
An ordinal model describes the chances for each of the categories. One might imagine this model as giving a continuous response, plus a mapping how this response is translated into the categories. So, if category A has been given limits l1 and l2, then the probability that a response falls between l1 and l2, is the probability of category A.
One of the nice properties of the ordinal model is that the categories don't have to be the same size. It does not matter if the step from zero to one star is the same as the step from four to five stars. The model will adapt for that.

log linear model

The log linear model also predicts on a continuous scale. This scale is via the inverse-log (exponential) translated to a positive number. This number is then Poisson distributed. This means that a prediction has a confidence interval from the model, plus the variation due to the Poisson distribution. It should be noted that up to now my predictions were too precise, I only included the Poisson part. It also means that the predictions cannot become more 'narrow' for lack of a better word. It is not possible to get predictions which are saying there is 80% chance of one goal, because the Poisson distribution cannot make such outcomes. I can see that as a potential disadvantage. 

Application

Ordinal model in R

There are at least three packages with functions which can be used to fit ordinal models. The most obvious one is MASS which has the function polr. Almost as obvious is the package ordinal, which is a bit more extensive than just using polr. Finally, package VGAM has most capabilities. In this post I will do one demonstration of polr, and for the remainder use clm from package ordinal. VGAM is not used.
To demonstrate the equivalence between polr and clm:
top <- data.frame(OffenseClub=c('FC Groningen','Vitesse')

StartData$oGoals <- ordered(StartData$Goals)

(pol1 <- polr(oGoals ~OffenseClub,data=StartData))
Call:
polr(formula = oGoals ~ OffenseClub, data = StartData)

Coefficients:
           OffenseClubAjax              OffenseClubAZ 
                2.16090602                 1.10515167 
  OffenseClubDe Graafschap       OffenseClubExcelsior 
               -0.07000361                -0.47817103 
   OffenseClubFC Groningen       OffenseClubFC Twente 
                0.05437062                 1.61304748 
     OffenseClubFC Utrecht       OffenseClubFeyenoord 
                0.72672725                 1.40465078 
OffenseClubHeracles Almelo       OffenseClubNAC Breda 
                0.62502257                 0.38942582 
            OffenseClubNEC             OffenseClubPSV 
                0.26579358                 1.77961585 
   OffenseClubRKC Waalwijk         OffenseClubRoda JC 
                0.07824836                 0.63463829 
  OffenseClubSC Heerenveen         OffenseClubVitesse 
                1.64087778                 0.42797219 
      OffenseClubVVV-Venlo 
                0.14843577 

Intercepts:
       0|1        1|2        2|3        3|4        4|5        5|6        6|7 
-0.5749746  0.8583642  1.9763217  2.9660666  4.1504225  4.9698549  6.2901004 

Residual Deviance: 1929.761 
AIC: 1977.761 
(clm1 <- clm(oGoals ~OffenseClub ,data=StartData))
formula: oGoals ~ OffenseClub
data:    StartData

 link  threshold nobs logLik  AIC     niter max.grad
 logit flexible  612  -964.88 1977.76 6(0)  6.04e-13

Coefficients:
           OffenseClubAjax              OffenseClubAZ 
                   2.16095                    1.10517 
  OffenseClubDe Graafschap       OffenseClubExcelsior 
                  -0.07005                   -0.47813 
   OffenseClubFC Groningen       OffenseClubFC Twente 
                   0.05446                    1.61290 
     OffenseClubFC Utrecht       OffenseClubFeyenoord 
                   0.72677                    1.40463 
OffenseClubHeracles Almelo       OffenseClubNAC Breda 
                   0.62502                    0.38933 
            OffenseClubNEC             OffenseClubPSV 
                   0.26583                    1.77970 
   OffenseClubRKC Waalwijk         OffenseClubRoda JC 
                   0.07833                    0.63467 
  OffenseClubSC Heerenveen         OffenseClubVitesse 
                   1.64085                    0.42778 
      OffenseClubVVV-Venlo 
                   0.14849 

Threshold coefficients:
    0|1     1|2     2|3     3|4     4|5     5|6     6|7 
-0.5750  0.8584  1.9763  2.9661  4.1504  4.9699  6.2902 

predict(pol1,top,type='p')
          0         1         2          3          4           5           6
1 0.3476590 0.3431691 0.1815278 0.07606575 0.03521247 0.009087138 0.005324424
2 0.2683624 0.3376048 0.2187080 0.10209439 0.04962636 0.013063007 0.007703923
            7
1 0.001954373
2 0.002837110

predict(clm1,top,type='p')
$fit
          0         1         2          3          4           5           6
1 0.3476380 0.3431713 0.1815361 0.07607152 0.03521588 0.009087829 0.005325017
2 0.2684004 0.3376141 0.2186884 0.10207952 0.04961821 0.013060380 0.007702605
            7
1 0.001954409
2 0.002836354

Selecting a start model

The feasible models are along the same lines as before. Offense, defense, home/away team, first/second half of the season. It is obvious that teams and hone/away are statistically significant.
clm0 <- clm(oGoals ~1,data=StartData)
clm2 <- clm(oGoals ~OffenseClub + DefenseClub,data=StartData)
clm3 <- clm(oGoals ~OffenseClub + DefenseClub + 
            OffThuis,data=StartData)
anova(clm0,clm1,clm2,clm3)
Likelihood ratio tests of cumulative link models:
     formula:                                      link: threshold:
clm0 oGoals ~ 1                                    logit flexible  
clm1 oGoals ~ OffenseClub                          logit flexible  
clm2 oGoals ~ OffenseClub + DefenseClub            logit flexible  
clm3 oGoals ~ OffenseClub + DefenseClub + OffThuis logit flexible  

     no.par    AIC   logLik LR.stat df Pr(>Chisq)    
clm0      7 2038.1 -1012.03                          
clm1     24 1977.8  -964.88  94.305 17  9.984e-13 ***
clm2     41 1951.1  -934.54  60.677 17  8.122e-07 ***
clm3     42 1922.3  -919.14  30.806  1  2.851e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
So, how does this model clm3 compare with model3? Using a slightly redrafted function fbpredict (see bottom of post) the prediction Roda JC vs FC Utrecht the individual outcomes are a bit different, but the summary is close enough not to make much differences. This is not so strange, in many ways the same model is build.
fbpredict(clm3,"Roda JC","FC Utrecht")
$details
Roda JC in rows against FC Utrecht in columns 
  0      1      2      3      4      5      6      7     
0 0.0173 0.0424 0.0477 0.0297 0.0155 0.0040 0.0024 0.0009
1 0.0349 0.0856 0.0964 0.0599 0.0313 0.0082 0.0048 0.0018
2 0.0303 0.0741 0.0835 0.0519 0.0271 0.0071 0.0042 0.0015
3 0.0153 0.0375 0.0423 0.0263 0.0137 0.0036 0.0021 0.0008
4 0.0072 0.0176 0.0198 0.0123 0.0064 0.0017 0.0010 0.0004
5 0.0018 0.0044 0.0049 0.0031 0.0016 0.0004 0.0002 0.0001
6 0.0010 0.0026 0.0029 0.0018 0.0009 0.0002 0.0001 0.0001
7 0.0004 0.0009 0.0010 0.0006 0.0003 0.0001 0.0001 0     

$`summary chances`
   Roda JC      equal FC Utrecht 
 0.4603544  0.2196707  0.3199749 

Selecting model extensions; interactions

It is also possible to extend the model. This shows that there is an interaction between defense capabilities and playing home or away. The statistical significance is about p=0.06, which is low enough to consider this model for prediction purposes.

clm4a <- clm(oGoals ~OffenseClub*OffThuis + DefenseClub 
     ,data=StartData)
clm4b <- clm(oGoals ~OffenseClub + DefenseClub*OffThuis 
     ,data=StartData)
clm5 <- clm(oGoals ~(OffenseClub + DefenseClub)*OffThuis 
     ,data=StartData)
anova (clm3,clm4a,clm5)
Likelihood ratio tests of cumulative link models:

      formula:                                        link: threshold:
clm3  oGoals ~ OffenseClub + DefenseClub + OffThuis   logit flexible  
clm4a oGoals ~ OffenseClub * OffThuis + DefenseClub   logit flexible  
clm5  oGoals ~ (OffenseClub + DefenseClub) * OffThuis logit flexible  

      no.par    AIC  logLik LR.stat df Pr(>Chisq)  
clm3      42 1922.3 -919.14                        
clm4a     59 1938.2 -910.08  18.115 17    0.38164  
clm5      76 1945.6 -896.81  26.552 17    0.06497 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
anova (clm3,clm4b,clm5)
Likelihood ratio tests of cumulative link models:

      formula:                                        link: threshold:
clm3  oGoals ~ OffenseClub + DefenseClub + OffThuis   logit flexible  
clm4b oGoals ~ OffenseClub + DefenseClub * OffThuis   logit flexible  
clm5  oGoals ~ (OffenseClub + DefenseClub) * OffThuis logit flexible  

      no.par    AIC  logLik LR.stat df Pr(>Chisq)  
clm3      42 1922.3 -919.14                        
clm4b     59 1930.7 -906.36  25.560 17    0.08286 .
clm5      76 1945.6 -896.81  19.107 17    0.32242  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
The implication is that model 'clm4b' is now considered an alternative for model 'clm3'. This makes for an interesting change in predictions. It can be seen that with this model FC Utrecht has a better chance to win than with 'clm3'.
fbpredict(clm4b,"Roda JC","FC Utrecht")
$details
Roda JC in rows against FC Utrecht in columns 
  0      1      2      3      4      5      6      7     
0 0.0379 0.0689 0.0503 0.0220 0.0093 0.0022 0.0012 0.0004
1 0.0701 0.1274 0.0931 0.0406 0.0172 0.0040 0.0023 0.0008
2 0.0523 0.0950 0.0694 0.0303 0.0128 0.0030 0.0017 0.0006
3 0.0231 0.0420 0.0307 0.0134 0.0057 0.0013 0.0007 0.0003
4 0.0098 0.0179 0.0131 0.0057 0.0024 0.0006 0.0003 0.0001
5 0.0023 0.0041 0.0030 0.0013 0.0006 0.0001 0.0001 0     
6 0.0013 0.0024 0.0017 0.0008 0.0003 0.0001 0      0     
7 0.0005 0.0008 0.0006 0.0003 0.0001 0      0      0     

$`summary chances`
   Roda JC      equal FC Utrecht 
 0.3696674  0.2506325  0.3797000 

Winter break

On top of that, the effect of before/after winter break can be examined. It would seem that winter break does have an effect. There is not much point in trying to make predictions using the model with winter break. If winter break has an effect, so does summer break. This does however, fit well with the remark of Huub on  a previous post 'I tried to predict the results using only the data from the present year, and got four correct (FC Utrecht, FC Groningen, FC Twente, PSV). Maybe it is an idea to combine both the data of last year and of the current season and give a weight to the current year of, for example, 2. In this way there is still enough information in the model but it also accounts for more recent results...'. Difference between years is a reason for more recent data to perform better. 
StartData$year <- factor(c(substr(old$Datum,1,4),substr(old$Datum,1,4)))
clm6 <- clm(oGoals ~OffenseClub + DefenseClub  + year + OffThuis 
     ,data=StartData)
clm7 <- clm(oGoals ~(OffenseClub + DefenseClub)*year + OffThuis 
     ,data=StartData)
anova (clm3,clm6,clm7)
Likelihood ratio tests of cumulative link models:
     formula:                                               link: threshold:
clm3 oGoals ~ OffenseClub + DefenseClub + OffThuis          logit flexible  
clm6 oGoals ~ OffenseClub + DefenseClub + year + OffThuis   logit flexible  
clm7 oGoals ~ (OffenseClub + DefenseClub) * year + OffThuis logit flexible  

     no.par    AIC  logLik LR.stat df Pr(>Chisq)  
clm3     42 1922.3 -919.14                        
clm6     43 1924.2 -919.12  0.0341  1    0.85356  
clm7     77 1945.7 -895.84 46.5557 34    0.07407 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

anova(clm3,clm7)
Likelihood ratio tests of cumulative link models:
     formula:                                               link: threshold:
clm3 oGoals ~ OffenseClub + DefenseClub + OffThuis          logit flexible  
clm7 oGoals ~ (OffenseClub + DefenseClub) * year + OffThuis logit flexible  

     no.par    AIC  logLik LR.stat df Pr(>Chisq)  
clm3     42 1922.3 -919.14                        
clm7     77 1945.7 -895.84   46.59 35    0.09106 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Even more complex model

A final step is to combine the good of model 'clm4b' and 'clm7'. It would seem these effects can very well be combined. This means that at some point a year effect must be included.

clmX <- clm(oGoals ~(OffenseClub + DefenseClub)*year + DefenseClub*OffThuis 
     ,data=StartData)
anova(clm3,clmX)
Likelihood ratio tests of cumulative link models:

     formula:                                                             link:
clm3 oGoals ~ OffenseClub + DefenseClub + OffThuis                        logit
clmX oGoals ~ (OffenseClub + DefenseClub) * year + DefenseClub * OffThuis logit
     threshold:
clm3 flexible  
clmX flexible  

     no.par    AIC  logLik LR.stat df Pr(>Chisq)  
clm3     42 1922.3 -919.14                        
clmX     94 1951.3 -881.67  74.942 52     0.0203 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
anova(clm4b,clmX)
Likelihood ratio tests of cumulative link models:

      formula:                                                            
clm4b oGoals ~ OffenseClub + DefenseClub * OffThuis                       
clmX  oGoals ~ (OffenseClub + DefenseClub) * year + DefenseClub * OffThuis
      link: threshold:
clm4b logit flexible  
clmX  logit flexible  

      no.par    AIC  logLik LR.stat df Pr(>Chisq)  
clm4b     59 1930.7 -906.36                        
clmX      94 1951.3 -881.67  49.383 35    0.05424 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
anova(clm7,clmX)
Likelihood ratio tests of cumulative link models:

     formula:                                                             link:
clm7 oGoals ~ (OffenseClub + DefenseClub) * year + OffThuis               logit
clmX oGoals ~ (OffenseClub + DefenseClub) * year + DefenseClub * OffThuis logit
     threshold:
clm7 flexible  
clmX flexible  

     no.par    AIC  logLik LR.stat df Pr(>Chisq)  
clm7     77 1945.7 -895.84                        
clmX     94 1951.3 -881.67  28.353 17    0.04098 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Additional R code

fbpredict <- function(object,club1,club2) {
  UseMethod('fbpredict',object)
}

fbpredict.polr <- function(object,club1,club2) {
  top <- data.frame(OffenseClub=c(club1,club2),DefenseClub=c(club2,club1),OffThuis=c(1,0))
  prepred <- predict(object,top,type='p')
  oo <- outer(prepred[2,],prepred[1,])
  rownames(oo) <- 0:(ncol(prepred)-1)
  colnames(oo) <- rownames(oo)
  class(oo) <- c('fboo',class(oo))
  attr(oo,'row') <- club1
  attr(oo,'col') <- club2
  wel <- c(sum(oo[upper.tri(oo)]),sum(diag(oo)),sum(oo[lower.tri(oo)]))
  names(wel) <- c(club1,'equal',club2)
  return(list(details=oo,'summary chances'=wel))
}

fbpredict.clm <- function(object,club1,club2) {
  top <- data.frame(OffenseClub=c(club1,club2),DefenseClub=c(club2,club1),OffThuis=c(1,0))
  prepred <- predict(object,top,type='p')$fit
  oo <- outer(prepred[2,],prepred[1,])
  rownames(oo) <- 0:(ncol(prepred)-1)
  colnames(oo) <- rownames(oo)
  class(oo) <- c('fboo',class(oo))
  attr(oo,'row') <- club1
  attr(oo,'col') <- club2
  wel <- c(sum(oo[upper.tri(oo)]),sum(diag(oo)),sum(oo[lower.tri(oo)]))
  names(wel) <- c(club1,'equal',club2)
  return(list(details=oo,'summary chances'=wel))
}

print.fboo <- function(x,...) {
  cat(attr(x,'row'),'in rows against',attr(x,'col'),'in columns \n')
  class(x) <- class(x)[-1]
  attr(x,'row') <- NULL
  attr(x,'col') <- NULL
  oo <- formatC(x,format='f',width=4)
  oo <- gsub('\\.0+$','       ',oo)
  oo <- substr(oo,1,6)
  print(oo,quote=FALSE,justify='left')
}

Sunday, September 23, 2012

Football model; plots and usage

After reading data, making a predictions display and building a football data model it is time to put this to validate a bit more (regression plots) and put to usage. It appears that the regression plots in the car package were not very informative for this model, except to find unexpected results. The predictions are not good at all.

Roda JC-FC Utrecht

First of all, when I started this post on Friday, Roda JC played against FC Utrecht, 0-1. FC Utrecht goes sub top. Frankly, I am not very surprised. While I would expect Roda JC to win this game (p=0.46), chance for FC Utrecht to win is 0.33, so that is not so strange but is disappointing for my model.
fbpredict(model3,"Roda JC","FC Utrecht")
$details
Roda JC in rows against FC Utrecht in columns 
  0      1      2      3      4      5      6      7      8      9     
0 0.0241 0.0489 0.0494 0.0334 0.0169 0.0068 0.0023 0.0007 0.0002 0     
1 0.0410 0.0830 0.0841 0.0567 0.0287 0.0116 0.0039 0.0011 0.0003 0.0001
2 0.0349 0.0706 0.0714 0.0482 0.0244 0.0099 0.0033 0.0010 0.0002 0.0001
3 0.0198 0.0400 0.0405 0.0273 0.0138 0.0056 0.0019 0.0005 0.0001 0     
4 0.0084 0.0170 0.0172 0.0116 0.0059 0.0024 0.0008 0.0002 0.0001 0     
5 0.0029 0.0058 0.0058 0.0039 0.0020 0.0008 0.0003 0.0001 0      0     
6 0.0008 0.0016 0.0017 0.0011 0.0006 0.0002 0.0001 0      0      0     
7 0.0002 0.0004 0.0004 0.0003 0.0001 0.0001 0      0      0      0     
8 0      0.0001 0.0001 0.0001 0      0      0      0      0      0     
9 0      0      0      0      0      0      0      0      0      0     

$`summary chances`
   Roda JC      equal FC Utrecht 
 0.4580659  0.2126926  0.3291782 

Diagnostic plots

But, I should have looked at some of the diagnostic plots first. Luckily between the stats and the car package we have a nice collection of tools.
library(car)
infIndexPlot(model3)
A number of things are interesting. Just under index 100 is the most influential observation. We can track this to row 97. This represents Groningen making six ! goals against Feyenoord. That's the best FC Groningen ever did against Feyenoord. It was also found to be a remarkable result at that time, so if this were not influential or outlyingI should be worried.
outlierTest(model3)
No Studentized residuals with Bonferonni p < 0.05
Largest |rstudent|:
   rstudent unadjusted p-value Bonferonni p
97 3.634049         0.00027901      0.17075
StartData[97,]
    OffenseClub DefenseClub Goals OffThuis
97 FC Groningen   Feyenoord     6        1
A second 'feature'in the plots is the higher hat value in the first part of the series. This has to do with the variable OffThuis. Removing OffThuis makes the effect go away. Note that the diagonal of the hat matrix is everywhere the same (within numerical accuracy):
mm <- model.matrix(model3)
table(diag(mm %*% solve(t(mm) %*%  mm) %*% t(mm)))
0.0588235294117647 0.0588235294117648 0.0588235294117649 
               179                427                  6 
What would be different are the weights of the data. In the non-gaussian GLM framework these are dependent on the expected values. The slightly higher expected values when the attacking club plays home make the hat matrix in general a bit higher. Interpreting these plots is clearly more difficult in the non linear regression case.

ResidualPlot

Residual plot does not reveal much more than the previous plot. Record 97 has a relatively high residual. The advantage of this plot is that the outliers are more clearly numbered. In the sub-plot bottom right the bunching of the residuals in lines is because the observed data is non-continuous; from bottom to top these are the values for 0 to 7 goals. No goals gives a negative Pearson residual. No surprise there.   
residualPlots(model3)

Predicting

The teams have been playing in the weekend. The model can be used to predict the outcomes. To start, here are the results. The data in gray represent two teams new in Eredivisie in the season, so I got no predictions.
Roda JC      - Utrecht      0-1
PEC Zwolle   - Groningen    1-2
RKC Waalwijk - VVV          1-1
Vitesse      - Heracles     1-1
NEC          - Willem II    0-0
ADO Den Haag - Ajax         1-1
Twente       - Heerenveen   1-0
NAC Breda    - AZ           2-1
PSV          - Feyenoord    3-0
These are the expectations:
         club1           club2      win1     equal      win2
1      Roda JC      FC Utrecht 0.4580659 0.2126926 0.3291782
2 RKC Waalwijk       VVV-Venlo 0.6076020 0.2180298 0.1743364
3      Vitesse Heracles Almelo 0.5723334 0.2275537 0.2000907
4 ADO Den Haag            Ajax 0.1037534 0.1511710 0.7446496
5    FC Twente   SC Heerenveen 0.6605607 0.1558135 0.1822923
6    NAC Breda              AZ 0.2539698 0.2627759 0.4832506
7          PSV       Feyenoord 0.5055082 0.2147899 0.2796468
Needless to say, I am not impressed. Choosing the most probable outcome as result, I have two correct Twente - Heerenveen and PSV - Feyenoord.
Last week flo2speak commented that (s)he tried to do the same in German football and had 30% correct. I am in the same ballpark. Maybe I should try the ordinal regression too. On top of that; I never predict ties, this weekend had four games tied, and season 2011-2012 had 20% of the games tied. Clearly improvement is needed.

Prediction code

To predict for a weekend I made a wrapper around fbpredict in which a data frame can be used. Not the most efficient code, but for this size of calculation that is not a problem. Promoted teams which are lacking data are also removed. 
topred <- read.table(textConnection("
'Roda JC'        'FC Utrecht'
'PEC Zwolle'     'FC Groningen'
'RKC Waalwijk'   'VVV-Venlo'
'Vitesse'        'Heracles Almelo'
'NEC'            'Willem II'
'ADO Den Haag'   'Ajax'
'FC Twente'      'SC Heerenveen'
'NAC Breda'      'AZ'
'PSV'            'Feyenoord'"
),col.names=c('Off','Def'))

morepred <- function(topred) {
  topred <- topred[topred[,1] %in% levels(model3$data$OffenseClub) & 
                   topred[,2] %in% levels(model3$data$OffenseClub) ,]
  ap <- lapply(1:nrow(topred),function(irow) {
        fbp <- fbpredict(model3,as.character(topred[irow,1]),
                                as.character(topred[irow,2]))
        sec2 <- fbp[[2]]
        mydf <- data.frame(club1=topred[irow,1],
                           club2=topred[irow,2],
                           win1=sec2[1],
                           equal=sec2[2],
                           win2=sec2[3])
      })
  dc <- do.call(rbind,ap)
  rownames(dc) <- 1:nrow(dc)
  dc
}
morepred(topred)



Sunday, September 16, 2012

Football model

After reading Dutch football data (Eeredivisie 2011-2012) and making a predictions display it is time to look at a few simple models to predict goals. To reiterate the data setup, each game played consists of two rows in the data frame. One row for the number of goals the home playing team makes, another row for the away team. We start with four models. Two models I don't believe in; A zero  model where the number of goals is independent of the clubs and everything, model 1 where the number of goals is only dependent on the team making the goals. Two other models are probable. Model 2, both the attacking and the defending team determine the number of goals, finally, model 3, both teams determine the number of goals, but also who is playing at home.

model0 <- glm(Goals ~ 1,data=StartData,family='poisson')
model1 <- glm(Goals ~OffenseClub,data=StartData,family='poisson')
model2 <- glm(Goals ~OffenseClub + DefenseClub,data=StartData,family='poisson')
model3 <- glm(Goals ~OffenseClub + DefenseClub +
                     OffThuis,data=StartData,family='poisson')
anova (model0,model1,model2,model3,test='Chisq')
Analysis of Deviance Table

Model 1: Goals ~ 1
Model 2: Goals ~ OffenseClub
Model 3: Goals ~ OffenseClub + DefenseClub
Model 4: Goals ~ OffenseClub + DefenseClub + OffThuis
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       611     865.23                          
2       594     754.17 17  111.064 7.610e-16 ***
3       577     699.13 17   55.043 6.743e-06 ***
4       576     668.96  1   30.172 3.955e-08 ***
It appears that modelling step which makes the model more complex is significant, we must reject the hypothesis that any of these terms is not relevant. Hence the number of goals is dependent on the teams plus a home team effect.

The twelfth man 

It does make a difference who is playing at home. In practical terms, due to the model used, this advantage  is difficult to interpret. In general, when two clubs of equal strength play each other, they each make 1.3 goals.
exp(coef(model2)[1])
(Intercept) 
   1.346538 
When one of these equally strong teams plays away, the other at home, the numbers change. A team playing at home makes 1.6 goals, while playing away only 1.1. 
exp(coef(model3)[length(coef(model3))] + coef(model3)[1])
OffThuis 
 1.58019 
exp(coef(model3)[1])
(Intercept) 
   1.112886 
This would make playing away or at home both statistical and practically significant. Note that the size of this effect can not be transferred to other circumstances.

The teams 

Each of the teams has two parameters in the model. These can be most easily be interpreted as offensive and defensive power. The following code plots these powers.
co <- coef(model3)
coO <- co[grep('Offense',names(co))]
coD <- co[grep('Defense',names(co))]
names(coO) <- gsub('OffenseClub','',names(coO))
names(coD) <- gsub('DefenseClub','',names(coD))
# Ado Den Haag is missing in the parameterization. so it is added.
coB <- rbind(cbind(coO,coD),matrix(c(0,0)
             ,nrow=1,,dimnames=list('Ado Den Haag',c('coO','coD'))))
# scaled for relative strength 
coB <- as.data.frame(scale(coB,scale=FALSE)) 
# -coD to make more defensive power visually larger
plot(-coD ~coO, type='n', data=coB,xlab='Offensive power',ylab='Defensive power',axes=FALSE)
text(-coD ~coO,data=coB,labels=rownames(coB))
abline(a=0,b=1)
abline(v=0)
abline(h=0)
The plot shows the axes, a team close to the centre (NAC Breda, FC Utrecht) was average in both offensive and defensive strength. A diagonal line depicts the equal defense and offense strength region. Hence Feyenoord is equally strong in offense and defense, same for De Graafschap. The line is not quite diagonal, the range in in offense strength is larger than the range in defense strength. The best teams is top right; Ajax. The worst teams are bottom left; De Graafschap and Excelsior have relegated to eerste divisie. A few clubs are noticeable for their mismatch in offensive and defensive strengths. SC Heerenveen has almost the same goal making power as Ajax, but not enough defensive capacity. In contrast, Vitesse won't receive many goals, but lacks the power to make the goals. Overall they have about the same strength.
Otherwise stated; if SC Heerenveen played against itself. ignoring home team advantage, it would probably make two or even three goals. 
fbpredict(model2,'SC Heerenveen','SC Heerenveen')[[1]]
SC Heerenveen in rows against SC Heerenveen in columns 
  0      1      2      3      4      5      6      7      8      9     
0 0.0060 0.0153 0.0196 0.0167 0.0107 0.0055 0.0023 0.0009 0.0003 0.0001
1 0.0153 0.0391 0.0501 0.0428 0.0274 0.0140 0.0060 0.0022 0.0007 0.0002
2 0.0196 0.0501 0.0641 0.0548 0.0351 0.0180 0.0077 0.0028 0.0009 0.0003
3 0.0167 0.0428 0.0548 0.0467 0.0299 0.0153 0.0065 0.0024 0.0008 0.0002
4 0.0107 0.0274 0.0351 0.0299 0.0192 0.0098 0.0042 0.0015 0.0005 0.0001
5 0.0055 0.0140 0.0180 0.0153 0.0098 0.0050 0.0021 0.0008 0.0003 0.0001
6 0.0023 0.0060 0.0077 0.0065 0.0042 0.0021 0.0009 0.0003 0.0001 0     
7 0.0009 0.0022 0.0028 0.0024 0.0015 0.0008 0.0003 0.0001 0      0     
8 0.0003 0.0007 0.0009 0.0008 0.0005 0.0003 0.0001 0      0      0     
9 0.0001 0.0002 0.0003 0.0002 0.0001 0.0001 0      0      0      0     
If Vitesse played against itself it would make zero or one goal.
fbpredict(model2,'Vitesse','Vitesse')[[1]]
Vitesse in rows against Vitesse in columns 
  0      1      2      3      4      5      6      7      8      9     
0 0.1165 0.1252 0.0673 0.0241 0.0065 0.0014 0.0002 0      0      0     
1 0.1252 0.1346 0.0724 0.0259 0.0070 0.0015 0.0003 0      0      0     
2 0.0673 0.0724 0.0389 0.0139 0.0037 0.0008 0.0001 0      0      0     
3 0.0241 0.0259 0.0139 0.0050 0.0013 0.0003 0.0001 0      0      0     
4 0.0065 0.0070 0.0037 0.0013 0.0004 0.0001 0      0      0      0     
5 0.0014 0.0015 0.0008 0.0003 0.0001 0      0      0      0      0     
6 0.0002 0.0003 0.0001 0.0001 0      0      0      0      0      0     
7 0      0      0      0      0      0      0      0      0      0     
8 0      0      0      0      0      0      0      0      0      0     
9 0      0      0      0      0      0      0      0      0      0    

model extensions

The Residual deviance of model3 is 668.96 on 576 degrees of freedom. That might mean some more effects can be found in the data. 

twelfth man and teams

The first extension is that home and away advantage is different between teams. Based on these data, this does not seem to be statistically significant.
model4a <- glm(Goals ~OffenseClub*OffThuis + DefenseClub 
                       ,data=StartData,family='poisson')
model4b <- glm(Goals ~OffenseClub + DefenseClub*OffThuis 
                       ,data=StartData,family='poisson')
model5 <- glm(Goals ~(OffenseClub + DefenseClub)*OffThuis 
                       ,data=StartData,family='poisson')
anova (model3,model4a,model5,test='Chisq')
Analysis of Deviance Table

Model 1: Goals ~ OffenseClub + DefenseClub + OffThuis
Model 2: Goals ~ OffenseClub * OffThuis + DefenseClub
Model 3: Goals ~ (OffenseClub + DefenseClub) * OffThuis
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       576     668.96                     
2       559     649.00 17   19.953   0.2766
3       542     626.77 17   22.236   0.1758
anova (model3,model4b,model5,test='Chisq')
Analysis of Deviance Table

Model 1: Goals ~ OffenseClub + DefenseClub + OffThuis
Model 2: Goals ~ OffenseClub + DefenseClub * OffThuis
Model 3: Goals ~ (OffenseClub + DefenseClub) * OffThuis
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       576     668.96                     
2       559     647.46 17   21.499   0.2048
3       542     626.77 17   20.690   0.2404

Before and after winter break

Winter break has the possibility to change players. It might be, that teams change in quality in this period. In these data, it seems this effect is not statistically significant.
StartData$year <- factor(c(substr(old$Datum,1,4),substr(old$Datum,1,4)))
model6 <- glm(Goals ~OffenseClub + DefenseClub  + year + OffThuis 
             ,data=StartData,family='poisson')
model7 <- glm(Goals ~(OffenseClub + DefenseClub)*year + OffThuis 
             ,data=StartData,family='poisson')
anova (model3,model6,model7,test='Chisq')
Analysis of Deviance Table

Model 1: Goals ~ OffenseClub + DefenseClub + OffThuis
Model 2: Goals ~ OffenseClub + DefenseClub + year + OffThuis
Model 3: Goals ~ (OffenseClub + DefenseClub) * year + OffThuis
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       576     668.96                     
2       575     668.82  1    0.135   0.7129
3       541     625.48 34   43.345   0.1308

Sunday, September 9, 2012

Football predictions display

Having looked at the football data earlier, I wanted to look at predictions for new games. This consists of two parts, getting a predictive model, predicting and displaying the predictions. I decided to do this backwards, first to make the displays. This will make things easier when the time is there to compare models. To get the predictions I use a very simple model, which basically states, a club makes about x goals, irrespective of all other conditions. I don't believe this model, but it can give predictions.
model1 <- glm(Goals ~OffenseClub,data=StartData,family='poisson')
The consequence of this setup is that each game needs two predictions, one for the first club, one for the second. For clubs Vitesse and FC Groningen are used to make the predictions.

The prediction of the glm is a mean number of goals, which is still quite far from the reality of a number of goals. For this I use the Poisson distribution and treat the prediction as true. I do not include overdispersion nor standard error of parameters. The result shows FC Groningen has 30% of getting no goals, 35% chance of getting 1 goal, 22 % for two goals, after which the chances become quickly very low. 
top <- data.frame(OffenseClub=c('FC Groningen','Vitesse'))
prepred <- predict(model1,top,type='response')
(dp1 <- dpois(0:5,prepred[1]))[1:4]
[1] 0.29942768 0.36107456 0.21770672 0.08750956
Finally, the predictions need to be combined, to get a pair of goals. Not surprisingly, if the chance of a particular outcome, such as one goal, is 30%, then the chance of a pair of outcomes, such as 1-1 may be 30%*30%=9%. In this case it turns out to be slightly higher, 12%. This is the most probable outcome too. 
dp2 <- dpois(0:6,prepred[2])
oo <- outer(dp1,dp2)
rownames(oo) <- 0:6
colnames(oo) <- 0:6
round(oo,digits=3)
      0     1     2     3     4     5     6
0 0.073 0.103 0.073 0.034 0.012 0.003 0.001
1 0.088 0.124 0.088 0.041 0.015 0.004 0.001
2 0.053 0.075 0.053 0.025 0.009 0.002 0.001
3 0.021 0.030 0.021 0.010 0.004 0.001 0.000
4 0.006 0.009 0.006 0.003 0.001 0.000 0.000
5 0.002 0.002 0.002 0.001 0.000 0.000 0.000
6 0.000 0.000 0.000 0.000 0.000 0.000 0.000
It will be more useful to summarize the outcome as win-equal-lost. These are extracted as sums of probabilities. 
c(sum(oo[upper.tri(oo)]),sum(diag(oo)),sum(oo[lower.tri(oo)]))
[1] 0.4167411 0.2612236 0.3211237
It is practical to fit all this in a little function which creates these data in one go. The only new things are the introduction of a new class fboo which is used to direct the prediction to the appropriate accompanying print function and some attributes to administrate the clubs predicted.
fbpredict <- function(object,club1,club2) {
  top <- data.frame(OffenseClub=c(club1,club2),DefenseClub=c(club2,club1),OffThuis=c(1,0))
  prepred <- predict(object,top,type='response')
  dp1 <- dpois(0:9,prepred[1])
  dp2 <- dpois(0:9,prepred[2])
  oo <- outer(dp2,dp1)
  rownames(oo) <- 0:9
  colnames(oo) <- 0:9
  class(oo) <- c('fboo',class(oo))
  attr(oo,'row') <- club1
  attr(oo,'col') <- club2
  wel <- c(sum(oo[upper.tri(oo)]),sum(diag(oo)),sum(oo[lower.tri(oo)]))
  names(wel) <- c(club1,'equal',club2)
  return(list(details=oo,'summary chances'=wel))
}

print.fboo <- function(x,...) {
  cat(attr(x,'row'),'in rows against',attr(x,'col'),'in columns \n')
  class(x) <- class(x)[-1]
  attr(x,'row') <- NULL
  attr(x,'col') <- NULL
  oo <- formatC(x,format='f',width=4) # fixed format
  oo <- gsub('\\.0+$','       ',oo)   # replace trailing 0 by ' '
  oo <- substr(oo,1,6)                # and fix the width
  print(oo,quote=FALSE,justify='left')
}

fbpredict(model1,'FC Groningen','Vitesse')
$details
FC Groningen in rows against Vitesse in columns 
  0      1      2      3      4      5      6      7      8      9     
0 0.0730 0.0880 0.0531 0.0213 0.0064 0.0016 0.0003 0.0001 0      0     
1 0.1030 0.1242 0.0749 0.0301 0.0091 0.0022 0.0004 0.0001 0      0     
2 0.0727 0.0877 0.0529 0.0213 0.0064 0.0015 0.0003 0.0001 0      0     
3 0.0342 0.0413 0.0249 0.0100 0.0030 0.0007 0.0001 0      0      0     
4 0.0121 0.0146 0.0088 0.0035 0.0011 0.0003 0.0001 0      0      0     
5 0.0034 0.0041 0.0025 0.0010 0.0003 0.0001 0      0      0      0     
6 0.0008 0.0010 0.0006 0.0002 0.0001 0      0      0      0      0     
7 0.0002 0.0002 0.0001 0      0      0      0      0      0      0     
8 0      0      0      0      0      0      0      0      0      0     
9 0      0      0      0      0      0      0      0      0      0     

$`summary chances`
FC Groningen        equal      Vitesse 
   0.3213815    0.2612237    0.4173918