Saturday, September 19, 2015

Predicting Titanic deaths on Kaggle VI: Stan

It is a bit a contradiction. Kaggle provides competitions on data science, while Stan is clearly part of the (Bayesian) statistics. Yet after using random forests, boosting and bagging, I also think this problem has a suitable size for Stan, which I understand can handle larger problems than older Bayesian software such as JAGS.
What I aim to do is enter a load of variables in the Stan model. Aliasing will be ignored, and I hope the hierarchical model will provide suitable shrinkage for terms which are not relevant. 

Data

The data has been mildly adapted from previously. Biggest change is that I have decided to make age into a factor, based on the value when present, with a special level for missing. The alternative in context of a Bayesian model would be to make fitting age part of the model, but that seems more complex than I am willing to go at this point.

Model

The idea of the model is pretty simple. it will be logistic regression, with only factors in the independent variables. All levels in all factors are used. So when Sex is entered in the model, it will add two parameter values, one for male, one for female (see code below, variable fsex). When passenger class is entered in the model, it has three values (variabe fpclass). Upon the parameter values there is a prior distribution. I have chosen a normal prior distribution, around zero with standard deviation sdsex and sdpclass respectively. These have a common prior (sd1). I have used both half normal with standard deviation 3 as below and uniform (0,3) for the prior.
  parameters {
    real fsex[2];
    real intercept;
    real fpclass[3];
    real <lower=0> sdsex;
    real <lower=0> sdpclass;
    real <lower=0> sd1;
  }
  transformed parameters {
    real expect[ntrain];
    for (i in 1:ntrain) {
      expect[i] <- inv_logit(
        intercept+
        fsex[sex[i]]+
        fpclass[pclass[i]]
        );
      }    
  }
  model {        
    fsex ~ normal(0,sdsex);
    fpclass ~ normal(0,sdpclass);

    sdsex ~  normal(0,sd1);
    sdpclass ~ normal(0,sd1);
    sd1 ~ normal(0,3);
    intercept ~ normal(0,1);
    survived ~ bernoulli(expect);
  }

Predictions

In this phase I have decided to make the predictions within the Stan program. The way which seemed to work is to duplicate all independent variables and do the predictions in the generated quantities section of the program. These predictions are actually probabilities of survival for each passenger for each MCMC sample. Hence a bit of post processing is used, the mean probability is calculated and a cut off of 0.5 is used to decide the final classification.
Since the Stan code runs pretty quickly after it has been compiled it is feasible to run this as a two stage process. A first run to examine the model parameters. A second run just to obtain the predictions.

Model 1

This is the parameter output of a model with just age and passenger class. It has been added here mostly to show a simple full coded example what the code looks like. With sdsex bigger than sdpclass it follows that sex had a bigger influence than class on the chance of survival.
Inference for Stan model: 8fb625a6ccf29aab919e1dcd494247aa.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

              mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
intercept    -0.07    0.03 0.83   -1.68   -0.65   -0.07    0.49    1.61   670 1.00
fsex[1]       1.40    0.04 0.89   -0.46    0.83    1.40    1.98    3.07   533 1.01
fsex[2]      -1.24    0.04 0.89   -3.11   -1.81   -1.23   -0.65    0.45   529 1.01
fpclass[1]    0.94    0.03 0.70   -0.40    0.50    0.93    1.33    2.34   536 1.01
fpclass[2]    0.12    0.03 0.69   -1.24   -0.31    0.11    0.52    1.55   527 1.01
fpclass[3]   -0.93    0.03 0.69   -2.25   -1.35   -0.94   -0.50    0.47   499 1.01
sdsex         2.06    0.05 1.20    0.76    1.23    1.74    2.49    5.36   648 1.00
sdpclass      1.40    0.04 0.90    0.50    0.84    1.15    1.67    3.86   661 1.00
sd1           2.41    0.04 1.27    0.71    1.45    2.18    3.11    5.48  1071 1.00
lp__       -421.36    0.09 2.12 -426.25 -422.56 -421.00 -419.82 -418.19   616 1.00

Samples were drawn using NUTS(diag_e) at Sun Sep 13 12:28:49 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Model 2

Model 2 has all linear effects in it. Predictions using this model were submitted, it gave a score of 0.78947 which was not an improvement over previous scores. I have one submission using bagging which did better, got same result with boosting and worse results with random forest. 
Note that while this model looks pretty complex, this score was obtained without any interactions. 
Inference for Stan model: 6278b3ebade9802ad9544b1242bada20.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

                mean se_mean    sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
intercept       0.28    0.05  0.74   -1.20   -0.16    0.28    0.77    1.68   229 1.01
sd1             0.80    0.03  0.23    0.43    0.63    0.77    0.97    1.31    63 1.05
fsex[1]         0.24    0.03  0.56   -0.64   -0.01    0.13    0.39    1.85   353 1.01
fsex[2]        -0.05    0.04  0.50   -1.29   -0.24    0.00    0.22    0.89   131 1.02
fpclass[1]      0.54    0.07  0.57   -0.39    0.15    0.53    0.90    1.77    60 1.03
fpclass[2]      0.15    0.06  0.50   -0.68   -0.27    0.16    0.50    1.15    76 1.01
fpclass[3]     -0.78    0.05  0.48   -1.66   -1.10   -0.79   -0.47    0.18   102 1.01
fembarked[1]    0.26    0.02  0.33   -0.29    0.06    0.22    0.43    1.02   298 1.01
fembarked[2]    0.08    0.02  0.32   -0.55   -0.07    0.06    0.20    0.78   456 1.00
fembarked[3]   -0.18    0.02  0.31   -0.80   -0.33   -0.18   -0.01    0.45   378 1.01
foe[1]         -0.32    0.08  0.53   -1.63   -0.65   -0.28    0.04    0.54    41 1.07
foe[2]          0.04    0.02  0.43   -0.73   -0.11    0.01    0.19    1.02   433 1.01
foe[3]          0.34    0.03  0.48   -0.41    0.04    0.26    0.57    1.54   227 1.03
fcabchar[1]    -0.25    0.18  0.46   -1.16   -0.49   -0.13    0.03    0.45     7 1.17
fcabchar[2]    -0.04    0.03  0.35   -0.86   -0.16   -0.04    0.10    0.62   143 1.03
fcabchar[3]     0.07    0.01  0.27   -0.47   -0.06    0.05    0.18    0.75   325 1.03
fcabchar[4]    -0.09    0.04  0.31   -0.85   -0.23   -0.03    0.08    0.48    55 1.05
fcabchar[5]     0.22    0.02  0.33   -0.32    0.00    0.17    0.40    1.03   219 1.02
fcabchar[6]     0.31    0.04  0.36   -0.25    0.02    0.25    0.59    1.12    76 1.03
fcabchar[7]    -0.21    0.06  0.40   -1.03   -0.43   -0.11    0.04    0.45    44 1.05
fage[1]         0.14    0.06  0.32   -0.38   -0.04    0.04    0.28    0.78    27 1.10
fage[2]         0.45    0.14  0.59   -0.12    0.02    0.17    0.77    1.68    19 1.17
fage[3]         0.10    0.15  0.53   -0.74   -0.15   -0.01    0.12    1.37    13 1.24
fage[4]        -0.10    0.01  0.31   -0.86   -0.25   -0.05    0.03    0.48   558 1.01
fage[5]         0.17    0.10  0.43   -0.52   -0.04    0.04    0.25    1.07    20 1.14
fage[6]         0.00    0.01  0.21   -0.42   -0.09   -0.01    0.09    0.47   250 1.01
fage[7]         0.09    0.03  0.20   -0.32   -0.01    0.05    0.22    0.47    50 1.05
fage[8]        -0.22    0.03  0.31   -1.06   -0.34   -0.15   -0.01    0.17   129 1.03
fage[9]        -0.02    0.07  0.39   -0.98   -0.16   -0.01    0.12    0.60    27 1.09
fage[10]        0.00    0.01  0.19   -0.43   -0.06    0.01    0.06    0.40   813 1.00
fticket[1]      0.07    0.03  0.21   -0.33   -0.06    0.04    0.22    0.51    54 1.05
fticket[2]     -0.04    0.03  0.34   -0.87   -0.17   -0.01    0.17    0.59   108 1.02
fticket[3]     -0.17    0.05  0.43   -1.23   -0.38   -0.09    0.06    0.60    69 1.04
fticket[4]      0.01    0.05  0.31   -0.48   -0.19    0.01    0.19    0.69    40 1.06
fticket[5]      0.02    0.02  0.36   -0.64   -0.13   -0.02    0.15    0.95   210 1.01
fticket[6]      0.03    0.05  0.32   -0.63   -0.15    0.02    0.21    0.55    37 1.07
fticket[7]      0.14    0.02  0.38   -0.40   -0.05    0.05    0.26    1.18   245 1.01
fticket[8]      0.01    0.01  0.35   -0.84   -0.12    0.02    0.20    0.73   929 1.01
fticket[9]      0.01    0.01  0.32   -0.69   -0.12    0.03    0.14    0.65   766 1.00
fticket[10]    -0.23    0.04  0.46   -1.53   -0.39   -0.08    0.03    0.38   149 1.02
fticket[11]    -0.02    0.01  0.32   -0.66   -0.16   -0.04    0.11    0.70  1001 1.01
fticket[12]     0.37    0.05  0.47   -0.16    0.03    0.20    0.57    1.58    82 1.04
fticket[13]    -0.20    0.02  0.34   -1.05   -0.36   -0.14    0.01    0.36   221 1.03
ftitle[1]       1.26    0.08  0.73   -0.03    0.75    1.20    1.74    2.85    79 1.03
ftitle[2]       0.47    0.04  0.70   -1.05    0.13    0.45    0.87    1.79   372 1.01
ftitle[3]      -2.27    0.03  0.66   -3.59   -2.59   -2.34   -1.88   -0.86   486 1.01
ftitle[4]       0.78    0.06  0.73   -0.86    0.35    0.89    1.25    2.10   144 1.01
fsibsp[1]       0.70    0.06  0.51   -0.33    0.36    0.69    1.09    1.69    77 1.04
fsibsp[2]       0.48    0.08  0.53   -0.62    0.14    0.49    0.93    1.46    49 1.06
fsibsp[3]       0.67    0.11  0.65   -0.60    0.23    0.61    1.16    1.85    35 1.07
fsibsp[4]      -1.58    0.04  0.64   -2.89   -2.03   -1.52   -1.16   -0.38   292 1.02
fparch[1]       0.25    0.02  0.33   -0.25    0.03    0.22    0.39    1.02   450 1.01
fparch[2]       0.04    0.01  0.33   -0.53   -0.13    0.00    0.17    0.83   521 1.01
fparch[3]      -0.03    0.06  0.38   -0.64   -0.22   -0.01    0.14    0.78    35 1.07
fparch[4]      -0.47    0.18  0.54   -1.48   -0.81   -0.31   -0.02    0.22     9 1.12
ffare[1]       -0.01    0.01  0.15   -0.39   -0.04    0.00    0.03    0.29   421 1.01
ffare[2]       -0.02    0.01  0.15   -0.41   -0.06    0.00    0.02    0.26   419 1.01
ffare[3]       -0.02    0.01  0.15   -0.35   -0.06    0.00    0.02    0.26   671 1.00
ffare[4]        0.00    0.01  0.16   -0.31   -0.05    0.00    0.04    0.36   586 1.01
ffare[5]        0.08    0.01  0.20   -0.15   -0.01    0.01    0.12    0.66   247 1.02
sdsex           0.51    0.04  0.44    0.03    0.21    0.35    0.68    1.64   123 1.01
sdpclass        0.75    0.04  0.36    0.32    0.48    0.68    0.94    1.59    64 1.04
sdembarked      0.43    0.02  0.32    0.06    0.21    0.39    0.53    1.24   328 1.01
sdoe            0.56    0.05  0.38    0.06    0.32    0.47    0.73    1.51    48 1.05
sdcabchar       0.39    0.03  0.26    0.03    0.18    0.36    0.59    0.97    61 1.04
sdage           0.33    0.06  0.29    0.02    0.09    0.23    0.53    0.89    23 1.15
sdticket        0.35    0.03  0.24    0.04    0.19    0.30    0.45    0.99    68 1.05
sdtitle         1.29    0.02  0.35    0.75    1.07    1.19    1.47    2.15   320 1.01
sdsibsp         1.00    0.02  0.37    0.49    0.74    0.95    1.16    1.95   222 1.01
sdparch         0.47    0.12  0.37    0.01    0.16    0.38    0.73    1.25    10 1.12
sdfare          0.14    0.02  0.17    0.00    0.02    0.09    0.19    0.60    92 1.03
lp__         -342.28    2.74 15.87 -372.33 -353.23 -344.26 -330.95 -310.12    34 1.09

Samples were drawn using NUTS(diag_e) at Sun Aug 30 13:00:19 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

Code

Data reading

# preparation and data reading section
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

# read and combine
train <- read.csv('train.csv')
train$status <- 'train'
test  <- read.csv('test.csv')
test$status <- 'test'
test$Survived <- NA
tt <- rbind(test,train)

# generate variables
tt$Embarked[tt$Embarked==''] <- 'S'
tt$Embarked <- factor(tt$Embarked)
tt$Pclass <- factor(tt$Pclass)
tt$Survived <- factor(tt$Survived)
tt$age <- tt$Age
tt$age[is.na(tt$age)] <- 999
tt$age <- cut(tt$age,c(0,2,5,9,12,15,21,55,65,100,1000))

tt$Title <- sapply(tt$Name,function(x) strsplit(as.character(x),'[.,]')[[1]][2])
tt$Title <- gsub(' ','',tt$Title)
tt$Title[tt$Title=='Dr' & tt$Sex=='female'] <- 'Miss'
tt$Title[tt$Title %in% c('Capt','Col','Don','Sir','Jonkheer','Major','Rev','Dr')] <- 'Mr'
tt$Title[tt$Title %in% c('Lady','Ms','theCountess','Mlle','Mme','Ms','Dona')] <- 'Miss'
tt$Title <- factor(tt$Title)
# changed cabin character
tt$cabchar <- substr(tt$Cabin,1,1)
tt$cabchar[tt$cabchar %in% c('F','G','T')] <- 'X';
tt$cabchar <- factor(tt$cabchar)
tt$ncabin <- nchar(as.character(tt$Cabin))
tt$cn <- as.numeric(gsub('[[:space:][:alpha:]]','',tt$Cabin))
tt$oe <- factor(ifelse(!is.na(tt$cn),tt$cn%%2,-1))
tt$Fare[is.na(tt$Fare)]<- median(tt$Fare,na.rm=TRUE)
tt$ticket <- sub('[[:digit:]]+$','',tt$Ticket)
tt$ticket <- toupper(gsub('(\\.)|( )|(/)','',tt$ticket))
tt$ticket[tt$ticket %in% c('A2','A4','AQ3','AQ4','AS')] <- 'An'
tt$ticket[tt$ticket %in% c('SCA3','SCA4','SCAH','SC','SCAHBASLE','SCOW')] <- 'SC'
tt$ticket[tt$ticket %in% c('CASOTON','SOTONO2','SOTONOQ')] <- 'SOTON'
tt$ticket[tt$ticket %in% c('STONO2','STONOQ')] <- 'STON'
tt$ticket[tt$ticket %in% c('C')] <- 'CA'
tt$ticket[tt$ticket %in% c('SOC','SOP','SOPP')] <- 'SOP'
tt$ticket[tt$ticket %in% c('SWPP','WC','WEP')] <- 'W'
tt$ticket[tt$ticket %in% c('FA','FC','FCC')] <- 'F'
tt$ticket[tt$ticket %in% c('PP','PPP','LINE','LP','SP')] <- 'PPPP'
tt$ticket <- factor(tt$ticket)
tt$fare <- cut(tt$Fare,breaks=c(min(tt$Fare)-1,quantile(tt$Fare,seq(.2,.8,.2)),max(tt$Fare)+1))

train <- tt[tt$status=='train',]
test <- tt[tt$status=='test',]

#end of preparation and data reading

options(width=90)

First model

datain <- list(

    survived = c(0,1)[train$Survived],
    ntrain = nrow(train),
    ntest=nrow(test),
    sex=c(1:2)[train$Sex],
    psex=c(1:2)[test$Sex],
    pclass=c(1:3)[train$Pclass],
    ppclass=c(1:3)[test$Pclass]    
)

parameters=c('intercept','fsex','fpclass','sdsex','sdpclass','sd1')
my_code <- ' 
    data {
    int<lower=0> ntrain;
    int<lower=0> ntest;
    int survived[ntrain];
    int <lower=1,upper=2> sex[ntrain];
    int <lower=1,upper=2> psex[ntest];
    int <lower=1,upper=3> pclass[ntrain];
    int <lower=1,upper=3> ppclass[ntest];

    }
    parameters {
    real fsex[2];
    real intercept;
    real fpclass[3];
    real <lower=0> sdsex;
    real <lower=0> sdpclass;
    real <lower=0> sd1;
    }
    transformed parameters {
    real expect[ntrain];
    for (i in 1:ntrain) {
    expect[i] <- inv_logit(
      intercept+
      fsex[sex[i]]+
      fpclass[pclass[i]]
    );
    }
    
    }
    model {        
    fsex ~ normal(0,sdsex);
    fpclass ~ normal(0,sdpclass);

    sdsex ~  normal(0,sd1);
    sdpclass ~ normal(0,sd1);
    sd1 ~ normal(0,3);
    intercept ~ normal(0,1);
    survived ~ bernoulli(expect);
    }
    generated quantities {
    real pred[ntest];
    for (i in 1:ntest) {
    pred[i] <- inv_logit(
    intercept+
        fsex[psex[i]]+
        fpclass[ppclass[i]]

    );
    }
    }
    '

fit1 <- stan(model_code = my_code, 
    data = datain, 
    pars=parameters,
    iter = 1000, 
    chains = 4,
    open_progress=FALSE)

fit1

Second model


datain <- list(
    survived = c(0,1)[train$Survived],
    ntrain = nrow(train),
    ntest=nrow(test),
    sex=c(1:2)[train$Sex],
    psex=c(1:2)[test$Sex],
    pclass=c(1:3)[train$Pclass],
    ppclass=c(1:3)[test$Pclass],
    embarked=c(1:3)[train$Embarked],
    pembarked=c(1:3)[test$Embarked],
    oe=c(1:3)[train$oe],
    poe=c(1:3)[test$oe],
    cabchar=c(1:7)[train$cabchar],
    pcabchar=c(1:7)[test$cabchar],
    age=c(1:10)[train$age],
    page=c(1:10)[test$age],
    ticket=c(1:13)[train$ticket],
    pticket=c(1:13)[test$ticket],
    title=c(1:4)[train$Title],
    ptitle=c(1:4)[test$Title],
    sibsp=c(1:4,rep(4,6))[train$SibSp+1],
    psibsp=c(1:4,rep(4,6))[test$SibSp+1],
    parch=c(1:4,rep(4,6))[train$Parch+1],
    pparch=c(1:4,rep(4,6))[test$Parch+1],
    fare=c(1:5)[train$fare],
    pfare=c(1:5)[test$fare]
)

parameters=c('intercept','sd1',
    'fsex','fpclass','fembarked',
    'foe','fcabchar','fage',
    'fticket','ftitle',
    'fsibsp','fparch',
    'ffare',
    'sdsex','sdpclass','sdembarked',
    'sdoe','sdcabchar','sdage',
    'sdticket','sdtitle',
    'sdsibsp','sdparch',
    'sdfare')
my_code <- ' 
    data {
    int<lower=0> ntrain;
    int<lower=0> ntest;
    int survived[ntrain];
    int <lower=1,upper=2> sex[ntrain];
    int <lower=1,upper=2> psex[ntest];
    int <lower=1,upper=3> pclass[ntrain];
    int <lower=1,upper=3> ppclass[ntest];
    int <lower=1,upper=3> embarked[ntrain];
    int <lower=1,upper=3> pembarked[ntest];
    int <lower=1,upper=3> oe[ntrain];
    int <lower=1,upper=3> poe[ntest];
    int <lower=1,upper=7> cabchar[ntrain];
    int <lower=1,upper=7> pcabchar[ntest];
    int <lower=1,upper=10> age[ntrain];
    int <lower=1,upper=10> page[ntest];
    int <lower=1,upper=13> ticket[ntrain];
    int <lower=1,upper=13> pticket[ntest];
    int <lower=1,upper=4> title[ntrain];
    int <lower=1,upper=4> ptitle[ntest];
    int <lower=1,upper=4> sibsp[ntrain];
    int <lower=1,upper=4> psibsp[ntest];
    int <lower=1,upper=4> parch[ntrain];
    int <lower=1,upper=4> pparch[ntest];
    int <lower=1,upper=5> fare[ntrain];
    int <lower=1,upper=5> pfare[ntest];
    
    }
    parameters {
    real fsex[2];
    real intercept;
    real fpclass[3];
    real fembarked[3];
    real foe[3];
    real fcabchar[7];
    real fage[10];
    real fticket[13];
    real ftitle[4];
    real fparch[4];
    real fsibsp[4];
    real ffare[5];
    real <lower=0> sdsex;
    real <lower=0> sdpclass;
    real <lower=0> sdembarked;
    real <lower=0> sdoe;
    real <lower=0> sdcabchar;
    real <lower=0> sdage;
    real <lower=0> sdticket;
    real <lower=0> sdtitle;
    real <lower=0> sdparch;
    real <lower=0> sdsibsp;
    real <lower=0> sdfare;

    real <lower=0> sd1;
    }
    transformed parameters {
    real expect[ntrain];
    for (i in 1:ntrain) {
    expect[i] <- inv_logit(
    intercept+
    fsex[sex[i]]+
    fpclass[pclass[i]]+
    fembarked[embarked[i]]+
    foe[oe[i]]+
    fcabchar[cabchar[i]]+
    fage[age[i]]+
    fticket[ticket[i]]+
    ftitle[title[i]]+
    fsibsp[sibsp[i]]+
    fparch[parch[i]]+
    ffare[fare[i]]
    );
    }
    
    }
    model {        
    fsex ~ normal(0,sdsex);
    fpclass ~ normal(0,sdpclass);
    fembarked ~ normal(0,sdembarked);
    foe ~ normal(0,sdoe);
    fcabchar ~ normal(0,sdcabchar);
    fage ~ normal(0,sdage);
    fticket ~ normal(0,sdticket);
    ftitle ~ normal(0,sdtitle);
    fsibsp ~ normal(0,sdsibsp);
    fparch ~ normal(0,sdparch);
    ffare ~ normal(0,sdfare);

    sdsex ~  normal(0,sd1);
    sdpclass ~ normal(0,sd1);
    sdembarked ~ normal(0,sd1);
    sdoe ~ normal(0,sd1);
    sdcabchar ~ normal(0,sd1);
    sdage ~ normal(0,sd1);
    sdticket ~ normal(0,sd1);
    sdtitle ~ normal(0,sd1);
    sdsibsp ~ normal(0,sd1);
    sdparch ~ normal(0,sd1);
    sdfare ~ normal(0,sd1);
    sd1 ~ normal(0,1);
    intercept ~ normal(0,1);

    survived ~ bernoulli(expect);
    }
    generated quantities {
    real pred[ntest];
    for (i in 1:ntest) {
    pred[i] <- inv_logit(
    intercept+
    fsex[psex[i]]+
    fpclass[ppclass[i]]+
    fembarked[pembarked[i]]+
    foe[poe[i]]+
    fcabchar[pcabchar[i]]+
    fage[page[i]]+
    fticket[pticket[i]]+
    ftitle[ptitle[i]]+
    fsibsp[psibsp[i]]+
    fparch[pparch[i]]+
    ffare[pfare[i]]

    );
    }
    }
    '

fit1 <- stan(model_code = my_code, 
    data = datain, 
    pars=parameters,
    iter = 1000, 
    chains = 4,
    open_progress=FALSE)

fit1
#plot(fit1,ask=TRUE)
#traceplot(fit1,ask=TRUE)
fit2 <- stan(model_code = my_code, 
    data = datain, 
    fit=fit1,
    pars=c('pred'),
    iter = 2000, 
    warmup =200,
    chains = 4,
    open_progress=FALSE)

fit3 <- as.matrix(fit2)[,-419]
#plots of individual passengers
#plot(density(fit3[,1]))

#plot(density(fit3[,18]))
#plot(density(as.numeric(fit3),adjust=.3))
decide1 <- apply(fit3,2,function(x) mean(x)>.5)
decide2 <- apply(fit3,2,function(x) median(x)>.5)
#table(decide1,decide2)

out <- data.frame(
    PassengerId=test$PassengerId,
    Survived=as.numeric(decide1),
    row.names=NULL)
write.csv(x=out,
    file='stanlin.csv',
    row.names=FALSE,
    quote=FALSE)

Sunday, September 6, 2015

Predicting Titanic deaths on Kaggle V: Ranger

In two previous posts (Predicting Titanic deaths on Kaggle IV: random forest revisited, Predicting Titanic deaths on Kaggle) I was unable to make random forest predict as well as boosting. Hence when I read about an alternative implementation; ranger I took the opportunity to check if with ranger I could improve predictions. The claim ranger makes is that it is faster than RandomForest.
Meanwhile, I have also been reading that RandomForest is not the best implementation. I did not bookmark where, but Google turned up Benchmarking Random Forest Classification on wise.io, and Benchmarking Random Forest Implementations on DataScience.LA.

Data

A slightly adapted version of pre-processing was used. I created a more simple version of the cabin character, thereby moving it from different variables A to F to one multilevel factor with values A, B, C, D, E, X. In addition ticket has been made in a factor with more levels.

Age

Ranger is supposed to be fast, so I took the opportunity to do a replicated cross validation for predicting missing values for age. The cross validation is replicated because in previous experiments I found the difference between two settings might be smaller than the variation within settings. For this errorest() from the ipred package was used and a wrapper around ranger's predict was written to make it function. The plot below shows cross validation error for 10 different runs for each of the settings, so also has an indication of the variation of the cross validation error. From the plot. it was decided to chose mtry=2, nodesize=7. As can be seen from the plot, mtry 1 has the worse predictions, while at nodesize 7 and higher reasonable predictions can be made. In hindsight, bigger nodesizes might even be better, but this was not investigated at the time.

Survival

Again using replicated cross validation, the following prediction errors were found. No mtry=1 in the plot, but that did not perform well. In this plot it does not seem like there are huge differences, but the combination of small nodesize and larger mtry does not seem to pay off. Since these prediction errors are in the same range as previous results it was decided not to make a Kaggle submission on these data.

Number of trees

Since it is unclear to me what the influence of the number of trees was, I did a small experiment with 50, 500 and 5000 trees. Again 10 times a cross validation. In this plot, 50 trees gives a surprising good prediction error for such a simple model, but 5000 is a bit better. Rather than investigating if there is sufficient number of trees, I cut the corner and chose a large number; 200000. It should be noted that this fit into memory and only took a few minutes. However, this did not improve the Kaggle score of my previous random forest attempt.

Conclusion

Ranger is indeed a fast and memory sparse random forest implementation. However, it was not able to improve my prediction error.

Code

Please know that code has been reformatted after pasting in blogger to improve layout. Some intermediate data saving and restoring code has been removed too.
# preparation and data reading section
library(ranger)
#https://www.reddit.com/r/MachineLearning/comments/3hvy7v/ranger_a_fast_implementation_of_random_forests/
library(lattice)
library(latticeExtra)
# has cross validation
library(ipred)

# read and combine
train <- read.csv('train.csv')
train$status <- 'train'
test  <- read.csv('test.csv')
test$status <- 'test'
test$Survived <- NA
tt <- rbind(test,train)

# generate variables
tt$Embarked[tt$Embarked==''] <- 'S'
tt$Embarked <- factor(tt$Embarked)
tt$Pclass <- factor(tt$Pclass)
tt$Survived <- factor(tt$Survived)
tt$Title <- sapply(tt$Name,function(x) strsplit(as.character(x),'[.,]')[[1]][2])
tt$Title <- gsub(' ','',tt$Title)
tt$Title[tt$Title=='Dr' & tt$Sex=='female'] <- 'Miss'
tt$Title[tt$Title %in% c('Capt','Col','Don','Sir','Jonkheer',
   'Major','Rev','Dr')] <- 'Mr'
tt$Title[tt$Title %in% c('Lady','Ms',
   'theCountess','Mlle','Mme','Ms','Dona')] <- 'Miss'
tt$Title <- factor(tt$Title)
# changed cabin character
tt$cabchar <- substr(tt$Cabin,1,1)
tt$cabchar[tt$cabchar %in% c('F','G','T')] <- 'X';
tt$cabchar <- factor(tt$cabchar)
tt$ncabin <- nchar(as.character(tt$Cabin))
tt$cn <- as.numeric(gsub('[[:space:][:alpha:]]','',tt$Cabin))
tt$oe <- factor(ifelse(!is.na(tt$cn),tt$cn%%2,-1))
tt$Fare[is.na(tt$Fare)]<- median(tt$Fare,na.rm=TRUE)
tt$ticket <- sub('[[:digit:]]+$','',tt$Ticket)
tt$ticket <- toupper(gsub('(\\.)|( )|(/)','',tt$ticket))
tt$ticket[tt$ticket %in% c('A2','A4','AQ3','AQ4','AS')] <- 'An'
tt$ticket[tt$ticket %in% c('SCA3','SCA4','SCAH','SC','SCAHBASLE','SCOW')] <- 'SC'
tt$ticket[tt$ticket %in% c('CASOTON','SOTONO2','SOTONOQ')] <- 'SOTON'
tt$ticket[tt$ticket %in% c('STONO2','STONOQ')] <- 'STON'
tt$ticket[tt$ticket %in% c('C')] <- 'CA'
tt$ticket[tt$ticket %in% c('SOC','SOP','SOPP')] <- 'SOP'
tt$ticket[tt$ticket %in% c('SWPP','WC','WEP')] <- 'W'
tt$ticket[tt$ticket %in% c('FA','FC','FCC')] <- 'F'
tt$ticket[tt$ticket %in% c('PP','PPP','LINE','LP','SP')] <- 'PPPP'
tt$ticket <- factor(tt$ticket)

#end of preparation and data reading

# age section
# get an age without missings
forage <- tt[!is.na(tt$Age) & tt$status=='train',
   names(tt) %in% c('Age','Sex','Pclass','SibSP',
        'Parch','Fare','Title','Embarked','cabchar','ncabin','ticket')]
# oe is side of vessel, not relevant for age?

totest <- expand.grid(mtry=1:4,min.node.size=1:11,rep=1:10)

la <- lapply(1:nrow(totest),function(ii) {
      ee <-    errorest(Age ~ .,
          mtry=totest$mtry[ii],
          min.node.size=totest$min.node.size[ii],
          model=ranger,
          predict=function(object,newdata) 
            predict(object,data=newdata)$predictions,
          write.forest=TRUE,
          data=forage)
      cc <- c(mtry=totest$mtry[ii],
          min.node.size=totest$min.node.size[ii],
          error=ee$error)
      # print(cc)
      cc
    })
sla <- do.call(rbind,la)
sla <- as.data.frame(sla)

useOuterStrips(
    densityplot(~ error | factor(mtry)+factor(min.node.size), 
        data=sla))

# 2,7?
rfa1 <- ranger(Age ~ .,
    data=forage,
    mtry=2,
    write.forest=TRUE,
    min.node.size=7)

tt$AGE <- tt$Age
tt$AGE[is.na(tt$AGE)] <- predict(rfa1,tt[is.na(tt$AGE),])$predictions
table(tt$age)
# end of age section

#final data section
train <- tt[tt$status=='train',]
test <- tt[tt$status=='test',]
#end of final data section

#model selection 1
forSurf <- train[,names(train) %in% c('Survived','AGE','Sex','Pclass','SibSP',
        'Parch','Fare','Title','Embarked','ncabin','ticket','oe')]

totest <- expand.grid(mtry=2:5,
  min.node.size=c(1:4,seq(6,12,2),15,20,25),rep=1:10)

la2 <- lapply(1:nrow(totest),function(ii) {
      ee <-    errorest(Survived ~.,
          mtry=totest$mtry[ii],
          min.node.size=totest$min.node.size[ii],
          model=ranger,
          predict=function(object,newdata) 
            predict(object,data=newdata)$predictions,
          write.forest=TRUE,
          data=forSurf
      )
      cc <- c(mtry=totest$mtry[ii],
          min.node.size=totest$min.node.size[ii],
          error=ee$error)
      cat('.')
      if (totest$mtry[ii]==max(totest$mtry) & 
          totest$min.node.size[ii]==max(totest$min.node.size)) 
        cat('\n')
      cc
    })
sla2 <- do.call(rbind,la2)
sla2 <- as.data.frame(sla2)

useOuterStrips(
    densityplot(~ error | factor(mtry)+factor(min.node.size), 
        data=sla2))

############

totest <- expand.grid(num.trees=c(50,500,5000),rep=1:10)

la3 <- lapply(1:nrow(totest),function(ii) {
      ee <-    errorest(Survived ~.,
          mtry=4,
          min.node.size=12,
          num.trees=totest$num.trees[ii],
          model=ranger,
          predict=function(object,newdata) 
            predict(object,data=newdata)$predictions,
          write.forest=TRUE,
          data=forSurf
      )
      cc <- c(num.trees=totest$num.trees[ii],error=ee$error)
      cat('.')
      if (totest$num.trees[ii]==max(totest$num.trees)) cat('\n')
      cc
    })
sla3 <- do.call(rbind,la3)
sla3 <- as.data.frame(sla3)
densityplot(~ error | factor(num.trees), data=sla3)

#########
rang3 <- ranger(Survived ~ .,
    mtry=4,
    min.node.size=12,
    write.forest=TRUE,
    data=forSurf,
    num.trees=200000) 

pp <- predict(rang3,test)
out <- data.frame(
    PassengerId=test$PassengerId,
    Survived=pp$predictions,row.names=NULL)
write.csv(x=out,
    file='rf.1.sep.csv',
    row.names=FALSE,
    quote=FALSE)

# Your submission scored 0.75598, which is not an improvement of your best score.