Sunday, August 23, 2015

Predicting Titanic deaths on Kaggle IV: random forest revisited

On July 19th I used randomForest to predict the deaths on Titanic in the Kaggle competition. Subsequently I found that both bagging and boosting gave better predictions than randomForest. This I found somewhat unsatisfactory, hence I am now revisiting randomForest. To my disappointment this does not result in predictions as good as bagging and boosting.
Note that all code is at the bottom of the post

Data

Data has not changed very much.

Age

Since ipred package has a nice function for obtaining error using cross-validation, getting better predictions for Age when not in the data is the first adaptation. The model parameters to be optimized are mtry and nodesize. The plot shows that mtry=5 and nodesize=4 should give the best predictions.

Using these settings, the following predicted vs observed ages are obtained. I am not really impressed.

Survival model 1

Model building 1

Having complete data, the next step is using cross-validation to select nodesize and mtry for the survival model. The following predictive capability was observed. Note that the error in these models is a bit larger than observed previously with bagging and boosting. However, observing that, does not suggest a remedy. It was chosen to use nodesize=3 mtry=7.

Evaluation Model 1

There are a number of ways to have randomForest give predictions. One can just ask for the categories, or the probability of a category. At this point I am looking at those probabilities, since I think the model might be improved. For this improvement, I do need to understand what is happening. Using the model, the following out of bag probabilities per category are found (pp[,1] is the probability of category 0). This is not ideal. Ideally most of the probabilities are close to 0 and 1. But here there are quite a number where this is not the case. Especially category 1 is not easily found and quite a few of the category 1 are seen as category 0. Hence the question becomes if it is possible to get better defined categories. As a first step, I will try to optimize the point where the cut is made between the two categories.

The plot below shows the number of correct predictions as function of the cut off point. It shows that the whole center region is a possible cut off, except near 0.4. The value 0.5 is not optimal. 

Examination of cut of point

After making this plot I wondered if this shape would be same for other settings of nodesize and mtry.  Since I have a distinct feeling it is all dependent on the luck of the draw, it is repeated a number of times for each setting. Based on this I have chosen that a cut off of 0.55 is appropriate for a a wider range of settings. The best out of box predictions seem to happen with a higher value for mtry and a low value for nodesize. Thinking back on the density plot, it would seem that high nodesize and low mtry has low probabilities in the center region. However, the price for that is quite some errors in out of bag predictions.

Survival Model 2

Model Building 2

Using the cut off of 0.55, again cross validation to select model parameters mtry and nodesize. Again each setting is tried a few times to get an idea of variability of prediction quality. Based on these settings I have chosen nodesize=6 and mtry=6.

Submission

Your submission scored 0.75. Not really as much as I had hoped for.

Code

Note that the code has been reformatted and cleaned after pasting in the blogging application. This should not have caused any coding errors.
# preparation and data reading section
library(randomForest)
library(lattice)
# has cross validation
library(ipred)


# read and combine
train <- read.csv('train.csv')
train$status <- 'train'
test  <- read.csv('test.csv')
test$status <- 'test'
test$Survived <- NA
tt <- rbind(test,train)

# generate variables
tt$Pclass <- factor(tt$Pclass)
tt$Survived <- factor(tt$Survived)
tt$age <- tt$Age
tt$age[is.na(tt$age)] <- 35
tt$age <- cut(tt$age,c(0,2,5,9,12,15,21,55,65,100))
tt$Title <- sapply(tt$Name,function(x) strsplit(as.character(x),'[.,]')[[1]][2])
tt$Title <- gsub(' ','',tt$Title)
tt$Title[tt$Title %in% c('Capt','Col','Don','Sir','Jonkheer','Major')] <- 'Mr'
tt$Title[tt$Title %in% c('Lady','Ms','theCountess','Mlle','Mme','Ms','Dona')] <- 'Miss'
tt$Title <- factor(tt$Title)
tt$A <- factor(grepl('A',tt$Cabin))
tt$B <- factor(grepl('B',tt$Cabin))
tt$C <- factor(grepl('C',tt$Cabin))
tt$D <- factor(grepl('D',tt$Cabin))
tt$E <- factor(grepl('E',tt$Cabin))
tt$F <- factor(grepl('F',tt$Cabin))
tt$ncabin <- nchar(as.character(tt$Cabin))
tt$PC <- factor(grepl('PC',tt$Ticket))
tt$STON <- factor(grepl('STON',tt$Ticket))
tt$cn <- as.numeric(gsub('[[:space:][:alpha:]]','',tt$Cabin))
tt$oe <- factor(ifelse(!is.na(tt$cn),tt$cn%%2,-1))
tt$Fare[is.na(tt$Fare)]<- median(tt$Fare,na.rm=TRUE)
#end of preparation and data reading

# age section
# get an age without missings
forage <- tt[!is.na(tt$Age) & tt$status=='train',names(tt) %in% 
     c('Age','Sex','Pclass','SibSP',
        'Parch','Fare','Title','Embarked','A','B','C','D','E','F',
        'ncabin','PC','STON','oe')]

totest <- expand.grid(mtry=4:7,nodesize=3:6)

la <- lapply(1:nrow(totest),function(ii) {
      ee <-    errorest(Age ~.,
          mtry=totest$mtry[ii],
          nodesize=totest$nodesize[ii],
          model=randomForest,
          data=forage)
      cc <- c(mtry=totest$mtry[ii],nodesize=totest$nodesize[ii],error=ee$error)
      print(cc)
      cc
    })

sla <- do.call(rbind,la)
sla <- as.data.frame(sla)
xyplot(error ~ mtry, groups= nodesize, data=sla,auto.key=TRUE,type='l')
# chosen 5,4
rfa1 <- randomForest(Age ~ .,
    data=forage,
    ntree=1000,
    mtry=5,
    nodesize=4)
plot(tt$Age,predict(rfa1,tt))
abline(a=0,b=1,col='red')
tt$AGE <- tt$Age
tt$AGE[is.na(tt$AGE)] <- predict(rfa1,tt[is.na(tt$AGE),])
tt$age <- cut(tt$AGE,c(0,2,5,9,12,15,21,55,65,100))
# end of age section

#final data section
train <- tt[tt$status=='train',]
test <- tt[tt$status=='test',]
#end of final data section

#model selection 1
forSurf <- train[,names(train) %in% 
     c('Survived','age','AGE','Sex','Pclass','SibSP',
        'Parch','Fare','Title','Embarked','A','B','C','D','E','F',
        'ncabin','PC','STON','oe')]

# rfx <- randomForest(Survived ~.,data=forSurf)
totest <- expand.grid(mtry=6:9,nodesize=3:7)

la <- lapply(1:nrow(totest),function(ii) {
      ee <-    errorest(Survived ~.,
          mtry=totest$mtry[ii],
          nodesize=totest$nodesize[ii],
          model=randomForest,
          data=forSurf,
          ntree=1000,
          est.para=control.errorest(k=20)
      )
      cc <- c(mtry=totest$mtry[ii],
         nodesize=totest$nodesize[ii],
         sampsize=totest$sampsize[ii],
         error=ee$error)
      print(cc)
      cc
    })
sla <- do.call(rbind,la)
sla <- as.data.frame(sla)
xyplot(error ~ mtry, groups= nodesize, data=sla,auto.key=TRUE,type='l')
#end of model selection 1

#model evaluation section 1a
rfx <- randomForest(Survived ~.,data=forSurf,nodesize=3,mtry=7,ntree=1000)
pp <- predict(rfx,type='prob')
densityplot(~ pp[,1] | forSurf$Survived,adj=.3)
cuts <- seq(.20,.7,.001) 
plot(y=sapply(cuts,function(cc){
          decide=factor(as.numeric(pp[,1]<cc))
          sum(decide==forSurf$Survived)
        }),
    x=cuts)
#end  of model evaluation section 1a


cuts <- seq(.25,.65,.001) 
# model evaluation 1b
eval2 <- expand.grid(nodesize=seq(4,100,8),mtry=seq(2,8,2),count=1:10)
sach <- lapply( 1:nrow(eval2),function(i) {
      rfx <- randomForest(Survived ~.,
          data=forSurf,
          nodesize=eval2$nodesize[i],
          mtry=eval2$mtry[i],
          ntree=1000)
      pp <- predict(rfx,type='prob')
      nerr=sapply(cuts,function(cc){
            decide=factor(as.numeric(pp[,1]<cc))
            sum(decide==forSurf$Survived)})
      data.frame(
          nerr=nerr,
          cuts=cuts,
          mtry=eval2$mtry[i],
          nodesize=eval2$nodesize[i],
          i=rep(i,length(cuts)))
    })
sach <- do.call(rbind,sach)
xyplot(nerr ~ cuts | nodesize + mtry ,group=i, data=sach,auto.key=FALSE,type='l')
##############
# # chose cuts at .55
##############

#biased prediction
twpred <- function(object,newdata=NULL) {
  preds <- predict(object,newdata,type='prob')
  factor(as.numeric(preds[,1]<0.55),levels=c('0','1'))
}
totest2 <- expand.grid(mtry=seq(2,8,2),nodesize=seq(2,30,4),count=1:10)

la2 <- lapply(1:nrow(totest2),function(ii) {
      ee <-    errorest(Survived ~.,
          mtry=totest2$mtry[ii],
          nodesize=totest2$nodesize[ii],
          model=randomForest,
          data=forSurf,
          ntree=500,
          predict=twpred,
          est.para=control.errorest(k=10)
      )
      cc <- c(mtry=totest2$mtry[ii],
         nodesize=totest2$nodesize[ii],
         i=totest2$count[ii],
         error=ee$error)
      print(cc)
      cc
    })
sla2 <- do.call(rbind,la2)
sla2 <- as.data.frame(sla2)
xyplot(error ~ factor(mtry) | factor(nodesize),
    groups= i, data=sla2,auto.key=FALSE,type='l')
##
#let select mtry=6, nodesize=6
rf2 <-randomForest(Survived ~ .,
    data=forSurf,
    replace=TRUE,
    ntree=2000,
    nodesize=6,
    mtry=6)  

pp <- predict(rf2,test)
out <- data.frame(
    PassengerId=test$PassengerId,
    Survived=pp,row.names=NULL)
write.csv(x=out,
    file='rf.16.aug.csv',
    row.names=FALSE,
    quote=FALSE)

# get a result
# Your submission scored 0.75598

2 comments:

  1. Thanks for posting your analysis. I appreciate it because I know it represents quite a bit of work on your part.

    When I run your code in a clean RStudio session, I have a problem in the "Age" section with this assignment:

    la <- lapply(1:nrow(totest),function(ii) {
    ee <- errorest(Age ~.,
    mtry=totest$mtry[ii],
    nodesize=totest$nodesize[ii],
    model=randomForest,
    data=forage)
    cc <- c(mtry=totest$mtry[ii],nodesize=totest$nodesize[ii],error=ee$error)
    print(cc)
    cc
    })

    I get the following error.

    [begin output]
    Error in randomForest.default(m, y, ...) :
    NA/NaN/Inf in foreign function call (arg 1)
    In addition: Warning messages:
    1: In data.matrix(x) : NAs introduced by coercion
    2: In data.matrix(x) : NAs introduced by coercion
    Called from: randomForest.default(m, y, ...)
    [end output]

    I am not sure what is going on. I checked the "forage" data.frame for NAs but could not find any.

    TIA.

    ReplyDelete
    Replies
    1. That is difficult to debug remotely. If I have such things I try to remove some of the variables to see which variable is causing the error. This is most easily done by just fitting the randomForest, wrapping in errorest and a lapply just makes more confusing.

      Delete