Sunday, July 26, 2015

Predicting Titanic deaths on Kaggle II: gbm

Following my previous post I have decided to try and use a different method: generalized boosted regression models (gbm). I have read the background in Elements of Statistical Learning and arthur charpentier's nice post on it. This data is a nice occasion to get my hands dirty.

Data 

Data as before. However, I have added some more variables. In addition, during the analysis it appeared that gbm does not like to have logical variables in the x-variables. One missing value of Fare in the test set gets the median value in order to avoid having missing values in the data. I must say I like using dplyr for this data handing. It allows me to use the same code for training and test with minimum effort.
library(dplyr)
library(gbm)

set.seed(4321)
titanic <- read.csv('train.csv') %>%
    mutate(.,Pclass=factor(Pclass),
        Survived=factor(Survived),
        age=ifelse(is.na(Age),35,Age),
        age = cut(age,c(0,2,5,9,12,15,21,55,65,100)),
        Title=sapply(Name,function(x) strsplit(as.character(x),'[.,]')[[1]][2]),
        Title=gsub(' ','',Title),
        Title =ifelse(Title %in% c('Capt','Col','Don','Sir','Jonkheer','Major'),'Mr',Title),
        Title =ifelse(Title %in% c('Lady','Ms','theCountess','Mlle','Mme','Ms','Dona'),'Miss',Title),
        Title = factor(Title),
        A=factor(grepl('A',Cabin)),
        B=factor(grepl('B',Cabin)),
        C=factor(grepl('C',Cabin)),
        D=factor(grepl('D',Cabin)),
        E=factor(grepl('E',Cabin)),
        F=factor(grepl('F',Cabin)),
        ncabin=nchar(as.character(Cabin)),
        PC=factor(grepl('PC',Ticket)),
        STON=factor(grepl('STON',Ticket)),
        cn = as.numeric(gsub('[[:space:][:alpha:]]','',Cabin)),
        oe=factor(ifelse(!is.na(cn),cn%%2,-1)),
        train = sample(c(TRUE,FALSE),
            size=891,
            replace=TRUE, 
            prob=c(.9,.1)   ) )

test <- read.csv('test.csv') %>%
    mutate(.,
        Embarked=factor(Embarked,levels=levels(titanic$Embarked)),
        Pclass=factor(Pclass),
  #      Survived=factor(Survived),
        age=ifelse(is.na(Age),35,Age),
        age = cut(age,c(0,2,5,9,12,15,21,55,65,100)),
        Title=sapply(Name,function(x) strsplit(as.character(x),'[.,]')[[1]][2]),
        Title=gsub(' ','',Title),
        Title =ifelse(Title %in% c('Capt','Col','Don','Sir','Jonkheer','Major'),'Mr',Title),
        Title =ifelse(Title %in% c('Lady','Ms','theCountess','Mlle','Mme','Ms','Dona'),'Miss',Title),
        Title = factor(Title),
        A=factor(grepl('A',Cabin)),
        B=factor(grepl('B',Cabin)),
        C=factor(grepl('C',Cabin)),
        D=factor(grepl('D',Cabin)),
        E=factor(grepl('E',Cabin)),
        F=factor(grepl('F',Cabin)),
        ncabin=nchar(as.character(Cabin)),
        PC=factor(grepl('PC',Ticket)),
        STON=factor(grepl('STON',Ticket)),
        cn = as.numeric(gsub('[[:space:][:alpha:]]','',Cabin)),
        oe=factor(ifelse(!is.na(cn),cn%%2,-1))
)
test$Fare[is.na(test$Fare)]<- median(titanic$Fare)

Age

Age has missing values, thus the first task is to fill those. Since gbm is the method used for the main analysis, I will be used it for age too. This has the added advantage that I can exercise with both a numerical and a categorical variable as response.
One of the things with boosting is that it opens itself to over fitting. Boosting consists of adding trees which are structured to improve fit. At some point the trees will just start boost the noise rather than the structure. Gbm comes with a cross validation (cv) option, which is the preferred way to get the predictive qualities of models, and cv is used to determine the optimum number of trees. But, there is catch, it throws an error if there are variables in the data.frame which are not used in the model. Hence in the code below first the data is selected, and subsequently the model run.
The model parameters, interaction.depth=4, shrinkage=0.0005 come from a bit of experimenting. n.trees has to be high enough that it is clear the optimum number of trees is lower than the number estimated. It seems n.cores=2 works under both windows and linux.
forage <- filter(titanic,!is.na(titanic$Age)) %>%
     select(.,Age,SibSp,Parch,Fare,Sex,Pclass,Title,Embarked,A,B,C,D,E,F,ncabin,PC,STON,oe)

rfa1 <- gbm(Age ~ ., 
    data=forage,
    interaction.depth=4,
    cv.folds=10,
    n.trees=8000,
    shrinkage=0.0005,
    n.cores=2)

gbm.perf(rfa1)
Using cv method...
[1] 6824
It is time here to confess that I have been working on this over several sessions. It appears that when I created the code, 7118 trees were optimum, while I stored that data for a session with 6824 trees. Thus is the way of these methods, unlike traditional statistical methods, they have a different result any time. But, as can be seen from the plot, the difference should be minimal.
titanic$AGE<- titanic$Age
titanic$AGE[is.na(titanic$AGE)] <- predict(rfa1,titanic,n.trees=7118)[is.na(titanic$Age)]
test$AGE<- test$Age
test$AGE[is.na(test$AGE)] <- predict(rfa1,test,n.trees=7118)[is.na(test$Age)]

Survival

During the calculations I learned that the response should be a float containing 0 and 1. With two categories there are various distributions to be used: bernoulli, huberized and adaboost. Using the 10% test data I had set apart, it seemed adaboost worked best for these data. 
gb1 <- filter(titanic,train) %>%
    select(.,age,SibSp,Parch,Fare,Sex,Pclass,
        Title,Embarked,A,B,C,D,E,F,ncabin,PC,STON,oe,AGE,Survived)%>%
      mutate(Survived=c(0,1)[Survived]) # not integer or factor but float
#table(gb1$Survived)
gb1m <-      gbm(Survived ~ .,
         cv.folds=11,
         n.cores=2,
         interaction.depth=5,
         shrinkage = 0.0005,
         distribution='adaboost',
    data=gb1,
    n.trees=10000)
gbm.perf(gb1m)
Using cv method...
[1] 6355

In my code I have used 6000 trees. 
One thing about gbm is that it does not respond with categories. It is a proportion answers for either category.
preds <- predict(gb1m,titanic,
    n.trees=6000,type='response')
density(preds) %>% plot
Thus there is a need or opportunity to determine the cut off point. For this my test set comes in somewhat handy.
preds2<- preds[!titanic$train]
target <- c(0,1)[titanic$Survived[!titanic$train]]
sapply(seq(.3,.7,.01),function(step)
  c(step,sum(ifelse(preds2<step,0,1)!=target)))
     [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
[1,]  0.3  0.31  0.32  0.33  0.34  0.35  0.36  0.37  0.38  0.39   0.4  0.41
[2,] 17.0 17.00 17.00 17.00 17.00 17.00 17.00 17.00 18.00 17.00  16.0 16.00
     [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
[1,]  0.42  0.43  0.44  0.45  0.46  0.47  0.48  0.49   0.5  0.51  0.52  0.53
[2,] 16.00 16.00 15.00 14.00 14.00 14.00 13.00 15.00  15.0 15.00 16.00 16.00
     [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36]
[1,]  0.54  0.55  0.56  0.57  0.58  0.59   0.6  0.61  0.62  0.63  0.64  0.65
[2,] 16.00 17.00 17.00 17.00 17.00 18.00  18.0 18.00 18.00 19.00 20.00 20.00
     [,37] [,38] [,39] [,40] [,41]
[1,]  0.66  0.67  0.68  0.69   0.7
[2,] 20.00 21.00 21.00 21.00  21.0
It is a bit messy output, but at 0.48 the least errors are found.

Predictions

This is fairly straightforward. I am not unhappy to report an improvement, bringing me from tail to middle region of the peloton.
pp <- predict(gb1m,test,n.trees=6000,type='response')
pp <- ifelse(pp<0.48,0,1)
out <- data.frame(
    PassengerId=test$PassengerId,
    Survived=pp,row.names=NULL)
write.csv(x=out,
    file='gbm.csv',
    row.names=FALSE,
    quote=FALSE)

Sunday, July 19, 2015

Predicting Titanic deaths on Kaggle

Kaggle has a competition to predict who will die on the famous Titanic 'Machine Learning from Disaster''. It is placed as knowledge competition. Just up there to learn. I am late to the party, it has been been for 1 1/2 year, to end by end 2015. It is a small data set, hence interesting to learn from. It is also a competition with a number of entries which have perfect predictions.
Just for fun, I have been trying to see what I would achieve with simple attempt with randomforest. For those in the competition, this randomforest got me 0.74163, placing me 2781 out of 3064 entries. An acceptable spot, since this is using off the shelf approach. An improvement may follow in a subsequent post.

Data

Data downloaded from Kaggle. It is real world data, hence has the odd missing (in passenger age) and a number of columns with messy data, which might be employed to create additional variables. For the purpose of validation about 90% of the data gets flagged to be training set. test will be the test, set, results of which to be passed back to Kaggle.
  PassengerId Survived Pclass                                                Name
1           1        0      3                             Braund, Mr. Owen Harris
2           2        1      1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)
3           3        1      3                              Heikkinen, Miss. Laina
4           4        1      1        Futrelle, Mrs. Jacques Heath (Lily May Peel)
5           5        0      3                            Allen, Mr. William Henry
6           6        0      3                                    Moran, Mr. James
     Sex Age SibSp Parch           Ticket    Fare Cabin Embarked
1   male  22     1     0        A/5 21171  7.2500              S
2 female  38     1     0         PC 17599 71.2833   C85        C
3 female  26     0     0 STON/O2. 3101282  7.9250              S
4 female  35     1     0           113803 53.1000  C123        S
5   male  35     0     0           373450  8.0500              S
6   male  NA     0     0           330877  8.4583              Q
library(dplyr)
library(randomForest)
library(lattice)
options(width=85)
head(read.csv('train.csv'))

titanic <- read.csv('train.csv') %>%
    mutate(.,Pclass=factor(Pclass),
        Survived=factor(Survived),
        age=ifelse(is.na(Age),35,Age),
        age = cut(age,c(0,2,5,9,12,15,21,55,65,100)),
        A=grepl('A',Cabin),
        B=grepl('B',Cabin),
        C=grepl('C',Cabin),
        D=grepl('D',Cabin),
        cn = as.numeric(gsub('[[:space:][:alpha:]]','',Cabin)),
        oe=factor(ifelse(!is.na(cn),cn%%2,-1)),
        train = sample(c(TRUE,FALSE),
            size=891,
            replace=TRUE, 
            prob=c(.9,.1)   ) )
test <- read.csv('test.csv') %>%
    mutate(.,Pclass=factor(Pclass),
        age=ifelse(is.na(Age),35,Age),
        age = cut(age,c(0,2,5,9,12,15,21,55,65,100)),
        A=grepl('A',Cabin),
        B=grepl('B',Cabin),
        C=grepl('C',Cabin),
        D=grepl('D',Cabin),
        cn = as.numeric(gsub('[[:space:][:alpha:]]','',Cabin)),
        oe=factor(ifelse(!is.na(cn),cn%%2,-1)),
        Embarked=factor(Embarked,levels=levels(titanic$Embarked))
        )
test$Fare[is.na(test$Fare)]<- median(titanic$Fare)
Age has missing values, hence the first step is to fill those in. In the code above, an age factor has been made, where missings are imputed the largest category. 

Model building

A simple prediction using randomForest.
rf1 <- randomForest(Survived ~ 
        Sex+Pclass + SibSp +
        Parch + Fare + 
        Embarked + age +
        A+B+C+D +oe,
    data=titanic,
    subset=train,
    replace=FALSE,
    ntree=1000)
Call:
 randomForest(formula = Survived ~ Sex + Pclass + SibSp + Parch +      Fare + Embarked + age + A + B + C + D + oe, data = titanic,      replace = FALSE, ntree = 1000, subset = train) 
               Type of random forest: classification
                     Number of trees: 1000
No. of variables tried at each split: 3

        OOB estimate of  error rate: 16.94%
Confusion matrix:
    0   1 class.error
0 454  40  0.08097166
1  95 208  0.31353135

This shows some bias in the predictions. Class one gets predicted as class 0 far too often. Hence I will optimize not only the normal variables nodesize (Minimum size of terminal nodes) and mtry (Number of variables randomly sampled as candidates at each split) but also classwt (Priors of the classes).

titanic$pred <- predict(rf1,titanic)
with(titanic[!titanic$train,],sum(pred!=Survived)/length(pred))

mygrid <- expand.grid(nodesize=c(2,4,6),
    mtry=2:5,
    wt=seq(.5,.7,.05))
sa <- sapply(1:nrow(mygrid), function(i) {
    rfx <- randomForest(Survived ~ 
        Sex+Pclass + SibSp +
        Parch + Fare + 
        Embarked + age +
        A+B+C+D +oe,
    data=titanic,
    subset=train,
    replace=TRUE,
    ntree=4000,
    nodesize=mygrid$nodesize[i],
    mtry=mygrid$mtry[i],
    classwt=c(1-mygrid$wt[i],mygrid$wt[i]))  
    preds <- predict(rfx,titanic[!titanic$train,])
    nwrong <- sum(preds!=titanic$Survived[!titanic$train])
    c(nodesize=mygrid$nodesize[i],mtry=mygrid$mtry[i],wt=mygrid$wt[i],pw=nwrong/length(preds))
    })
tsa <- as.data.frame(t(sa))
xyplot(pw ~ wt | mtry,group=factor(nodesize), data=tsa,auto.key=TRUE,type='l')

What is less visible from this plot is the amount of variation I had in the results. One prediction better or worse really makes a difference in the figure. This is the reason I have increased the number of trees in the model.

Final predictions

Using the best settings from above, gets you to the bottom of the ranking. The script makes the model, writes predictions in the format required by kaggle.
rf2 <- randomForest(Survived ~ 
        Sex+Pclass + SibSp +
        Parch + Fare + 
        Embarked + age +
        A+B+C+D +oe,
    data=titanic,
    replace=TRUE,
    ntree=5000,
    nodesize=4,
    mtry=3,
    classwt=c(1-.6,.6))  

pp <- predict(rf2,test)
out <- data.frame(
    PassengerId=test$PassengerId,
    Survived=pp,row.names=NULL)
write.csv(x=out,
    file='rf1.csv',
    row.names=FALSE,
    quote=FALSE)

Sunday, July 5, 2015

More on causes of death in Netherlands over the years

Last week I had a post 'Deaths in the Netherlands by cause and age'. During creation of that post I made one plot which I had not shown. It shows something odd. There is a vertical striping. Hence mortality varies by year across age.
To examine this phenomenon further here is a plot of some underlying causes. I would say the striping is present for at least three categories; "Diseases of the circulatory system", "Diseases of the respiratory organs" and "Sympt., Abnormal clinical Observations". This is odd, since these do not seem to be contagious. I suspect therefore that something like harsh weather (heat or cold) makes life more difficult, but does not get to be the final cause in the administration.
In addition there is something which I did not realize before regarding "Mental and behavioral disorders". They are age related. But it also seems that somewhere in the nineties of last century they became acceptable to register. And suddenly they are present, across several age categories.
This plot, same data, differently organized, shows that the years with these causes are similar, especially "Diseases of the circulatory system" and "Diseases of the respiratory organs"

Can it statistically be seen?

It is very nice that I can see that, but how about measuring it? Hence for age 90 to 95, after detrending, correlation between the two most visually correlated causes of death.
Pearson's product-moment correlation

data:  xx$`Diseases of the respiratory organs` and xx$`Diseases of the circulatory system`
t = 2.4997, df = 62, p-value = 0.01509
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.06133681 0.51042863
sample estimates:
     cor 
0.302584 

Code

library(dplyr)
library(ggplot2)
txtlines <- readLines('Overledenen__doodsoo_170615161506.csv')
txtlines <- grep('Centraal',txtlines,value=TRUE,invert=TRUE) 
#txtlines[1:5]
#cat(txtlines[4])
r1 <- read.csv(sep=';',header=FALSE,
        col.names=c('Causes','Causes2','Age','year','aantal','count'),
        na.strings='-',text=txtlines[3:length(txtlines)]) %>%
    select(.,-aantal,-Causes2)
transcauses <- c(
    "Infectious and parasitic diseases",
    "Diseases of skin and subcutaneous",
    "Diseases musculoskeletal system and connective ",
    "Diseases of the genitourinary system",
    "Pregnancy, childbirth",
    "Conditions of perinatal period",
    "Congenital abnormalities",
    "Sympt., Abnormal clinical Observations",
    "External causes of death",
    "Neoplasms",
    "Illness of blood, blood-forming organs",
    "Endocrine, nutritional, metabolic illness",
    "Mental and behavioral disorders",
    "Diseases of the nervous system and sense organs",
    "Diseases of the circulatory system",
    "Diseases of the respiratory organs",
    "Diseases of the digestive organs",
    "Population",
    "Total all causes of death")
#cc <- 
cbind(transcauses,levels(r1$Causes))
options(width=100)
levels(r1$Causes) <- transcauses
levels(r1$Age) <- 
    gsub('jaar','year',levels(r1$Age)) %>%
    gsub('tot','to',.) %>%
    gsub('of ouder','+',.)
r1 <- mutate(r1,age=as.numeric(sub(' .*$','',Age)))

Deaths <- filter(r1,Causes=='Total all causes of death') %>%
    mutate(.,Total=count) %>%
    select(.,-count,-Causes) %>%
    merge(.,r1) %>%
    filter(.,Causes %in% transcauses[18]) %>%
    mutate(.,Population=count,
        Percentage=100*Total/Population,
        year = as.numeric(gsub('*','',year,fixed=TRUE))) %>%
    select(.,-Causes,-count)

png('deathall.png')
v <- ggplot(Deaths[Deaths$age>60,], aes( year,Age, fill = Percentage))
v + geom_raster() +
    scale_fill_gradientn (
        colours=c('white','black'))+
    theme(legend.position="bottom")+
    ggtitle('Total all causes of death')
dev.off()

v3 <- filter(r1,Causes %in% transcauses[18],
        age>65) %>%
    mutate(.,Population=count) %>%
    select(.,-count,-Causes) %>%
    merge(.,r1) %>%
    filter(.,Causes %in% transcauses[c(8,15,9,10,13,16)]) %>%
    mutate(.,Total=count,
        Percentage=100*Total/Population,
        year = as.numeric(gsub('*','',year,fixed=TRUE))) %>%
    select(.,-count)
png('bycause.png')
 ggplot(v3, aes( year,Age, fill = Percentage))+
  geom_raster() +
    scale_fill_gradientn (
        colours=c('white','black'))+
    theme(legend.position="bottom")+
    facet_wrap( ~ Causes,nrow=3)
dev.off()

png('byage.png')
ggplot(v3[v3$age>75,], aes( year,Causes, fill = Percentage))+
    geom_raster() +
    scale_fill_gradientn (
        colours=c('white','black'))+
    theme(legend.position="bottom")+
    facet_wrap( ~ Age,nrow=3)
dev.off()

xx <- filter(r1,Causes %in% transcauses[18],
        age==90) %>%
    mutate(.,Population=count) %>%
    select(.,-count,-Causes) %>%
    merge(.,r1) %>%
    filter(.,Causes %in% transcauses[c(8,15,9,16)]) %>%
    mutate(.,Total=count,
        Percentage=100*Total/Population,
        year = as.numeric(gsub('*','',year,fixed=TRUE)),
        Causes=factor(Causes)) %>%
    select(.,-count,-Age,-age,-Population,-Total) %>%
   reshape(.,direction='wide',timevar='Causes',idvar='year')

names(xx) <- gsub('Percentage.','',names(xx))
for (i in 2:ncol(xx)) xx[,i]<- xx[,i] - predict(loess(xx[,i] ~ year,data=xx))

cor.test(xx$`Diseases of the respiratory organs`,
    xx$`Diseases of the circulatory system`)

Sunday, June 28, 2015

Deaths in the Netherlands by cause and age

I downloaded counts of deaths by age, year and mayor cause from the Dutch statistics site. In this post I do some plots to look at causes and changes between the years.

Data 

Data from CBS. I downloaded the data in Dutch, hence the first thing to do was provide some kind of translation. The coding used seems slightly different from IDC-10 main categories (and has been alphabetically disordered compared to that). I used Google translate and IDC-10 to obtain the translations

Plots

Preparation

In the following I will be using both percentage of population and percentage of deaths by age cohort. The need for the percentage of deaths is because in some cohorts the percentages of deaths are much higher, thereby hiding anything happening in other cohorts. In addition I should mention that for visual purposes only the most important eight causes are used in the plots

Young

It seems that most of risks are associated with birth. In addition, these risks have steadily been decreasing. 
Looking at the age cohorts above 0 years, it seems accidents are most important. Most remarkable is a spike at 1953, which occurs for all four ages. After some consideration, I link this to the North Sea flood of 1953. It is remarkable that this is visible in the plot. It says a lot about how safe we are from accidents that it does. In the age category 15 to 20 there is also a relatively large bump during the 1970 to 1975. This is more difficult to explain, but I suspect traffic, especially the moped. A light motorcycle which preferably would be boosted to run much faster than the legal speed. 1975 saw the requirement to wear a helmet. It was much hated at the time, but in hindsight I can see that government felt compelled to do something and that it did have effect.
Looking at the plots, it seems the next big cause are Neoplasms. This is not because these become more deadly, it is because accidents are getting under control.

Elderly

For the elderly, diseases of the circulatory system are the main cause and decreasing quite a bit. The number of Symptoms and Abnormal Clinical Observations seems to decrease too. Since this seems to be a nice name for the 'other' category, this may be better diagnostics.
What is less visible is the increase in mental and behavioral disorders, especially after 1980 and at oldest age. It also seems that Neoplasms are getting lower very slowly.

Code

data reading

library(dplyr)
library(ggplot2)
txtlines <- readLines('Overledenen__doodsoo_170615161506.csv')
txtlines <- grep('Centraal',txtlines,value=TRUE,invert=TRUE) 
#txtlines[1:5]
#cat(txtlines[4])
r1 <- read.csv(sep=';',header=FALSE,
        col.names=c('Causes','Causes2','Age','year','aantal','count'),
        na.strings='-',text=txtlines[3:length(txtlines)]) %>%
    select(.,-aantal,-Causes2)
transcauses <- c(
    "Infectious and parasitic diseases",
    "Diseases of skin and subcutaneous",
    "Diseases musculoskeletal system and connective ",
    "Diseases of the genitourinary system",
    "Pregnancy, childbirth",
    "Conditions of perinatal period",
    "Congenital abnormalities",
    "Sympt., Abnormal clinical Observations",
    "External causes of death",
    "Neoplasms",
    "Illness of blood, blood-forming organs",
    "Endocrine, nutritional, metabolic illness",
    "Mental and behavioral disorders",
    "Diseases of the nervous system and sense organs",
    "Diseases of the circulatory system",
    "Diseases of the respiratory organs",
    "Diseases of the digestive organs",
    "Population",
    "Total all causes of death")
#cc <- cbind(transcauses,levels(r1$Causes))
#options(width=100)
levels(r1$Causes) <- transcauses
levels(r1$Age) <- 
    gsub('jaar','year',levels(r1$Age)) %>%
    gsub('tot','to',.) %>%
    gsub('of ouder','+',.) 

Preparation for plots

perc.of.death <- filter(r1,Causes=='Total all causes of death') %>%
    mutate(.,Population=count) %>%
    select(.,-count,-Causes) %>%
    merge(.,r1) %>%
    filter(.,Causes %in% transcauses[1:17]) %>%
    mutate(.,Percentage=100*count/Population,
        Causes = factor(Causes),
        year = as.numeric(gsub('*','',year,fixed=TRUE))
    )
perc.of.pop <- filter(r1,Causes=='Population') %>%
    mutate(.,Population=count) %>%
    select(.,-count,-Causes) %>%
    merge(.,r1) %>%
    filter(.,Causes %in% transcauses[1:17]) %>%
    mutate(.,Percentage=100*count/Population,
        Causes = factor(Causes),
        year = as.numeric(gsub('*','',year,fixed=TRUE))

    )

young

png('youngpop1.png')
tmp1 <- perc.of.pop %>% filter(.,Age %in% levels(perc.of.pop$Age)[c(1,2,11,3)],
        !is.na(Percentage)) %>%
    mutate(.,Age=factor(Age,levels=levels(perc.of.pop$Age)[c(1,2,11,3)]),
        Causes =factor(Causes)) 
# select 'important' causes (which somewhen got over 15%)
group_by(tmp1,Causes)%>%
    summarize(.,mp = max(Percentage)) %>%
    mutate(.,rk=rank(-mp)) %>%
    merge(.,tmp1) %>%
    filter(.,rk<=8) %>%
    ggplot(.,
        aes(y=Percentage,x=year,col=Causes)) +
    geom_line()+
    guides(col=guide_legend(ncol=2)) + 
    facet_wrap( ~Age ) +
    theme(legend.position="bottom")+
    ylab('Percentage of Cohort')
dev.off()
###
png('youngpop2.png')
tmp1 <- perc.of.pop %>% filter(.,Age %in% levels(perc.of.pop$Age)[c(2,11,3,4)],
        !is.na(Percentage)) %>%
    mutate(.,Age=factor(Age,levels=levels(perc.of.pop$Age)[c(2,11,3,4)]),
        Causes =factor(Causes)) 
# select 'important' causes (which somewhen got over 15%)
group_by(tmp1,Causes)%>%
    summarize(.,mp = max(Percentage)) %>%
    mutate(.,rk=rank(-mp)) %>%
    merge(.,tmp1) %>%
    filter(.,rk<=8) %>%
    ggplot(.,
        aes(y=Percentage,x=year,col=Causes)) +
    geom_line()+
    guides(col=guide_legend(ncol=2)) + 
    facet_wrap( ~Age ) +
    theme(legend.position="bottom")+
    ylab('Percentage of Cohort')
# https://en.wikipedia.org/wiki/North_Sea_flood_of_1953
dev.off()

old

png('oldpop.png')
tmp2 <- perc.of.pop %>% filter(.,Age %in% levels(perc.of.pop$Age)[18:21],
        !is.na(Percentage)) %>%
    mutate(.,Age=factor(Age),
        Causes =factor(Causes)) 
group_by(tmp2,Causes)%>%
    summarize(.,mp = max(Percentage)) %>%
    mutate(.,rk=rank(-mp)) %>%
    merge(.,tmp2) %>%
    filter(.,rk<=8) %>%
    ggplot(.,
        aes(y=Percentage,x=year,col=Causes)) +
    geom_line()+
    guides(col=guide_legend(ncol=2)) + 
    facet_wrap( ~Age ) +
    theme(legend.position="bottom")+
    ylab('Percentage of Cohort')
dev.off()
# rj.GD 
#     2 

png('oldpop2.png')
tmp2 <- perc.of.pop %>% 
    filter(.,
        Age %in% levels(perc.of.death$Age)[18:21],
        year>=1980,
        !is.na(Percentage)) %>%
    mutate(.,Age=factor(Age),
        Causes =factor(Causes)) 
group_by(tmp2,Causes)%>%
    summarize(.,mp = max(Percentage)) %>%
    mutate(.,rk=rank(-mp)) %>%
    merge(.,tmp2) %>%
    filter(.,rk<=8) %>%
    ggplot(.,
        aes(y=Percentage,x=year,col=Causes)) +
    geom_line()+
    guides(col=guide_legend(ncol=2)) + 
    facet_wrap( ~Age ) +
    theme(legend.position="bottom")+
    ylab('Percentage of Cohort')

dev.off()


  





Sunday, June 21, 2015

SAS PROC MCMC example 12 in R: Change point model

I restarted at working my way through the PROC MCMC examples. The SAS manual describes this example: Consider the data set from Bacon and Watts (1971), where $y_ i$ is the logarithm of the height of the stagnant surface layer and the covariate $x_ i$ is the logarithm of the flow rate of water. 
It is a simple example. It provided no problems at all for STAN and Jags. For LaplacesDemon on the other hand I had some problems. It took me quite some effort to obtain samples which seemed to be behaving. I did not try to do this in MCMCpack, but noted that the function MCMCregressChange() uses a slightly different model. The section below shows first the results, at the bottom the code is given.

Previous post in the series PROC MCMC examples programmed in R were: example 61.1: sampling from a known densityexample 61.2: Box Cox transformationexample 61.5: Poisson regressionexample 61.6: Non-Linear Poisson Regressionexample 61.7: Logistic Regression Random-Effects Model, and example 61.8: Nonlinear Poisson Regression Multilevel Random-Effects Model

Data

Data are read as below.
observed <-
'1.12  -1.39   1.12  -1.39   0.99  -1.08   1.03  -1.08
0.92  -0.94   0.90  -0.80   0.81  -0.63   0.83  -0.63
0.65  -0.25   0.67  -0.25   0.60  -0.12   0.59  -0.12
0.51   0.01   0.44   0.11   0.43   0.11   0.43   0.11
0.33   0.25   0.30   0.25   0.25   0.34   0.24   0.34
0.13   0.44  -0.01   0.59  -0.13   0.70  -0.14   0.70
-0.30   0.85  -0.33   0.85  -0.46   0.99  -0.43   0.99
-0.65   1.19'
observed <- scan(text=gsub('[[:space:]]+',' ',observed),
    what=list(y=double(),x=double()),
    sep=' ')
stagnant <- as.data.frame(observed)

LaplacesDemon

I have been playing around with LaplacesDemon. There is actually a function
LaplacesDemon.hpc which can use multiple cores. However, on my computer it seems more efficient just to use mclapply() from the parallel package and give the result class LaplacesDemon.hpc . Having said that, I had again quite some trouble to get LaplacesDemon to work well. In the end I used a combination of two calls to LaplacesDemon. The plot below shows selected samples after the first run. Not good enough, but that I do like this way of displaying the results of chains. It should be added that the labels looked correct with all parameters. However, that gave to much output for this blog. In addition, after the second call the results looked acceptable.

Call:
LaplacesDemon(Model = Model, Data = MyData, Initial.Values = apply(cc1$Posterior1,
    2, median), Covar = var(cc1$Posterior1), Iterations = 1e+05,
    Algorithm = "RWM")

Acceptance Rate: 0.2408
Algorithm: Random-Walk Metropolis
Covariance Matrix: (NOT SHOWN HERE; diagonal shown instead)
       alpha        beta1        beta2           cp           s2
4.920676e-04 2.199525e-04 3.753738e-04 8.680339e-04 6.122862e-08

Covariance (Diagonal) History: (NOT SHOWN HERE)
Deviance Information Criterion (DIC):
          All Stationary
Dbar -144.660   -144.660
pD      7.174      7.174
DIC  -137.486   -137.486
Initial Values:
        alpha         beta1         beta2            cp            s2
 0.5467048515 -0.4100040451 -1.0194586232  0.0166405998  0.0004800931

Iterations: 4e+05
Log(Marginal Likelihood): 56.92606
Minutes of run-time: 1.32
Model: (NOT SHOWN HERE)
Monitor: (NOT SHOWN HERE)
Parameters (Number of): 5
Posterior1: (NOT SHOWN HERE)
Posterior2: (NOT SHOWN HERE)
Recommended Burn-In of Thinned Samples: 0
Recommended Burn-In of Un-thinned Samples: 0
Recommended Thinning: 5
Specs: (NOT SHOWN HERE)
Status is displayed every 100 iterations
Summary1: (SHOWN BELOW)
Summary2: (SHOWN BELOW)
Thinned Samples: 40000
Thinning: 1


Summary of All Samples
                  Mean           SD         MCSE      ESS            LB
alpha     5.348239e-01 0.0244216567 2.999100e-04 11442.06  4.895229e-01
beta1    -4.196180e-01 0.0142422533 1.661658e-04 12726.60 -4.469688e-01
beta2    -1.013882e+00 0.0164892337 1.681833e-04 15191.59 -1.046349e+00
cp        2.855852e-02 0.0308177765 3.649332e-04 11945.66 -3.406306e-02
s2        4.472644e-04 0.0001429674 1.383748e-06 16571.94  2.474185e-04
Deviance -1.446602e+02 3.7879060637 4.940488e-02 10134.91 -1.496950e+02
LP        4.636511e+01 1.8939530321 2.470244e-02 10134.91  4.164313e+01
                Median            UB
alpha       0.53339024  5.842152e-01
beta1      -0.41996859 -3.903572e-01
beta2      -1.01387256 -9.815650e-01
cp          0.03110570  8.398674e-02
s2          0.00042101  8.006666e-04
Deviance -145.46896682 -1.352162e+02
LP         46.76949458  4.888251e+01


Summary of Stationary Samples
                  Mean           SD         MCSE      ESS            LB
alpha     5.348239e-01 0.0244216567 2.999100e-04 11442.06  4.895229e-01
beta1    -4.196180e-01 0.0142422533 1.661658e-04 12726.60 -4.469688e-01
beta2    -1.013882e+00 0.0164892337 1.681833e-04 15191.59 -1.046349e+00
cp        2.855852e-02 0.0308177765 3.649332e-04 11945.66 -3.406306e-02
s2        4.472644e-04 0.0001429674 1.383748e-06 16571.94  2.474185e-04
Deviance -1.446602e+02 3.7879060637 4.940488e-02 10134.91 -1.496950e+02
LP        4.636511e+01 1.8939530321 2.470244e-02 10134.91  4.164313e+01
                Median            UB
alpha       0.53339024  5.842152e-01
beta1      -0.41996859 -3.903572e-01
beta2      -1.01387256 -9.815650e-01
cp          0.03110570  8.398674e-02
s2          0.00042101  8.006666e-04
Deviance -145.46896682 -1.352162e+02
LP         46.76949458  4.888251e+01

STAN

Stan did not give much problems for this analysis.
Inference for Stan model: smodel.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

         mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
Beta[1] -0.42    0.00 0.01 -0.45 -0.43 -0.42 -0.41 -0.39  1017 1.00
Beta[2] -1.01    0.00 0.02 -1.05 -1.02 -1.01 -1.00 -0.98  1032 1.00
Alpha    0.54    0.00 0.03  0.49  0.52  0.53  0.55  0.59   680 1.00
s2       0.00    0.00 0.00  0.00  0.00  0.00  0.00  0.00  1361 1.00
cp       0.03    0.00 0.03 -0.04  0.00  0.03  0.05  0.09   636 1.00
lp__    90.63    0.06 1.78 86.17 89.71 91.00 91.91 93.03   935 1.01

Samples were drawn using NUTS(diag_e) at Fri Jun 19 21:17:54 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

JAGS

Again no problems for Jags.
Inference for Bugs model at "/tmp/Rtmpy4a6C5/modeld4f6e9c9055.txt", fit using jags,
 4 chains, each with 10000 iterations (first 5000 discarded), n.thin = 5
 n.sims = 4000 iterations saved
          mu.vect sd.vect     2.5%      25%      50%      75%    97.5%  Rhat
alpha       0.534   0.027    0.479    0.518    0.533    0.552    0.586 1.040
beta[1]    -0.420   0.015   -0.450   -0.429   -0.420   -0.410   -0.389 1.013
beta[2]    -1.014   0.017   -1.049   -1.024   -1.014   -1.003   -0.980 1.023
cp          0.029   0.035   -0.038    0.006    0.032    0.051    0.100 1.037
s2          0.000   0.000    0.000    0.000    0.000    0.001    0.001 1.004
deviance -144.501   3.986 -149.634 -147.378 -145.432 -142.584 -134.378 1.021
         n.eff
alpha      160
beta[1]    380
beta[2]    290
cp         160
s2         710
deviance   290

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 7.9 and DIC = -136.6
DIC is an estimate of expected predictive error (lower deviance is better).

CODE used

# http://support.sas.com/documentation/cdl/en/statug/67523/HTML/default/viewer.htm#statug_mcmc_examples18.htm
# Example 61.12 Change Point Models
##############
#Data                                                   
##############
observed <-
'1.12  -1.39   1.12  -1.39   0.99  -1.08   1.03  -1.08
0.92  -0.94   0.90  -0.80   0.81  -0.63   0.83  -0.63
0.65  -0.25   0.67  -0.25   0.60  -0.12   0.59  -0.12
0.51   0.01   0.44   0.11   0.43   0.11   0.43   0.11
0.33   0.25   0.30   0.25   0.25   0.34   0.24   0.34
0.13   0.44  -0.01   0.59  -0.13   0.70  -0.14   0.70
-0.30   0.85  -0.33   0.85  -0.46   0.99  -0.43   0.99
-0.65   1.19'
observed <- scan(text=gsub('[[:space:]]+',' ',observed),
    what=list(y=double(),x=double()),
    sep=' ')
stagnant <- as.data.frame(observed)
#plot(y~x,data=stagnant)
##############
#LaplacesDemon                                                   
##############
library('LaplacesDemon')
library(parallel)

mon.names <- "LP"
parm.names <- c('alpha',paste('beta',1:2,sep=''),'cp','s2')

PGF <- function(Data) {
  x <-c(rnorm(5,0,1))
  x[4] <- runif(1,-1.3,1.1)
  x[5] <- runif(1,0,2)
  x
}
MyData <- list(mon.names=mon.names,
    parm.names=parm.names,
    PGF=PGF,
    x=stagnant$x,
    y=stagnant$y)
#N<-1
Model <- function(parm, Data)
{
  alpha=parm[1]
  beta=parm[2:3]
  cp=parm[4]
  s2=parm[5]
  yhat <- alpha+(Data$x-cp)*beta[1+(Data$x>=cp)]
  LL <- sum(dnorm(Data$y,yhat,sd=sqrt(s2),log=TRUE))
  prior <- sum(dnorm(parm[1:3],0,1e3,log=TRUE))+
      dunif(cp,-1.3,1.1,log=TRUE)+
      dunif(s2,0,5,log=TRUE)
  LP=LL+prior
  Modelout <- list(LP=LP, Dev=-2*LL, Monitor=LP,
      yhat=yhat,
      parm=parm)
  return(Modelout)
}
Fit1 <- mclapply(1:4,function(i)
      LaplacesDemon(Model,
    Data=MyData,
    Iterations=100000,
    Algorithm='RWM',
    Covar=c(rep(.01,4),.00001),
    Initial.Values = c(.5,-.4,-1,.05,.001)) #Initial.Values 
)
class(Fit1) <- 'demonoid.hpc'
plot(Fit1,Data=MyData,Parms=c('alpha'))
cc1 <- Combine(Fit1,MyData)
#
Fit2 <- mclapply(1:4,function(i)
      LaplacesDemon(Model,
          Data=MyData,
          Iterations=100000,
          Algorithm='RWM',
          Covar=var(cc1$Posterior1),
          Initial.Values = apply(cc1$Posterior1,2,median)) #Initial.Values 
)
class(Fit2) <- 'demonoid.hpc'
#plot(Fit2,Data=MyData,Parms=c('alpha'))
cc2 <- Combine(Fit2,MyData)
cc2
##############
#STAN                                                   
##############
stanData <- list(
    N=nrow(stagnant),
    x=stagnant$x,
    y=stagnant$y)

library(rstan)
smodel <- '
    data {
    int <lower=1> N;
    vector[N] x;
    vector[N] y;
    }
    parameters {
    real Beta[2];
    real Alpha;
    real <lower=0> s2;
    real <lower=-1.3,upper=1.1> cp;
    }
    transformed parameters {
    vector[N] yhat;
    for (i in 1:N)
       yhat[i] <- Alpha+(x[i]-cp)*Beta[1+(x[i]>cp)];
    }
    model {
    y ~ normal(yhat,sqrt(s2));
    s2 ~ uniform(0,1e3);
    cp ~ uniform(-1.3,1.1);
    Alpha ~ normal(0,1000);
    Beta ~ normal(0,1000);
    }
    '
fstan <- stan(model_code = smodel,
    data = stanData,
    pars=c('Beta','Alpha','s2','cp'))
fstan
##############
#Jags                                                  
##############
library(R2jags)jagsdata <- list(
    N=nrow(stagnant),
    x=stagnant$x,
    y=stagnant$y)

jagsm <- function() {
  for(i in 1:N) {
    yhat[i] <- alpha+(x[i]-cp)*beta[1+(x[i]>cp)]
    y[i] ~ dnorm(yhat[i],tau)
  }
  tau <- 1/s2
  s2 ~  dunif(0,1e3)
  cp ~ dunif(-1.3,1.1)
  alpha ~ dnorm(0,0.001)
  beta[1] ~ dnorm(0,0.001)
  beta[2] ~ dnorm(0,0.001)
}
params <- c('alpha','beta','s2','cp')
myj <-jags(model=jagsm,
    data = jagsdata,
    n.chains=4,
    n.iter=10000,
    parameters=params,
)
myj