Sunday, September 14, 2014

Trying a prefmap

Preference mapping is a key technique in sensory and consumer research. It links the sensory perception on products to the liking of products and hence provides clues to the development of new, well tasting, products. Even though it is a key technique, it is also a long standing problem how to perform such an analysis. In R the SensoMineR package provides a prefmap procedure. This post attempts to create such an analysis with Stan.

Data

Data are the coctail data from the SensoMineR package. After conversion to a scale 0 to 10 with 10 most liked, the product means are:
   means
1   5.03
2   3.02
3   5.42
4   3.55
5   5.67
6   5.74
7   3.84
8   3.75
9   4.17
10  4.26
11  3.20
12  3.88
13  5.98
14  3.95
15  6.47
16  4.90


Model

The model is build upon my post of last week: Mapping products in a space . What is added is a consumer section. Each consumer's preference is modeled as a ideal point, where liking is maximum, with points further away liked less and less. In mathematical terms the distance dependent function is maxliking * e-distance*scale. The ideal point is captured by three numbers; its liking and its coordinates. The scale function is, for now, common for all consumers. In addition there is a lot of code for administration of all parameters.
model1 <- "
data {
        int<lower=0> ndescriptor;
        int<lower=0> nproduct;
        matrix[nproduct,ndescriptor] profile;
    int<lower=0> nconsumer;
    matrix[nproduct,nconsumer] liking;
}
parameters {
    row_vector[nproduct] prodsp1;
    row_vector[nproduct] prodsp2;
    real<lower=0,upper=1> sigma1;
    real<lower=0,upper=1> sigma2;
    matrix [nconsumer,2] optim;
    vector <lower=0> [nconsumer] maxima;
    real <lower=0> scale;
    real <lower=0> sliking;
}
transformed parameters {
   vector [ndescriptor] descrsp1;
   vector [ndescriptor] descrsp2;
     matrix[nproduct,ndescriptor] expected1;  
     matrix[nproduct,ndescriptor] expected2;  
     matrix[nproduct,ndescriptor] residual1;  
     vector[nproduct] distances;
     matrix[nproduct,nconsumer] likepred;


   descrsp1 <- profile'*prodsp1';
   expected1 <- (descrsp1*prodsp1)';
   residual1 <- profile-expected1;
   descrsp2 <- profile'*prodsp2';
   expected2 <- (descrsp2*prodsp2)';
   for (i in 1:nconsumer) {
      for (r in 1:nproduct) {
      distances[r] <- sqrt(square(prodsp1[r]-optim[i,1])+
                           square(prodsp2[r]-optim[i,2]));
      likepred[r,i] <- maxima[i]*exp(-distances[r]*scale);
      }
   }
}
model {  
     for (r in 1:nproduct) {
        prodsp1[r] ~ normal(0,1);
        prodsp2[r] ~ normal(0,1);
        for (c in 1:ndescriptor) {
           profile[r,c] ~ normal(expected1[r,c],sigma1);
           residual1[r,c] ~ normal(expected2[r,c],sigma2);
        }
        for (i in 1:nconsumer) {
           liking[r,i] ~ normal(likepred[r,i],sliking);
           optim[i,1] ~ normal(0,2);
           optim[i,2] ~ normal(0,2);
        }
    scale ~ normal(1,.1);
    maxima ~ normal(5,2);
    sliking ~ normal(2,1);
    }
}
generated quantities {
   vector [ndescriptor] descrspace1;
   vector [ndescriptor] descrspace2;
   row_vector [nproduct] prodspace1;
   row_vector [nproduct] prodspace2;
   matrix [nconsumer,2] optima;

   prodspace1 <-(
                     ((prodsp1[12]>0)*prodsp1)-
                     ((prodsp1[12]<0)*prodsp1)
                  );
   prodspace2 <-(
                     ((prodsp2[12]>0)*prodsp2)-
                     ((prodsp2[12]<0)*prodsp2)
                  ); 
   descrspace1 <-(
                     ((prodsp1[12]>0)*descrsp1)-
                     ((prodsp1[12]<0)*descrsp1)
                  );
   descrspace2 <-(
                     ((prodsp2[12]>0)*descrsp2)-
                     ((prodsp2[12]<0)*descrsp2)
                  ); 
   for (i in 1:nconsumer) {
      optima[i,1] <- (
                        ((prodsp1[12]>0)*optim[i,1])-
                        ((prodsp1[12]<0)*optim[i,1])
                     );
      optima[i,2] <- (
                        ((prodsp2[12]>0)*optim[i,2])-
                        ((prodsp2[12]<0)*optim[i,2])
                     );
   }
}
"

Analysis results

Sensominer's result

For comparative reasons the plot resulting from SensoMineR's carto() function. I have followed the parameter settings from the SensoMineR package to get this plot. Color is liking, numbered dots are products. The blue zone is best liked, as can be seen from the products with highest means residing there.

New method

In the plot the blue dots are samples of ideal points, the bigger black numbers are locations of products and the smaller red numbers are consumer's ideal points.
This is different from the SensoMineR map , the consumers have pulled well liked products such as 13 and 15 to the center. In a way, I suspect that in this analysis the consumer's preference has overruled most information from the sensory space. Given that, I will be splitting consumers.

Three groups of consumers

Three groups of consumers were created via k-means clustering. From sensory and consumer insight point of view the clusters may describe three different ways to experience the particular products. Obviously a clustering upon demographics or marketing segments may be equally valid, but I don't have that information. The cluster sizes are 15, 52 and 33 respectively.

Cluster 1

This cluster is characterized by liking for products 8 to 11. Compared to the original space, this cluster does not like products 13 and 15 so much, does not dislike product 4 and 12 so much.

Cluster 2

These are the bulk of the consumers and the result of all consumers is more pronounced. However, product 1 has shifter quite a distance to liked.

Cluster 3

This plot is again fairly similar to the all consumer plot. What is noticeable here is that there is a void in the center. The center of the most liked region is not occupied.

Next Steps

There are still some things to improve in this approach. Better tuning of the various priors in the model. Modeling the range of consumer's liking rather than solely their maximum. It may be possible to have the scale parameter subject dependent. Perhaps a better way to extract the dimensions from sensory space, thereby avoiding the Jacobian warning and using estimated standard deviations of the sensory profiling data. Finally, improved graphics.

Code

# Reading and first map

# senso.cocktail
# hedo.cocktail
library(SensoMineR)
data(cocktail)
res.pca <- PCA(senso.cocktail,graph=FALSE)
# SensoMineR does a dev.new for each graph, hence captured like this.
dev.new <- function() png('carto.png')
res.carto <- carto(res.pca$ind$coord[,1:2],
    graph.tree=FALSE,
    graph.corr=FALSE,
    hedo.cocktail)
dev.off()
# reset default graph settings
rm(dev.new)
dev.new()

# model

library(rstan)
nprod <- 16
ndescr <- 13
nconsumer <- 100
sprofile <- as.matrix(scale(senso.cocktail))
datain <- list(
    nproduct=nprod,
    ndescriptor=ndescr,
    profile=sprofile,
    nconsumer=nconsumer,
    liking = as.matrix(10-hedo.cocktail[,1:nconsumer])
)
data.frame(means=rowMeans(10-hedo.cocktail)  )

model1 <- "
data {
        int<lower=0> ndescriptor;
        int<lower=0> nproduct;
        matrix[nproduct,ndescriptor] profile;
    int<lower=0> nconsumer;
    matrix[nproduct,nconsumer] liking;
}
parameters {
    row_vector[nproduct] prodsp1;
    row_vector[nproduct] prodsp2;
    real<lower=0,upper=1> sigma1;
    real<lower=0,upper=1> sigma2;
    matrix [nconsumer,2] optim;
    vector <lower=0> [nconsumer] maxima;
    real <lower=0> scale;
    real <lower=0> sliking;
}
transformed parameters {
   vector [ndescriptor] descrsp1;
   vector [ndescriptor] descrsp2;
     matrix[nproduct,ndescriptor] expected1;  
     matrix[nproduct,ndescriptor] expected2;  
     matrix[nproduct,ndescriptor] residual1;  
     vector[nproduct] distances;
     matrix[nproduct,nconsumer] likepred;


   descrsp1 <- profile'*prodsp1';
   expected1 <- (descrsp1*prodsp1)';
   residual1 <- profile-expected1;
   descrsp2 <- profile'*prodsp2';
   expected2 <- (descrsp2*prodsp2)';
   for (i in 1:nconsumer) {
      for (r in 1:nproduct) {
      distances[r] <- sqrt(square(prodsp1[r]-optim[i,1])+
                           square(prodsp2[r]-optim[i,2]));
      likepred[r,i] <- maxima[i]*exp(-distances[r]*scale);
      }
   }
}
model {  
     for (r in 1:nproduct) {
        prodsp1[r] ~ normal(0,1);
        prodsp2[r] ~ normal(0,1);
        for (c in 1:ndescriptor) {
           profile[r,c] ~ normal(expected1[r,c],sigma1);
           residual1[r,c] ~ normal(expected2[r,c],sigma2);
        }
        for (i in 1:nconsumer) {
           liking[r,i] ~ normal(likepred[r,i],sliking);
           optim[i,1] ~ normal(0,2);
           optim[i,2] ~ normal(0,2);
        }
    scale ~ normal(1,.1);
    maxima ~ normal(5,2);
    sliking ~ normal(2,1);
    }
}
generated quantities {
   vector [ndescriptor] descrspace1;
   vector [ndescriptor] descrspace2;
   row_vector [nproduct] prodspace1;
   row_vector [nproduct] prodspace2;
   matrix [nconsumer,2] optima;

   prodspace1 <-(
                     ((prodsp1[12]>0)*prodsp1)-
                     ((prodsp1[12]<0)*prodsp1)
                  );
   prodspace2 <-(
                     ((prodsp2[12]>0)*prodsp2)-
                     ((prodsp2[12]<0)*prodsp2)
                  ); 
   descrspace1 <-(
                     ((prodsp1[12]>0)*descrsp1)-
                     ((prodsp1[12]<0)*descrsp1)
                  );
   descrspace2 <-(
                     ((prodsp2[12]>0)*descrsp2)-
                     ((prodsp2[12]<0)*descrsp2)
                  ); 
   for (i in 1:nconsumer) {
      optima[i,1] <- (
                        ((prodsp1[12]>0)*optim[i,1])-
                        ((prodsp1[12]<0)*optim[i,1])
                     );
      optima[i,2] <- (
                        ((prodsp2[12]>0)*optim[i,2])-
                        ((prodsp2[12]<0)*optim[i,2])
                     );
   }
}
"

pars <- c('prodspace1','prodspace2','optima','scale','maxima')

fit <- stan(model_code = model1,
    data = datain,
    pars=pars)

# plotting

fitsamps <- as.data.frame(fit)

combiplot <- function(fitsamps,datain,labs) {
    prod <- reshape(fitsamps,
        drop=names(fitsamps)[33:ncol(fitsamps)],
        direction='long',
        varying=list(names(fitsamps)[1:16],
            names(fitsamps)[17:32]),
        timevar='sample',
        times=1:16,
        v.names=c('PDim1','PDim2')
    )
        sa <- sapply(1:16,function(x)
            c(sample=x,
                Dim1=mean(prod$PDim1[prod$sample==x]),
                Dim2=mean(prod$PDim2[prod$sample==x])))
    sa <- as.data.frame(t(sa))
   
    optimindex <- grep('optima',names(fitsamps))
    noptim <- datain$nconsumer
    loc <- reshape(fitsamps,
        drop=names(fitsamps)[(1:ncol(fitsamps))[-optimindex]],
        direction='long',
        varying=list(names(fitsamps)[optimindex[1:noptim]],
            names(fitsamps)[optimindex[(1:noptim)+noptim]]),
        timevar='subject',
        times=1:noptim,
        v.names=c('Dim1','Dim2')
    )
    locx <- loc[sample(nrow(loc),60000),]
    plot(locx$Dim1,locx$Dim2,
        col='blue',
        pch=46,
        cex=2,
        xlim=c(-1,1)*.7,
        ylim=c(-1,1)*.7)
    sa2 <- sapply(1:noptim,function(x)
            c(sample=x,
                Dim1=mean(loc$Dim1[loc$subject==x]),
                Dim2=mean(loc$Dim2[loc$subject==x])))
    sa2 <- as.data.frame(t(sa2))
    text(x=sa2$Dim1,y=sa2$Dim2,labels=labs,cex=.8,col='red')
    text(x=sa$Dim1,y=sa$Dim2,labels=sa$sample,cex=1.5)
    invisible(fitsamps)
}

combiplot(fitsamps,datain,1:100)

# three clusters

tlik <- t(scale(hedo.cocktail))
km <- kmeans(tlik,centers=3)
table(km$cluster)


datain1 <- list(
    nproduct=nprod,
    ndescriptor=ndescr,
    profile=sprofile,
    nconsumer=sum(km$cluster==1),
    liking = as.matrix(10-hedo.cocktail[,km$cluster==1])
)
fit1 <- stan(model_code = model1,
    data = datain1,
    fit=fit,
    pars=pars)

fitsamps1 <- as.data.frame(fit1)
#

datain2 <- list(
    nproduct=nprod,
    ndescriptor=ndescr,
    profile=sprofile,
    nconsumer=sum(km$cluster==2),
    liking = as.matrix(10-hedo.cocktail[,km$cluster==2])
)
fit2 <- stan(model_code = model1,
    data = datain2,
    fit=fit,
    pars=pars)

fitsamps2 <- as.data.frame(fit2)
##
datain3 <- list(
    nproduct=nprod,
    ndescriptor=ndescr,
    profile=sprofile,
    nconsumer=sum(km$cluster==3),
    liking = as.matrix(10-hedo.cocktail[,km$cluster==3])
)
fit3 <- stan(model_code = model1,
    data = datain3,
    fit=fit,
    pars=pars)

fitsamps3 <- as.data.frame(fit3)
combiplot(fitsamps1,datain1,which(km$cluster==1))
combiplot(fitsamps2,datain2,which(km$cluster==2))
combiplot(fitsamps3,datain3,which(km$cluster==3))

Sunday, September 7, 2014

Mapping products in a space

I have read about people doing a Bayesian PCA at some points and always wondered how that would work. Then, at some point I thought of a way to do so. As ideas evolved my interest became not PCA as such, but rather in a prefmap. As a first step in that this post contains the mapping from a sensory space to a two dimensional space. For prefmap this step is commonly done via a PCA.

Data

Data are the coctail data from sensominer package.

Algorithm

The calculation is mostly inspired by the many PLS algorithms to which I was exposed when I was doing chemometrics. Scores and loadings may be obtained from each other by multiplying with the data matrix. In this case it means I just take a set of product scores and obtain the associated descriptors via a simple matrix multiplication. The resulting product and descriptor vectors can be used to reconstruct the original matrix; the best solution minimizes difference between the constructed and original data. For dimension two subtract reconstructed data from original data and repeat on residuals.

Scaling

PCA has the possibility to have unit length scores or loadings, or R and Q mode if that is your favorite jargon. If one has a more singular value decomposition look, it is just where the eigenvalues go. At this point I made the choice to do that in the variable space. 

Unique solution

PCA is known not to have one unique solution; each solution is equivalent to its mirror image. It seemed most elegant to do this completely at the end, after inspection of the data it seemed the location of product 12 was suitable for making the solution unique, since it was extreme on both dimensions. The final step (generated quantities) forces the location to be top right quadrant for data reported.

Code

library(rstan)
nprod <- 16
ndescr <- 13
sprofile <- as.matrix(scale(senso.cocktail,scale=FALSE))
datain <- list(
    nproduct=nprod,
    ndescriptor=ndescr,
    profile=sprofile
    )
   
model1 <- "
data {
        int<lower=0> ndescriptor;
        int<lower=0> nproduct;
        matrix[nproduct,ndescriptor] profile;
}
parameters {
    row_vector[nproduct] prodsp1;
    row_vector[nproduct] prodsp2;
    real<lower=0,upper=1> sigma1;
    real<lower=0,upper=1> sigma2;
}
transformed parameters {
   vector [ndescriptor] descrsp1;
   vector [ndescriptor] descrsp2;
   matrix[nproduct,ndescriptor] expected1;  
   matrix[nproduct,ndescriptor] expected2;  
   matrix[nproduct,ndescriptor] residual1;  

   descrsp1 <- profile'*prodsp1';
   expected1 <- (descrsp1*prodsp1)';
   residual1 <- profile-expected1;
   descrsp2 <- profile'*prodsp2';
   expected2 <- (descrsp2*prodsp2)';
}
model {  
     for (r in 1:nproduct) {
        prodsp1[r] ~ normal(0,1);
        prodsp2[r] ~ normal(0,1);
        for (c in 1:ndescriptor) {
           profile[r,c] ~ normal(expected1[r,c],sigma1);
           residual1[r,c] ~ normal(expected2[r,c],sigma2);
        }
     }
}
generated quantities {
   vector [ndescriptor] descrspace1;
   vector [ndescriptor] descrspace2;
   row_vector [nproduct] prodspace1;
   row_vector [nproduct] prodspace2;
   prodspace1 <-(
                     ((prodsp1[12]>0)*prodsp1)-
                     ((prodsp1[12]<0)*prodsp1)
                  );
   prodspace2 <-(
                     ((prodsp2[12]>0)*prodsp2)-
                     ((prodsp2[12]<0)*prodsp2)
                  ); 
   descrspace1 <-(
                     ((prodsp1[12]>0)*descrsp1)-
                     ((prodsp1[12]<0)*descrsp1)
                  );
   descrspace2 <-(
                     ((prodsp2[12]>0)*descrsp2)-
                     ((prodsp2[12]<0)*descrsp2)
                  ); 
}
"
pars <- c('prodspace1','prodspace2','descrspace1','descrspace2')
fit1 <- stan(model_code = model1,
    data = datain,
    pars=pars)

Results

For comparison, first a standard biplot.

Product space

It is not difficult to extract the samples and plot them. See end of post. One notable property of the plot is that the products are in ellipses with the minor axis towards the center. Apparently part of variation between MCMC samples is rotational freedom between dimensions. Other than that the solution is actually pretty close to the PCA

Descriptor space

The rotational freedom is even more clear here.

Additional code

data

library(SensoMineR)
data(cocktail)

biplot

pr <- prcomp(senso.cocktail) 
plot(pr)
biplot(pr)

product plot

fit1samps <- as.data.frame(fit1)

prod <- reshape(fit1samps,
    drop=names(fit1samps)[33:59],
    direction='long',
    varying=list(names(fit1samps)[1:16],
        names(fit1samps)[17:32]),
    timevar='sample',
    times=1:16,
    v.names=c('PDim1','PDim2')
)
   
prod <- prod[order(prod$PDim1),]
plot(prod$PDim1,prod$PDim2,
    col=c(2,17,3,4,6,5,7:10,13,12,11,14:16)[prod$sample],
    pch=46,
    cex=2,
    xlim=c(-1,1)*.75,
    ylim=c(-1,1)*.75)
sa <- sapply(1:16,function(x)
        c(sample=x,
            Dim1=mean(prod$PDim1[prod$sample==x]),
            Dim2=mean(prod$PDim2[prod$sample==x])))
sa <- as.data.frame(t(sa))
text(x=sa$Dim1,y=sa$Dim2,labels=sa$sample,cex=1.5)

descriptor plot

descr <- reshape(fit1samps,
    drop=names(fit1samps)[c(1:32,59)],
    direction='long',
    varying=list(names(fit1samps)[33:45],
        names(fit1samps)[46:58]),
    timevar='sample',
    times=1:13,
    v.names=c('DDim1','DDim2')
)

descr <- descr[order(descr$DDim1),]
plot(descr$DDim1,descr$DDim2,
    col=c(2,1,3:13)[descr$sample],
    pch=46,
    cex=2,
    xlim=c(-1,1)*9,
    ylim=c(-1,1)*9)
sa <- sapply(1:13,function(x)
        c(sample=x,
            Dim1=mean(descr$DDim1[descr$sample==x]),
            Dim2=mean(descr$DDim2[descr$sample==x])))
sa <- as.data.frame(t(sa))
text(x=sa$Dim1,y=sa$Dim2,labels=names(senso.cocktail))

Sunday, August 31, 2014

Beta binomial revisited

I got some comments on my priors of last week's post where better priors were proposed. Hence this week's post looks at them. After some minor adaptations, since I have nine beta parameters and beta needs to be fairly high, I can conclude that these priors provide significant improvements. Especially the model which has p(alpha,beta) prop to (alpha + beta)(-5/2) worked extremely well.

Priors

There were two proposals. A common step was 'parameterizing in terms of mean theta = (alpha / (alpha + beta)) and total count kappa = (alpha + beta)'. In my own data there are nine regions hence nine sets of theta, kappa, alpha, beta. As hyperprior over the nine regions' incidence I chose a normal distribution for theta.  The difference between the proposals is in kappa. I made no hyperpriors over the nine values of kappa.

Prior: Uniform on log scale

'Andrew (...) suggests taking the priors on alpha and beta to be uniform on the log scale.' In code this means log kappa is uniform distributed.

Prior: p(alpha,beta) prop to (alpha + beta)(-5/2)

Apparently in Bayesian Data Analysis 3 it is proposed to make p(alpha,beta) prop to (alpha + beta)(-5/2). This is handled via a Pareto distribution.

Sampling kappa priors

Via a small program I had a look what this means in terms of samples. The uniform log scale seemed to be reasonable even for the range of my beta parameters. On the other hand the p(alpha,beta) prop to (alpha + beta)(-5/2) proposal kept kappa way too low for the data at hand. This seems informative for a different range of kappa and hence beta rather than uninformative. There I made a different variation.
modelk <- '
parameters {
    real kappa_log1;
    real kappa2;
    real <lower=10^6> kappa3;
}
transformed parameters {
    real kappa1;
    kappa1 <- exp(kappa_log1);
}
model {      
    kappa_log1 ~ uniform(0,18);
    kappa2 ~ pareto(0.1,1.5);
    kappa3 ~ pareto(10^6,1.5);
}
'
parametersk <- c('kappa1','kappa2','kappa3')
initsk <- function() {
    list(
        kappa_log1=runif(1,0,18)
        , kappa2 = runif(1,.2,10)
        , kappa3 = runif(1,2e6,1e7)
    )
}

kfits <- stan(model_code = modelk,
        pars=parametersk,
        inits=initsk)
kfits

Inference for Stan model: modelk.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

             mean   se_mean         sd       2.5%        25%        50%
kappa1 1994621.64 550416.10 7754182.44       1.57      82.85    4772.51
kappa2       0.24      0.02       0.27       0.10       0.12       0.16
kappa3 2602295.43 296845.16 4314974.09 1021790.77 1209898.43 1558773.41
lp__       -18.81      0.16       1.65     -23.11     -19.64     -18.41
              75%       97.5% n_eff Rhat
kappa1  160223.83 25951765.02   198 1.03
kappa2       0.25        0.89   123 1.03
kappa3 2490623.45 11991112.38   211 1.02
lp__       -17.59      -16.83   112 1.03

Samples were drawn using NUTS(diag_e) at Sat Aug 30 14:12:51 2014.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Effect in the whole model

Prior: Uniform on log scale 

This works reasonable well. With significantly less samples than last week there are reasonable estimates. The only thing worrying is that incidence[1] (it was named betamean[1] in the previous models, but that was actually a bit of a misnomer) is a bit too high.
 model1 <- '
    data {
       int<lower=0> n;
       int<lower=0> nregion;
       int count[n];
       int<lower=1,upper=nregion> Region[n];
       int Population[n];
       real scale;
        }
    parameters {
       vector <lower=0,upper=1> [nregion]  theta;
       vector [nregion]  kappa_log;
       vector <lower=0,upper=1> [n] p1;
       real <lower=0,upper=1> thetam;
       real <lower=0> thetasd;
        }
  transformed parameters {
       vector [nregion] kappa;
       vector [nregion] a;
       vector [nregion] b;

       kappa <- exp(kappa_log);
       for (i in 1:nregion) {
          a[i] <- kappa[i] * theta[i];
          b[i] <- kappa[i] * (1 - theta[i]);
       }
  }
  model {     
        kappa_log ~ uniform(12,18);
        for (i in 1:n) {
            p1[i] ~ beta(a[Region[i]],b[Region[i]]);
        }
        count ~ binomial(Population,p1);
        theta ~ normal(thetam,thetasd);
        thetam ~ uniform(0,1);
        thetasd ~ uniform(0,.1);      
        }
  generated quantities {
        vector [n] pp;
        vector [nregion] incidence;
        real meaninc;
        incidence <- scale*theta;
        pp <- p1 * scale;
        meaninc <- scale*thetam;
        }
'

inits1 <- function()
    list(theta=rnorm(datain$nregion,1e-6,1e-7),
         thetam = rnorm(1,1e-6,1e-7),
         thetasd = rnorm(1,1e-6,1e-7),
         kappa_log=runif(datain$nregion,15,18),
         p1=rnorm(datain$n,1e-7,1e-8))

parameters1 <- c('a','b','incidence','kappa','kappa_log','meaninc')

fits1 <- stan(model_code = model1,
    data = datain,
    pars=parameters1    ,
    init=inits1)
print(fits1,probs=NA)

Inference for Stan model: model1. 
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

                    mean    se_mean          sd    n_eff Rhat
a[1]                7.95       1.32       13.67 NA   107 1.04
a[2]                9.16       1.43       16.76 NA   138 1.02
a[3]               50.52       2.61       32.13 NA   151 1.03
a[4]               35.68       2.25       32.61 NA   210 1.01
a[5]                9.10       1.32       13.99 NA   113 1.04
a[6]               11.99       1.48       17.40 NA   139 1.01
a[7]               13.64       1.83       20.56 NA   126 1.01
a[8]               14.19       2.56       20.35 NA    63 1.05
a[9]               12.14       2.51       21.39 NA    73 1.05
b[1]          5956080.76 1213799.11 10903417.54 NA    81 1.04
b[2]          5293475.42  844235.36  9984631.29 NA   140 1.01
b[3]         28376424.43 1399820.31 17900739.23 NA   164 1.03
b[4]         19757543.83 1195288.60 17665183.92 NA   218 1.02
b[5]          5493903.33  897750.46  9136674.30 NA   104 1.05
b[6]          6449567.62  787646.39  9190268.32 NA   136 1.01
b[7]          8500594.37 1172609.55 12531293.24 NA   114 1.01
b[8]          8901106.72 1698740.11 12644831.14 NA    55 1.06
b[9]          7267575.15 1406258.66 12121859.82 NA    74 1.05
incidence[1]        0.10       0.00        0.02 NA    84 1.07
incidence[2]        0.11       0.00        0.02 NA   177 1.02
incidence[3]        0.11       0.00        0.01 NA   153 1.01
incidence[4]        0.11       0.00        0.01 NA   178 1.02
incidence[5]        0.11       0.00        0.02 NA   216 1.01
incidence[6]        0.12       0.00        0.02 NA   121 1.03
incidence[7]        0.10       0.00        0.02 NA   161 1.02
incidence[8]        0.10       0.00        0.02 NA    81 1.06
incidence[9]        0.10       0.00        0.02 NA   119 1.04
kappa[1]      5956088.71 1213800.44 10903430.55 NA    81 1.04
kappa[2]      5293484.58  844236.78  9984647.90 NA   140 1.01
kappa[3]     28376474.95 1399822.89 17900770.95 NA   164 1.03
kappa[4]     19757579.51 1195290.81 17665216.05 NA   218 1.02
kappa[5]      5493912.43  897751.76  9136688.06 NA   104 1.05
kappa[6]      6449579.61  787647.86  9190285.60 NA   136 1.01
kappa[7]      8500608.01 1172611.40 12531313.60 NA   114 1.01
kappa[8]      8901120.91 1698742.67 12644851.22 NA    55 1.06
kappa[9]      7267587.30 1406261.16 12121881.04 NA    74 1.05
kappa_log[1]       14.56       0.18        1.40 NA    60 1.08
kappa_log[2]       14.37       0.13        1.42 NA   122 1.01
kappa_log[3]       16.88       0.07        0.87 NA   167 1.03
kappa_log[4]       16.28       0.08        1.15 NA   210 1.01
kappa_log[5]       14.76       0.11        1.17 NA   119 1.03
kappa_log[6]       15.10       0.08        1.03 NA   153 1.01
kappa_log[7]       15.08       0.17        1.36 NA    61 1.03
kappa_log[8]       15.12       0.22        1.38 NA    41 1.08
kappa_log[9]       14.76       0.17        1.45 NA    70 1.04
meaninc             0.11       0.00        0.01 NA   107 1.02
lp__            -7659.18       1.14       11.63 NA   105 1.01

Samples were drawn using NUTS(diag_e) at Sat Aug 30 14:24:52 2014.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

Prior: p(alpha,beta) prop to (alpha + beta)(-5/2)


This model did quite well. I wish all my models worked as well.
model2 <- '
data {
    int<lower=0> n;
    int<lower=0> nregion;
    int count[n];
    int<lower=1,upper=nregion> Region[n];
    int Population[n];
    real scale;
}
parameters {
    vector <lower=0,upper=1> [nregion]  theta;
    vector <lower=1e6> [nregion]  kappa; 
    vector <lower=0,upper=1> [n] p1;
    real <lower=0,upper=1> thetam;
    real <lower=0> thetasd;
}
transformed parameters {
    vector [nregion] a;
    vector [nregion] b;
    for (i in 1:nregion) {
        a[i] <- kappa[i] * theta[i];
        b[i] <- kappa[i] * (1 - theta[i]);
    }
   
}
model {      
    kappa ~ pareto(10^6,1.5);
    for (i in 1:n) {
        p1[i] ~ beta(a[Region[i]],b[Region[i]]);
    }
    count ~ binomial(Population,p1);
    theta ~ normal(thetam,thetasd);
    thetam ~ uniform(0,1);
    thetasd ~ uniform(0,.1);       
}
generated quantities {
    vector [n] pp;
    vector [nregion] incidence;
    real meaninc;
    incidence <- scale*theta;
    pp <- p1 * scale;
    meaninc <- scale*thetam;
}
'

inits2 <- function()
    list(theta=rnorm(datain$nregion,1e-6,1e-7),
        thetam = rnorm(1,1e-6,1e-7),
        thetasd = rnorm(1,1e-6,1e-7),
        kappa=runif(datain$nregion,1e6,1e7),
        p1=rnorm(datain$n,1e-7,1e-8))

parameters2 <- c('a','b','incidence','kappa','meaninc')

fits2 <- stan(model_code = model2,
    data = datain,
    pars=parameters2,
    init=inits2)

print(fits2,probs=NA)

Inference for Stan model: model2.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

                    mean   se_mean          sd    n_eff Rhat
a[1]                3.03      0.06        2.54 NA  1851    1
a[2]                3.36      0.05        2.74 NA  2746    1
a[3]               22.28      1.39       39.52 NA   807    1
a[4]                7.27      0.27       12.39 NA  2173    1
a[5]                3.61      0.08        3.50 NA  2025    1
a[6]                4.38      0.09        3.70 NA  1743    1
a[7]                3.58      0.13        4.14 NA   957    1
a[8]                3.54      0.11        3.60 NA  1125    1
a[9]                3.20      0.10        3.81 NA  1346    1
b[1]          2132903.53  47586.20  2085059.70 NA  1920    1
b[2]          1804395.74  29404.95  1540741.77 NA  2745    1
b[3]         12482567.37 782922.81 22226255.17 NA   806    1
b[4]          4057579.18 146369.50  6818624.58 NA  2170    1
b[5]          2080275.79  44577.27  2020891.35 NA  2055    1
b[6]          2368569.05  48762.30  1922950.01 NA  1555    1
b[7]          2281906.54  81326.60  2583506.11 NA  1009    1
b[8]          2319737.52  71109.80  2408437.85 NA  1147    1
b[9]          2092476.48  61180.23  2313166.69 NA  1430    1
incidence[1]        0.09      0.00        0.02 NA   746    1
incidence[2]        0.12      0.00        0.02 NA  1173    1
incidence[3]        0.11      0.00        0.02 NA  1209    1
incidence[4]        0.11      0.00        0.02 NA  1387    1
incidence[5]        0.11      0.00        0.02 NA  1406    1
incidence[6]        0.12      0.00        0.02 NA  1248    1
incidence[7]        0.10      0.00        0.02 NA   999    1
incidence[8]        0.10      0.00        0.02 NA  1063    1
incidence[9]        0.10      0.00        0.02 NA   935    1
kappa[1]      2132906.56  47586.25  2085062.04 NA  1920    1
kappa[2]      1804399.10  29405.01  1540744.41 NA  2745    1
kappa[3]     12482589.65 782924.19 22226294.41 NA   806    1
kappa[4]      4057586.45 146369.76  6818636.82 NA  2170    1
kappa[5]      2080279.40  44577.35  2020894.78 NA  2055    1
kappa[6]      2368573.43  48762.38  1922953.62 NA  1555    1
kappa[7]      2281910.12  81326.73  2583510.17 NA  1009    1
kappa[8]      2319741.06  71109.91  2408441.36 NA  1147    1
kappa[9]      2092479.69  61180.33  2313170.41 NA  1430    1
meaninc             0.11      0.00        0.01 NA   901    1
lp__            -7881.32      0.42        8.66 NA   428    1

Samples were drawn using NUTS(diag_e) at Sat Aug 30 14:34:49 2014.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).


Sunday, August 24, 2014

JAGS and Stan

During the last year I have been running some estimations in both JAGS and Stan. In that period I have seen one example where JAGS could not get me decent samples (in the sense of low Rhat and high number of effective samples) but that was data which I could not blog about. When two weeks ago I had a problem where part of my model did not converge well in JAGS I wondered how Stan would fare. Hence this post. It appears that Stan did not really do much better. What did appear is that results in this kind of difficult problem can vary depending on the inits and random samples used in the chain. This probably means more samples helps, but that is not the topic of this post.

Programs

In effect I expect most readers of this blog to know about both Stan and JAGS, but a few lines about them seem not amiss. Stan and JAGS can be used for the same kind of problems, but they are quite different. JAGS is a variation on BUGS, similar to WinBUGS and OpenBUGS, where a model states just relations between variables. Stan on the other hand, is a program where a model has clearly defined parts, where order of statements is of influence. Stan is compiled, which takes some time by itself. Both Stan and BUGS can be run by themselves, but I find it most convenient to run them from R. R is then used for pre-processing data, setting up the model and finally summarizing the samples. Because JAGS and Stan are so different, they need completely different number of MCMC samples. Stan is supposed to be more efficient, hence needing less samples to obtain a posterior result of similar quality.
From a model development point of view, JAGS (rjags, R2jags) is slightly more integrated in R than Stan (Rstan), mostly because JAGS models pretend to be R models, which means my editor will lend a hand, while Rstan has its model just in a text vector. In addition, JAGS has no compilation time. The plus of Stan though is highly organized model code.

Models

The model describes the number of shootings per state, hierarchically under regions. This means there is a binomial probability of interest, the states, under beta distributed regions. The beta has uninformative priors. After some tweaking the models should be equivalent. This means that the JAGS model is slightly different from previous posts. The number of samples chosen is 250000 with 100000 burn-in for JAGS and 4000 with half burn-in for Stan. I have chosen for ten chains. Usually I would use four, but since I suspected some chains to misbehave, I opted for a larger number. The inits were either around 1000, which means that a number of parameters have to shift quite a bit to get beta near 1 in 100000 or close to the that distribution, which means the parameters mostly have to converge the regions and states to the correct values. In terms of model behavior I only look at the priors and hyperpriors. Especially a and b from the beta distribution (state level) are difficult to estimate, while their ratio and state level estimates are quite easy.

Results

What I expected to write here is that Stan was coping a bit better, especially when the inits are a bit off. Which is what happened in the first version of the post. But then I did an additional calculation and Stan got worse results too. So, part of the conclusion is that it is very dependent on the inits and the random sampling in the MCMC chain.

Speed

In terms of speed, Stan has the property that different chains have markedly different speeds. One chain can take 90 seconds while the next takes 250 seconds. In JAGS individual chains progress is not displayed, so no information there.
In general speeds were about the same, 1000 to 1800 seconds. If that seems large, this was on a Celeron processor with one core used. MCMC chains are embarrassingly parallel, so gains can be made easy.

Gelman diagnostic

This is the diagnostic calculated in coda so the diagnostics are comparable. There are eight series of values in the figure. Each pair is for one model, where coda actually gives both point estimate and 95% upper limit of the estimate. Smart directs to the better inits, 1000 to inits around 1000. The x axis refers to the parameters estimated. There is something odd at variables 19, 20 and 28, 29. Just prior to posting I discovered that variables, especially aa and bb are sorted differently compared to the other variables in Stan than JAGS. In hindsight I should have put my parameters in alphabetical order.
The plot shows that Stan is actually doing a bit less than JAGS especially with the inits which should have made correct results more easy.
 

Effective number of samples

Again the plot is made using calculations in coda so the numbers are comparable. In number of effective samples it seems Stan is doing a bit better for the more difficult parameters. For the easy parameters JAGS is a bit better, here the large number of samples for JAGS pays off..

Conclusion

Neither JAGS nor Stan came out clearly on top, which was not as I expected. Nevertheless, it still seems that while JAGS is my tool for simple models, while Stan is the choice for more complex models. Conversion from JAGS to Stan was not difficult.

Output from runs

stan & inits1

    user   system  elapsed
1992.725    2.666 2016.079


Inference for Stan model: model1.
10 chains, each with iter=4000; warmup=2000; thin=1;
post-warmup draws per chain=2000, total post-warmup draws=20000.

                   mean    se_mean          sd    n_eff Rhat
a[1]              53.44       4.38       44.95 NA   105 1.10
a[2]              65.07       7.55       53.74 NA    51 1.10
a[3]              66.92       6.93       54.54 NA    62 1.10
a[4]              67.28       7.00       55.38 NA    63 1.11
a[5]              62.51       4.99       50.49 NA   102 1.10
a[6]              70.32       8.18       59.19 NA    52 1.11
a[7]              62.44       7.46       53.44 NA    51 1.10
a[8]              59.37       6.37       49.08 NA    59 1.11
a[9]              63.54       5.29       53.66 NA   103 1.10
b[1]        44661405.38 3638329.39 37093869.18 NA   104 1.10
b[2]        39099064.06 4007140.91 32542035.41 NA    66 1.10
b[3]        38335546.03 4378451.81 31386394.46 NA    51 1.11
b[4]        38034644.61 3133779.95 31056909.88 NA    98 1.10
b[5]        39443027.62 4038448.85 32265357.99 NA    64 1.10
b[6]        35355902.63 2933955.81 29177099.95 NA    99 1.10
b[7]        41237192.06 4561471.29 33706574.62 NA    55 1.10
b[8]        42188390.23 3364049.93 34756182.10 NA   107 1.10
b[9]        39656654.17 4301318.67 32155276.73 NA    56 1.10
betamean[1]        0.08       0.00        0.02 NA   793 1.01
betamean[2]        0.11       0.00        0.02 NA  1253 1.01
betamean[3]        0.11       0.00        0.01 NA  1436 1.01
betamean[4]        0.11       0.00        0.02 NA  1608 1.01
betamean[5]        0.10       0.00        0.02 NA   947 1.01
betamean[6]        0.12       0.00        0.02 NA   683 1.01
betamean[7]        0.09       0.00        0.02 NA   759 1.01
betamean[8]        0.09       0.00        0.01 NA   717 1.01
betamean[9]        0.10       0.00        0.02 NA   737 1.02
aa                63.61       6.74       52.03 NA    60 1.11
bb          39230839.48 3945047.13 31100418.06 NA    62 1.11
sda               11.71       1.18       13.79 NA   137 1.07
sdb          6716646.80  546142.87  8203368.43 NA   226 1.04
lp__           -7555.39       2.81       24.19 NA    74 1.10

Samples were drawn using NUTS(diag_e) at Sun Aug 24 10:32:29 2014.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).


Stan & inits2

    user   system  elapsed
1039.834    1.779 1042.681 

Inference for Stan model: model1.
10 chains, each with iter=4000; warmup=2000; thin=1;
post-warmup draws per chain=2000, total post-warmup draws=20000.

                   mean    se_mean          sd    n_eff Rhat
a[1]              46.65       2.96       40.52 NA   188 1.04
a[2]              56.89       3.51       47.84 NA   186 1.03
a[3]              59.52       3.83       50.16 NA   171 1.04
a[4]              58.93       3.70       50.71 NA   188 1.04
a[5]              54.94       3.32       45.85 NA   190 1.03
a[6]              61.79       3.96       53.71 NA   184 1.04
a[7]              53.75       3.40       46.37 NA   186 1.04
a[8]              50.94       3.22       44.15 NA   187 1.04
a[9]              54.71       3.52       47.86 NA   185 1.04
b[1]        38295125.50 2367547.52 32822125.24 NA   192 1.04
b[2]        34100563.26 2135975.40 29277394.26 NA   188 1.04
b[3]        33605901.88 2089826.28 28387357.36 NA   185 1.04
b[4]        33282843.68 2064977.85 27983832.94 NA   184 1.04
b[5]        34582670.15 2189638.25 29708184.70 NA   184 1.04
b[6]        31320186.63 1969915.12 26631419.99 NA   183 1.04
b[7]        35911817.86 2246565.24 30569577.66 NA   185 1.04
b[8]        36698348.20 2277791.00 31350133.96 NA   189 1.03
b[9]        34452421.07 2110620.24 28624582.07 NA   184 1.04
betamean[1]        0.08       0.00        0.02 NA   531 1.02
betamean[2]        0.11       0.00        0.02 NA   935 1.01
betamean[3]        0.11       0.00        0.01 NA   377 1.02
betamean[4]        0.11       0.00        0.02 NA  1667 1.01
betamean[5]        0.10       0.00        0.02 NA   512 1.02
betamean[6]        0.12       0.00        0.02 NA   791 1.01
betamean[7]        0.09       0.00        0.01 NA  1239 1.01
betamean[8]        0.09       0.00        0.01 NA   426 1.02
betamean[9]        0.10       0.00        0.01 NA   704 1.01
aa                55.36       3.47       46.59 NA   181 1.04
bb          34503256.80 2129396.22 28613633.55 NA   181 1.04
sda               10.07       0.71       11.42 NA   256 1.03
sdb          5272208.87  372540.69  6542979.18 NA   308 1.02
lp__           -7557.13       1.74       23.38 NA   181 1.03

Samples were drawn using NUTS(diag_e) at Sun Aug 24 10:56:17 2014.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).


JAGS inits 1

    user   system  elapsed
1784.454    1.222 1791.519

Inference for Bugs model at "/tmp/Rtmp2VW032/model75d4f74fe01.txt", fit using jags,
 10 chains, each with 250000 iterations (first 1e+05 discarded), n.thin = 150
 n.sims = 10000 iterations saved
                 mu.vect      sd.vect int.matrix  Rhat n.eff
a[1]              52.059       43.877         NA 1.099    65
a[2]              62.548       51.117         NA 1.102    63
a[3]              64.949       53.088         NA 1.104    62
a[4]              65.120       54.238         NA 1.104    62
a[5]              61.225       50.014         NA 1.103    63
a[6]              67.498       56.420         NA 1.102    63
a[7]              59.181       49.137         NA 1.102    63
a[8]              57.787       47.902         NA 1.103    63
a[9]              61.403       51.773         NA 1.102    63
aa                60.755       49.814         NA 1.099    65
b[1]        42704592.037 35622401.674         NA 1.104    62
b[2]        37598265.554 31204530.763         NA 1.100    64
b[3]        36937885.481 30057853.515         NA 1.104    62
b[4]        36766568.933 30153113.256         NA 1.103    63
b[5]        38140646.333 31358488.234         NA 1.103    63
b[6]        34271877.336 28059552.425         NA 1.103    63
b[7]        39747380.703 32646090.587         NA 1.103    63
b[8]        40262327.803 33127265.445         NA 1.102    63
b[9]        38041673.881 30717741.484         NA 1.104    62
bb          37696463.522 30374989.608         NA 1.104    62
betamean[1]        0.079        0.018         NA 1.001  6100
betamean[2]        0.108        0.017         NA 1.005  1200
betamean[3]        0.111        0.012         NA 1.001 10000
betamean[4]        0.111        0.018         NA 1.002  4300
betamean[5]        0.103        0.015         NA 1.002  3200
betamean[6]        0.122        0.016         NA 1.004  1700
betamean[7]        0.094        0.015         NA 1.001 10000
betamean[8]        0.090        0.014         NA 1.002  4300
betamean[9]        0.098        0.015         NA 1.015   420
sda               10.819       15.196         NA 1.043   140
sdb          6136772.765  8870298.491         NA 1.046   130
deviance         266.189       18.534         NA 1.083    76

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 151.1 and DIC = 417.3
DIC is an estimate of expected predictive error (lower deviance is better).

jags & ints 2

user system elapsed
1854.296 0.863 1856.972

Inference for Bugs model at "/tmp/Rtmp2VW032/model75d1dda784d.txt", fit using jags,
 10 chains, each with 250000 iterations (first 1e+05 discarded), n.thin = 150
 n.sims = 10000 iterations saved
                 mu.vect      sd.vect int.matrix  Rhat n.eff
a[1]              62.153       46.590         NA 1.063    98
a[2]              74.562       53.977         NA 1.063    98
a[3]              77.974       56.111         NA 1.067    92
a[4]              77.895       56.781         NA 1.067    93
a[5]              73.379       53.374         NA 1.065    94
a[6]              80.658       59.219         NA 1.065    95
a[7]              71.378       52.600         NA 1.062    99
a[8]              69.329       50.894         NA 1.065    95
a[9]              73.640       55.117         NA 1.064    97
aa                72.389       52.305         NA 1.062    99
b[1]        51494466.291 38425814.261         NA 1.065    95
b[2]        44959657.895 33022305.640         NA 1.062    99
b[3]        44386799.990 31879332.379         NA 1.066    94
b[4]        43857406.601 31539586.787         NA 1.066    94
b[5]        45956962.745 33781483.665         NA 1.066    94
b[6]        40856777.369 29453791.246         NA 1.065    95
b[7]        47575125.880 34573426.507         NA 1.063    98
b[8]        48634592.709 35656103.487         NA 1.066    94
b[9]        45324100.091 32444527.795         NA 1.063    97
bb          44759762.836 31404548.768         NA 1.062    99
betamean[1]        0.078        0.018         NA 1.002  4500
betamean[2]        0.107        0.016         NA 1.004  1600
betamean[3]        0.111        0.012         NA 1.001  9700
betamean[4]        0.112        0.017         NA 1.002  4200
betamean[5]        0.102        0.014         NA 1.002  3500
betamean[6]        0.123        0.015         NA 1.004  1800
betamean[7]        0.094        0.015         NA 1.001 10000
betamean[8]        0.090        0.014         NA 1.001  7800
betamean[9]        0.099        0.014         NA 1.013   470
sda               14.300       21.320         NA 1.027   220
sdb          8080234.258 12065450.388         NA 1.027   220
deviance         269.866       17.631         NA 1.053   120

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 143.4 and DIC = 413.2
DIC is an estimate of expected predictive error (lower deviance is better).

Code

Reading data

r13 <- readLines('raw13.txt')
r14 <- readLines('raw14.txt')
r1 <- c(r13,r14)
r2 <- gsub('\\[[a-zA-Z0-9]*\\]','',r1)
r3 <- gsub('^ *$','',r2)
r4 <- r3[r3!='']
r5 <- gsub('\\t$','',r4)
r6 <- gsub('\\t References$','',r5)
r7 <- read.table(textConnection(r6),
    sep='\t',
    header=TRUE,
    stringsAsFactors=FALSE)
r7$Location[r7$Location=='Washington DC'] <-
    'WashingtonDC, DC'
r8 <- read.table(textConnection(as.character(r7$Location)),
    sep=',',
    col.names=c('Location','State'),
    stringsAsFactors=FALSE)
r8$State <- gsub(' ','',r8$State)
r8$State[r8$State=='Tennessee'] <- 'TN'
r8$State[r8$State=='Ohio'] <- 'OH'
r8$State[r8$State %in% c('Kansas','KA')] <- 'KS'
r8$State[r8$State=='Louisiana'] <- 'LA'
r8$State[r8$State=='Illinois'] <- 'IL'
r8$State <- toupper(r8$State)
table(r8$State)
r7$StateAbb <- r8$State
r7$Location <- r8$Location
r7 <- r7[! (r7$State %in% c( 'PUERTORICO','PR')),]
r7$Date <- gsub('/13$','/2013',r7$Date)
r7$date <- as.Date(r7$Date,format="%m/%d/%Y")

states <- data.frame(
    StateAbb=as.character(state.abb),
    StateRegion=state.division,
    State=as.character(state.name)
)
states <- rbind(states,data.frame(StateAbb='DC',
        State='District of Columbia',
        StateRegion= 'Middle Atlantic'))
# http://www.census.gov/popest/data/state/totals/2013/index.html
inhabitants <- read.csv('NST-EST2013-01.treated.csv')
#put it all together

states <- merge(states,inhabitants)
r9 <- merge(r7,states)
#########################

r10 <- merge(as.data.frame(xtabs(~StateAbb,data=r9)),states,all=TRUE)
r10$Freq[is.na(r10$Freq)] <- 0
r10$Incidence <- r10$Freq*100000*365/r10$Population/
    as.numeric((max(r7$date)-min(r7$date)))

Common for modelling

datain <- list(
    count=r10$Freq,
    Population = r10$Population,
    n=nrow(r10),
    nregion =nlevels(r10$StateRegion),
    Region=(1:nlevels(r10$StateRegion))[r10$StateRegion],
    scale=100000*365/
        as.numeric((max(r7$date)-min(r7$date))))


parameters <- c('a','b','betamean','aa','bb','sda','sdb')

inits1 <- function()
    list(a=rnorm(datain$nregion,100,10),
        b=rnorm(datain$nregion,1e8,1e7),
        aa=rnorm(1,100,10),
        sda=rnorm(1,100,10),
        sdb=rnorm(1,1e7,1e6),
        bb=rnorm(1,1e7,1e6),
        p1=rnorm(datain$n,1e-7,1e-8))

inits2 <- function()
    list(a=rnorm(datain$nregion,1000,100),
        b=rnorm(datain$nregion,1000,100),
        aa= rnorm(1,           1000,100),
        sda=rnorm(1,1000,100),
        sdb=rnorm(1,1000,100),
        bb= rnorm(1,1000,100),
        p1=rnorm(datain$n,1e-7,1e-8))

inits1l<- lapply(1:10,function(x) inits1())
inits2l<- lapply(1:10,function(x) inits2())

Stan model

model1 <- '
    data {
         int<lower=0> n;
       int<lower=0> nregion;
       int count[n];
       int<lower=1,upper=nregion> Region[n];
       int Population[n];
       real scale;
        }
    parameters {
       vector <lower=0> [nregion]  a;
       vector <lower=0> [nregion]  b; 
       vector <lower=0,upper=1> [n] p1;
       real <lower=0> aa;
       real <lower=0> bb;
       real <lower=0> sda;
       real <lower=0> sdb;
        }
  model {       
        for (i in 1:n) {
            p1[i] ~ beta(a[Region[i]],b[Region[i]]);
        }
        count ~ binomial(Population,p1);
        a ~ normal(aa,sda);
        b ~ normal(bb,sdb);
        aa ~ uniform(0,1e5);
        bb ~ uniform(0,1e8);
        sda ~ uniform(0,1e5);
       sdb ~ uniform(0,1e8);
        }
  generated quantities {

        vector [n] pp;
        vector [nregion] betamean;
        for (i in 1:nregion) {
           betamean[i] <- scale*a[i]/(a[i]+b[i]);
           }
        pp <- p1 * scale;

        }
'

system.time(fits1 <- stan(model_code = model1,
    data = datain,
    pars=parameters,
    init=inits1l,
    iter = 4000,
    chains = 10))
print(fits1,probs=NA)

system.time(fits2 <- stan(model_code = model1,
        data = datain,
        pars=parameters,
        init=inits2l,
        iter = 4000,
        chains = 10))
print(fits2,probs=NA)

JAGS model

 model.jags <- function() {
    for (i in 1:n) {
        count[i] ~ dbin(p1[i],Population[i])
        p1[i] ~ dbeta(a[Region[i]],b[Region[i]])
        pp[i] <- p1[i]*scale
    }
    for (i in 1:nregion) {
        a[i] ~ dnorm(aa,tauaa) %_% T(0,)
        b[i] ~ dnorm(bb,taubb) %_% T(0,)
        betamean[i] <- scale * a[i]/(a[i]+b[i])
     }
    tauaa <- pow(sda,-2)
    sda ~dunif(0,1e5)
    taubb <- pow(sdb,-2)
    sdb ~dunif(0,1e8)
    aa ~ dunif(0,1e5)
    bb ~ dunif(0,1e8)
}

system.time(jags1 <- jags(datain,
    model=model.jags,
    inits=inits1l,
    parameters=parameters,
    n.iter=250000,
    n.burnin=100000,
    n.chain=10))
print(jags1,intervals=NA)


system.time(jags2 <- jags(datain,
        model=model.jags,
        inits=inits2l,
        parameters=parameters,
        n.iter=250000,
        n.burnin=100000,
        n.chain=10))
print(jags2,intervals=NA)

Post processing

st1coda <- mcmc.list(lapply(1:ncol(fits1),
        function(x) mcmc(as.array(fits1)[,x,])))
st2coda <- mcmc.list(lapply(1:ncol(fits2),
        function(x) mcmc(as.array(fits2)[,x,])))
jg1coda <- as.mcmc(jags1)
jg2coda <- as.mcmc(jags2)

options(width=100)

gdiag <- cbind(gelman.diag(st1coda)[[1]],
    gelman.diag(st2coda)[[1]],
    gelman.diag(jg1coda)[[1]],
    gelman.diag(jg2coda)[[1]])

png('Gelman Diagnostic.png')
matplot(gdiag,ylim=c(1,1.5),ylab='Gelman Diagnostic')
legend(x='topleft',pch=format(c(1:8)),
    c('1 Stan smart point',
'2 Stan smart upper',
'3 Stan  1000 point',
'4 Stan  1000 upper',
'5 Jags smart point',
'6 Jags smart upper',
'7 Jags  1000 point',
'8 Jags  1000 upper'),
col=c(1:6,1,2),ncol=2)
dev.off()

efs <- cbind(effectiveSize(st1coda),
    effectiveSize(st2coda),
    effectiveSize(jg1coda),
    effectiveSize(jg2coda))

png('Efsz.png')
matplot(efs,log='y',ylab='Effective sample size')
legend(x='topleft',pch=format(c(1:8)),
    c('1 Stan smart ',
        '2 Stan  1000 point',
        '3 Jags smart point',
        '4 Jags  1000 point'),
    col=c(1:4))
dev.off()