Sunday, November 24, 2013

Dutch Rainwater Composition 1992-2011

Last week I examined rainwater composition 1992 to 2005. There is additional data, but National Institute for Public Health has changed equipment in 2005. This week I will add those data.

Data

Data is in a number of spreadsheets. The script is a variation on last week, with an extra wrapper for spreadsheets. The final parts are merging locations (Station Leiduin and Witteveen have been replaced by De Zilk and Valthermond respectively) and adding the equipment variable. The code is at the bottom of the post.

pH

It seems pH is still slightly increasing. It also seems there are like two levels of pH. For instance, in Gilze-Rijen the pH is either around 5 or around 6. But 5.5 is less probable. 

Iron

For Iron there are just four locations with complete data. Hence these are displayed. It would seem the trend downward continues.

Iron model

Since I wanted to give the two machines each a separate variance, I opted to use nlme in this particular calculation. In addition, rather than following the plot, all locations are used, and the replacement stations are not merged together. The machines seem to give similar variation, which probably means variation in iron concentration is much larger than machine accuracy. The decrease estimate is still around 5% per year.
Linear mixed-effects model fit by REML
 Data: Fe2 
       AIC      BIC    logLik
  3135.969 3184.729 -1559.984

Random effects:
 Formula: ~StartDate | Location
 Structure: General positive-definite, Log-Cholesky parametrization
            StdDev       Corr  
(Intercept) 1.018836e-05 (Intr)
StartDate   6.680830e-06 -0.05 

 Formula: ~1 | machine %in% Location
        (Intercept)  Residual
StdDev:  0.09556604 0.3791318

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~1 | machine 
 Parameter estimates:
   Van Essen/ECN vanger  Eigenbrodt NSA 181 KHT 
               1.000000                1.049395 
Fixed effects: LAmount ~ StartDate 
                 Value  Std.Error   DF    t-value p-value
(Intercept)  0.4373752 0.06188798 3258   7.067208       0
StartDate   -0.0000640 0.00000549 3258 -11.645303       0
 Correlation: 
          (Intr)
StartDate -0.887

Standardized Within-Group Residuals:
         Min           Q1          Med           Q3          Max 
-3.752718189 -0.655477969 -0.003628072  0.645709629  4.226135302 

Number of Observations: 3280
Number of Groups: 
             Location machine %in% Location 
                   17                    21 
#decrease estimate
> (10^fixef(lme1)['StartDate'])^365
StartDate 
0.9476403 

Code

Import data

library(XLConnect)

readsheet <- function(wb,cursheet) {
  df = readWorksheet(wb, sheet = cursheet , header = TRUE)
  topline <- 8
  test <- which(df[6,]=='C')+1
  numin <- df[topline:nrow(df),test]
  names(numin) <- make.names( gsub(' ','',df[6,test]))
  for(i in 1:ncol(numin)) numin[,i] <- as.numeric(gsub(',','.',numin[,i]))
  status = df[topline:nrow(df),test-1]
  names(status) <- paste('C',make.names( gsub(' ','',df[6,test])),sep='')
  
  numin$StartDate <- as.Date(substr(df[topline:nrow(df),1],1,10))
  numin$EndDate <-  as.Date(substr(df[topline:nrow(df),2],1,10))
  numin <- cbind(numin,status)
  tall <- reshape(numin,
      varying=list(make.names( gsub(' ','',df[6,test])),paste('C',make.names( gsub(' ','',df[6,test])),sep='')),
      v.names=c('Amount','Status'),
      timevar='Compound',
      idvar=c('StartDate','EndDate'),
      times=paste(df[6,test],'(',df[5,test],')',sep=''),
      direction='long')
  tall$Compound <- factor(gsub(' )',')',gsub(' +',' ',tall$Compound)))
  row.names(tall)<- NULL
  tall$Location <- cursheet
  tall
}

readfile <- function(fname) {
  wb <- loadWorkbook(fname)
  print(wb)
  sheets <- getSheets(wb)[-1]
  tt <- lapply(sheets,readsheet,wb=wb)
  tt2 <- do.call(rbind,tt) 
  tt2$Location <- factor(tt2$Location)
  tt2$fname <- fname
  tt2
}

fnames <- dir(pattern='*.xls') 
rf <- lapply(fnames,readfile)
rf2 <- do.call(rbind,rf)

(ll <- levels(rf2$Location))
ll[c(13,17)] <- 'V&W'
ll[c(9,4)] <- 'L&DZ'
rf2$Location2 <- factor(ll[rf2$Location])
head(rf2)
rf2$machine <- factor(ifelse(rf2$fname=="lmre_1992-2005.xls",
        'Van Essen/ECN vanger',
        ' Eigenbrodt NSA 181 KHT'))

pH plot


library(ggplot2)
library(scales)
levels(rf2$Compound)
(cc <- levels(rf2$Compound)[22])
rf2$Amount2 <- rf2$Amount
rf2$Amount2[rf2$Amount<1 & rf2$Compound==cc] <- NA
rf2[rf2$Amount < 1 & rf2$Compound==cc & !is.na(rf2$Amount),]
pp <- ggplot(rf2[rf2$Compound==cc & !(rf2$Location2 %in% ll[c(2,5,7,15)]),], 
    aes( StartDate,Amount2,col=machine))
pp + geom_point() +
    facet_wrap(~Location2) +
    ylab(cc) + xlab('Year') + ylim(c(4,7)) +
    scale_x_date(breaks=as.Date(paste(c('1992','2000','2010'),'-01-01',sep=''),
            format=('%Y-%m-%d')),
        labels = date_format("%Y")) +
     theme(legend.position = "bottom")


Fe Plot


(cc <- levels(rf2$Compound)[9])
#levels(rf2$Location2)
pp <- ggplot(rf2[rf2$Compound==cc & (rf2$Location2 %in% 
            levels(rf2$Location2)[c(7,8,10,13)]),], 
  aes( StartDate,Amount,color=machine))
pp + geom_point() +
    scale_y_log10() +
    facet_wrap(~Location2) +
    ylab(cc) + xlab('Year') +
    scale_x_date(breaks=as.Date(paste(c('1992','2000','2010'),'-01-01',sep=''),
            format=('%Y-%m-%d')),
        labels = date_format("%Y")) +
        theme(legend.position = "bottom")

Fe model


Fe <- rf2[rf2$Compound==levels(rf2$Compound)[9],]
# plot(density(Fe$Amount[!is.na(Fe$Status) & Fe$Status!=0],adjust=.5))
Fe$LAmount <- log10(Fe$Amount)
Fe$LAmount[Fe$Amount<=0] <- NA
Fe$Location <- factor
#nlme
library(nlme)
Fe2 <- Fe[complete.cases(Fe),]
lme1 <- lme(LAmount ~  StartDate,
 #   random= ~ 1|machine, # machine = groups
 #   random=~StartDate | Location, # location=groups
     random=list(Location=~StartDate, machine=~1),   
    weights=varIdent(form=~1 |  machine),
    data=Fe2 )    
summary(lme1)

Sunday, November 17, 2013

Dutch Rainwater Composition 1992-2005.

After reading Blog About Stats' Open Data Index Blog Post I decided to browse a bit in the Open Data Index. Choosing Netherlands and following Emission of Pollutants I ended on a page from National Institute for Public Health. The page http://www.lml.rivm.nl/data/gevalideerd/index.html is in Dutch and chose composition of rain water 1992-2005.There is also 2006-2008, 2009, 2010 and 2011. In 2006 machines have changed, 'a report describing changes is in preparation'. I am only using 1992-2005 in this post.

Data

The data is in a spreadsheet, one page per location. There is quite a big header on each page and each data column has an associated column marking data below detection limit. This being real world data there are also the odd missing data.
Following Read Excel Data from R I chose to read the data via XLConnect. The whole process is wrapped in a function, So I can read all pages (all locations).
library(XLConnect)
wb = loadWorkbook("lmre_1992-2005.xls")
sheets <- getSheets(wb)[-1]
readsheet <- function(cursheet) {
  df = readWorksheet(wb, sheet = cursheet , header = TRUE)
  topline <- 8
  test <- which(df[6,]=='C')+1
  numin <- df[topline:nrow(df),test]
  names(numin) <- make.names( gsub(' ','',df[6,test]))
  for(i in 1:ncol(numin)) numin[,i] <- as.numeric(gsub(',','.',numin[,i]))
  status = df[topline:nrow(df),test-1]
  names(status) <- paste('C',make.names( gsub(' ','',df[6,test])),sep='')
  numin$StartDate <- as.Date(substr(df[topline:nrow(df),1],1,10))
  numin$EndDate <-  as.Date(substr(df[topline:nrow(df),2],1,10))
  numin <- cbind(numin,status)
  tall <- reshape(numin,
      varying=list(make.names( gsub(' ','',df[6,test])),
         paste('C',make.names( gsub(' ','',df[6,test])),sep='')),
      v.names=c('Amount','Status'),
      timevar='Compound',
      idvar=c('StartDate','EndDate'),
      times=paste(df[6,test],'(',df[5,test],')',sep=''),
      direction='long')
  tall$Compound <- factor(gsub(' )',')',gsub(' +',' ',tall$Compound)))
  row.names(tall)<- NULL
  tall$Location <- cursheet
  tall
}
tt <- lapply(sheets,readsheet)
tt2 <- do.call(rbind,tt) 
tt2$Location <- factor(tt2$Location)

Plots

There are certainly many variables to be analyzed, more than I would do in a blog post. I have chosen pH (acid rain) and Iron (Fe).
levels(tt2$Compound)
 [1] "As (umol/l)"     "Ca (umol/l)"     "Cd (umol/l)"     
 [4] "Cl (umol/l)"     "Co (umol/l)"     "Cr (umol/l)"    
 [7] "Cu (umol/l)"     "F (umol/l)"      "Fe (umol/l)"     
[10] "K (umol/l)"      "K25 (uS/cm)"     "massa_hc (g)"   
[13] "massa_zm (g)"    "Mg (umol/l)"     "Mn (umol/l)"   
[16] "Na (umol/l)"     "NH4 (umol/l)"    "Ni (umol/l)"     
[19] "NO3 (umol/l)"    "nsl (mm)"        "Pb (umol/l)"     
[22] "pH (pH-eenheid)" "PO4 (umol/l)"    "SO4 (umol/l)"
[25] "st.zr. (umol/l)" "V (umol/l)"      "Zn (umol/l)" 

pH   

Acid rain was quite a scare last century. We should be able to see progress, especially in the nineties, when there was quite some activity on it. One location is not shown; Leiduin has data complimentary to 'De Zilk'. Removal of this location gave a much nicer layout for the remaining data.
The plots show progress, but some locations (e.g. De Zilk, Rotterdam) there is still a way to go. Is Vredepeel going more acid again? I may need to add more data after all.
library(ggplot2)
library(scales)
(cc <- levels(tt2$Compound)[22])
pp <- ggplot(tt2[tt2$Compound==cc & tt2$Location!='Leiduin',], 
    aes( StartDate,Amount))
pp + geom_point() +
    facet_wrap(~Location) +
    ylab(cc) + xlab('Year') +
    scale_x_date(breaks='4 years',labels = date_format("%Y"))

Iron

Iron is plotted on a logarithmic y scale. Disadvantage is that this hides extreme high (Kollumerwaard) and below zero's. However, on this scale the data seem more or less 'normal'.
(cc <- levels(tt2$Compound)[9])
pp <- ggplot(tt2[tt2$Compound==cc & tt2$Location!='Leiduin',], aes( StartDate,Amount))
pp + geom_point() +
    scale_y_log10() +
    facet_wrap(~Location) +
    ylab(cc) + xlab('Year') +
    scale_x_date(breaks='4 years',labels = date_format("%Y"))

Mixed model for Iron

Just to get some quantification on the reduction I ran a model. The obvious choice is again to use logarithmic transformed data. However, some zeros unleashed havoc on the subsequent estimates. Hence these data are made NA prior to computations.
library(lme4) 
Fe <- tt2[tt2$Compound==levels(tt2$Compound)[9],]
# plot(density(Fe$Amount[!is.na(Fe$Status) & Fe$Status!=0],adjust=.5))
Fe$LAmount <- log10(Fe$Amount)
Fe$LAmount[Fe$Amount<=0] <- NA
ll <- lmer(LAmount ~  StartDate + (1|Location) +(StartDate|Location ) ,
    data=Fe)
summary(ll)
Linear mixed model fit by REML ['lmerMod']
Formula: LAmount ~ StartDate + (1 | Location) + 
        (StartDate | Location) 
   Data: Fe 

REML criterion at convergence: 2170.498 

Random effects:
 Groups     Name        Variance  Std.Dev.  Corr
 Location   (Intercept) 4.727e-11 6.875e-06     
 Location.1 (Intercept) 1.456e-04 1.207e-02     
            StartDate   1.348e-10 1.161e-05 1.00
 Residual               1.436e-01 3.789e-01     
Number of obs: 2338, groups: Location, 17

Fixed effects:
              Estimate Std. Error t value
(Intercept)  5.001e-01  5.971e-02   8.376
StartDate   -6.975e-05  6.338e-06 -11.005

Correlation of Fixed Effects:
          (Intr)

StartDate -0.862
A parameter of interest is effect of StartDate. It is the size of the effect per day. This translates in 5% reduction per year. The t-value suggests this is significant, as does the associated ANOVA.
> (10^-6.975e-5)^365
[1] 0.9430642
ll2 <- lmer(LAmount ~  1 + (1|Location) +(StartDate|Location ) ,
    data=Fe)
anova(ll,ll2)
Data: Fe
Models:
ll2: LAmount ~ 1 + (1 | Location) + (StartDate | Location)
ll: LAmount ~ StartDate + (1 | Location) + (StartDate | Location)
    Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)   
ll2  6 2196.0 2230.5 -1092.0   2184.0                            
ll   7 2157.3 2197.6 -1071.7   2143.3 40.656      1  1.815e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A bootstrap simulation of this significance needs some more processing. It did not work unless I removed all data with missing values prior to calculation. The result was 0 out of 1000 simulations, which in hindsight is obvious. Even if I don't trust the ANOVA, I don't believe it would be a factor 10^7 off either.  

Sunday, November 10, 2013

A small comparison of bio-equivalence calculations.

Last week I looked at two-way cross-over studies and followed the example of Schütz (http://bebac.at/) in the analysis. Since the EU has its on opinions (Questions & Answers: Positions on specific questions addressed to the pharmacokinetics working party) and two example data sets, I was wondering how the various computations compared.

Data

There are 6 data-sets in the analysis. 
  1. The first data from Schütz. 
  2. The 'Dataset from Chow & Liu'. This is a two-period data set. Schütz did these data in three versions. This version is complete (balanced).
  3. Data set 2; incomplete (one subject misses one evaluation). Note the EU guide states: 'subjects in a crossover trial who do not provide evaluable data for both of the test and reference products should not be included'.  
  4. Data set 2; unbalanced (one subject is missing). 
  5. EU data set 1: 'a four period crossover study where subjects receive both test and reference twice, with some subjects providing data for only a subset of the treatment periods'
  6. EU data set 2: 'a three period crossover study where all subjects receive reference twice and test once'

Bio-equivalence analysis

There are five algorithms. I wrapped each in a function and made a unified print() method, so as to ease comparison. These functions have two input parameters; the data and the index for the level of the reference treatment.
  1. The code from Schütz slide 14. Note this only did two period data, and not the incomplete data.
  2. ANOVA. In contrast to my previous post subject is now fixed. For the interval I use just the residual degrees of freedom. 
  3. nlme This should deliver what proc mixed would deliver
  4. MCMCglmm. A Bayesian MCMC sampler.  
  5. lme4. This is traditional mixed model with point estimate and interval via a simulation

Result comparison

Data  set 1

     Method     Point_estimate p5CI     p95CI    CVIntra 
[1,] "Schutz"   100.8168       95.47312 106.4596 7.370138
[2,] "ANOVA"    100.8168       95.47312 106.4596 7.370138
[3,] "nlme"     100.8168       95.47312 106.4596 7.370138
[4,] "MCMCglmm" 100.8011       95.38906 106.5084 8.336442
[5,] "lme4"     100.71         96.03672 105.9925 7.370137

Data set 2

     Method     Point_estimate p5CI     p95CI    CVIntra 
[1,] "Schutz"   97.17545       88.3128  106.9275 19.47357
[2,] "ANOVA"    97.17545       88.3128  106.9275 19.47357
[3,] "nlme"     97.17545       88.31279 106.9275 19.47359
[4,] "MCMCglmm" 97.09287       86.66635 108.7019 24.21701
[5,] "lme4"     97.51018       89.10346 106.7388 19.47351

Data set 3: Inbalanced

     Method     Point_estimate p5CI     p95CI    CVIntra 
[1,] "Schutz"   95.79137       87.02926 105.4356 19.05692
[2,] "ANOVA"    95.60894       86.86353 105.2348 19.05692
[3,] "nlme"     95.60894       86.86352 105.2349 19.05693
[4,] "MCMCglmm" 95.6052        85.20512 107.4927 24.13791
[5,] "lme4"     95.80032       86.8732  103.6471 19.05691

Data set 4: Incomplete

     Method     Point_estimate p5CI     p95CI    CVIntra 
[1,] "ANOVA"    95.60894       86.86353 105.2348 19.05692
[2,] "nlme"     96.46814       87.6157  106.215  19.2212 
[3,] "MCMCglmm" 96.78571       86.78392 108.6978 23.77217
[4,] "lme4"     96.76121       88.38716 105.469  19.22122

Data set 5: four period crossover

     Method     Point_estimate p5CI     p95CI    CVIntra 
[1,] "ANOVA"    115.6587       107.1057 124.8948 41.65396
[2,] "nlme"     115.7298       107.1707 124.9725 41.66876
[3,] "MCMCglmm" 115.7334       107.0849 125.0422 41.87724
[4,] "lme4"     115.7831       106.6829 124.6334 41.66873

Data set 6: three period crossover

     Method     Point_estimate p5CI     p95CI    CVIntra 
[1,] "ANOVA"    102.2644       97.31555 107.4649 11.85557
[2,] "nlme"     102.2644       97.31555 107.4649 11.85557
[3,] "MCMCglmm" 102.2624       97.29063 107.4589 12.17123
[4,] "lme4"     102.3081       97.30974 107.192  11.85559

From this small analysis one can observe

  • Only to use the Schütz code for balanced two period crossover data. But in that case nlme and ANOVA provide the same answers. Hence it is purely educational
  • Given the width of the intervals, the point estimates are close enough so differences can be ignored. 
  • For all data except the incomplete data the difference between nlme and ANOVA is absent or very small. 
  • MCMCglmm has some differences, compared to other methods it over estimates CVIntra. MCMCglmm gives a slightly wider confidence interval for bioequivalence. 
  • lme4 had in most cases a slightly narrower interval.

Code

###############################################################################
# Data
###############################################################################

#Schutz 1
T1 <- c(28.39,33.36,24.29,28.61,59.49,42.30)
T2 <- c(49.42,36.78,34.81,45.54,28.23,25.71)
R1 <- c(39.86,32.75,34.97,45.44,27.87,24.26)
R2 <- c(35.44,33.40,24.65,31.77,65.29,37.01)
S1 <- data.frame(
    aval=c(T1,R2,R1,T2),
    subjid=factor(c(
            rep(c(1,4,6,7,9,12),2),
            rep(c(2,3,5,8,10,11),2))),
    seqa=rep(c('TR','RT'),each=12),
    trta=rep(c('T','R','R','T'),each=6),
    aperiod=factor(rep(c(1,2,1,2),each=6))
)
S1 <- S1[order(S1$subjid,S1$aperiod),]
S1$logAval <- log(S1$aval)

S2 <- read.table(textConnection('
Subject Period Sequence Formulation AUC Subject Period Sequence Formulation AUC
1 1 RT Reference 74.675 13 1 TR Test 122.45
1 2 RT Test 73.675 13 2 TR Reference 124.975
2 1 TR Test 74.825 14 1 TR Test 99.075
2 2 TR Reference 37.35 14 2 TR Reference 85.225
3 1 TR Test 86.875 15 1 RT Reference 69.725
3 2 TR Reference 51.925 15 2 RT Test 59.425
4 1 RT Reference 96.4 16 1 RT Reference 86.275
4 2 RT Test 93.25 16 2 RT Test 76.125
5 1 RT Reference 101.95 17 1 TR Test 86.35
5 2 RT Test 102.125 17 2 TR Reference 95.925
6 1 RT Reference 79.05 18 1 TR Test 49.925
6 2 RT Test 69.45 18 2 TR Reference 67.1
7 1 TR Test 81.675 19 1 RT Reference 112.675
7 2 TR Reference 72.175 19 2 RT Test 114.875
8 1 TR Test 92.7 20 1 RT Reference 99.525
8 2 TR Reference 77.5 20 2 RT Test 116.25
9 1 TR Test 50.45 21 1 TR Test 42.7
9 2 TR Reference 71.875 21 2 TR Reference 59.425
10 1 TR Test 66.125 22 1 TR Test 91.725
10 2 TR Reference 94.025 22 2 TR Reference 114.05
11 1 RT Reference 79.05 23 1 RT Reference 89.425
11 2 RT Test 69.025 23 2 RT Test 64.175
12 1 RT Reference 85.95 24 1 RT Reference 55.175
12 2 RT Test 68.7 24 2 RT Test 74.575
'),header=TRUE,stringsAsFactors=FALSE)
S2 <- with(S2,data.frame(
    aval=c(AUC,AUC.1),
    subjid=factor(c(Subject,Subject.1)),
    seqa=factor(c(Sequence,Sequence.1)),
    trta=factor(c(Formulation,Formulation.1)),
    aperiod=factor(c(Period,Period.1))))
S2$logAval <- log(S2$aval)
S2InC <- S2[!(S2$subjid=='24' & S2$aperiod=='2'),]
S2InB <- S2[!(S2$subjid=='24'),]

##
#The EU data sets I copied-pasted into .csv prior to analysis.
#EU1
EU1 <- read.csv('eu1.csv')
EU1$subjid <- factor(EU1$SUBJECT)
EU1$aperiod <- factor(EU1$PERIOD)
EU1$aval <- EU1$DATA
EU1$seqa <- EU1$SEQUENCE
EU1$trta <- EU1$FORMULATION
EU1$logAval <- log(EU1$aval)
#EU2
EU2 <- read.csv('eu2.csv')
EU2$subjid <- factor(EU2$SUBJECT)
EU2$aperiod <- factor(EU2$PERIOD)
EU2$aval <- EU2$DATA
EU2$seqa <- EU2$SEQUENCE
EU2$trta <- EU2$FORMULATION
EU2$logAval <- log(EU2$aval)

###############################################################################
# Analysis functions
###############################################################################

print.BEQ <- function(x,...) {
  result <- paste(
      " Back transformed (raw data scale)\n",
      "Point estimate:",
      round(x$Point_estimate, digits=2),"%\n",
      "90 % confidence interval:",
      round(x$p5CI,digits=2), "% to",
      round(x$p95CI, digits=2),"%\n",
      "CVIntra:",round(x$CVIntra,digits=2),
      "%\n")
  cat(result)
  invisible(x)
}

BESchutz <- function(datain,iref) {
  prepdata <- datain[order(datain$seqa,datain$subjid,datain$aperiod),]
  T1 <- with(prepdata,aval[seqa==levels(seqa)[iref] & aperiod==1])
  R1 <- with(prepdata,aval[seqa==levels(seqa)[3-iref] & aperiod==1])
  R2 <- with(prepdata,aval[seqa==levels(seqa)[iref] & aperiod==2])
  T2 <- with(prepdata,aval[seqa==levels(seqa)[3-iref] & aperiod==2])
  RT <- log(R1) - log(T2)
  TR <- log(R2) - log(T1)
  n1 <- length(RT)
  mRT <- mean(RT)
  vRT <- var(RT)
  n2 <- length(TR)
  mTR <- mean(TR)
  vTR <- var(TR)
  mD <- mean(log(c(T1,T2))) - mean(log(c(R1,R2)))
  MSE <- (((n1-1)*vRT + (n2-1)*vTR)/(n1+n2-2))/2
  alpha <- 0.05
  lo <- mD - qt(1-alpha,n1+n2-2)*sqrt(MSE)*
      sqrt((1/(2*n1) + 1/(2*n2)))
  hi <- mD + qt(1-alpha,n1+n2-2)*sqrt(MSE)*
      sqrt((1/(2*n1) + 1/(2*n2)))
#result <- paste(
#    paste(" Back transformed (raw data scale)\n",
#        "Point estimate:",
#        round(100*exp(mD), digits=2),"%\n"),
#    paste("90 % confidence interval:"),
#    paste(round(100*exp(lo), digits=2), "% to"),
#    paste(round(100*exp(hi), digits=2),"%\n",
#        paste("CVintra:",round(100*sqrt(exp(MSE)-1),
#                digits=2),"%\n")))
  result <- list(Method='Schutz',
      Point_estimate=100*exp(mD),
      p5CI=100*exp(lo),
      p95CI=100*exp(hi), 
      CVIntra=100*sqrt(exp(MSE)-1))
  class(result) <- c('BEQ',class(result))
  result
}

#######
# plain anova
#######
BEanova <- function(datain,iref) {
  trtastring <- paste('trta',levels(datain$trta)[iref],sep='')
  A1 <- aov(logAval ~  aperiod +seqa+ trta + subjid,
      data=datain)
  SE <- sqrt(vcov(A1)[trtastring,trtastring])
  td <- qt(.05,A1$df.residual)
  result <- list(Method='ANOVA',
      Point_estimate=100*exp(coef(A1)[trtastring]),
      p5CI=100*exp(coef(A1)[trtastring]+td*SE),
      p95CI=100*exp(coef(A1)[trtastring]-td*SE),
      CVIntra=100*sqrt(exp(sum(A1$residuals^2)/A1$df.residual)-1))
  class(result) <- c('BEQ',class(result))
  result
}

########
# nlme
########
library(nlme)
BEnlme <- function(datain,iref) {
  trtastring <- paste('trta',levels(datain$trta)[iref],sep='')
  L1 <- lme(logAval ~ aperiod + seqa + trta,
      random=~1|subjid,
      data=datain)
  coL1 <- fixef(L1)
  SEL1 <- sqrt(vcov(L1)[trtastring,trtastring])
  dfL1 <- L1$fixDF$terms[names(L1$fixDF$terms)=='trta']
  tdL1 <- qt(.05,dfL1)
  result <- list(Method='nlme',
      Point_estimate=100*exp(coL1[trtastring]),
      p5CI=100*exp(coL1[trtastring]+tdL1*SEL1),
      p95CI=100*exp(coL1[trtastring]-tdL1*SEL1),
      CVIntra=100*sqrt(exp(as.numeric(VarCorr(L1)['Residual','Variance']))-1))
  class(result) <- c('BEQ',class(result))
  result
}

######
# MCMCglm
######
library(MCMCglmm)
BEmcmcglmm <- function(datain,iref) {
  trtastring <- paste('trta',levels(datain$trta)[iref],sep='')
  
  M1 <- MCMCglmm(logAval ~ aperiod + seqa + trta, 
      random= ~ subjid ,
      data=datain,family='gaussian',nitt=500000,thin=20,
      burnin=50000,verbose=FALSE)
  M1est <- 100*exp(quantile(M1$Sol[,trtastring],c(.05,.5,.95)))
  result=list(Method='MCMCglmm',
      Point_estimate=M1est[2],
      p5CI=M1est[1],
      p95CI=M1est[3],
      CVIntra=    100*sqrt(exp(mean(M1$VCV[,'units']))-1))
  class(result) <- c('BEQ',class(result))
  result
}

#######
# lme4
######
library(lme4)
BElmer <- function(datain,iref) {
  trtastring <- paste('trta',levels(datain$trta)[iref],sep='')
  R1 <- lmer(logAval ~ aperiod + seqa + trta + (1|subjid),
      data=datain)
  simeff <- function(m0) {
    s <- simulate(m0)
    fixef(refit(m0,s))[trtastring]
  }
  ff <- replicate(1000,simeff(R1))
  R1est <- 100*exp(quantile(ff,c(.05,.5,.95)))
  result <- list(Method='lme4',
      Point_estimate=R1est[2],
      p5CI=R1est[1],
      p95CI=R1est[3],
      CVIntra=100*sqrt(exp(sigma(R1)^2)-1))
  class(result) <- c('BEQ',class(result))
  result
}

###############################################################################
# Result summary
###############################################################################

t(cbind(BESchutz(S1,2)
        , BEanova(S1,2)
        , BEnlme(S1,2)
        , BEmcmcglmm(S1,2)
        , BElmer(S1,2)
    ))

t(cbind(BESchutz(S2,2)
        , BEanova(S2,2)
        , BEnlme(S2,2)
        , BEmcmcglmm(S2,2)
        , BElmer(S2,2)
    ))

t(cbind( BESchutz(S2InB,2)
        , BEanova(S2InB,2)
        , BEnlme(S2InB,2)
        , BEmcmcglmm(S2InB,2)
        , BElmer(S2InB,2)
    ))

t(cbind(  BEanova(S2InC,2)
        , BEnlme(S2InC,2)
        , BEmcmcglmm(S2InC,2)
        , BElmer(S2InC,2)
    ))

t(cbind( BEanova(EU1,2)
        , BEnlme(EU1,2)
        , BEmcmcglmm(EU1,2)
        , BElmer(EU1,2)
    ))
t(cbind( BEanova(EU2,2)
        , BEnlme(EU2,2)
        , BEmcmcglmm(EU2,2)
        , BElmer(EU2,2)
    ))

Sunday, November 3, 2013

Statistical aspects of two-way cross-over studies

I ran into this presentation on Statistical aspects of two-way cross-over studies by Ing. Helmut Schütz (http://bebac.at). He presented some code and referred to the bear package. The bear package is menu driven, which is not my thing. I had to try and do that in R via other packages. The aim is to estimate if the treatments are bioequivalent and to estimate intra subject variation.

Data

This seems like a funny way to create data, but the presentation contains the data in these four lines. The code after that, putting it in a data.frame, is mine.
T1 <- c(28.39,33.36,24.29,28.61,59.49,42.30)
T2 <- c(49.42,36.78,34.81,45.54,28.23,25.71)
R1 <- c(39.86,32.75,34.97,45.44,27.87,24.26)
R2 <- c(35.44,33.40,24.65,31.77,65.29,37.01)
datain <- data.frame(
    aval=c(T1,R2,R1,T2),
    subjid=factor(c(
            rep(c(1,4,6,7,9,12),2),
            rep(c(2,3,5,8,10,11),2))),
    seqa=rep(c('TR','RT'),each=12),
    trta=rep(c('T','R','R','T'),each=6),
    aperiod=factor(rep(c(1,2,1,2),each=6))
)
datain <- datain[order(datain$subjid,datain$aperiod),]
datain$logAval <- log(datain$aval)

Anova

With these balanced data Anova gives no problems
A1 <- aov(logAval ~  aperiod +seqa+ trta + Error(subjid),
    data=datain)
summary(A1)

Error: subjid
          Df Sum Sq Mean Sq F value Pr(>F)
seqa       1 0.0023  0.0023   0.014  0.907
Residuals 10 1.5943  0.1594               

Error: Within
          Df  Sum Sq  Mean Sq F value Pr(>F)  
aperiod    1 0.02050 0.020501   3.784 0.0804 .
trta       1 0.00040 0.000397   0.073 0.7921  
Residuals 10 0.05417 0.005417                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Extraction of parameters of interest. 

The degrees of freedom are calculated outside the anova. This particular bit should be correct even with unbalanced data, but that is untested.
wi <- summary(A1)$`Error: Within`[[1]]
countn <- rowSums(table(datain$seqa,datain$subjid)>0)
sn <- sqrt(1/(2*countn[1]) + 1/(2*countn[2]))
SE <- sqrt(wi$`Mean Sq`[rownames(wi)=='Residuals'])*sn
td <- qt(.05,wi$`Df`[rownames(wi)=='Residuals'])
cat('Point estimate (%)',100*exp(coef(A1)$Within['trtaT']),'\n',
    '90% interval (%)  '  ,
    100*exp(coef(A1)$Within['trtaT']+td*SE),'to ',
    100*exp(coef(A1)$Within['trtaT']-td*SE),
    '\n',
    'CVIntra (%)       ',
    100*sqrt(exp(wi$`Mean Sq`[rownames(wi)=='Residuals'])-1),
    '\n')
Point estimate (%) 100.8168 
 90% interval (%)   95.47312 to  106.4596 
 CVIntra (%)        7.370138 

nlme

nlme gave no problems. the parameters are the same.
library(nlme)
L1 <- lme(logAval ~ aperiod + seqa + trta,
    random=~1|subjid,
    data=datain)
summary(L1)

Linear mixed-effects model fit by REML
 Data: datain 
       AIC      BIC   logLik
  6.767892 12.74229 2.616054

Random effects:
 Formula: ~1 | subjid
        (Intercept)   Residual
StdDev:   0.2775045 0.07360159

Fixed effects: logAval ~ aperiod + seqa + trta 
               Value  Std.Error DF   t-value p-value
(Intercept) 3.510257 0.11720776 10 29.949011  0.0000
aperiod2    0.058454 0.03004772 10  1.945360  0.0804
seqaTR      0.019577 0.16301059 10  0.120097  0.9068
trtaT       0.008135 0.03004772 10  0.270732  0.7921
 Correlation: 
         (Intr) aperd2 seqaTR
aperiod2 -0.128              
seqaTR   -0.695  0.000       
trtaT    -0.128  0.000  0.000

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-1.21397837 -0.41862241 -0.05987476  0.37505780  1.30243785 

Number of Observations: 24
Number of Groups: 12 

Extraction of parameters of interest

One noticeable things here is that VarCorr() resulted in a character variable, so this is transformed by as.numerical.

coL1 <- fixef(L1)
SEL1 <- sqrt(vcov(L1)['trtaT','trtaT'])
dfL1 <- L1$fixDF$terms[names(L1$fixDF$terms)=='trta']
tdL1 <- qt(.05,dfL1)
cat('Point estimate (%)',100*exp(coL1['trtaT']),'\n',
    '90% interval (%)  '  ,
    100*exp(coL1['trtaT']+tdL1*SEL1),'to ',
    100*exp(coL1['trtaT']-tdL1*SEL1),
    '\n',
    'CVIntra (%)       ',
    100*sqrt(exp(as.numeric(VarCorr(L1)['Residual','Variance']))-1),
    '\n')
Point estimate (%) 100.8168 
 90% interval (%)   95.47312 to  106.4596 
 CVIntra (%)        7.370138 

MCMCglmm

Definitely not the standard or expected calculation, but the answers are really close to the expected values.

M1 <- MCMCglmm(logAval ~ aperiod + seqa + trta, 
    random= ~ subjid ,
    data=datain,family='gaussian',nitt=500000,thin=20,
    burnin=50000,verbose=FALSE)
summary(M1)
 Iterations = 50001:499981
 Thinning interval  = 20
 Sample size  = 22500 

 DIC: -41.6716 

 G-structure:  ~subjid

       post.mean l-95% CI u-95% CI eff.samp
subjid   0.09517  0.02514   0.1989    22500

 R-structure:  ~units

      post.mean l-95% CI u-95% CI eff.samp
units  0.006832  0.00187  0.01401    22500

 Location effects: logAval ~ aperiod + seqa + trta 

            post.mean  l-95% CI  u-95% CI eff.samp  pMCMC    
(Intercept)  3.510768  3.245017  3.764446    22518 <4e-05 ***
aperiod2     0.058452 -0.008879  0.125118    22500 0.0838 .  
seqaTR       0.018367 -0.326948  0.397998    22500 0.9148    
trtaT        0.007888 -0.056976  0.077410    22500 0.7985    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Extraction of parameters of interest

Fairly straightforward, it is all in the samples. Notice the CV Intra is larger than expected, this was observed before in previous REML exercises. By and large, getting decent (point) estimates for variances is a major weakness for MCMCglmm.
M1est <- 100*exp(quantile(M1$Sol[,'trtaT'],c(.05,.5,.95)))
cat('Point estimate (%)',M1est[2],'\n',
    '90% interval (%)  '  ,
    M1est[1],'to ',M1est[3],
    '\n',
    'CVIntra (%)       ',
    100*sqrt(exp(mean(M1$VCV[,'units']))-1),
    '\n')

Point estimate (%) 100.7712 
 90% interval (%)   95.45551 to  106.4557 
 CVIntra (%)        8.279958 

lme4

lme4 was non-cooperating. I expected to use mcmcsamp(), but that function has disappeared in my version of lme4. Since I was a minor version behind on R I updated, which lead to updating Eclipse to from Juno to Kepler, which lead to updating Java while I was at it, but mcmcsamp() was gone. As a note, I had an old version of R/ & lme4 sitting around, the values out of mcmcsamp were quite different from expected. I am  using version 0.999999-2 for these calculations. 
library(lme4)
R1 <- lmer(logAval ~ aperiod + seqa + trta + (1|subjid),
    data=datain)
summary(R1)
Linear mixed model fit by REML ['lmerMod']
Formula: logAval ~ aperiod + seqa + trta + (1 | subjid) 
   Data: datain 

REML criterion at convergence: -5.2321 

Random effects:
 Groups   Name        Variance Std.Dev.
 subjid   (Intercept) 0.077009 0.2775  
 Residual             0.005417 0.0736  
Number of obs: 24, groups: subjid, 12

Fixed effects:
            Estimate Std. Error t value
(Intercept) 3.510257   0.117208  29.949
aperiod2    0.058454   0.030048   1.945
seqaTR      0.019577   0.163011   0.120
trtaT       0.008135   0.030048   0.271

Correlation of Fixed Effects:
         (Intr) aperd2 seqaTR
aperiod2 -0.128              
seqaTR   -0.695  0.000       
trtaT    -0.128  0.000  0.000

Extraction of parameters of interest

The easy way out as detailed in http://glmm.wikidot.com/faq would be to impute the df value from another program, such as aov(). This would mean 10 df, and all the values would be equal to the presentation. I this blog I want to try something else. I wonder of this combination of simulate() and refit() is legit? The 90% interval is slightly narrower than the one in the presentation. 
simeff <- function(m0) {
  s <- simulate(m0)
  fixef(refit(m0,s))['trtaT']
}
ff <- replicate(1000,simeff(R1))
R1est <- 100*exp(quantile(ff,c(.05,.5,.95)))
cat('Point estimate (%)',R1est[2],'\n',
    '90% interval (%)  '  ,
    R1est[1],'to ',R1est[3],
    '\n',
    'CVIntra (%)       ',
    100*sqrt(exp(sigma(R1)^2)-1),
    '\n')
Point estimate (%) 100.7082 
 90% interval (%)   95.75779 to  105.7563 
 CVIntra (%)        7.370137