Sunday, August 25, 2013

More REML exercise

Last week I tried exercise 1 of the SAS(R) proc mixed with R libraries lme4 and MCMCglm. So this week I aimed for exercise 2 but ended up redoing exercise 1 with nlme.
Exercise 2 results gave me problems with library lme4 and latter parts of the exercise had special correlation structures which seem not feasible with lme4. So library nlme came in the picture. However, while reading up on it, I concluded that Pinheiro and Bates (2000) used this same data in their book. I am not about to reproduce the book in a blog, so I have to check in more detail what I can blog. So, for now I am redoing exercise 1 with library nlme.

Running the analysis

The first part of the analysis is actually clear sailing. The random effect specification is quite different from the other libraries, but it was pretty simple.
m1 <- lme(Y ~ A + B + A*B,random=~1|Block/A,data=sp)
summary(m1)
Linear mixed-effects model fit by REML
 Data: sp 
       AIC      BIC    logLik
  137.7618 145.7752 -59.88092

Random effects:
 Formula: ~1 | Block
        (Intercept)
StdDev:    7.899104

 Formula: ~1 | A %in% Block
        (Intercept) Residual
StdDev:    3.921982 3.059593

Fixed effects: Y ~ A + B + A * B 
             Value Std.Error DF   t-value p-value
(Intercept)  37.00  4.667411  9  7.927307  0.0000
A2            1.00  3.517318  6  0.284308  0.7857
A3          -11.00  3.517318  6 -3.127383  0.0204
B2           -8.25  2.163459  9 -3.813338  0.0041
A2:B2         0.50  3.059593  9  0.163420  0.8738
A3:B2         7.75  3.059593  9  2.533016  0.0321
 Correlation: 
      (Intr) A2     A3     B2     A2:B2 
A2    -0.377                            
A3    -0.377  0.500                     
B2    -0.232  0.308  0.308              
A2:B2  0.164 -0.435 -0.217 -0.707       
A3:B2  0.164 -0.217 -0.435 -0.707  0.500

Standardized Within-Group Residuals:
         Min           Q1          Med           Q3          Max 
-1.165347655 -0.544427403 -0.007998877  0.526246640  1.473815718 

Number of Observations: 24
Number of Groups: 
       Block A %in% Block 
           4           12 

Covariance


VarCorr(m1)
            Variance     StdDev  
Block =     pdLogChol(1)         
(Intercept) 62.395850    7.899104
A =         pdLogChol(1)         
(Intercept) 15.381944    3.921982
Residual     9.361111    3.059593

Likelihood

-2*logLik(m1)
'log Lik.' 119.7618 (df=9)

ANOVA

anova(m1)

            numDF denDF  F-value p-value
(Intercept)     1     9 55.34418  <.0001
A               2     6  4.06957  0.0764
B               1     9 19.38873  0.0017
A:B             2     9  4.01929  0.0566

Predict A level 1

This is a problem. I went through the book, browsed the web, only to conclude that getting the mean is not difficult, but getting the associated standard error is not supported. Not because of a hate for standard errors, but because it is an intrinsically difficult problem. I seem to remember reading years ago that lme4 had the MCMC functionality precisely for this reason. Which does not help me if I try stubbornly to get the standard error anyway. 

Not a solution

After reading http://glmm.wikidot.com/faq, https://stat.ethz.ch/pipermail/r-sig-mixed-models/2010q2/003585.html and  http://stackoverflow.com/questions/14358811/extract-prediction-band-from-lme-fit I drafted this, which does not really work for my problem as it does not provide level A1 without specifying a level for B.
newdat <- unique(sp[sp$A==1,c('A','B','Block')])
newdat$pred <- predict(m1, newdat, level = 0)
Designmat <- model.matrix(eval(eval(m1$call$fixed)[-2]), newdat[-3]) 
predvar <- diag(Designmat %*% m1$varFix %*% t(Designmat)) 
newdat$SE <- sqrt(predvar) 
newdat$SE2 <- sqrt(predvar+m1$sigma^2)
newdat
   A B Block  pred       SE      SE2
1  1 1     1 37.00 4.667411 5.580846
4  1 1     2 37.00 4.667411 5.580846
7  1 1     3 37.00 4.667411 5.580846
10 1 1     4 37.00 4.667411 5.580846
13 1 2     1 28.75 4.667411 5.580846
16 1 2     2 28.75 4.667411 5.580846
19 1 2     3 28.75 4.667411 5.580846
22 1 2     4 28.75 4.667411 5.580846

Combining parameters

Since level A1 is just a weighted sum of known parameters with their own variances. But it is so unclear which level of variance is used. It does provide the 4.54 which SAS has labelled 'A1 mean broad'.
fixef(m1)%*%c(1,0,0,.5,0,0)
       [,1]
[1,] 32.875
sqrt(vcov(m1)[1,1]+ .25 *vcov(m1)[4,4]+vcov(m1)[1,4])
[1] 4.540329

Contrasts

With a different parameterization of the model and suitable contrasts. This also provides the 'A1 mean broad' of SAS.
contrasts(sp$B) <- contr.sum(levels(sp$B))
m2 <- lme(Y ~ A + B + A*B -1 ,random=~1|Block/A,data=sp)
summary(m2)

Linear mixed-effects model fit by REML
 Data: sp 
       AIC      BIC    logLik
  141.9207 149.9341 -61.96036

Random effects:
 Formula: ~1 | Block
        (Intercept)
StdDev:    7.899104

 Formula: ~1 | A %in% Block
        (Intercept) Residual
StdDev:    3.921982 3.059593

Fixed effects: Y ~ A + B + A * B - 1 
       Value Std.Error DF   t-value p-value
A1    32.875  4.540329  6  7.240665  0.0004
A2    34.125  4.540329  6  7.515975  0.0003
A3    25.750  4.540329  6  5.671395  0.0013
B1     4.125  1.081730 10  3.813338  0.0034
A2:B1 -0.250  1.529797 10 -0.163420  0.8734
A3:B1 -3.875  1.529797 10 -2.533016  0.0297
 Correlation: 
      A1     A2     A3     B1     A2:B1 
A2     0.757                            
A3     0.757  0.757                     
B1     0.000  0.000  0.000              
A2:B1  0.000  0.000  0.000 -0.707       
A3:B1  0.000  0.000  0.000 -0.707  0.500

Standardized Within-Group Residuals:
         Min           Q1          Med           Q3          Max 
-1.165347654 -0.544427403 -0.007998877  0.526246640  1.473815718 

Number of Observations: 24
Number of Groups: 
       Block A %in% Block 
           4           12 

Sunday, August 18, 2013

Exercise in REML/Mixed model

I want to build a bit more experience in REML, so I decided to redo some of the SAS examples in R. This post describes the results of example 59.1 (page 5001, SAS(R)/STAT User guide 12.3 link). Following the list from freshbiostats I will analyze using lme4 and MCMCglm.

Data

The data is a split plot design. To quote 'The split-plot design involves two experimental factors, A and B. Levels of A are randomly assigned to whole plots (main plots), and levels of B are randomly assigned to split plots (subplots) within each whole plot.'. I will read the data in r1 then convert to a suitable data.frame sp.
r1 <- read.table(textConnection('
1 1 1 56 1 1 2 41
1 2 1 50 1 2 2 36
1 3 1 39 1 3 2 35
2 1 1 30 2 1 2 25
2 2 1 36 2 2 2 28
2 3 1 33 2 3 2 30
3 1 1 32 3 1 2 24
3 2 1 31 3 2 2 27
3 3 1 15 3 3 2 19
4 1 1 30 4 1 2 25
4 2 1 35 4 2 2 30
4 3 1 17 4 3 2 18'),

col.names=c('Block1','A1','B1','Y1','Block2','A2','B2','Y2'))
sp <- with(r1,data.frame(
        Block=factor(c(Block1,Block2)),
        A=factor(c(A1,A2)),
        B=factor(c(B1,B2)),
        Y=c(Y1,Y2)))

Analysis

I won't repeat the preliminary part (number of levels, number of observations etc.). It is not the R philosophy to have that within the mixed model. Hence it remains to calculate variances, determe the significance of fixed effects and a mean with standard deviation for the first level of factor a.

lme4

The model summary gives same variances for the random effects, same residual log likelihood ('-2 Res Log Likelihood' in proc mixed is 'REMLdev' in lme4), different values for AIC, BIC. lme4 does not by default give tests for fixed effects. 
l1 <- lmer(Y ~ A + B + A*B + (1 |Block) + (   1| A : Block) ,data=sp)
summary(l1)
Linear mixed model fit by REML 
Formula: Y ~ A + B + A * B + (1 | Block) + (1 | A:Block) 
   Data: sp 
   AIC   BIC logLik deviance REMLdev
 137.8 148.4 -59.88    141.7   119.8
Random effects:
 Groups   Name        Variance Std.Dev.
 A:Block  (Intercept) 15.3819  3.9220  
 Block    (Intercept) 62.3958  7.8991  
 Residual              9.3611  3.0596  
Number of obs: 24, groups: A:Block, 12; Block, 4

Fixed effects:
            Estimate Std. Error t value
(Intercept)   37.000      4.667   7.928
A2             1.000      3.517   0.284
A3           -11.000      3.517  -3.127
B2            -8.250      2.163  -3.813
A2:B2          0.500      3.060   0.163
A3:B2          7.750      3.060   2.533

Correlation of Fixed Effects:
      (Intr) A2     A3     B2     A2:B2 
A2    -0.377                            
A3    -0.377  0.500                     
B2    -0.232  0.308  0.308              
A2:B2  0.164 -0.435 -0.217 -0.707       
A3:B2  0.164 -0.217 -0.435 -0.707  0.500

Test for fixed effects

The example gives tests for all fixed effects. I am following the school where testing a main effect when it is present in an interaction has no meaning, so I will only test the A*B interaction. There are two ways to go about this; compare hierarchical models via anova and simulation. Proc mixed has p-value 0.0566. Anova gave 0.0204, simulation 0.068.
l2 <- lmer(Y ~ A + B + (1 |Block) + (   1| A : Block) ,data=sp)
anova(l2,l1)
Data: sp
Models:
l2: Y ~ A + B + (1 | Block) + (1 | A:Block)
l1: Y ~ A + B + A * B + (1 | Block) + (1 | A:Block)
   Df    AIC    BIC  logLik  Chisq Chi Df Pr(>Chisq)  
l2  7 163.47 171.71 -74.734                           
l1  9 159.69 170.29 -70.844 7.7796      2    0.02045 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

pboot <- function(m0,m1) {
  s <- simulate(m0)
  L0 <- logLik(refit(m0,s))
  L1 <- logLik(refit(m1,s))
  2*(L1-L0)
}
obsdev <- 2*( as.numeric(logLik(l1))-as.numeric(logLik(l2)))
set.seed(1001)
reps <- replicate(1000,pboot(l2,l1))
mean(reps>obsdev)
[1] 0.068

mean for a level 1

The example provides the mean with three standard errors, depending on the inference space. The broad inference space is of interest, mean 32.875, sd=4.54. In lme4 we will run this as another simulation. The sd is a bit smaller.
mcs <- mcmcsamp(l1,10000)
mcsdf <- as.data.frame(mcs)
c(mean=mean(mcsdf[,1]+.5*mcsdf[,'B2']),sd=sd(mcsdf[,1]+.5*mcsdf[,'B2']))
     mean        sd 
32.913643  3.459708 

MCMCglm

MCMCglm has a different perspective, for one, it is Bayesian However this does not preclude us from looking at the data. It seems the variances are completely different, much larger. More on this later. Fixed effects look quite like the ones from lme4.
library(MCMCglmm)
m1 <- MCMCglmm(Y ~ A + B + A*B, random= ~Block + A : Block ,
    data=sp,family='gaussian',nitt=500000,thin=20,
    burnin=50000,verbose=FALSE)
summary(m1)

 Iterations = 50001:499981
 Thinning interval  = 20
 Sample size  = 22500 

 DIC: 145.191 

 G-structure:  ~Block

      post.mean  l-95% CI u-95% CI eff.samp
Block     149.8 1.985e-17    459.8    21784

               ~A:Block

        post.mean  l-95% CI u-95% CI eff.samp
A:Block     25.99 5.336e-17      120    374.1

 R-structure:  ~units

      post.mean l-95% CI u-95% CI eff.samp
units     18.81    3.764    39.78    606.9

 Location effects: Y ~ A + B + A * B 

            post.mean l-95% CI u-95% CI eff.samp   pMCMC   
(Intercept)   36.9516  23.9439  49.5868    22500 0.00213 **
A2             0.9780  -8.4952  10.5917    24190 0.80222   
A3           -11.0647 -20.8373  -1.5902    22500 0.03076 * 
B2            -8.2458 -14.2918  -2.0896    22991 0.01413 * 
A2:B2          0.5208  -7.8395   9.7543    22500 0.89973   
A3:B2          7.7596  -1.0876  16.2564    22500 0.07511 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Test for fixed effects

Hypothesis tests do not really exist in this context. However, one can have a look if 0,0 for the terms A2:B2 and A3:B2 at the same time is probable. Both these terms are positive, but it seems that 3% of the samples have both terms negative.
table(sign(m1$Sol[,'A2:B2']),sign(m1$Sol[,'A3:B2']))/nrow(m1$Sol)
             -1          1
  -1 0.03195556 0.41791111
  1  0.00560000 0.54453333

mean for a level 1

This can be pulled from the fixed effects samples. The standard deviation is a bit larger than from proc mixed and lme4.
c(mean=mean(m1$Sol[,1]+.5*m1$Sol[,'B2']),
        sd=sd(m1$Sol[,1]+.5*m1$Sol[,'B2']))
     mean        sd 
32.828740  6.655951 

MCMC variances discussion

MCMCglm had strange results for variations. For further examination I drew some quantiles. 
The first concerns blocks, the posterior mean of 150 is way of from 62 which both lme4 and proc mixed give. Still, the median seems reasonable. 
The second is A:block, 15 from proc mixed and lme4. Many, many MCMC samples have very low values, showing to a different solution. But there are also 10% samples over 85. 
Third is units, proc mixed and lme4 have 9.4, most of the samples are much larger.
lapply(1:3,function(i) quantile(m1$VCV[,i],seq(.1,.9,.2)))
[[1]]
         10%          30%          50%          70%          90% 
6.036660e-08 2.804703e+01 5.683922e+01 1.050030e+02 2.735606e+02 

[[2]]
         10%          30%          50%          70%          90% 
6.975880e-13 4.202219e-06 2.628300e+00 2.210180e+01 8.511472e+01 

[[3]]
      10%       30%       50%       70%       90% 
 7.104689 11.608162 16.596443 22.304393 32.750846 

It almost seems as MCMCglm finds samples where A:blocks is absent, while units get a much higher variation. Obviously, this alternative model cn be run in lme4, and compared with the first model. The A:Blocks term is not significant, which makes this explanation less plausible.
l3 <- lmer(Y ~ A + B + A*B + (1 |Block)  ,data=sp)
summary(l3)
 Linear mixed model fit by REML 
Formula: Y ~ A + B + A * B + (1 | Block) 
   Data: sp 
   AIC BIC logLik deviance REMLdev
 139.6 149 -61.81    146.8   123.6
Random effects:
 Groups   Name        Variance Std.Dev.
 Block    (Intercept) 65.472   8.0915  
 Residual             21.667   4.6547  
Number of obs: 24, groups: Block, 4

Fixed effects:
            Estimate Std. Error t value
(Intercept)   37.000      4.667   7.928
A2             1.000      3.291   0.304
A3           -11.000      3.291  -3.342
B2            -8.250      3.291  -2.507
A2:B2          0.500      4.655   0.107
A3:B2          7.750      4.655   1.665

Correlation of Fixed Effects:
      (Intr) A2     A3     B2     A2:B2 
A2    -0.353                            
A3    -0.353  0.500                     
B2    -0.353  0.500  0.500              
A2:B2  0.249 -0.707 -0.354 -0.707       
A3:B2  0.249 -0.354 -0.707 -0.707  0.500
anova(l1,l3)

Data: sp
Models:
l3: Y ~ A + B + A * B + (1 | Block)
l1: Y ~ A + B + A * B + (1 | Block) + (1 | A:Block)
   Df    AIC    BIC  logLik  Chisq Chi Df Pr(>Chisq)  
l3  8 162.83 172.25 -73.414                           
l1  9 159.69 170.29 -70.844 5.1407      1    0.02337 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
obsdev13 <- 2*( as.numeric(logLik(l1))-as.numeric(logLik(l3)))
reps13 <- replicate(1000,pboot(l3,l1))
mean(reps13>obsdev13)
[1] 0.027

Sunday, August 4, 2013

More rainfall calculations - REML

I wanted to have a look at various REML methods for a long time. The rainfall data seemed a nice example. On top of that, FreshBiostats had a blog post 'Mixed Models in R: lme4, nlme, or both?'. So lme4 it is.

The data  

As described before; the data are from KNMI; ROYAL NETHERLANDS METEOROLOGICAL INSTITUTE. The actual page for the data is http://www.knmi.nl/klimatologie/monv/reeksen/select_rr.html.  The data selected were from 6 locations; De Bilt, Groningen, Heerde, Kerkwerve, Ter Apel and Winterswijk (Sibinkweg). The blog starts with data entered in a data.frame named all which was derived a few weeks ago.

Analysis

A simple analysis would ignore years. The data are plotted below, the figure indicates a clear effect.  
sel1 <- all[(all$year <1916 | all$year>2003) 
        & all$Month=='Jan' ,]
sel1$decade <- factor(c('1906-1915','2004-2013')[1+(sel1$year>1950)]) 
sel1$Rain <- as.numeric(sel1$RD>0)

ag1 <- aggregate(sel1$Rain,list(Year=sel1$Year,
        location=sel1$location,
        decade=sel1$decade),FUN=sum)
ag1$Year <- factor(ag1$Year)
p <- ggplot(ag1, aes(decade, x/31))
p + geom_boxplot(notch=TRUE)

Analysis part 2

The problem appears once years are plotted; It seems more of the years in the first decade have a low proportion days with rain, but some of the last decade also have less days with rain.
p <- ggplot(ag1, aes(decade, x/31,col=Year))
p + geom_boxplot()
As written lme4 was my package of choice. This does give a significant effect, but it only at 3%.
library(lme4)
l1 <- glmer(cbind(31-x,x) ~  (1|Year)  ,data=ag1,family=binomial)
l1
Generalized linear mixed model fit by the Laplace approximation 
Formula: cbind(31 - x, x) ~ (1 | Year) 
   Data: ag1 
   AIC   BIC logLik deviance
 231.3 236.9 -113.7    227.3
Random effects:
 Groups Name        Variance Std.Dev.
 Year   (Intercept) 0.38069  0.617   
Number of obs: 120, groups: Year, 20

Fixed effects:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -0.3950     0.1423  -2.775  0.00552 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

l2 <- glmer(cbind(31-x,x) ~  decade + (1|Year) ,data=ag1,family=binomial)
l2
Generalized linear mixed model fit by the Laplace approximation 
Formula: cbind(31 - x, x) ~ decade + (1 | Year) 
   Data: ag1 
   AIC   BIC logLik deviance
 228.5 236.9 -111.3    222.5
Random effects:
 Groups Name        Variance Std.Dev.
 Year   (Intercept) 0.29496  0.5431  
Number of obs: 120, groups: Year, 20

Fixed effects:
                Estimate Std. Error z value Pr(>|z|)  
(Intercept)      -0.1019     0.1783  -0.572   0.5674  
decade2004-2013  -0.5862     0.2528  -2.319   0.0204 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr)

dc2004-2013 -0.705
anova(l1,l2)
Data: ag1
Models:
l1: cbind(31 - x, x) ~ (1 | Year)
l2: cbind(31 - x, x) ~ decade + (1 | Year)
   Df    AIC    BIC  logLik Chisq Chi Df Pr(>Chisq)  
l1  2 231.31 236.88 -113.65                          
l2  3 228.53 236.90 -111.27 4.772      1    0.02893 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1