Saturday, July 19, 2014

Guns are cool - time effects

September last year I made a post using the shootingtracker data. It is attempted in shootingtracker to register all shootings with at least four victims, be they wounded or dead. The data starts  January 1st 2013, which means that by now the amount of data has almost doubled. This surely is a dataset where I hope the makers find less and less data to add. Analysis shows Sundays in summer have the highest number of shootings on a day. Three to four shootings on Sunday in July and August.

Data

Data sits in two pages on shootingtracker, 2013 and 2014. In preparation for this post I copied/pasted those data in notepad and removed headers and footers. In the 2013 data I kept names of columns. The first steps of reading the data are removing some of the (for me) extraneous info, such as reference where the data came from. Subsequently the state and town are separated and a few records which do not have the correct state abbreviation are corrected. Finally, 13 is reformatted to 2013 and date is created. The last record used is from July 9th, 2014.
r13 <- readLines('raw13.txt')
r14 <- readLines('raw14.txt')
r1 <- c(r13,r14)
head(r1)
[1] "Number\t Date\t Alleged Shooter\t Killed\t Wounded\t Location\t References"
[2] "1\t1/1/13\tCarlito Montoya\t4\t0\tSacramento, CA\t"                        
[3] " [Expand] "                                                          
[4] "2\t1/1/13\tUnknown\t1\t3\tHawthorne, CA\t"                                 
[5] " [Expand] "                                                          
[6] "3\t1/1/13\tJulian Sims\t0\t4\tMcKeesport, PA\t"
tail(r1)
[1] "141\t7/8/2014\tUnknown\t1\t4\tSan Bernardino, CA\t"    
[2] " [Expand] "                                      
[3] "142\t7/8/2014\tUnknown\t0\t5\tProvidence, RI\t"        
[4] " [Expand] "                                      
[5] "143\t7/9/2014\tRonald Lee Haskell\t6\t1\tHouston, TX\t"
[6] " [Expand] " 
r2 <- gsub('\\[[a-zA-Z0-9]*\\]','',r1)
r3 <- gsub('^ *$','',r2)
r4 <- r3[r3!='']
r5 <- gsub('\\t$','',r4)
r6 <- gsub('\\t References$','',r5)
r7 <- read.table(textConnection(r6),
    sep='\t',
    header=TRUE,
    stringsAsFactors=FALSE)
r7$Location[r7$Location=='Washington DC'] <-
    'Washington, DC'
r8 <- read.table(textConnection(as.character(r7$Location)),
    sep=',',
    col.names=c('Location','State'),
    stringsAsFactors=FALSE)
r8$State <- gsub(' ','',r8$State)
r8$State[r8$State=='Tennessee'] <- 'TN'
r8$State[r8$State=='Ohio'] <- 'OH'
r8$State[r8$State=='Kansas'] <- 'KS'
r8$State[r8$State=='Louisiana'] <- 'LA'
r8$State[r8$State=='Illinois'] <- 'IL'
r8$State <- toupper(r8$State)

r7$State <- r8$State
r7$Location <- r8$Location
r7 <- r7[r7$State != 'PUERTORICO',]
Sys.setlocale(category = "LC_TIME", locale = "C")
r7$Date <- gsub('/13$','/2013',r7$Date)
r7$date <- as.Date(r7$Date,format="%m/%d/%Y")

Effect of day and month

Effect of day of the week is pretty easy to plot, just add the day and run qplot(weekday). In this case complexities arise because I want the days in a specific order, Monday to Sunday. Second is that not all days occur equally often in the test period. This is not enough to invalidate the plot, but since I had to correct for occurrence of months for a similar plot, I decided to reuse that code for weekdays. The data.frame alldays is used to calculate the number of days in the data set. I am not going to over analyze this, Sundays stick out in a negative way.
library(ggplot2)
r7$weekday <- factor(format(r7$date,'%a'),
    levels=format(as.Date('07/07/2014',format="%m/%d/%Y")+0:6,'%a'))
r7$month <- factor(format(r7$date,'%b'),
    levels=format(as.Date('01/15/2014',format="%m/%d/%Y")+
            seq(0,length.out=12,by=30),'%b'))
alldays <- data.frame(
    date=seq(min(r7$date),max(r7$date),by=1))
alldays$weekday <- factor(format(alldays$date,'%a'),
    levels=levels(r7$weekday))
alldays$month <- factor(format(alldays$date,'%b'),
    levels=format(as.Date('01/15/2014',format="%m/%d/%Y")+
            seq(0,length.out=12,by=30),'%b'))
ggplot(data=data.frame(prop=as.numeric(table(r7$weekday)/
        table(alldays$weekday)),
    weekday=factor(levels(r7$weekday),
        levels=levels(r7$weekday))),
    aes(y=prop,x=weekday)) +
    geom_bar(stat='identity') +
    ylab('Shootings per day') +
    xlab('Day of the week')
In terms of months, it seems summer is worse than the other seasons and winter is best.
ggplot(data=data.frame(prop=as.numeric(table(r7$month)/
                    table(alldays$month)),
            month=factor(levels(r7$month),
                levels=levels(r7$month))),
        aes(y=prop,x=month)) +
    geom_bar(stat='identity') +
    ylab('Shootings per day') +
    xlab('Month')   

A model

It is not obvious, given the unequal distributions of weekdays over months, how significant a month effect is. To examine this, I have reorganized the data to display shootings per day. Data frame alldays is used again, now to ensure data with no shootings are correctly represented. The modeling shows a clear effect of months and the interaction of days and months on the brink of significance.
r7$one <- 1
ag <- aggregate(r7$one,
    by=list(date=r7$date),FUN=sum)
counts <- merge(alldays,ag,all=TRUE)
counts$x[is.na(counts$x)] <- 0
g0 <- glm(x ~ weekday ,data=counts,
    family='poisson')
g1 <- glm(x ~ weekday + month,data=counts,
    family='poisson')
g2 <- glm(x ~ weekday * month,data=counts,
    family='poisson')
anova(g0,g1,g2,test='Chisq')
Analysis of Deviance Table

Model 1: x ~ weekday
Model 2: x ~ weekday + month
Model 3: x ~ weekday * month
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
1       548     653.86                        
2       537     624.86 11   29.008 0.002264 **
3       471     538.99 66   85.870 0.050718 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Predictions

To understand the interaction, a plot is made of the expected values by day and month. Looking at this plot, the Sunday effect is most pronounced in Summer. In addition I would not be surprised if next Sunday has four shootings again.
pred1 <- expand.grid(weekday=factor(levels(r7$weekday),
        levels=levels(r7$weekday)),
    month=factor(levels(r7$month),
        levels=levels(r7$month)))
preds <- predict(g2,pred1,
    type='response',
    se=TRUE)
pred1$fit <- preds$fit
pred1$se.fit <- preds$se.fit

limits <- aes(ymax = fit+se.fit, ymin=fit-se.fit)
p <- ggplot(pred1, aes(fill=weekday, y=fit, x=month))
dodge <- position_dodge(width=0.9)
p + geom_bar(position=dodge, stat="identity") +
    geom_errorbar(limits, position=dodge, width=0.25) +
    theme(legend.position = "bottom") +
    ylab('Shootings per Day') +
    guides (fill=guide_legend('Day'))


Follow the example?

One might ask if shootings which have much media attention give cause to copycats. This is not easy to analyze, given that clearly time effects from day and month exist. Besides, which shootings get a lot of media attention? Yet we can look at the number of shootings over time and at least add the shootings with most victims in the plot. The number of victims to be marked has arbitrarily been chosen as 18 or more. In this plot I cannot see the connection. 
r7$Victims <- r7$Killed+r7$Wounded
table(r7$Victims)
  4   5   6   7   8   9  10  12  13  14  18  19  20  21 
322 104  25  26  11   5   4   2   2   2   1   1   1   1 

p <- ggplot(counts, aes(y=x,x=date))
p + stat_smooth(span=.11) +
    geom_point() +
    geom_vline(xintercept=as.numeric(r7$date[r7$Victims>15])) +
    ylab('shootings per day')

An ACF

Mostly because making an ACF is integral part of analyzing time series. Here it is, in case anybody doubted a week effect. I chose this fairly long lag, because 10 weeks seemed nice to me. 
plot(acf(counts$x,lag.max=70))

Saturday, July 12, 2014

odfweave setup and counting logicals

Two short items in this blogpost. Since it was not obvious how to run odfWeave() in my particular setup, the call I am using. Then there were several people crosstabulating logical vectors, so I wanted to play along, 80 times faster than table().

odfWeave

My particular setup consists of R, 7-zip, libreoffice. Somehow they don't 100% play along when using odfWeave. I had that problem this spring and decided to put my solution in a post at some point. In terms of versions therefore, I had that with my previous versions, and tested that it still runs with my new setup (R 3.1.1, Libreoffice  4.2.5.2). The only loose end, is that odfWeave complains I am re-using a directory, and that I need to empty said directory manually.
# the standard example call that works for me
demoFile <- system.file("examples", "simple.odt", package = "odfWeave")
outputFile <- gsub("simple.odt", "output.odt", demoFile)
odfWeave(demoFile, outputFile,
    workDir='C:\\Users\\Kees\\Documents\\tmp',
    odfWeaveControl(zipCmd = 
            c("C:\\Progra~1\\7-Zip\\7z a -tzip $$file$$ . -r", 
                "C:\\Progra~1\\7-Zip\\7z x -tzip $$file$$ -yr") ))
# removing files
file.remove(dir('C:\\Users\\Kees\\Documents\\tmp',
        recursive=TRUE,
        full.names=TRUE))

# using a different directory
odfWeave('C:\\Users\\Kees\\Documents\\test\\testcases.odt',
    'C:\\Users\\Kees\\Documents\\test\\testout.odt',
    workDir='C:\\Users\\Kees\\Documents\\tmp',
    odfWeaveControl(zipCmd = 
            c("C:\\Progra~1\\7-Zip\\7z a -tzip $$file$$ . -r", 
                "C:\\Progra~1\\7-Zip\\7z x -tzip $$file$$ -yr") ))

Cross table of logical vectors

This was started in Sometimes Table is not the Answer – a Faster 2×2 Table and carried on with Sometimes I feel (some) need for speed. So, I wanted to add my own attempts. The aim is to make a cross table of two logical vectors with a minimum of time. Which becomes important if these vectors are long. Solutions from previous posts.
set.seed(2014)

manual = sample(c(TRUE, FALSE), 10e6, replace = TRUE)
auto = sample(c(TRUE, FALSE), 10e6, replace = TRUE)

logical.tab = function(x, y) {
  tt = sum(x & y)
  tf = sum(x & !y)
  ft = sum(!x & y)
  ff = sum(!x & !y)
  return(matrix(c(ff, tf, ft, tt), 2, 2))
}

basic.tab2 = function(x, y) {
  dif = x - y
  tf = sum(dif > 0)
  ft = sum(dif < 0)
  tt = sum(x*y)
  ff = length(dif) - tt - tf - ft
  return(c(tf, ft, tt, ff))
}
tabulate(manual + auto *2+1, 4)

My idea was we should use the margins and go back from there.
my.tab = function(x, y) {
  tt = sum(x * y)
  t1=sum(x)
  t2=sum(y)
  return(matrix(c(length(x)-t1-t2+tt,  t1-tt, t2-tt, tt), 2, 2))
}

my.tab2 <- function(x, y) {
  phase1 <- colSums(cbind(x,y,x*y))
  return(matrix(c(length(x)-sum(phase1[-3])+phase1[3],
     phase1[-3]-phase1[3],
     phase1[3]),2,2))
}
With my particular hardware table() is just too slow to microbenchmark often, but 80 times faster than table() is not bad.
library(microbenchmark)
microbenchmark(
    logical.tab(manual, auto), 
    basic.tab2(manual, auto),
    my.tab(manual,auto),
    my.tab2(manual,auto),
    tabulate(manual + auto *2+1, 4),
    table(manual,auto),
    times = 20)
Unit: milliseconds
                               expr        min         lq     median         uq        max neval
          logical.tab(manual, auto)  2852.5587  2888.8590  2906.4571  2972.3916  3227.0821    20
           basic.tab2(manual, auto)   705.8153   722.5800   746.1683   765.9400   957.5435    20
               my.tab(manual, auto)   185.8359   186.6829   188.0988   224.2308   413.5623    20
              my.tab2(manual, auto)   463.2731   481.8843   487.7825   512.2563   694.1729    20
 tabulate(manual + auto * 2 + 1, 4)   276.1837   300.8009   315.9451   379.7302   534.7997    20
                table(manual, auto) 15703.0576 16132.0100 16231.3342 16466.7445 19012.0273    20

Sunday, July 6, 2014

Stone Flakes V, networks again

Last week I tried pcalg. This week deal (Learning Bayesian Networks with Mixed Variables). The aim n this post I want to try something new, a causal graphical model. The aim here is just as much to get myself a feel what these things do as to understand how the stone flakes data fit together.

Data

Data are stone flakes data which I analyzed previously. The first post was clustering, second linking to hominid type, third regression. Together these made for the bulk of a standard analysis. In this new analysis the same starting data is used.
r2 <- read.table('StoneFlakes.txt',header=TRUE,na.strings='?')
r1 <- read.table('annotation.txt',header=TRUE,na.strings='?')
r12 <- merge(r1,r2) 

r12$group <- factor(r12$group,labels=c('Lower Paleolithic',
        'Levallois technique',
        'Middle Paleolithic',
        'Homo Sapiens'))
r12$site <- factor(c('other','gravel pit')[r12$site+1])
r12$mat <- factor(c('flint','other')[r12$mat])
r12$lmage <- log10(-r12$age)

Deal

The starting point of this post was to continue/ my analysis of last week. But when I discovered deal could be used to discover the model, repeating last week's analysis was chosen instead. Deal does not have a Vignette, but there is a paper
deal: A Package for Learning Bayesian Networks which helped me very well to get started.

First Model

Initially I wanted to start with a model containing only continuous variables, similar to before, but that threw an error in jointprior(). Hence I added groups as factor. Autosearch() and heuristicsearch() give a lot of output, basically one line for each step. For brievety these are not shown. The good thing about this model is that it has a solution where 'group' is driving other variables.
library(deal)
rfin <- subset(r12,,c(names(r2)[-1],'group'))
rfin <- rfin[complete.cases(rfin),]
rfin.nw <- network(rfin)
rfin.prior <- jointprior(rfin.nw)

Imaginary sample size: 8 
rfin.nw <- learn(rfin.nw,rfin,rfin.prior)$nw
rfin.search <- autosearch(rfin.nw,
    rfin,
    rfin.prior,
    trace=FALSE)

plot(rfin.search$nw)
Heuristic is used to further improve the model. In the end the model seems a bit more complex that pcalg, but not unreasonably so.
rfin.heuristic <- heuristic(rfin.search$nw,
    rfin,
    rfin.prior,
    restart=10,
    trace=FALSE,
    trylist=rfin.search$trylist)
plot(rfin.heuristic$nw)

Second model

For brevity I won't be repeating the code. It is all the same except for the data going in, which will be shown. The second model is similar to the first, but the (potential) outliers have been removed. This model looks even more clean.
rfin <- subset(r12,
    !(r12$ID %in% c('ms','c','roe','sz','va','arn')),
    c(names(r2)[-1],'group'))
rfin <- rfin[complete.cases(rfin),]

Third model

Including all sensible factors is the model I wanted to do. However, it seemed the imaginary sample size grew to 96. It is my experience that higher imaginary sample sizes produce more complex networks and longer run times. The current model seems a bit too complex to my liking. Moving under the recommended number gave runtime errors. 

Final model

In the final model (a few are skipped now) a number of simplifications were made. Region is removed as factor, log(-age) as continuous variable. Group has lost Homo Sapiens, since that category had only three records. The model is restricted in the sense that group cannot be the result of other variables. In the plot these are shown as red arrows.
rfin <- subset(r12,
    !(r12$ID %in% c('ms','c','roe','sz','va','arn')),
    c(-ID,-number,-age,-dating,-region,-lmage))
rfin <- rfin[rfin$group !='Homo Sapiens',]
rfin <- rfin[complete.cases(rfin),]
rfin.nw <- network(rfin)
rfin.prior <- jointprior(rfin.nw)
mybanlist <- matrix(
    c(2:11,
        rep(1,10)),ncol=2)
banlist(rfin.nw) <- mybanlist

Conclusion

Deal makes too complex networks for my liking, pcalg cannot use discrete variables. Deal has banned links, a feature which helps. pcalg made more nice plots, but I have the feeling that is relatively easily remedied. Neither gave a model which struck me as a model to continue with. I'll be needing quite some more study to feel comfortable with this kind of models.

Sunday, June 29, 2014

stone flakes IV

In this post I want to try something new, a causal graphical model. The aim here is just as much to get myself a feel what these things do as to understand how the stone flakes data fit together.

Data

Data are stone flakes data which I analyzed previously. The first post was clustering, second linking to hominid type, third regression. Together these made for the bulk of a standard analysis. In this new analysis the same starting data is used.
r2 <- read.table('StoneFlakes.txt',header=TRUE,na.strings='?')
r1 <- read.table('annotation.txt',header=TRUE,na.strings='?')
r12 <- merge(r1,r2)

Packages

The main package used is pcalg (Methods for graphical models and causal inference). Even though it lives on cran, it requires RBGL (An interface to the BOOST graph library) which lives on Bioconductor. Plots are made via Rgraphvis (Provides plotting capabilities for R graph objects), Bioconductor again, which itself has the hard work done by graphviz, which, on my linux machine, is a few clicks to install. 
library('pcalg')
library('Rgraphviz')


First analysis

I am just following the vignette here, to get some working code.
rx <- subset(r12,,names(r2)[-1])
rx <- rx[complete.cases(rx) & !(r12$ID %in% c('ms','c','roe','sz','va','arn')),]
suffStat <- list(C = cor(rx), n = nrow(rx))
pc.gmG <- pc(suffStat, indepTest = gaussCItest,
    p = ncol(rx), alpha = 0.01)
png('graph1.png')
plot(pc.gmG, main = "")


personally I dislike this plot since you have to know which variable is which number. I don't think that is acceptable for things one wants to share. Since I could not find documentation how to modify this via the plot statement, I took the ugly road of directly modifying an S4 object; pc.Gmc.
pc.gmG@graph@nodes <- names(rx)
names(pc.gmG@graph@edgeL) <- names(rx)
png('graph2.png')
plot(pc.gmG, main = "")
dev.off()

This makes some sense looking at the variable names.
RTI (Relative-thickness index of the striking platform) is connected to WDI (Width-depth index of the striking platform). PSF (platform primery (yes/no, relative frequency)) is related to FSF (Platform facetted (yes/no, relative frequency)). PSF is also related to PROZD (Proportion of worked dorsal surface (continuous)) which then goes to ZDF1 (Dorsal surface totally worked (yes/no, relative frequency)). ZDF1 is also influenced by FLA (Flaking angle (the angle between the striking platform and the splitting surface)).

Second analysis

Much as like this analysis, it does not lead to a connection between flakes on one hand and age or group on the other hand. Since the algorithm assumes normal distributed variables, group is out of the question. Log(-age) seems to be closest to normal distributed.
rx <- subset(r12,,names(r2)[-1])
rx$lmage <- log(-r12$age)
rx <- rx[complete.cases(rx) & !(r12$ID %in% c('ms','c','roe','sz','va','arn')),]
suffStat <- list(C = cor(rx), n = nrow(rx))
pc.gmG <- pc(suffStat, indepTest = gaussCItest,
    p = ncol(rx), alpha = 0.01)
pc.gmG@graph@nodes <- names(rx)
names(pc.gmG@graph@edgeL) <- names(rx)
plot(pc.gmG, main = "")


T

Adding age links the two parts, while keeping most of the previous graph unchanged. The causal link however, seems reversed, does age cause change in flakes or do changes in flakes cause age? Nevertheless, it does show a different picture than before. In linear regression FSF and LBI contributed, but there I had not removed the outliers. In this approach FSF features, but is in its turn driven by PSF. The other direct influence is ZDF1, which is now also driven by WDI.

Third analysis

It is probably pushing the limits of what is normal distributed, but there are two binairy variables, stone material (1=flint, 2=other) and site (1=gravel pit, 0=other).
rx <- subset(r12,,c('site','mat',names(r2)[-1]))
rx$lmage <- log(-r12$age)
rx <- rx[complete.cases(rx) & !(r12$ID %in% c('ms','c','roe','sz','va','arn')),]
suffStat <- list(C = cor(rx), n = nrow(rx))
pc.gmG <- pc(suffStat, indepTest = gaussCItest,
    p = ncol(rx), alpha = 0.01)
pc.gmG@graph@nodes <- names(rx)
names(pc.gmG@graph@edgeL) <- names(rx)
plot(pc.gmG, main = "")

This makes a link from material to RTI (relative thickness) and a connection site to FLA (flaking angle), see below.
# site (1=gravel pit, 0=other)
boxplot(FLA ~ c('other','gravel pit')[site+1],
    data=r12,
    ylab='FLA',
    xlab='site')

#stone material (1=flint, 2=other)
boxplot(RTI ~ c('flint','other')[mat],
    data=r12,
    ylab='RTI',
    xlab='mat')

Sunday, June 22, 2014

stone flakes III

Stone flakes are waste products from the tool making process in the stone age. This is the second post, first post was clustering, second linking to hominid type. The data also contains a more or less continuous age variable, which gives possibility to use regression, which is the topic of this week.

Data

Data source see first post. Regarding age especially: 'in millenia (not to be taken too seriously) ... mode of dating (geological=more accurate, typological)'. On inspection, it seems most of the typological dates are 200k, hence it seemed an idea to ignore typological. In  addition, age is negative. Since at some point I want to use a Box-Cox transformation, for which positive is required, -age is added.
r2 <- read.table('StoneFlakes.txt',header=TRUE,na.strings='?')
r1 <- read.table('annotation.txt',header=TRUE,na.strings='?')
r12 <- merge(r1,r2)
r12$Group <- factor(r12$group,labels=c('Lower Paleolithic',
        'Levallois technique',
        'Middle Paleolithic',
        'Homo Sapiens'))
r12$mAge <- -r12$age
r12c <- r12[complete.cases(r12),]
table(r12$age ,r1$dating )
       geo typo
  -400   2    0
  -300   8    1
  -200  13   12
  -130   1    0
  -120  12    0
  -80   23    2
  -40    3    0

Regression

The tool to start is linear regression. This shows there is a relation, mostly age with FSF and LBI. It is not a very good model, the standard error is at least 55 thousand years.
l1 <- lm(age ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,
    data=r12c[r12c$dating=='geo',])
summary(l1)
Call:
lm(formula = age ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + 
    PROZD, data = r12c[r12c$dating == "geo", ])

Residuals:
     Min       1Q   Median       3Q      Max 
-127.446  -30.646   -7.889   27.790  159.471 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) -328.8260   310.6756  -1.058   0.2953  
LBI          149.1334    65.8323   2.265   0.0281 *
RTI           -0.8196     2.8498  -0.288   0.7749  
WDI           16.2067    20.1351   0.805   0.4249  
FLA           -1.6769     1.8680  -0.898   0.3739  
PSF           -0.9222     1.0706  -0.861   0.3934  
FSF            1.9496     0.8290   2.352   0.0229 *
ZDF1           0.9537     1.1915   0.800   0.4275  
PROZD          1.2245     2.0068   0.610   0.5447  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 55.09 on 47 degrees of freedom
Multiple R-squared:  0.7062, Adjusted R-squared:  0.6562 
F-statistic: 14.12 on 8 and 47 DF,  p-value: 3.23e-10
Some model validation can be made through the car package. It gives the impression that the error increases with age. It is not a particular strong effect, but then, the age range is not that large either, 40 to 400, which is a factor 10.
library(car)
par(mfrow=c(2,2))
plot(l1,ask=FALSE)
Since I know of no theoretical basis to chose a transformation, Box-Cox is my method of choice to proceed. Lambda zero, or log transformation, is certainly a choice which seems reasonable, hence it has been selected.
r12cx <- r12c[r12c$dating=='geo',]
summary(p1 <- powerTransform(
        mAge ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD, 
        r12cx))
bcPower Transformation to Normality 

   Est.Power Std.Err. Wald Lower Bound Wald Upper Bound
Y1    0.1211   0.1839          -0.2394           0.4816

Likelihood ratio tests about transformation parameters
                             LRT df         pval
LR test, lambda = (0)  0.4343012  1 5.098859e-01
LR test, lambda = (1) 21.3406407  1 3.844933e-06

Linear regression, step 2

Having chosen a transformation, it is time to rerun the model. It is now clear that FSF is most important and LBI a bit less.
r12cx$lAge <- log(-r12cx$age)
l1 <- lm(lAge ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,
    data=r12cx)
summary(l1)
Call:
lm(formula = lAge ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + 
    PROZD, data = r12cx)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.04512 -0.18333  0.07013  0.21085  0.58648 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)   
(Intercept)  5.071224   2.030941   2.497  0.01609 * 
LBI         -0.953462   0.430357  -2.216  0.03161 * 
RTI          0.005561   0.018630   0.298  0.76664   
WDI         -0.041576   0.131627  -0.316  0.75350   
FLA          0.018477   0.012211   1.513  0.13695   
PSF          0.002753   0.006999   0.393  0.69580   
FSF         -0.015956   0.005419  -2.944  0.00502 **
ZDF1        -0.004485   0.007789  -0.576  0.56748   
PROZD       -0.009941   0.013119  -0.758  0.45236   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3601 on 47 degrees of freedom
Multiple R-squared:  0.6882, Adjusted R-squared:  0.6351 
F-statistic: 12.97 on 8 and 47 DF,  p-value: 1.217e-09

Plot of regression

With two independent variables, it is easy to make a nice plot:
l2 <- lm(lAge ~ LBI + FSF ,
    data=r12cx)
par(mfrow=c(1,1))
incont <- list(x=seq(min(r12cx$LBI),max(r12cx$LBI),length.out=12),
    y=seq(min(r12cx$FSF),max(r12cx$FSF),length.out=13))
topred <- expand.grid(LBI=incont$x,
    FSF=incont$y)
topred$p1 <- predict(l2,topred)        
incont$z <- matrix(-exp(topred$p1),nrow=length(incont$x))
contour(incont,xlab='LBI',ylab='FSF')
cols <- colorRampPalette(c('violet','gold','seagreen'))(4) 
with(r12cx,text(x=LBI,y=FSF,ID,col=cols[group]))

Predictions

I started in data analysis as a chemometrician, my method of choice for a predictive model with correlated independent variables is PLS. In this case one component seems enough (lowest cross validation RMSEP), but the model explains 60% of log(age) variability, which is not impressive.
library(pls)
p1 <- mvr(lAge ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,
    data=r12cx,
    method='simpls',
    validation='LOO',
    scale=TRUE,
    ncomp=5)
summary(p1)
Data: X dimension: 56 8 
Y dimension: 56 1
Fit method: simpls
Number of components considered: 5

VALIDATION: RMSEP
Cross-validated using 56 leave-one-out segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps
CV          0.6016   0.3714   0.3812   0.3864   0.3915   0.3983
adjCV       0.6016   0.3713   0.3808   0.3859   0.3910   0.3976

TRAINING: % variance explained
      1 comps  2 comps  3 comps  4 comps  5 comps
X       51.07    64.11    76.81    83.13    89.44
lAge    64.62    67.94    68.70    68.78    68.81
A plot shows there are a number of odd points, which are removed in the next section.
r12c$plspred <- -exp(predict(p1,r12c,ncomp=1))
plot(plspred ~age,ylab='PLS prediction',type='n',data=r12c)
text(x=r12c$age,y=r12c$plspred,r12c$ID,col=cols[r12c$group])
In an email from Thomas Weber, it was also indicated that there might be reasons to doubt the homonid group of a few inventories (rows); reasons include few flakes, changing insight and misfit from "impressionist technological" point of view. All inventories mentioned in that email are now removed. As can be seen, a two component PLS model is now preferred, and 80% of variance is explained.
p2 <- mvr(lAge ~ LBI + RTI + WDI + FLA + PSF + FSF + ZDF1 + PROZD,
    data=r12cx,
    method='simpls',
    validation='LOO',
    scale=TRUE,
    ncomp=5,
    subset= !(ID %in% c('ms','c','roe','sz','va','arn')))
summary(p2)

Data: X dimension: 53 8 
Y dimension: 53 1
Fit method: simpls
Number of components considered: 5

VALIDATION: RMSEP
Cross-validated using 53 leave-one-out segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps
CV          0.6088   0.3127   0.3057   0.3062   0.3070   0.3153
adjCV       0.6088   0.3126   0.3054   0.3059   0.3065   0.3140

TRAINING: % variance explained
      1 comps  2 comps  3 comps  4 comps  5 comps
X       52.14    63.94    77.05    79.82    84.77
lAge    75.44    79.66    80.31    81.00    81.56

Plot 

The plot below shows age in data versus the model predictions. It should be noted that after all steps, it is to be expected the data used to fit the model is predicted better than the other data. This is especially true for the suspected outliers which were removed, but also for inventories with dating method typological.
Having said that, the model really does not find inventories v1 and v2 to be as old as the data states. Perhaps there was no or less or different technological change, which is not picked by the model. In addition, the step between middle paleolithic and homo sapiens is not picked by the model. It is my personal suspicion that getting more homo sapiens data would improve all of the models which I have made in these three blog posts. As it is, the regression tree was the best tool to detect these inventories, which is a bit odd.
r12c$plspred2 <- -exp(predict(p2,r12c,ncomp=2))
plot(plspred2 ~age,ylab='PLS prediction',type='n',data=r12c)
text(x=r12c$age,y=r12c$plspred2,r12c$ID,col=cols[r12c$group])