Sunday, January 20, 2013

Unemployment

I want to exercise a bit more with ggplot2 and there is always data to be gotten from Eurostat which is interesting. In Netherlands the statistics agency (CBS) brought these headlines (translated with http://translate.google.nl/):
  • Unemployment rose to 7.2 percent in December
  • From mid-2011, almost continuous increase of unemployment
  • Unemployment benefits in December increased by 18,000
  • In 2012, fewer benefits were terminated because of work resumption than in 2011
I lost overview in unemployment ages ago, save to know it is not good at all. Hence unemployment  plots.

Data introduction

Data from Eurostat (Harmonised unemployment rate % (seasonally adjusted)). I fiddles around with data ranges, put age categories in columns, country and month in rows, exported in .xls, cleaned it a bit in LibreOffice and exported to .cvs. It was important to have time as a variable, my first attempt had more month columns than the process could cope with. So, the R script can start with preparations:
library(ggplot2)
library(KernSmooth)
library(plyr)

r1 <- read.csv("unemployment.csv")
levels(r1$Region) <- sub(' countries)',')' ,levels(r1$Region),fixed=TRUE)
levels(r1$Region) <- sub(' (under United Nations Security Council Resolution 1244/99)','' ,levels(r1$Region),fixed=TRUE)
levels(r1$Region) <- sub('including +former GDR','Incl GDR' ,levels(r1$Region))
levels(r1$Region) <- sub('European Union','EU' ,levels(r1$Region))
levels(r1$Region)[levels(r1$Region)=='United Kingdom'] <- 'UK'
levels(r1$Region)[levels(r1$Region)=='United States'] <- 'US'
levels(r1$Region)[levels(r1$Region)=='Germany (Incl GDR from 1991)'] <- 'Germany'
levels(r1$Region)
 [1] "Austria"        "Belgium"        "Bulgaria"       "Croatia"   
 [5] "Cyprus"         "Czech Republic" "Denmark"        "Estonia"   
 [9] "Euro area (17)" "EU (27)"        "Finland"        "France" 
[13] "Germany"        "Greece"         "Hungary"        "Iceland" 
[17] "Ireland"        "Italy"          "Japan"          "Latvia" 
[21] "Lithuania"      "Luxembourg"     "Malta"          "Netherlands"
[25] "Norway"         "Poland"         "Portugal"       "Romania"
[29] "Slovakia"       "Slovenia"       "Spain"          "Sweden" 
[33] "Turkey"         "UK"             "US"   
r2 <- reshape(r1,varying=list(names(r1)[c(-1,-2)]),
    idvar=c('Region','TIME'),timevar='Age',direction='long',
    v.names=c('Percentage'),
    times=c(names(r1)[c(-1,-2)]))
r2$Age <-     gsub('.',' ',r2$Age,fixed=TRUE)
Agelevels <- unique(r2$Age)
Agelevels <- Agelevels[order(Agelevels)]
Agelevels
[1] "From 25 to 74 years" "Less than 25 years"  "Total"              
r2$Age <- factor(r2$Age,levels=Agelevels[c(2,1,3)]) 
r2$Date <- as.Date(paste(gsub('M','-',as.character(r2$TIME)),'-01',sep=''))

Plots

As far as I understand, Eurostat makes the numbers comparable between countries. We also get US and Japan, so this will be a somewhat global picture. To make the space per country large enough, the countries are split in three: low, middle and high maximum unemployment.
maxi <- aggregate(r2$Percentage,by=list(Region=r2$Region),FUN=max,na.rm=TRUE)
low <- maxi$Region[maxi$x<quantile(maxi$x,1/3)]
middle <- maxi$Region[maxi$x>quantile(maxi$x,1/3) & maxi$x<quantile(maxi$x,2/3)]
high <- maxi$Region[maxi$x>quantile(maxi$x,2/3)]
ggplot(r2[r2$Region %in% low,],aes(x=Date,y=Percentage,colour=Age)) +
    facet_wrap( ~ Region, drop=TRUE) +
    geom_line()  +
    theme(legend.position = "bottom") +# stat_smooth(span=.1,method='loess') +
    ylab('% Unemployment') + xlab('Year')
The plots show a bit too much wobbles for my taste, and at some countries more than others, but ggplot has a smoothing solution for that:
ggplot(r2[r2$Region %in% low,],aes(x=Date,y=Percentage,colour=Age)) +
    facet_wrap( ~ Region, drop=TRUE) +
    theme(legend.position = "bottom") + 
    stat_smooth(span=.1,method='loess') +
    ylab('% Unemployment') + xlab('Year')
I am not completely happy with that. If I look at results and remember correctly, these smoothers may choose a smoothing parameter, which is not what I would intent here. So I dragged a kernel smoother out. This used a normal kernel with bandwidth 90 (days), or three months. This means the smoothing uses data over a time period from plus or minus a year, with most of the info from the neighboring months. In addition this also gave me opportunity to try plyr.
r2$class <- interaction(r2$Region,r2$Age)
r3 <- r2[complete.cases(r2),]
r3$class <- factor(r3$class)
Perc <- ddply(.data=r3,.variables=.(class), 
    function(piece,...) {
      lp <- locpoly(x=as.numeric(piece$Date),y=piece$Percentage,
          drv=0,bandwidth=90)
      sdf <- data.frame(Date=as.Date(lp$x,origin='1970-01-01'),
          sPerc=lp$y,Age=piece$Age[1],Region=piece$Region[1])}
    ,.inform=FALSE
)
ggplot(Perc[Perc$Region %in% low ,],aes(x=Date,y=sPerc,colour=Age)) +
    facet_wrap( ~ Region, drop=TRUE) +
    geom_line()  +
    theme(legend.position = "bottom")+
    ylab('% Unemployment') + xlab('Year')
# removed incantation for middle and high



I will leave interpretation alone. Personally I mainly look at Europe, US, Germany and Netherlands.  The first two give me a global view. Locally, I live in the Netherlands and Germany is so influential in Netherlands it is a need to look at. 
In these plots I was surprised at Iceland (low numbers), Slovenia (big increase) and horrifed by Spain and Greece. I am sure each of us has their own interests and at least as much insight in these things as I do. 

Second part; derivative

The big question is, where are we going. For this I made the first derivative, smoothed with a somewhat bigger bandwidth, half a year. Derivatives are much more noisy and also suffer much more from rounding to the first digit, so this is a suitable bandwidth.
dPerc <- ddply(.data=r3,.variables=.(class), 
    function(piece,...) {
      lp <- locpoly(x=as.numeric(piece$Date),y=piece$Percentage,
          drv=1,bandwidth=365/2)
      sdf <- data.frame(Date=as.Date(lp$x,origin='1970-01-01'),
          dPerc=lp$y,Age=piece$Age[1],Region=piece$Region[1])}
    ,.inform=FALSE
)
ggplot(dPerc[dPerc$Region %in% low ,],aes(x=Date,y=dPerc,colour=Age)) +
    facet_wrap( ~ Region, drop=TRUE) +
    geom_line()  +
    theme(legend.position = "bottom")+
    ylab('Change in % Unemployment') + xlab('Year')
# removed incantation middle and high
To me, these plots show a bit where we are going. Especially the youth unemployment seems to be the weather vane for the total. What I pick here is Portugal which seems to stabilizing. Denmark, UK and Ireland may be improving. The US is improving but while they were first, the numbers are not so spectacular. Germany gets a bit of headwind, Netherlands a bit more. However, France and Slovania may become a point of worry. Latvia also looks bad.

2 comments:

  1. very nice dataviz - easier it would be to use the datamarkt portal www.datamarket.com with the R API like we did here

    http://blog.eoda.de/2012/11/18/pimp-up-datamarket-zeitreihen-mit/

    best regards
    Heiko

    ReplyDelete
    Replies
    1. Thanks. datamarket is a nice place to look for data. Not sure if they do Eurostat, but I'll certainly go and look for data there at some time.

      Delete