Regarding CPU speed, my current laptop has a lowly Celeron 877. From what I see at my computers activity, under R it is mostly one core which does the work. Which means that even though there are two cores the single core CPU mark of 715 (from cpubenchmark.net) is what I have available. A bit of checking shows the current batch of processors has mainly more cores. For instance, the highest rated common CPU, an Intel Core i7-4710HQ, has a CPU mark of 7935 and single core 1870. That is 2.5 times faster for one core. But it is best because there are four cores. The same is true down the line. Four cores is common. But single core speed has not improved that much. Unless I can actually use those extra cores, what is the gain? Hence I am wondering, can I do something with extra cores for real world R computations? For this I can investigate.
Easy approach, ParallelA bit of browsing shows that the parallel package is the easy way to use multiple cores, think of using mclapply() rather than lapply. And in many situations this is easy, for instance, cross validation is easy, except for the small upfront cost of partitioning the data in chunks. Trying different settings for a machine learning problem is similar.
To give this a certain real world setting, data was taken from the UCI machine learning repository: Physicochemical Properties of Protein Tertiary Structure Data Set which has 45730 rows and 9 variables. A bit of plotting shows this figure for 2000 random selected rows. It seems the problem is not so much which variables to use but rather interactions. This was also suggested by poor performance of linear regression.
Random forest in parallelEven though nine variables is a bit low for random forest, I elected to use it as first technique. The main variables to tune are nodesize and number of variables to try. Hence I wrapped this in mclapply, not even using a cross validation and taking care not to nest the mclapply calls. The result was a big usage of memory. Which in hindsight may be obvious. Each of the instances gets a complete data set. The net effect is that I ran out of RAM and data was swapped. This cannot be good for performance. It may also explain comments I have read that the caret package uses too much memory. A decent set of hardware for machine learning including a four core processor would create four instances of the same data. Perhaps adding another 4 GB of memory and an SSD rather than a HDD would serve me just as well as a new laptop...
tol <- expand.grid(mtry=1:3,
bomen <- mclapply(seq(1:nrow(tol)),function(i)