- 1. Introduction
- 2. The parallel R taxonomy
- 3. lapply-based parallelism
- 4. foreach-based parallelism
- 5. Caveats with lapply- and foreach-based parallelism
- 6. Alternative forms of parallelism
- 7. Map-Reduce-based parallelism with Hadoop
This tutorial goes through various parallel libraries available to R programmers by applying them all to solve a very simple parallel problem: k-means clustering. Although trivially parallel, k-means clustering is conceptually simple enough for people of all backgrounds to understand, yet it can illustrate most of the core concepts common to all parallel R scripts.
Algorithmically, k-means clustering involves arriving at some solution (a local minima) by iteratively approaching it from a randomly selected starting position. The more random starts we attempt, the more local minima we get. For example, the following diagram shows some random data (top left) and the result of applying k-means clustering from three different random starting guesses:
We can then calculate some value (I think of it as an energy function) that represents the error in each of these local minima. Finding the smallest error (the lowest "energy") from all of the starting positions (and their resulting local minima) gives you the "best" overall solution (the global minimum. However, finding this global minimum is what we call an NP-hard problem, meaning you'd need infinite time to be sure you've truly found the absolute best answer possible. Thus, we rely on increasing the number of random starts to get as close as we can to this one true global minimum.
The simplest example of a k-means calculation in R looks like
data <- read.csv('dataset.csv') result <- kmeans(data, centers=4, nstart=100) print(result)
This code tries to find four cluster centers using 100 starting positions, and the value of result is the k-means object containing the minimal result$tot.withinss value for all 100 starts. We'll now look at a couple of different ways we can parallelize this calculation. All of the example codes presented here can be found in my Parallel R GitHub repository.
This guide is adapted from a talk I give, and it assumes that you already know how to actually run R jobs on parallel computing systems. I wrote a guide, Running R on HPC Clusters that goes through the basics of how to actually run these example codes.
2. The parallel R taxonomy
There are a number of different ways to utilize parallelism to speed up a given R script. I like to think of them as generally falling into one of a few broad categories of parallel R techniques though:
- lapply-based parallelism
- foreach-based parallelism
- Poor-man's parallelism and hands-off parallelism
- Map-Reduce-based parallelism
Although there are an increasing number of additional libraries entering CRAN that provide means to add parallelism that I have not included in this taxonomy, they generally fall into (or close to) one of the above categories.
To illustrate how these forms of parallelism can be used in practice, the remainder of this guide will demonstrate how a solution to the aforementioned k-means clustering problem can be found using these parallel methods.
To begin, the most straightforward form of parallelism for R programmers is lapply-based parallelism which is covered in the next section.