Data-intensive computing has emerged as an area of intense interest in high-performance computing as the rate at which data is now being produced (and stored) has begun to outstrip our capacity to analyze it. By definition, these problems cannot be treated with the same means used to tackle “traditional” problems in computational science (e.g., molecular dynamics, fluid dynamics, and finite element methods), so a set of tools is emerging to fill this gap.
These tools, while not necessarily new, are new to the field of high-performance computing, and conversely, many of the concepts common to high-performance computing are new to the statisticians and computer scientists who have traditionally used these data-oriented tools. I’ve been developing training material to bridge these two fields. Much of this material is included in a tutorial/talk I give on using parallel R and Hadoop on Gordon, and I am in the process of converting those slides into written tutorials here.
These tutorials are all designed to be driven by examples. All of the code samples used to illustrate points should be available via links provided on each tutorial page.
Topics in Hadoop
- Conceptual Overview of Map/Reduce and Hadoop
- Writing Hadoop Applications in Python with Hadoop Streaming
- Parsing VCF Files with Hadoop Streaming
- Running Hadoop on HPC Clusters
- Slides: Hadoop Streaming: Programming Hadoop without Java
- Slides: Introduction to Spark
Topics in R
Topics in Storage
In addition to demonstrating the tools that aid in data-driven analysis, I’ve also started documenting some key storage technologies: