Experimenting with Distributed Data Processing in Haskell
YOW! Lambda Jam 2019
Apache Spark is one of the most popular data processing frameworks in the world, and is widely used in the enterprise. Its popularity is due in no small part to its adoption of the functional paradigm: it demonstrates the advantages of purity, higher-order functions and laziness to simplify the processing of large datasets. Haskell excels at all of those things; so it is only natural to think that Haskell would be a good fit for distributed data processing. Tweag.io's Sparkle and Soostone's Hadron are a few examples in the Haskell ecosystem.
'distributed-dataset' is a framework written in Haskell designed to efficiently process large amount of data. With the StaticPointers extension of GHC Haskell, we are able to distribute a computation across different machines; and using the technique described by Matei et al that led to Apache Spark, we can express and execute large scale data transforms using a pretty DSL.
In this talk, I am going to give a brief introduction to the library, and then move on to explaining the key implementation ideas and the advantages that Haskell offers to distributed data processing.
: "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing" by Matei Zaharia, et al.