Realizing the Promise of Portable Data Processing with Apache Beam
YOW! Data 2017
The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the Big Data ecosystem together; it enables users to "run-anything-anywhere".
This talk will briefly cover the capabilities of the Beam model for data processing, as well as the current state of the Beam ecosystem. We'll discuss Beam architecture and dive into the portability layer. We'll offer a technical analysis of the Beam's powerful primitive operations that enable true and reliable portability across diverse environments. Finally, we'll demonstrate a complex pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Spark on Amazon Web Services, Apache Flink on Google Cloud, Apache Apex on-premise), and give a glimpse at some of the challenges Beam aims to address in the future.
Sr. Software Engineer
I'm serving as a chair of the Apache Beam Project Management Committee, and have been regularly committing code to the project since its inception. I'm working as a Senior Software Engineer at Google.
Before Beam, I have been working on its predecessor, Google Cloud Dataflow, since its beginnings, most recently by leading the development of the Dataflow SDK for Java.