Validating Big Data & ML Pipelines: Avoiding Burning the World Down on Accident w/ Apache Spark, Beam, Flink & friends
YOW! 2019 Sydney
As big data jobs move from the proof-of-concept phase into powering real production services, we have to start considering what will happen when everything eventually goes wrong (such as recommending inappropriate products or other decisions taken on bad data). This talk will attempt to convince you that we will all eventually get aboard the failboat (especially with ~40% of respondents automatically deploying their Spark jobs results to production), and it's important to automatically recognize when things have gone wrong so we can stop deployment before we have to update our resumes.
Figuring out when things have gone terribly wrong is trickier than it first appears since we want to catch the errors before our users notice them (or failing that before CNN notices them). We will explore general techniques for validation, look at responses from people validating big data jobs in production environments, and libraries that can assist us in writing relative validation rules based on historical data.
For folks working in streaming, we will talk about the unique challenges of attempting to validate in a real-time system, and what we can do besides keeping an up-to-date resume on file for when things go wrong. To keep the talk interesting real-world examples (with company names removed) will be presented, as well as several creative-common licensed cat pictures and an adorable panda GIF.
Holden is a transgender Canadian open source developer advocate with a focus on Apache Spark, BEAM, and related "big data" tools. She is the co-author of Learning Spark, High Performance Spark, and another Spark book that's a bit more out of date. She is a committer on and PMC member on Apache Spark and committer on SystemML & Mahout projects. Prior to joining Apple she worked at Google, IBM, Alpine, Databricks, Google (yes again), Foursquare, and Amazon. When not in San Francisco, Holden speaks internationally about different big data technologies (mostly Spark). She was tricked into the world of big data while trying to improve search and recommendation systems and has long since forgotten her original goal. Outside of work she enjoys playing with fire, riding scooters, and dancing.