Stream All the Things!!
YOW! Data 2018
Streaming data architectures aren't just "faster" Big Data architectures. They must be reliable and scalable as never before, more like microservice architectures.
This talk has three goals:
- Justify the transition from batch-oriented big data to stream-oriented fast data.
- Explain the requirements that streaming architectures must meet and the tools and techniques used to meet them.
- Discuss the ways that fast data and microservice architectures are converging.
Big data started with an emphasis on batch-oriented architectures, where data is captured in large, scalable stores, then processed using batch jobs. To reduce the gap between data arrival and information extraction, these architectures are now evolving to be stream oriented, where data is processed as it arrives. Fast data is the new buzz word.
These architectures introduce new challenges for developers. Whereas a batch job might run for hours, a stream processing system typically runs for weeks or months, which raises the bar for making these systems reliable and scalable to handle any contingency.
The microservice world has faced this challenge for a while. Microservices are inherently message driven, responding to requests for service and sending messages to other microservices, in turn. Hence, they are also stream oriented, in the sense that they must respond reliably to never-ending input. So, they offer guidance for how to build reliable streaming data systems. I'll discuss how these architectures are merging in other ways, too.
We'll also discuss how to pick streaming technologies based on four axes of concern:
- Low latency: What's my time budget for handling this data?
- High volume: How much data per unit time must I handle?
- Data processing: Do I need machine learning, SQL queries, conventional ETL processing, etc.?
- Integration with other tools: Which ones and how is data exchanged between them?
We'll consider specific examples of streaming tools and how they fit on these axes, including Spark, Flink, Akka Streams, and Kafka.
Head of Developer Relations
Dean Wampler is an expert in streaming data systems, focusing on applications of ML/AI. He is head of evangelism at Anyscale.io, which is focused on distributed Python for ML/AI. Previously, he was an engineering VP at Lightbend, where he led the development of Lightbend CloudFlow, an integrated system for building and running streaming data applications with popular open source tools. Dean is the author or co-author of three O’Reilly books on Scala, Functional Programming, and Hive. He contributes to several open source projects and he co-organizes and speaks at many technology conferences and Chicago-based user groups. Dean has a Ph.D. in Physics from the University of Washington.