Join Newsletter

Dean Wampler

Software Developer, Expert in Big Data, Scala, and Functional Programming. Mastering Deep Learning.

....

United States

Dean Wampler, Ph.D., is a member of the Office of the CTO and the Architect for Big Data Products and Services at Typesafe. He uses Scala and Functional Programming to build Big Data systems using Spark, Mesos, Hadoop, the Typesafe Reactive Platform, and other tools. Dean is the author or co-author of three O’Reilly books on Scala, Functional Programming, and Hive. He contributes to several open source projects (including Spark) and he co-organizes and speaks at many technology conferences and Chicago-based user groups.

Talks at YOW!

Streaming Data with Kafka and Microservices - YOW! 2018 Melbourne

When we think of modern data processing, we often think of batch-oriented ecosystems like Hadoop, including processing engines like Spark. However, the sooner we can extract useful information from our data, the better, which is driving an evolution towards stream processing or “fast data”. Many of the legacy tools, including Spark, provide various levels of support for stream processing, but deeper architectural changes are emerging.

Read More

Workshop - Workshop on Streaming Data with Kafka and Microservices - YOW! 2018 Melbourne

When we think of modern data processing, we often think of batch-oriented ecosystems like Hadoop, including processing engines like Spark. However, the sooner we can extract useful information from our data, the better, which is driving an evolution towards stream processing or “fast data”. Many of the legacy tools, including Spark, provide various levels of support for stream processing, but deeper architectural changes are emerging.

Then we’ll work through code examples that use Akka Streams and Kafka Streams with Kafka to implement a machine-learning example where a machine learning model is updated periodically to simulate the problem of periodic retraining and serving of ML models in a streaming context. In particular, if you periodically retrain the model using one tool chain, for example, once a day, how to do you incorporate the updated model into a running pipeline for scoring without restarting the pipeline?

Read More

Streaming Data with Kafka and Microservices - YOW! 2018 Brisbane

When we think of modern data processing, we often think of batch-oriented ecosystems like Hadoop, including processing engines like Spark. However, the sooner we can extract useful information from our data, the better, which is driving an evolution towards stream processing or “fast data”. Many of the legacy tools, including Spark, provide various levels of support for stream processing, but deeper architectural changes are emerging.

Read More

Streaming Data with Kafka and Microservices - YOW! 2018 Sydney

When we think of modern data processing, we often think of batch-oriented ecosystems like Hadoop, including processing engines like Spark. However, the sooner we can extract useful information from our data, the better, which is driving an evolution towards stream processing or “fast data”. Many of the legacy tools, including Spark, provide various levels of support for stream processing, but deeper architectural changes are emerging.

Read More

Workshop - Streaming Data with Kafka and Microservices - YOW! 2018 Sydney

When we think of modern data processing, we often think of batch-oriented ecosystems like Hadoop, including processing engines like Spark. However, the sooner we can extract useful information from our data, the better, which is driving an evolution towards stream processing or “fast data”. Many of the legacy tools, including Spark, provide various levels of support for stream processing, but deeper architectural changes are emerging.

Then we’ll work through code examples that use Akka Streams and Kafka Streams with Kafka to implement a machine-learning example where a machine learning model is updated periodically to simulate the problem of periodic retraining and serving of ML models in a streaming context. In particular, if you periodically retrain the model using one tool chain, for example, once a day, how to do you incorporate the updated model into a running pipeline for scoring without restarting the pipeline?

Read More

Stream All the Things!! - YOW! Data 2018

Streaming data architectures aren't just "faster" Big Data architectures. They must be reliable and scalable as never before, more like microservice architectures.

This talk has three goals:

  1. Justify the transition from batch-oriented big data to stream-oriented fast data.
  2. Explain the requirements that streaming architectures must meet and the tools and techniques used to meet them.
  3. Discuss the ways that fast data and microservice architectures are converging.

Big data started with an emphasis on batch-oriented architectures, where data is captured in large, scalable stores, then processed using batch jobs. To reduce the gap between data arrival and information extraction, these architectures are now evolving to be stream oriented, where data is processed as it arrives. Fast data is the new buzz word.

These architectures introduce new challenges for developers. Whereas a batch job might run for hours, a stream processing system typically runs for weeks or months, which raises the bar for making these systems reliable and scalable to handle any contingency.

The microservice world has faced this challenge for a while. Microservices are inherently message driven, responding to requests for service and sending messages to other microservices, in turn. Hence, they are also stream oriented, in the sense that they must respond reliably to never-ending input. So, they offer guidance for how to build reliable streaming data systems. I'll discuss how these architectures are merging in other ways, too.

We'll also discuss how to pick streaming technologies based on four axes of concern:

  • Low latency: What's my time budget for handling this data?
  • High volume: How much data per unit time must I handle?
  • Data processing: Do I need machine learning, SQL queries, conventional ETL processing, etc.?
  • Integration with other tools: Which ones and how is data exchanged between them?

We'll consider specific examples of streaming tools and how they fit on these axes, including Spark, Flink, Akka Streams, and Kafka.

Read More

Hands-on Kafka Streaming Microservices with Akka Streams and Kafka Streams - YOW! Data 2018

If you're building streaming data apps, your first inclination might be to reach for Spark Streaming, Flink, Apex, or similar tools, which run as services to which you submit jobs for execution. But sometimes, writing conventional microservices, with embedded stream processing, is a better fit for your needs.

In this hands-on tutorial, we start with the premise that Kafka is the ideal backplane for reliable capture and organization of data streams for downstream consumption. Then, we build several applications using Akka Streams and Kafka Streams on top of Kafka. The goal is to understand the relative strengths and weaknesses of these toolkits for building Kafka-based streaming applications. We'll also compare and contrast them to systems like Spark Streaming and Flink, to understand when those tools are better choices. Briefly, Akka Streams and Kafka Streams are best for data-centric microservices, where maximum flexibility is required for running the applications and interoperating with other systems, while systems like Spark Streaming and Flink are best for richer analytics over large streams where horizontal scalability through "automatic" partitioning of the data is required.

Each engine has particular strengths that we'll demonstrate:

  • Kafka Streams is purpose built for reading data from Kafka topics, processing it, and writing the results to new topics. With powerful stream and table abstractions, and an "exactly-once" capability, it supports a variety of common scenarios involving transformation, filtering, and aggregation.
  • Akka Streams emerged as a dataflow-centric abstraction for the general-purpose Akka Actors model, designed for general-purpose microservices, especially when _per-event_ low-latency is important, such as for complex event processing, where each event requires individual handling. In contrast, many other systems are efficient at scale, when the overhead is amortized over sets of records or when processing "in bulk". Also because of its general-purpose nature, Akka Streams supports a wider class of application problems and third-party integrations, but it's less focused on Kafka-specific capabilities.

Kafka Streams and Akka Streams are both libraries that you integrate into your microservices, which means you must manage their lifecycles yourself, but you also get lots of flexibility to do this as you see fit.

In contrast, Spark Streaming and Flink run their own services. You write "jobs" or use interactive shells that tell these services what computations to do over data sources and where to send results. Spark and Flink then determine what processes to run in your cluster to implement the dataflows. Hence, there is less of a DevOps burden to bear, but also less flexibility when you might need it. Both systems are also more focused on data analytics problems, with various levels of support for SQL over streams, machine learning model training and scoring, etc.

For the tutorial, you'll be given an execution environment and the code examples in a GitHub repo. We'll experiment with the examples together, interspersed with short presentations, to understand their strengths, weaknesses, performance characteristics, and lifecycle management requirements.

Read More