Join Newsletter

Apache Spark for Machine Learning on Large Data Sets

YOW! Data 2017

Apache Spark is a general purpose distributed computing framework for distributed data processing. With MLlib, Spark's machine learning library, fitting a model to a huge data set becomes very easy. Similarly, Spark's general purpose functionality enables application of a model across a large collection of observations. We'll walk through fitting a model to a big data set using MLlib and applying a trained scikit-learn model to a large data set.

Juliet Hougland

Data Vagabond

Bagged & Boosted

United States

Data scientist and engineer with expertise in computational mathematics and years of hands-on machine learning and big data experience. Regular worldwide keynote and invited speaker. For many years Juliet has been a contributor to the open source community working on projects such as Apache Spark, Scalding, and Kiji.