Join Newsletter

Entity Resolution at Scale

YOW! Data 2019

Real world data is rarely clean: there are often corrupted and duplicate records, and even corrupted records that are duplicates! One step in data cleaning is entity resolution: connecting all of the duplicate records into the single underlying entity that they represent.

This talk will describe how we approach entity resolution, and look at some of the challenges, solutions and lessons learnt when doing entity resolution on top of Apache Spark, and scaling it to process billions of records.

Huon Wilson

Sr. Software Engineer

CSIRO's Data61


Huon is a software engineer at CSIRO's Data61, working on scalable graph analytics and deep learning as part of the Stellargraph project. The project builds on technologies like Tensorflow, Spark, HBase and Elasticsearch. He previously worked on the Swift compiler at Apple, and before that was deeply involved in the Rust programming language both as a volunteer and working at Mozilla.