Respecting privacy with synthetically generated "look-alike" data sets
YOW! Data 2018
Safely handling data that contains sensitive or private information about people is a multi-million dollar problem at many companies. It adds time into the data engineering process, it can cost a lot in software licenses for specialised tools, and brings a range of reputational and legal risks.
Recent advances in deep learning have prompted an interesting way to attack this problem. By fitting a certain class of model on a source data set that contains sensitive information, we can produce a generator that outputs a supply of synthetic "look alike" data. This output data will preserve many of the statistical relationships between fields as the source does, and offers mathematical guarantees around the identifiability of individuals in the source data set.
This talk will provide an overview of the approach and show how it can speed data engineering effort and reduce risk.