Applying Dynamic Embeddings in Natural Language Processing to track the Evolution of Tech Skills
YOW! Data 2020
Many data scientists are familiar with word embedding models such as word2vec, which capture semantic similarity of words in a large corpus. However, word embeddings are limited in their ability to interrogate a corpus alongside other context or over time. Moreover, word embedding models either need significant amounts of data, or tuning through transfer learning of a domain-specific vocabulary that is unique to most commercial applications.
In this talk, I will introduce exponential family embeddings. Developed by Rudolph and Blei, these methods extend the idea of word embeddings to other types of high-dimensional data. I will demonstrate how they can be used to conduct advanced topic modeling on datasets that are medium-sized, which are specialized enough to require significant modifications of a word2vec model and contain more general data types (including categorical, count, continuous). I will discuss how my team implemented a dynamic embedding model using Tensor Flow and our proprietary corpus of job descriptions. Using both categorical and natural language data associated with jobs, we charted the development of different skill sets over the last few years. I will compare data science skill sets in US jobs vs Australian roles, specifically focusing on how tech and data science skill sets have developed, grown and pollinated other types of jobs over time.
Maryam runs research at TapRecruit, a startup that is building software tools to implement evidence-based talent management. TapRecruit’s research program integrate recent advances in NLP, data science and decision science to identify robust methods to reduce bias in talent decision-making and attract more qualified and diverse candidate pools. In a past life, Maryam was a cancer biologist and a data journalist. She holds a PhD from the Icahn School of Medicine at Mount Sinai.