Join Newsletter

Conference Program

All times displayed are in the Australia/Sydney timezone

8:45 AM

8:45 AM - 15 mins

Session Overviews and Introductions

9:00 AM

9:00 AM - 45 mins

Grand Ball Room 1

Applying Dynamic Embeddings in Natural Language Processing to track the Evolution of Tech Skills

Maryam Jahanshahi

Applying Dynamic Embeddings in Natural Language Processing to track the Evolution of Tech Skills

Maryam Jahanshahi

Many data scientists are familiar with word embedding models such as word2vec, which capture semantic similarity of words in a large corpus. However, word embeddings are limited in their ability to interrogate a corpus alongside other context or over time. Moreover, word embedding models either need significant amounts of data, or tuning through transfer learning of a domain-specific vocabulary that is unique to most commercial applications.

In this talk, I will introduce exponential family embeddings. Developed by Rudolph and Blei, these methods extend the idea of word embeddings to other types of high-dimensional data. I will demonstrate how they can be used to conduct advanced topic modeling on datasets that are medium-sized, which are specialized enough to require significant modifications of a word2vec model and contain more general data types (including categorical, count, continuous). I will discuss how my team implemented a dynamic embedding model using Tensor Flow and our proprietary corpus of job descriptions. Using both categorical and natural language data associated with jobs, we charted the development of different skill sets over the last few years. I will compare data science skill sets in US jobs vs Australian roles, specifically focusing on how tech and data science skill sets have developed, grown and pollinated other types of jobs over time.

Read More

9:45 AM

9:45 AM - 25 mins

Break / Q&A with Maryam Jahanshahi

10:10 AM

10:10 AM - 30 mins

Grand Ball Room 1

The Data Literacy Revolution

Dr Eugene Dubossarsky

The Data Literacy Revolution

Dr Eugene Dubossarsky

The popularity and ubiquity of data science, data analytics, AI and the trend towards digital transformation have led to massive, repeated failures in many businesses. Despite billions spent, hundreds of Ph.D.s hired, and much boasting in conference presentations, many enterprises are still struggling to leverage the value of these new technologies. The missing ingredient is the literacy of the rest of the organisation, particularly senior management.

This presentation will describe this new literacy: “data literacy”, the analogy with computer literacy, and reasons why this skill set will soon be as essential to all professionals as computer literacy is today. It will address issues of automation, the advent of decision making as the key managerial activity and the resulting democratisation of AI and analytics, however still maintaining a class of data science and analytics experts. The presentation will address issues of mindset, as well as skill set, and the ways in which management engagement with data analytics must change to leverage its value.

Read More

10:40 AM

10:40 AM - 25 mins

Break / Q&A with Dr Eugene Dubossarsky

11:05 AM

11:05 AM - 30 mins

Grand Ball Room 1

Data Maturity Levels

Greg Roodt

Data Maturity Levels

Greg Roodt

At a startup, typically the main concern is survival. Advanced analysis techniques and machine learning is often a luxury or even a distraction from the prime directive - don't die. However, as a startup grows,the data requirements evolve and eventually the startup morphs into a larger company where data is a core competitive advantage that drives decision making and product features.

In this talk, I describe what this evolution looks like and provide a framework to evaluate the different data maturity levels that a company may be at. This framework can not only be applied to a growing company, it can also be applied to a team or department within an already established company.

Read More

11:35 AM

11:35 AM - 25 mins

Break / Q&A with Greg Roodt

12:00 PM

12:00 PM - 60 mins

Virtual Lunch Break

8:45 AM

8:45 AM - 15 mins

Session Overviews and Introductions

9:00 AM

9:00 AM - 45 mins

Grand Ball Room 1

Apache Pulsar: The Next Generation Messaging and Queuing System

Karthik Ramasamy

Apache Pulsar: The Next Generation Messaging and Queuing System

Karthik Ramasamy

Apache Pulsar is the next generation messaging and queuing system with unique design trade-offs driven by the need for scalability and durability. Its two layered architecture of separating message storage from serving led to an implementation that unifies the flexibility and the high-level constructs of messaging, queuing and light weight computing with the scalable properties of log storage systems. This allows Apache Pulsar to be dynamically scaled up or down without any downtime. Using Apache BookKeeper as the underlying data storage, Pulsar guarantees data consistency and durability while maintaining strict SLAs for throughput and latency. Furthermore, Apache Pulsar integrates Pulsar Functions, a lambda style framework to write serverless functions to natively process data immediately upon arrival. This serverless stream processing approach is ideal for lightweight processing tasks like filtering, data routing and transformations. In this talk, we will give an overview about Apache Pulsar and delve into its unique architecture on messaging, storage and serverless data processing. We will also describe how Apache Pulsar is deployed in use case scenarios and explain how end-to-end streaming applications are written using Pulsar.

Read More

9:45 AM

9:45 AM - 25 mins

Break / Q&A with Karthik Ramasamy

10:10 AM

10:10 AM - 30 mins

Grand Ball Room 1

How to be a more impactful data analyst

Claire Carroll

How to be a more impactful data analyst

Claire Carroll

As the sole analyst in a fast-growing Australian startup, I experienced the pain of the traditional analyst workflow — stuck on a hamster wheel of report requests, Excel worksheets that frequently broke, an ever-growing backlog, and numbers that never quite matched up.

This story is familiar to almost any analyst. In this talk, I’ll draw on my own experience as well as similar experiences from others in the industry to share how I broke out of this cycle. You’ll learn how you can “scale yourself” by applying software engineering best practices to your analytics code, and how to turn this knowledge into an impactful analytics career.

Read More

10:40 AM

10:40 AM - 25 mins

Break / Q&A with Claire Carroll

4:00 PM

4:00 PM - 45 mins

Grand Ball Room 1

Stream Processing for Everyone with Continuous SQL Queries

Fabian Hueske

Stream Processing for Everyone with Continuous SQL Queries

Fabian Hueske

About four years ago, we started to add SQL support to Apache Flink with the primary goal to make stream processing technology accessible to non-developers. An important design decision to achieve this goal was to provide the same syntax and semantics for continuous streaming queries as for traditional batch SQL queries. Today, Flink runs hundreds of business critical streaming SQL queries at Alibaba, Criteo, DiDi, Huawei, Lyft, Uber, Yelp, and many other companies. Flink is obviously not the only system providing a SQL interface to process streaming data. There are several commercial and open source systems offering similar functionality. However, the syntax and semantics of the various streaming SQL offerings differ quite a lot.

In late 2018, members of the Apache Calcite, Beam, and Flink communities set out to write a paper discussing their joint approach to streaming SQL.
We submitted the paper "One SQL to Rule Them All – a Syntactically Idiomatic Approach to Management of Streams and Tables" to SIGMOD - the world's no. 1 database research conference - and it got accepted. Our goal was to get our approach validated by the database research community and to trigger a wider discussion about streaming SQL semantics. Today, the SQL Standards committee is discussing an extension of the standard to pinpoint the syntax and semantics of streaming SQL queries.

In my talk, I will briefly introduce the motivation for SQL queries on streams. I'll present the three-part extension proposal that we discussed in our paper consisting of (1) time-varying relations as a foundation for classical tables as well as streaming data, (2) event time semantics, (3) a limited set of optional keyword extensions to control the materialization of time-varying query results. Finally, I'll discuss how these concepts are implemented in Apache Flink and show some streaming SQL queries in action.

Read More

4:45 PM

4:45 PM - 25 mins

Break / Q&A with Fabian Hueske

5:10 PM

5:10 PM - 60 mins

Virtual Happy Hour

8:45 AM

8:45 AM - 15 mins

Session Overviews and Introductions

9:00 AM

9:00 AM - 45 mins

Grand Ball Room 1

Cluster-wide Scaling of Machine Learning with Ray

Dean Wampler

Cluster-wide Scaling of Machine Learning with Ray

Dean Wampler

Popular ML techniques like Reinforcement learning (RL) and Hyperparameter Optimization (HPO) require a variety of computational patterns for data processing, simulation (e.g., game engines), model search, training, and serving, and other tasks. Few frameworks efficiently support all these patterns, especially when scaling to clusters.

Ray is an open-source, distributed framework from U.C. Berkeley’s RISELab that easily scales applications from a laptop to a cluster. It was created to address the needs of reinforcement learning and hyperparameter tuning, in particular, but it is broadly applicable for almost any distributed Python-based application, with support for other languages forthcoming.

I'll explain the problems Ray solves and how Ray works. Then I'll discuss RLlib and Tune, the RL and HPO systems implemented with Ray. You'll learn when to use Ray versus alternatives, and how to adopt it for your projects.

Read More

9:45 AM

9:45 AM - 25 mins

Break / Q&A with Dean Wampler

10:10 AM

10:10 AM - 30 mins

Grand Ball Room 1

How COVID-19 has Accelerated the Journey to Data-driven Health Decisions

Dr. Denis Bauer

How COVID-19 has Accelerated the Journey to Data-driven Health Decisions

Dr. Denis Bauer

The speed with which COVID-19 has taken over the world has raised the demand for data-
driven health decisions and the shift towards virtual may actually enable the necessary data
collection. This session talks about how CSIRO has leveraged cloud-native technologies to
advance three areas of the COVID-19 response: firstly we worked with GISAID, the largest
data resource for the virus causing COVID-19 and use standard health terminologies (FHIR)
to help collect clinical patient data. This feeds into a Docker-based workflow that creates
identifying “fingerprints” of the virus for guiding vaccine developments and investigating
whether there are more pathogenic versions of the virus. Secondly, we developed a fully
serverless web-service for tailoring diagnostics efforts, capable of differentiating between
strains. Thirdly, we are creating a serverless COVID-19 analysis platform that allows
distributed genomics and patient data to be shared and analysed in a privacy- and
ownership-preserving manner and functioning as a surveillance system for detecting more
virulent strains early.

Read More

10:40 AM

10:40 AM - 25 mins

Break / Q&A with Dr. Denis Bauer

11:05 AM

11:05 AM - 30 mins

Grand Ball Room 1

Self supervised learning & making use of unlabelled data.

Mat Kelcey

Self supervised learning & making use of unlabelled data.

Mat Kelcey

The general supervised learning problem starts with a labelled dataset. It's common though to additionally have a large collection of unlabelled data also. Self supervision techniques are a way to make use of this data to boost performance. In this talk we'll review some contrastive learning techniques that can either be used to provide weak labelled data or to act as a way of pre training for few-shot learning.

Read More

11:35 AM

11:35 AM - 25 mins

Break / Q&A with Mat Kelcey

Back to Top