Join Newsletter

Conference Program

All times displayed are in the Australia/Sydney timezone

8:30 AM

8:30 AM - 30 mins

YOW! Data Workshop Registration

9:00 AM

9:00 AM - 480 mins

Cliftons Sydney

Beyond Relational: Applying Big Data Cloud Pipeline Patterns

Lynn Langit

Beyond Relational: Applying Big Data Cloud Pipeline Patterns

Lynn Langit

In this full-day workshop, you will learn applied big data solution patterns. most often, but not always using the public cloud. We’ll cover Amazon Web Services and Google Cloud Platform, and work with in small groups to design data pipeline architectures for common scenarios.

RESERVE YOUR SEAT NOW

Read More

8:00 AM

8:00 AM - 45 mins

Registration for YOW! Data 2017

8:45 AM

8:45 AM - 15 mins

Session Overviews & Introductions

9:00 AM

9:00 AM - 45 mins

Grand Lodge

Cloud Data Pipelines for Genomics from a Bioinformatician and a Developer

Lynn Langit and Dr. Denis Bauer

Cloud Data Pipelines for Genomics from a Bioinformatician and a Developer

Lynn Langit and Dr. Denis Bauer

Dr.Bauer and her team have been working to build genome-scale data pipelines that address the computational challenges and limits present in today’s cancer genomic (bioinformatics) data workflows.

Dr. Bauer and her team have built solutions which use modern architectures, such as serverless (AWS Lambda) and also customised machine learning on Apache Spark. AWS Community Hero and cloud architect Lynn Langit is also collaborating with the CSIRO team to push solutions at the cutting edge of bioinformatic research which best utilise advances in cloud technologies..

In this demo-filled session Lynn and Denis will discuss and demonstrate some of the latest cloud data pipeline work that they’ve been working together to build out for the bioinformatics community.

Read More

9:50 AM

9:50 AM - 30 mins

Grand Lodge

The Future of Art

J. Rosenbaum

The Future of Art

J. Rosenbaum

Most people are aware of the impact machine learning will have on jobs, on the future of research and autonomous machines, but few seem to be aware of the future role machine learning could play in the creative arts, in visual art and music. What will art be like when artists and musicians routinely work collaboratively with machines to create new and interesting artworks? What can we learn from art created using neural networks and what can we create? From the frivolous to the beautiful what does art created by computers look like and where can it take us?

This talk will explore magenta in tensorflow and neural style in caffe, google deep dream, next Rembrandt, and convolutional neural networks. I will look into some of the beautiful applications of machine learning in art and some of the ridiculous ones as well.

Read More

10:20 AM

10:20 AM - 30 mins

Morning Break

10:50 AM

10:50 AM - 30 mins

Grand Lodge

Looking Behind Microservices to Brewer's Theorem, Externalised Replication,and Event Driven Architecture

Mick Semb Wever

Looking Behind Microservices to Brewer's Theorem, Externalised Replication,and Event Driven Architecture

Mick Semb Wever

Scaling data is difficult, scaling people even more so.

Today Microservices makes it possible to effectively scale both data and people by taking advantage of bounded contexts and Conway's law.
But there's still a lot more theory that's coming together in our adventures in dealing with ever more data. Some of these ideas and theories are just history repeating, while others are newer concepts.

These ideas can be seen in many Microservices platforms, within the services' code but also in the surrounding infrastructural tools we become ever more reliant upon.

Mick'll take a dive into it using examples and offer recommendations after seven years of coding Microservices around 'big data' platforms. The presentation will be relevant to people wanting to move beyond REST based asynchronous platforms, to eventually consistent asynchronous designs that aim towards the goal of linear scalability and 100% availability.

Read More

11:25 AM

11:25 AM - 30 mins

Grand Lodge

Video Game Analytics on AWS

Richard Morwood

Video Game Analytics on AWS

Richard Morwood

This talk will cover how to use AWS technologies to build an analytics system for video games which can be used to analyse player behaviour in near real-time. This system enables developers to identify trends in player difficulties, ease of use, the highs and lows of player engagement and how to visualise these results in-game. This demo uses a serverless approach for data capturing, processing and serving using AWS Mobile Analytics, Apache Spark on DataBricks, Athena and Lambda technologies.

Representing data in game enables developers to see results in an environment they are already very familiar with and adjust level design to maximise engagement. Developers can use this information to track the effects of updated releases to easily identify if changes have had the intended effect. These same techniques can be applied in many scenarios including web tracking and click stream analytics.

Read More

12:00 PM

12:00 PM - 30 mins

Grand Lodge

Cast a Net Over your Data Lake

Natalia Ruemmele

Cast a Net Over your Data Lake

Natalia Ruemmele

As the variety of data continues to expand, the need for different kinds of analytics is increasing – big data is no longer just about the volume, but also about its increasing diversity. Unfortunately, there is no one-size-fits-all approach to analytics – no magic pill that will get your organization the insight it needs from data. Graph analytics offers a toolset to visualize your diverse data and to build more accurate predictive models by uncovering non-obvious inter-connections among your data sources.

In this talk we will discuss some use cases for graph analytics and walk through a particular scenario to find power-users for a promotion campaign. We will also cover machine learning approaches which can assist you in constructing graphs from diverse data sources.

Read More

12:30 PM

12:30 PM - 60 mins

Lunch Break

1:30 PM

1:30 PM - 30 mins

Grand Lodge

Energy Monitoring with Self-Taught Deep Networks

Sau Sheong Chang

Energy Monitoring with Self-Taught Deep Networks

Sau Sheong Chang

Energy disaggregation allows detection of individual electrical appliances from aggregated energy usage time series data. The insights of individual appliances are very useful for different energy-related applications, for example energy monitoring, demand response etc. Although it is very easy to collect large volume of energy usage data, inspecting and labelling time series is very tedious and expensive.

In this talk, I will present a solution to explore these unlabelled time-series data using two deep networks. The first RNN-based deep network extracts good representations of energy time series windows without much human intervention. By transferring these representations from unlabelled data to labeled data, the second deep network learns the model of targeted electrical appliance.

Read More

2:00 PM

2:00 PM - 30 mins

Grand Lodge

Dipping into the Big Data River: Stream Analytics at Scale

Radek Ostrowski

Dipping into the Big Data River: Stream Analytics at Scale

Radek Ostrowski

This presentation explains the concept of Kappa and Lambda architectures and showcases how useful business knowledge can be extracted from the constantly flowing river of data.

It also demonstrates how a simple POC could be built in a day with only getting your toes wet by leveraging Docker and other technologies like Kafka, Spark and Cassandra.

Read More

2:35 PM

2:35 PM - 30 mins

Grand Lodge

Writing Better R Code

Ondrej Ivanič

Writing Better R Code

Ondrej Ivanič

Data scientists, analysts, and statisticians are passionate about the data, models, and insights but the code used to produce the results (in many cases) is left behind. We have very good understanding of our code base during the time when we are working on the project but most of the time we do not write the code for the "future me".

In this talk, I describe and explain common coding pitfalls in R and then introduce functional programming using functions from base R, purrr (part of tidyverse) and pipes as a preferred solution for creating robust and reusable R code. Between the topics, I briefly touch on controversial topics such as "loops are bad" and "pipes are the best"

Read More

3:05 PM

3:05 PM - 30 mins

Afternoon Break

3:35 PM

3:35 PM - 30 mins

Grand Lodge

From Little Things, Big Data Grow - IoT at Scale

Christopher Biggs

From Little Things, Big Data Grow - IoT at Scale

Christopher Biggs

The Internet of Things (IoT) is really about the ubiquity of data, the possibility of humans extending their awareness and reach globally, or further.
IoT frees us from the tedium of physically monitoring or maintaining remote systems, but to be effective we must be able to rely on data being accessible but comprehensible.

This presentation covers three main areas of an IoT big data strategy

  • The Air Gap - options (from obvious to inventive) for connecting wireless devices to the internet
  • Tributaries - designing a scalable architecture for amalgamating IoT data flows into your data lake. Covers recommended API and message-bus architectures.
  • Management and visualisation - how to characterise and address IoT devices in ways that scale to continental populations. I will show some examples of large scale installations to which I've contributed and how to cope with information overload.
Read More

4:10 PM

4:10 PM - 30 mins

Grand Lodge

A Geometric Approach towards Data Analysis and Visualisation

Daniel Filonik

A Geometric Approach towards Data Analysis and Visualisation

Daniel Filonik

Beginning with the work of Bertin, visualisation scholars have attempted to systematically study and deconstruct visualisations in order to gain insights about their fundamental structure. More recently, the idea of deconstructing visualizations into fine-grained, modular units of composition also lies at the heart of graphics grammars. These theories provide the foundation for visualization frameworks and interfaces developed as part of ongoing research, as well as state-of-the-art commercial software, such as Tableau. In a similar vein, scholars like Tufte have long advocated to forego embellishments and decorations in favor of abstract and minimalist representations. They argue that such representations facilitate data analysis by communicating only essential information and minimizing distraction.

This presentation continues along such lines of thought, proposing that this pursuit naturally leads to a geometric approach towards data analysis and visualisation. Looking at data from a sufficiently high level of abstraction, one inevitably returns to fundamental mathematical concepts. As one of the oldest branches of mathematics, geometry offers a vast amount of knowledge that can be applied to the formal study of visualisations.

``Visualization is a method of computing. It transforms the symbolic into the geometric.'' (McCormick et al., 1987)

In other words, geometry is the mathematical link between abstract information and graphic representation. In order to graphically represent information, we assign to it a geometric form. In this presentation we will explore the nature of these mappings from symbolic to geometric representations. This geometric approach provides an alternative perspective towards analysing data. This perspective is inherently equipped with high-level abstractions and invites generalization. It enables the study of abstract geometric objects independent from a concrete presentation medium. Consequently, it allows to interpret data directly through geometric primitives and transformations.

The presentation illustrates the geometric approach using diverse examples and illustrations. In turn, we discuss the opportunities and challenges that arise from this perspective. For instance, a key benefit of this approach is that it allows to consider seemingly disparate visualization types in a unified framework. By systematically enumerating the design space of geometric representations, it is possible to trivially apply extensions and modifications, resulting in great expressiveness. The approach naturally extends to visualisation techniques for complex, multidimensional, multivariate data sets. However, the effectiveness of the resulting representations and cognitive challenges in the interpretation require careful consideration.

Read More

4:45 PM

4:45 PM - 30 mins

Grand Lodge

Learnings from Building Data Products at Zendesk

Bob Raman

Learnings from Building Data Products at Zendesk

Bob Raman

In this talk you will learn about the team structure and process for building Data Product from the lessons of one of the teams that builds Data Products at Zendesk. The Data Product team uses machine learning to build Data Products that will reduce cost of customer support for Zendesk's 100,000 odd customers.

This talk will explain the journey of the Data Product team to date - its structure and how it has evolved, challenges as well as successes and failures.

Read More

5:20 PM

5:20 PM - 45 mins

Grand Lodge

Scalable IOT with Apache Cassandra.

Aaron Morton

Scalable IOT with Apache Cassandra.

Aaron Morton

IOT and Event Based systems can process huge volumes of data. Which typically needs to be stored and read in near real time for event processing, in addition to being read in bulk to feed data hungry learning systems. Apache Cassandra provides a high performance, scalable, and fault tolerant database platform with excellent support for time series data models typically seen in IOT systems. It's millisecond (or better) latency can support systems that react to events in real time, while scalable bulk reads via batch processing systems such as Apache Hadoop and Apache Spark can support learning applications. These features, and more, make Cassandra an ideal persistence platform for modern data intensive, event driven, systems.

In this talk Aaron Morton, CEO at The Last Pickle, will discuss lessons learned using Cassandra for IOT systems. He will explain how Cassandra fits into the modern technology landscape and dive into data modelling for common IOT use cases, capacity planning for huge data loads, tuning for high performance, and integration with other data driven systems. Whether starting a new project, or deep into the weeds on an existing system, attendees will leave will leave with an understanding of how Apache Cassandra can help build robust infrastructure for IOT systems.

Read More

6:05 PM

6:05 PM - 60 mins

Conference Drinks & Networking

8:45 AM

8:45 AM - 15 mins

Session Overviews & Introductions

9:00 AM

9:00 AM - 45 mins

Grand Lodge

Processing Data of Any Size with Apache Beam

Jesse Anderson

Processing Data of Any Size with Apache Beam

Jesse Anderson

Rewriting code as you scale is a terrible waste of time. You have perfectly working code, but it doesn’t scale. You really need code that works at any size, whether that’s a megabyte or a terabyte. Beam allows you to learn a single API and process data as it grows. You don’t have to rewrite at every step.

In this session, we will talk about Beam and its API. We’ll see how to Beam execute on Big Data or small data. We’ll touch on some of the advanced features that make Beam an interesting choice.

Read More

9:50 AM

9:50 AM - 30 mins

Grand Lodge

Image Recognition for Non-Experts: From Google Cloud Vision to Tensorflow

Gareth Jones

Image Recognition for Non-Experts: From Google Cloud Vision to Tensorflow

Gareth Jones

Displaying an inappropriate ad on a website can be a major headache for an Ad network. Showing ads for a site’s major competitor, or ads in a category at odds with the site’s brand, for example, can cause embarrassment and lost revenue. With the selection of ads being largely algorithmic it can be hard to set up rules to make sure this doesn’t happen. You also don’t want your first awareness of the problem being a call from an angry CEO.

This talk shows how we built a system that uses image recognition to detect Ad Breaches. Our first version makes use of Google’s Cloud Vision API. The Cloud Vision API is a pre-trained service that recognises many categories of objects from images, along with some text recognition. I’ll discuss how to use the Cloud Vision API in your applications, what it is good at, what it is not.

I’ll then look at using transfer learning to improve our system’s ability to recognise Ad Breaches. I will look at how we can use the popular Tensorflow library to build our own image recognition model. Tensorflow comes with several pre-trained models for image recognition - using these I will show you how to build your own specialised image recognition models in a fraction of the time, and with a fraction of the input data, by re-using existing pre-trained layers from the best models out there. I’ll investigate whether we can train a model to detect potential ad breaches from a small set of examples.

Read More

10:20 AM

10:20 AM - 30 mins

Morning Break

10:50 AM

10:50 AM - 30 mins

Grand Lodge

Realizing the Promise of Portable Data Processing with Apache Beam

Davor Bonaci

Realizing the Promise of Portable Data Processing with Apache Beam

Davor Bonaci

The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the Big Data ecosystem together; it enables users to "run-anything-anywhere".

This talk will briefly cover the capabilities of the Beam model for data processing, as well as the current state of the Beam ecosystem. We'll discuss Beam architecture and dive into the portability layer. We'll offer a technical analysis of the Beam's powerful primitive operations that enable true and reliable portability across diverse environments. Finally, we'll demonstrate a complex pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Spark on Amazon Web Services, Apache Flink on Google Cloud, Apache Apex on-premise), and give a glimpse at some of the challenges Beam aims to address in the future.

Read More

11:25 AM

11:25 AM - 30 mins

Grand Lodge

Covariate Shift - Challenges and Good Practice

Joyce Wang

Covariate Shift - Challenges and Good Practice

Joyce Wang

A fundamental assumption in supervised machine learning is that both the training and query data are drawn from the same population/distribution. However, in real-life applications this is very often not the case as the query data distribution is unknown and cannot be guaranteed a-priori. Selection bias in collecting training samples will change the distribution of the training data from that of the overall population. This problem is known as covariate shift in the machine learning literature, and using a machine learning algorithm in this situation can result in spurious and often over-confident predictions.

Covariate shift is only detectable when we have access to query data. Visualization of training and query data would be helpful to gain an initial impression. Machine learning models can be used to detect covariate shift. For example, Gaussian Process could model the similarity between each query point from feature space of training data. One-class SVMs could detect outliers of training data. Both strategies detect query points that live in a different domain of the feature space from the training dataset.

We suggest two strategies to mitigate covariate shift: re-weighting training data, and active learning with probabilistic models.

First, re-weighting the training data is the process of matching distribution statistics between the training and query sets in feature space. When the model is trained (and validated) on re-weighted data, it is expected to generalise better to query data. However, significant overlap between training and query datasets is required.

Secondly, there may be a situation where we can acquire the labels of a small portion of the query set, potentially at great expense, to reduce the effects of covariate shift. Probabilistic models are required in this case because they indicate the uncertainty in their prediction. Active learning enables us to optimally select small subsets of query points that aim to maximally shrink the uncertainty in our overall prediction.

Read More

12:00 PM

12:00 PM - 30 mins

Grand Lodge

Batch as a Special Case of Streaming

Roman Kovalik

Batch as a Special Case of Streaming

Roman Kovalik

In this talk I will share my teams gruelling journey in attempting to migrate a batch like system into a streaming framework.

Walking through the various solutions that we tested using Flink, I'll be discussing each ones performance characteristics and bringing to light misconceptions in their designs.

Read More

12:30 PM

12:30 PM - 60 mins

Lunch Break

1:30 PM

1:30 PM - 45 mins

Grand Lodge

Apache Spark for Machine Learning on Large Data Sets

Juliet Hougland

Apache Spark for Machine Learning on Large Data Sets

Juliet Hougland

Apache Spark is a general purpose distributed computing framework for distributed data processing. With MLlib, Spark's machine learning library, fitting a model to a huge data set becomes very easy. Similarly, Spark's general purpose functionality enables application of a model across a large collection of observations. We'll walk through fitting a model to a big data set using MLlib and applying a trained scikit-learn model to a large data set.

Read More

2:20 PM

2:20 PM - 30 mins

Grand Lodge

Metrivour - Recording and Analyzing Metrics from the Electric Power Network

Arnold deVos

Metrivour - Recording and Analyzing Metrics from the Electric Power Network

Arnold deVos

Metrivour is a metrics recording and analytics system we developed for electric power operations and planning. The metrics are physical quantities such as voltage, current and power in an electric power network.

Some aspects of the system are familiar. The storage backend is a Cassandra database cluster (often used for metrics). The implementation consists of services written in Java and scala.

Other aspects are distinctive. The system has an analytics engine and query language that are designed for purpose.

The goal is to reduce volumes of noisy, irregular transducer measurements to regular time series of reasonable dimensions. This enables the next level of analysis to be performed by standard tools.

Read More

2:55 PM

2:55 PM - 30 mins

Grand Lodge

Low Latency Polyglot Model Scoring using Apache Apex

Ananth Gundabattula

Low Latency Polyglot Model Scoring using Apache Apex

Ananth Gundabattula

Data science is fast becoming a complementary approach and process to solve business challenges today. The explosion of frameworks to help data scientists build models bears a testimony to this. However when a model needs to be turned into a production version in very low latency and enterprise grade environments, there are a very few choices with each one having their own strengths and weaknesses. Adding to this is the current disconnect between a data scientists world which is all about modelling and an engineers world which is about SLAs and service guarantees. A framework like Apache Apex can complement each of these roles and provide constructs for both these worlds. This would help enterprises to drastically cut down the cost of model deployment to production environments.

The talk will present Apache Apex as a framework that can enable engineers and data scientists to build low latency enterprise grade applications. We will cover the foundations of Apex that contribute to the low latency processing capabilities of the platform. Subsequently aspects of the platform that make it qualify as an enterprise grade platform are discussed. Finally, we will cover the main aspects of the title of the talk wherein models developed in Java, R and Python can co-exist in the same scoring application framework thus enabling a true polyglot framework.

Read More

3:25 PM

3:25 PM - 30 mins

Afternoon Break

3:55 PM

3:55 PM - 30 mins

Grand Lodge

Introduction to Apache Amaterasu (Incubating): A CD Framework for your Big Data Pipelines

Yaniv Rodenski

Introduction to Apache Amaterasu (Incubating): A CD Framework for your Big Data Pipelines

Yaniv Rodenski

In the last few years, the DevOps movement has introduced ground breaking approaches to the way we manage the lifecycle of software development and deployment. Today organisations aspire to fully automate the deployment of microservices and web applications with tools such as Chef, Puppet and Ansible. However, the deployment of data-processing pipelines remains a relic from the dark-ages of software development.

Processing large-scale data pipelines is the main engineering task of the Big Data era, and it should be treated with the same respect and craftsmanship as any other piece of software. That is why we created Apache Amaterasu (Incubating) - an open source framework that takes care of the specific needs of Big Data applications in the world of continuous delivery.

In this session, we will take a close look at Apache Amaterasu (Incubating) a simple and powerful framework to build and dispense pipelines. Amaterasu aims to help data engineers and data scientists to compose, configure, test, package, deploy and execute data pipelines written using multiple tools, languages and frameworks.
We will see what Amaterasu provides today, and how it can help existing Big Data application and demo some of the new bits that are coming in the near future.

Read More

4:30 PM

4:30 PM - 30 mins

Grand Lodge

The Network, The Kingmaker: Distributed Tracing and Zipkin

Mick Semb Wever

The Network, The Kingmaker: Distributed Tracing and Zipkin

Mick Semb Wever

Adding Zipkin instrumentation into a codebase makes it possible to create one tracing view across an entire platform. This is the often eluded "correlation identifier" that's recommended by Microservices but has so few solid open sourced solutions available. This is an aspect to monitoring of distributed platforms akin to the separate concerns in aggregation of metrics and logs.

This talk will use the use case of extending Apache Cassandra's tracing: to use Zipkin so to demonstrate a single tracing view across an entire system. From browser and HTTP, through a distributed platform, and into the Database down to seeks on disk. Put together it makes easy to identify which queries to a particular service took the longest and to trace back how the application made them.

This presentation will raise the requirements and expectations DevOps has on their infrastructural tools. For people that want to take their infrastructural tools to the next level, where the network is known as the kingmaker.

Read More

5:05 PM

5:05 PM - 30 mins

Grand Lodge

What is the Most Common Street Name in Australia?

Rachel Bunder

What is the Most Common Street Name in Australia?

Rachel Bunder

Finding the most common street name in Australia may sound relatively simple, but it quickly leads to other questions. What is a street name? Do The Avenue, The Grand Parade and The Serpentine all share the same name? And what is a street? Is the M5 Motorway a street? What about M5 Motorway Offramp?

This talk will answer these questions using Open Street Map and Python. In particular, reading in and manipulating Open Street Map data using geopandas; exploring the structure of Open Street Map and creating models for parsing street names. And finding the most common street name in Australia.

Read More
Back to Top