Building Genomics Pipelines with AWS Lambda and Apache Spark
YOW! Hong Kong 2017
Lynn Langit shares lessons learned and cloud data pipeline patterns via examples from work she’s doing with CSIRO Bioinformatics Australia. The team there, led by Dr. Denis Bauer, is analyzing a number of large genomic datasets.
First, Lynn examines real-time analysis with cloud-based solutions. Keeping runtime constant can be challenging for problems that vary in complexity, such as genome engineering. The CSIRO GT-Scan2 tool works by instantaneously recruiting additional Lambda functions as the complexity increases. It was built using a microservices pattern (serverless) using AWS services.
Next, Lynn will demo a Jupyter notebook which shows how genomic research can leverage Apache Spark to massively parallelize the generation of random forests to identify disease genes efficiently.She’ll discuss the pipeline’s use of an OSS library written by the team at CSIRO (VariantSpark).
VariantSpark can analyze 3,000 samples with 80 million features in under 30 minutes. This pipeline enables real-time diagnosis by finding similar patients. This platform is contributing to motor neuron disease research (publicized by the Ice Bucket Challenge) in Australia.
Big Data and Cloud Architect
Lynn Langit Consulting
Lynn Langit is an independent software architect and educator. She is an AWS Community Hero, Google Cloud Developer Expert, Microsoft MVP and technical author for LinkedIn Learning. She has most recently worked as a lead architect on AWS IoT Enterprise project where she applied Mob Programming.
Lynn is also Director & Lead Courseware Author for “Teaching Kids Programming”. She has 8 years experience authoring technical courseware for middle school kids and has been a key contributor to TKPJava courseware library with 70+ open source kids coding lessons.