Rolling out Error Budgets across a 1000 person global engineering organisation
YOW! CTO Summit 2019 Brisbane
Zendesk has been struggling with reliability from it’s beginning - in many ways it has been a victim of its own overnight success. Over the last few years we’ve had to take drastic measures to address major outages, such as implementing company-wide change freezes.
These measures hurt when you have 1000 engineers in 120 product development teams across the globe, and in many ways create more risk when the freeze begins to thaw.
In order to avoid these freeze’s we have recently moved to implement concepts from the Site Reliability Engineering (SRE) discipline, specifically implementing Error Budgets along with SLOs/SLIs. The aim of this is to “scope” the freeze to those systems that have more reliability issues.
We’ve had some wins in introducing this approach, but are still very much at the beginning of this journey. This talk will tell the story of this journey along with providing some practical suggestions around tooling and practices to implement.
Senior Director of Engineering
I am a technology executive with over 20 years experience creating software solutions for the investment banking, insurance, logistics, utilities, travel publishing, sport and digital marketplace industries.
World class software delivery is a complex dance that brings together product ideas, user experience design, architecture, high quality testable software and frequent, iterative deployments to production.
My career has been about choreographing teams, people, processes and technology to make that happen.
I currently manage the engineering teams that build and run the Zendesk Developer Platform, based in Melbourne.