Loading…
Wednesday, October 2
 

07:45 BST

Morning Coffee and Tea
Wednesday October 2, 2019 07:45 - 08:45 BST
The Forum

08:45 BST

Opening Remarks
Speakers
avatar for Emil Stolarsky

Emil Stolarsky

Site Reliability Engineer, Wave Mobile Money
Emil is an SRE at Wave Mobile Money, helping make Africa the first cashless continent. Previously, he worked on caching, performance, and disaster recovery at Shopify, the internal Kubernetes platform at DigitalOcean, and everything in between at Cheddar. In addition to speaking at... Read More →
MS

Murali Suriar

Snowflake
Murali Suriar is a lapsed computer science graduate, turned network engineer, turned SRE. Working on traffic management at Snowflake after 12 years at Google. Currently learning what "the cloud is just someone else's computer" means.


Wednesday October 2, 2019 08:45 - 09:00 BST
The Liffey B

09:00 BST

The SRE I Aspire to Be
Yaniv Aknin dives into the secret sauce for a successful SRE organization: high-quality measurements of reliability. He explains why measuring reliability is crucial (and why it’s so hard), shares a couple of tips for getting it right, and explores why it's key for SREs doing Engineering work.

Speakers
YA

Yaniv Aknin

Google Cloud
Yaniv Aknin is Google Cloud Platform’s lead for quantitative reliability. He works with product managers, developers, and fellow SREs to create availability and performance metrics that accurately model customers’ experience, then optimizes those metrics toward the right reliability/cost... Read More →


Wednesday October 2, 2019 09:00 - 09:45 BST
The Liffey B

09:45 BST

Opening Plenary
Speakers
NL

Nancy Leveson

MIT
Dr. Nancy Leveson is a professor of Aeronautics and Astronautics at MIT. She has spent 35 years working to make the world safer in such fields as transportation, healthcare, petrochemicals, nuclear power, aerospace, etc. One common element throughout all her work is an emphasis on... Read More →


Wednesday October 2, 2019 09:45 - 10:30 BST
The Liffey B

10:30 BST

Break with Refreshments
Wednesday October 2, 2019 10:30 - 11:00 BST
The Forum

11:00 BST

SLOs for Data-Intensive Services
Designing and maintaining a search engine service can be challenging. One of the challenges is to set insightful SLOs where standard availability/latency SLOs do not fit. We will go through our journey towards defining a monitoring process for such services at Booking.com, from ineffective availability/latency SLOs to the current setup and all its advantages; travelling in a world where providing accurate and consistent responses can be as important as availability.

Speakers
YF

Yoann Fouquet

Booking.com
Yoann is a Site Reliability Engineer at Booking.com, working on core services within the Booking.com infrastructure.


Wednesday October 2, 2019 11:00 - 11:30 BST
Track 1: The Liffey B

11:00 BST

A Tale of Two Rotations: Building a Humane & Effective On-Call
Everyone wants to provide excellent and reliable service to their customers, but the world is a messy place. Things will break for reasons inside and outside of your control, and for the most unexpected reasons. At the end of the day, someone is going to be the on-call and step in to restore order to keep customers happy.

The question is, how do you keep your on-call as happy as your customers?

This talk examines how highly critical on-call rotations that protect core functionality can be made extremely effective and low-stress, and how completely ordinary rotations can get out of hand. We’ll discuss how best practices from the first rotation were successfully applied to the second, and how to apply them in your own rotation.

Speakers
avatar for Nick Lee

Nick Lee

Production Engineer, Uber
Nick Lee has worked at Uber Amsterdam for three years, starting off as a backend engineer building user facing features and transitioning over to Production Engineering as he discovered how great reliability and toil reduction work is.


Wednesday October 2, 2019 11:00 - 11:30 BST
Track 2: The Liffey A

11:00 BST

Automating HA Deployments with BGP, IPv6, and Anycast
Using BGP to load balancing and failover an application to increase site reliability has traditionally been expensive and tricky. In this workshop, we’ll walk through setting up a multi-host load-balanced and fail-over web application using BGP and open source technologies including Terraform, BGP, BIRD, an open source router, and IPv6.

Speakers
avatar for John Studarus

John Studarus

Cloud Engineer, Packet
John merges his interests in computing infrastructure, networking, and software security. His background includes leading product teams, writing prototype code and examining distributed systems at Fortune 500s and startups alike. He brings a rare combination of technical expertise... Read More →


Wednesday October 2, 2019 11:00 - 12:30 BST
Track 3: Liffey Hall 2

11:30 BST

Latency SLOs Done Right
Latency is a key indicator of service quality, and important to measure and track. However, measuring latency correctly is not easy. In contrast to familiar metrics like CPU utilization or request counts, the "latency" of a service is not easily expressed in numbers. Percentile metrics have become a popular means to measure the request latency, but have several shortcomings, especially when it comes to aggregation. The situation is particularly dire if we want to use them to specify Service Level Objectives (SLOs) that quantify the performance over a longer time horizons. In the talk we will explain these pitfalls, and suggest three practical methods how to implement effective Latency SLOs.

Speakers

Wednesday October 2, 2019 11:30 - 12:00 BST
Track 1: The Liffey B

11:30 BST

Support Operations Engineering: Scaling Developer Products to the Millions
Large scale internet infrastructure companies are increasingly relied upon by other engineering organisations, from self-serve customers to large enterprise organisations. The duty of helping customers SRE and engineering teams diagnose complex and stressful issues will likely rest with technical support. Support Operations Engineers compliment this by treating support as an optimisation problem with engineering solutions.

This talk describes the essential principles that Cloudflare’s Support Operations Engineers use to scale developer support in a large-scale internet infrastructure company serving 16+ million customer domains and more than 10% of global HTTP requests, whilst driving dramatic improvements in operational efficiency and delivering exceptional business value.

This talk will cover how Stateless Testing is used to introduce proactive support and improve customer retention, how Safety Engineering strategies are used with Machine Learning to automate customer support and how Operations Research with alerting data is used to create a next-gen Security Operations Centre.

Speakers
JA

Junade Ali

Cloudflare
Junade Ali is an Engineering Manager at Cloudflare, focusing on building the Support Operations Group. He has previously worked on high-integrity software for safety critical applications and previously served as Lead Developer of the largest digital agency in the UK (by headcount... Read More →


Wednesday October 2, 2019 11:30 - 12:00 BST
Track 2: The Liffey A

12:00 BST

Building a Scalable Monitoring System
A year ago, my company's monitoring setup was a disaster! We had 6 different monitoring tools sending alerts all over the place. In this talk, I will share how we overhauled our entire monitoring system and created a single, centralized, easy to use system that fits all of our needs. Not only does get the job done, but because it is so simple to use, developers have bought into the system and are actively helping to improve it as well.

Speakers
avatar for Molly Struve

Molly Struve

Kenna Security
Molly Struve is the Lead Site Reliability Engineer at Kenna Security. She joined Kenna in 2015 and has had the opportunity to work on some of the most challenging aspects of Kenna’s code base. This includes scaling Elasticsearch, sharding MySQL databases, and creating infrastructure... Read More →


Wednesday October 2, 2019 12:00 - 12:30 BST
Track 1: The Liffey B

12:00 BST

The Unmonitored Failure Domain: Mental Health
As stigma around mental health slowly peels away, a lot of our current conversations are centered around this individual model: Operators are responsible for watching their own stress levels, well-being, and avoiding burnout.

Yet, mental health can be contagious among team members. Studies show that if one team member is feeling stressed, anxious, or burnt out, that feeling will slowly spread to their co-workers. We must start addressing mental health on a team, organizational, and systemic level.

Attendees will leave with a new perspective of how they can use existing SRE approaches to improve mental health (e.g. SLOs) and a set of strategies for improving mental health (e.g. self-compassion and mindfulness). They’ll understand how the benefits from improving team well-being are widespread, and that, just as there are patterns for ensuring our systems remain resilient in the face of pressure, we can arm our teams with techniques as well.

Speakers
JW

Jaime Woo

Incident Labs
Jaime began his career as a molecular biologist before following his passion for writing. He is an award-nominated writer, focusing his work on the locus between culture and technology, with recent works in the Advocate, the Globe and Mail, and StarTrek.com. He is co-founder of Incident... Read More →


Wednesday October 2, 2019 12:00 - 12:30 BST
Track 2: The Liffey A

12:30 BST

Luncheon
Wednesday October 2, 2019 12:30 - 14:00 BST
The Forum

14:00 BST

Control Theory for SRE
Control Theory is a long and well-studied discipline in engineering. Nearly every large scale industrial process has dedicated control engineers, creating and maintaining safety and quality systems by assuring that parameters remain within bounds—or alert appropriately.

This session will teach you how to create a PID (Proportional, Integral, Derivative) controller to autoscale your Kubernetes deployment based on a custom target. This controller ensures smooth scale-up and scale down.

Speakers
avatar for Ted Hahn

Ted Hahn

Site Reliability Engineer, TCB Technologies, Inc
Ted Hahn is an SRE for hire working on planet-scale distributed systems.
avatar for MARK HAHN

MARK HAHN

Solutions Architect, Qualys
Mark Hahn is Qualys’s Solutions Architect for Cloud and DevOps Security. In this role he works with Qualys’s clients to ensure that cloud applications and infrastructure are secure and reliable. Mark uses DevSecOps and Site Reliability Engineering practices to ensure that software... Read More →


Wednesday October 2, 2019 14:00 - 14:30 BST
Track 1: The Liffey B

14:00 BST

Being Reasonable about SRE
When companies try to adopt SRE they're often just following a trend. They're doing so without previous analysis of the situation, expecting magic to start happening from day one. By the time they learn that this step hasn't really given them what they hoped for, there's a ton of frustration and bad taste. Let's look at how to explore what SRE is going to be doing in your company and how to build strong relationships with other teams.

Speakers
VU

Vit Urbanec

Unity Technologies
Vítek has joined the SRE movement with a background of systems architecture, consulting and infrastructure automation. He likes bridging the gap between the operations and service owners to get the most out of the DevOps ideals. He also leads the Unity DevMetal band in Helsinki... Read More →


Wednesday October 2, 2019 14:00 - 14:30 BST
Track 2: The Liffey A

14:00 BST

Implementing Distributed Consensus
May we introduce "Skinny," an education-focused, distributed lock service.

With the help of Skinny, we will:
  • briefly look at the Paxos protocol
  • see an example of a typical Paxos run
  • design a simple distributed consensus protocol
  • learn the tricky parts of implementing our simple distributed consensus protocol
  • gradually move from theory-level to coding-level, solving small challenges (network, availability, fault-tolerance) along the way

This short workshop addresses engineers who have had little exposure to the inner workings of distributed consensus, who want to learn about distributed consensus as they start building distributed systems, and who have worked with ready-made distributed consensus solutions such as Zookeper and etcd but strive to understand the underlying theory as well.

Disclaimer: This work is not affiliated with any company (including Google) and is purely educational!

Speakers
DL

Dan Lüdtke

Google
Dan is a Site Reliability Manager in Munich. He contributes to open source software projects, regularly helps to organize large hacker events, runs an autonomous system for fun, and dreams of space travel. Prior to Google, Dan served his country, worked as a security consultant, joined... Read More →
KB

Kordian Bruck

Google
As an Site Reliability Engineer, Kordian is touching production systems every day to prevent disasters. He loves iterating over architecture and organization structure to overcome Conway's Law. Pizza and funny cat videos enabled him to get a masters degree in computer science from... Read More →


Wednesday October 2, 2019 14:00 - 15:30 BST
Track 4: Liffey Meeting Room 2

14:00 BST

SRE Classroom, Or, How to Design a Reliable Distributed System in 3 Hours
This workshop ties together academic and practical aspects of systems engineering, with an emphasis on applying principles of systems design to a production service. We will analyze the service to quantify its performance, and iteratively improve the design.

Participants will work together in small groups to sketch out the design, identify components and their relationships, and to assess the suitability of the design to the system’s Service Level Objective (SLO). Participants will have a system design and bill of materials at the conclusion of this workshop.

Participants will not need laptops or specific coding experience; participants will need enthusiasm for collaborating in small groups, and for discussion-based problem-solving. Participants will come away with an understanding of the principles of iterative systems engineering, popularly known as “Non-abstract large systems design”.

This workshop covers material critical for SRE, an increasingly-broad field that combines software engineering and systems design.

Speakers
AP

Alex Perry

Google LLC
Alex Perry is a Staff SRE in Los Angeles for the last 13 years at Google. He has worked on many layers of network infrastructure, from fabrics to beyond corp services, as well as social and other applications. Recently, he's working on migrating internal enterprise systems from existing... Read More →
AS

Andrew Suffield

Goldman Sachs
Andrew Suffield is an SRE at Goldman Sachs in London. They tend to focus on production automation, distributed systems design, and teaching.


Wednesday October 2, 2019 14:00 - 17:30 BST
Track 3: Liffey Hall 2

14:30 BST

Eventually Consistent Service Discovery
Traditionally, service discovery has leaned towards strong consistency. If you are querying an endpoint, ideally you don't have to deal with split brain on the set of active healthy nodes. This talk will demonstrate how systems such as Envoy are shifting away from the strong consistency coordinator model and making a strong separation with not having service discovery in the hot path of the data plane.

We will be covering systems like Zookeeper and Raft from first principles to discuss how systems like Kafka and Etcd handle their service discovery. We will talk about the practicals of scaling systems like Envoy for tens of thousands of endpoints in a constantly shifting environment.

Speakers
avatar for Suhail Patel

Suhail Patel

Platform Engineer, Monzo Bank
Suhail is an Engineer on the Platform Squad. He focuses on reliability and database operations, ensuring that Monzo customers have access to managing their money 24/7. Suhail has spoken at other conferences such as SRECon Ireland 2019 and QCon London 2019. He has also spoken in various... Read More →


Wednesday October 2, 2019 14:30 - 15:00 BST
Track 1: The Liffey B

14:30 BST

From Nothing to SRE: Practical Guidance on Implementing SRE in Smaller Organisations
2019 is a brilliant time for SRE, and it's time to bring the field to organisations of every size! Smaller tech teams (≤ ~50 engineers) often encounter unique technical and management challenges during SRE adoption. For example, a full on-call load may spell burnout, yet a typical SRE approach to risk may cause concern. Moreover, misaligned incentives impede operational excellence whenever handing back the pager could spell the end of the service – and the organisation!

This talk follows the journey of an SRE team built from scratch – starting with "Is SRE right for you?", we explore practical technical and team guidance to gain buy-in for SRE and usher cultures of continual experimentation. We discuss challenges and blindspots which may cause surprise for teams at all stages of maturity.

Whether you are preparing to establish a reliability team or you already practise SRE, the practical guidance in this talk will ensure your efforts are a success.

Speakers
MH

Matthew Huxtable

Sparx
Matt founded and now leads the Site Reliability Engineering team at Sparx, an evidence-based education technology and data science company. With a background in systems engineering and Computer Science, he spends his days maintaining and promoting reliability of the core Sparx platform... Read More →


Wednesday October 2, 2019 14:30 - 15:00 BST
Track 2: The Liffey A

15:00 BST

Network Monitor: A Tale of ACKnowledging an Observability Gap
In the Fall of 2018 we spent nearly 6 weeks debugging Redis connection issues from our core app, pulling in many engineers along the way. The smoking gun to get our cloud provider involved was a high number of TCP retransmits. After bringing this evidence to them, their network engineers were able to fix the issue.

This incident showed us that we had an observability gap, due to lack of access and monitoring in our cloud environment. To this end, we built network monitor, a daemon running on all of our nodes to collect relevant network data. This daemon has evolved into a generic eBPF (extended Berkeley Packet Filter) orchestrator. In this talk, you'll learn about what we've built, and should walk away understanding why monitoring your network is a valuable endeavour, as well as how your teams can use eBPF to improve your observability stack.

Speakers
avatar for Jason Gedge

Jason Gedge

Staff Production Engineer, Shopify
Jason is a Staff Production Engineer on the service communication team at Shopify. In the past, he spearheaded the first iteration of Shopify’s self-serve cloud platform and is now rolling out their first cloud service communication mesh. On the side, he is keeping busy in the #crazy-cat-people... Read More →


Wednesday October 2, 2019 15:00 - 15:30 BST
Track 1: The Liffey B

15:00 BST

My Life as a Solo SRE
2015 was the worst year of my professional career. Between botched fail forward releases, major customer impacting incidents and weakly supported features, I was being worn down. And then in 2016, I helped introduce the organisation to SRE culture. I became the first SRE and helped change the engineering organisation. Over the next few months and years, I championed SLIs, drove down MTTA/MTTR and improved release cadence. In this talk, you will hear my tails of woe, but I will leave you with advice and tips on how to make your life and your organisation better.

Speakers
avatar for Brian Murphy

Brian Murphy

SRE Manager, G-Research
Brian Murphy is an SRE by nature and a manager by training. He currently works for G-Research where he built and leads an SRE team. Previously, he was an SRE Manager for a startup that was bought by Cisco. He currently lives with his wife, son and sassy dog in West London. His superpower... Read More →


Wednesday October 2, 2019 15:00 - 15:30 BST
Track 2: The Liffey A

15:30 BST

Break with Refreshments
Wednesday October 2, 2019 15:30 - 16:00 BST
The Forum

16:00 BST

Zero Touch Prod: Towards Safer and More Secure Production Environments
Many outages are caused by human mistakes when interacting with the production environment: typos when running the tools, accidentally running tests against production systems, errors in configuration files etc. In addition, there is a risk of such outages being caused by malicious insider actors. Zero Touch Prod (ZTP) mitigates those risks by providing principles and tooling to make all production changes via automation, safe proxies, or audited break-glass.

Speakers
MC

Michał Czapiński

Google Switzerland
Michał Czapiński is a senior SRE focusing on security and safety of the compute infrastructure at Google. Before joining Google Switzerland in 2014, he had received a PhD in High-Performance Computing at Cranfield University (UK). Outside of work he loves mountaineering, race snowboard... Read More →
RW

Rainer Wolafka

Google Switzerland
Rainer Wolafka is a Site Reliability Manager focusing on planet scale technical infrastructure and production safety at Google. Before joining Google Switzerland in 2015, Rainer worked on distributed file systems for IBM's Research and Development organization. Outside of work he... Read More →


Wednesday October 2, 2019 16:00 - 16:45 BST
Track 1: The Liffey B

16:00 BST

All of Our ML Ideas Are Bad (and We Should Feel Bad)
The vast majority of proposed production engineering uses of Machine Learning (ML) will never work. They are structurally unsuited to their intended purposes. There are many key problem domains where SREs want to apply ML but most of them do not have the right characteristics to be feasible in the way that we hope. After addressing the most common proposed uses of ML for production engineering and explaining why they won't work, several options will be considered, including approaches to evaluating proposed applications of ML for feasibility. ML cannot solve most of the problems most people want it to, but it can solve some problems. Probably.

Speakers
avatar for Todd Underwood

Todd Underwood

Senior Director of Engineering, Founder of ML SRE Google, Google
Presentation: ML in Real LifeMachine Learning has captured the attention of enterprises across the world. But most implementations of ML will face challenges as enterprises decide how much to invest in ML and how to mitigate some of the implementation and execution risks. ML is only... Read More →


Wednesday October 2, 2019 16:00 - 16:45 BST
Track 2: The Liffey A

16:45 BST

Zero-Downtime Rebalancing and Data Migration of a Mature Multi-Shard Platform
Application-level sharding is a common pattern for scaling multi-tenant architectures. However, once it has been put into production, you inevitably run into follow-up problems that aren't as widely discussed. In this talk, we will share years worth of experience and connect the dots to outline a full sharding solution that goes beyond the initial implementation and deployment. At the core of our toolkit is the "binlog", an event stream used by the MySQL replication protocol. The tooling we've built on top of this idea is being used in production at Shopify to balance hundreds of MySQL shards for uniform load distribution, isolate heavy tenants from each other, and has in the past been used to safely transfer the entire dataset of our over 800.000 tenants from physical datacenters to a cloud environment. All of this happens online, without downtime, and is practically invisible to the tenants.

Speakers
JL

Justin Li

Shopify
Justin is a production engineer at Shopify. He likes performance problems, parsers, and distributed systems, and has worked on many aspects of Shopify’s production system, notably resiliency, sharding, flash sale preparations, scriptable load balancing and routing, and optimizing... Read More →


Wednesday October 2, 2019 16:45 - 17:30 BST
Track 1: The Liffey B

16:45 BST

Fast, Available, Catastrophically Failing? Safely Avoiding Behavioral Incidents in Complex Production Systems
Operators are increasingly being asked to release and manage services that behave in ways that are increasingly difficult to reason about compared to traditional application services. Data products, model based machine learning services, ensemble models, and large microservices architectures are founded on deliberate complexity in such a way that their availability is only correctly measured via an SLA/QOS around their behavior, but also threatened by the unknown unknowns emergent behavior from their interactions.

Incidents move from being about general service availability, to behavioral.

Safely operating these types of service in production presents a host of challenges that even the most experienced SRE may not expect. Severe incidents with stable infrastructure, invisible errors rates, IMPROVING response times, but the business failing catastrophically losing millions of dollars? Absolutely!

Speakers
RK

Ramin Keene

fuzzbox.io
Ramin has helped enterprises large and small to put machine learning, a/b testing, and data science products into production. He’s made ALL the mistakes and then some, helping companies lose thousands, if not millions, of dollars along the way. He is currently based in Los Angeles... Read More →


Wednesday October 2, 2019 16:45 - 17:30 BST
Track 2: The Liffey A

17:30 BST

Social Hour
Wednesday October 2, 2019 17:30 - 18:30 BST
The Forum
 
Thursday, October 3
 

08:00 BST

Morning Coffee and Tea
Thursday October 3, 2019 08:00 - 09:00 BST
The Forum

09:00 BST

Advanced Napkin Math: Estimating System Performance from First Principles
Ever stood in front of the whiteboard with a group of your co-workers designing a system, but found yourself in that awkward position where none of you were able to answer whether something would be fast enough? In this talk, you'll learn how to combine base rates to answer challenging questions in a jiffy on pull requests, technical reviews, or in meetings, for example: Can we in the 5-second critical window on a regional failover, do a snapshot-and-restore of Memcached to the other region? How much overhead should we expect a proxy to incur? How far is this system from the optimum performance? With the methodology from this talk, you'll learn to quickly estimate expected system performance instead of building them first!

Speakers
SE

Simon Eskildsen

Shopify
As Director of Production Engineering at Shopify Simon works with teams to increase the performance, scalability, and resiliency of Shopify. Other than that, as a new resident of Canada, fulfilling his obligation to call everyone out when they think they've experienced "cold weat... Read More →


Thursday October 3, 2019 09:00 - 09:45 BST
Track 1: The Liffey B

09:00 BST

Deploying SRE Training Best Practices to Production: How We SRE'ed Our SRE Education Program
Structured education is important for ramping up new SREs to build confidence and fight imposter syndrome. In this talk, we take a look behind the scenes of the SRE EDU Orientation curriculum at Google from a technical standpoint and organizational point of view while highlighting best practices that can be applied at organizations of all sizes. We’ll show how we applied SRE best practices to the program itself to minimize toil for the organizers (keyword: automation!) and keep the training software reliable and up to date.

By implementing judicious monitoring, we learned that hands-on exercises are a more successful way to ramp people up than one-way lectures. We built a rigged production system where an instructor can trigger outages that the students need to triage, mitigate and resolve. As the system is internal only, students cannot cause externally visible harm, creating a safe learning environment that allows for experimentation.

Speakers
avatar for Jennifer Petoff

Jennifer Petoff

Director, SRE Education, Google Portugal
Jennifer Petoff is Director of Google Cloud Platform (GCP) & Technical Infrastructure (TI) Education and is based in Lisbon, Portugal. She leads training programs for Google's GCP and TI Engineering Teams. Jennifer is one of the co-editors of the best-selling book, Site Reliability... Read More →
avatar for JC Van Winkel

JC Van Winkel

SRE EDU lead educator, Google Switzerland
JC has been teaching UNIX and programming languages since 1992, working for AT Computing, a small courseware spin-off of the University of Nijmegen, the Netherlands. JC joined Google's Site Reliability Engineering team in 2010 and is both a founding member and lead educator of the... Read More →


Thursday October 3, 2019 09:00 - 09:45 BST
Track 2: The Liffey A

09:00 BST

Effective Distributed Tracing Workshop
If you're working in a large organization which is using a micro-service architecture, you can find it hard to keep tabs on what is going on under the hood. Performing root cause analysis of incidents can be as complicated as your organization makes it. Traditional metrics and logging, although essential, are also somewhat limited in some regards. To help with some of the issues mentioned before we can turn to Distributed Tracing and the view it gives us into our services.

In this workshop, we will give an introduction into Distributed Tracing, and OpenTracing, the open specification for vendor-neutral APIs for Distributed Tracing. After the introduction, there will be a hands-on opportunity to see how a distributed system is instrumented. Finally, we will break those applications and we will use Distributed Tracing to help us figure out what is going on.

For the hands on part, make sure to have either Java, Golang, or Docker to run the test application.



Speakers
avatar for Pedro Alves

Pedro Alves

Zalando SE
Pedro has been focusing on developing back end code for webapps since 2008. In Zalando since 2013, he has worked in different areas of Zalando’s business, and is now working in the SRE team, making sure people can buy shoes reliably.
SA

Serbay Arslanhan

Zalando SE
Serbay is learning to be a Site Reliability Engineer at Zalando in Berlin. Before joining the SRE team in Zalando, he worked on building systems related to customer facing Checkout solutions at Zalando and personal health, social networks, news aggregators in different companies... Read More →
LM

Luis Mineiro

Zalando SE
Luis's broad background in software engineering includes experience in DevOps, networks, mobile development, and more. Luis has been with Zalando since 2013—shaving yaks and creating the most beautiful bike sheds in the Shop team, later joining Platform Infrastructure to support... Read More →


Thursday October 3, 2019 09:00 - 12:30 BST
Track 3: Liffey Hall 2

09:45 BST

The Map Is Not the Territory: How SLOs Lead Us Astray, and What We Can Do about It
SLOs are a wonderfully intuitive concept: a quantitative contract that describes expected service behavior. These are often used in order to build feedback loops that prioritize reliability, communicate expected behavior when taking on a new dependency, and synchronize priorities across teams with specialized responsibilities when problems occur, among other use cases. However, SLOs are built on an implicit model of service behavior, with a raft of simplifying assumptions that don't universally hold.

These simplifying assumptions make SLO rules of thumb fall apart with complex modern services, which can result in bad decision making. In this talk, I will catalog a range of these issues with SLOs and demonstrate how they cause systematic failures of SLO-based processes. Armed with the knowledge of these failure modes, I'll present a set of best practices for understanding when SLOs produce incorrect and unexpected results and a set of techniques for constructing robust SLOs.

Speakers
ND

Narayan Desai

SREconEU Program Chair, Google
Narayan Desai is an SRE at Google, where he focuses on the reliability of Google Cloud Platform Data Analytics products. He has a checkered past, having worked on scheduling, configuration management, supercomputers, and metagenomics—always in the context of production systems... Read More →


Thursday October 3, 2019 09:45 - 10:30 BST
Track 1: The Liffey B

09:45 BST

SRE by Influence, Not Authority: How the New York Times Prepares for Large Scale Events
How do you SRE in a large decentralized organization where development teams manage their own deployments and infrastructure? In this session, we’ll cover how we formed our team at The New York Times and our rationale behind it, talk through our challenges, and how we leveraged an incident to kick off Elections readiness, our largest SRE effort to date.

Attendees will understand how we organized this effort and integrated our team to partner with application teams. We will detail how we increased reliability through a combination of architecture reviews, monitoring improvements, and stress testing.

Speakers
VW

Vinessa Wan

The New York Times
Vinessa Wan has been working in project management for the past 10 years. In her past 5 years at The New York Times, she has worked in R&D, product discovery, and now oversees the SRE and internal tooling & automation portfolio.
BH

Brett Haranin

The New York Times
Brett Haranin has been working as a software engineer and tech lead at various companies, large and small, for the last 17 years. Currently, he works as an SRE at The New York Times and is focused on helping teams mature the security and reliability of their systems. In his spare... Read More →


Thursday October 3, 2019 09:45 - 10:30 BST
Track 2: The Liffey A

10:30 BST

Break with Refreshments
Thursday October 3, 2019 10:30 - 11:00 BST
The Forum

11:00 BST

Load Balancing Building Blocks
Load balancing is often presented as a simple solution for difficult application problems, like providing redundancy and smooth blue/green application upgrades. But not all load balancers are created equal. Is a L7 load balancer better than an L4 one? What makes DNS a load balancing technique? Does using a CDN help?

This talk answers these questions and more. It covers 3 common variants of load balancing (L4, L7 and DNS) in a product agnostic manner, important properties of each variant, and why you would consider using them. It concludes with an overview of how Facebook uses all 3 variants to manage and control traffic flows globally.

Speakers
KL

Kyle Lexmond

Facebook
Kyle is a Production Engineer on the Traffic Applications team at Facebook Seattle, working to make sure requests from people get a 200 OK, not an error or vanishing into the ether(net). Previously at Twitter and AWS, he focuses on simplifying systems and making them more resilient... Read More →


Thursday October 3, 2019 11:00 - 11:45 BST
Track 1: The Liffey B

11:00 BST

Are We All on the Same Page? Let's Fix That
The industry defined as good practice to have as few alerts as possible, by alerting on symptoms that are associated with end-user pain rather than trying to catch every possible way that pain could be caused.

Organizations with complex distributed systems that span dozens of teams can have a hard time following such practice without burning out the teams owning the client-facing services. A typical solution is to have alerts on all the layers of their distributed systems. This approach almost always leads to an excessive number of alerts and results in alert fatigue.

Adaptive Paging is an alert handler that leverages the causality from tracing and OpenTracing's semantic conventions to page the team closest the problem. From a single alerting rule, a set of heuristics can be applied to identify the most probable cause, paging the respective team instead of the alert owner.

Speakers
LM

Luis Mineiro

Zalando SE
Luis's broad background in software engineering includes experience in DevOps, networks, mobile development, and more. Luis has been with Zalando since 2013—shaving yaks and creating the most beautiful bike sheds in the Shop team, later joining Platform Infrastructure to support... Read More →


Thursday October 3, 2019 11:00 - 11:45 BST
Track 2: The Liffey A

11:45 BST

What Happens When You Type en.wikipedia.org?
What happens when you type en.wikipedia.org? One of the most popular interview questions we have been asked quite a few times. But what about what happens on the server side? What happens on our end?

At Wikimedia, we run the world’s favourite encyclopædia and one of the top 5 websites of the Internet! In our talk, we will describe the architecture of Wikipedia, how routers, load balancers, caching, a bit more caching, message queues, databases, microservices, and containers are pieced together to serve you, and how open source plays a master role in it.

Furthermore, we will briefly talk about our transition from a monolith, to service-oriented architecture and microservices, to migrating them to Kubernetes.

Wikipedia is a very good example of a complex system; joining this talk will help you demystify one in an understandable way.

Speakers
avatar for effie mouzeli

effie mouzeli

Site Reliability Engineer, Wikimedia Foundation
Effie studied physics and scientific computing but decided to follow neither. Instead she became a sysadmin, later systems engineer, now SRE. She has worked in a number of startups and small organisations, where her responsibilities were usually automation, infrastructure architecture... Read More →
avatar for Alexandros Kosiaris

Alexandros Kosiaris

Senior SRE, Wikimedia Foundation
A Linux sysadmin, turned FreeBSD sysadmin, turned Linux sysadmin, turned systems engineer (somewhere along that path there’s a Devops hat as well), Alexandros has been in the space since 1999, starting as a hobbyist, then a professional. Currently working with the Wikimedia Foundation... Read More →


Thursday October 3, 2019 11:45 - 12:30 BST
Track 1: The Liffey B

11:45 BST

Weathering the Storm: How Early Warnings Save the Farm
LinkedIn’s production stack consists of over thousands of different applications and associated with complex dependencies. In this environment, when a production issue is caused due to a misbehaving microservice(s), finding the right culprit can be both challenging and time consuming.

At LinkedIn, we have built a framework to automate the incident correlation process by ingesting data pertaining to incidents and associated dependencies to identify the the unhealthy microservice(s). This gives us the ability to directly escalate an incident to the corresponding team thus cutting down MTTD/MTTR while improving quality of life of the oncall engineers.

In this talk, we will give a higher level overview of the correlation engine, how we are doing correlations, how we reduce false positives and increase the accuracy of the correlated results and finally lessons learned.

Speakers
BS

brian sherwin

LinkedIn
Brian Cory Sherwin has been a Sr. SRE at LinkedIn since 2012. Brian has had many responsibilities at LinkedIn ranging from auto-remediation, business metric collection and analysis, host-level monitoring, disaster recovery, data center decommissions, and incident command. The common... Read More →


Thursday October 3, 2019 11:45 - 12:30 BST
Track 2: The Liffey A

12:30 BST

Luncheon
Thursday October 3, 2019 12:30 - 14:00 BST
The Forum

14:00 BST

Refining Systems Data without Losing Fidelity
It is not feasible to run an observability infrastructure that is the same size as your production infrastructure. Past a certain scale, the cost to collect, process, and save every log entry, every event, and every trace that your systems generate dramatically outweighs the benefits. If your SLO is 99.95%, then you'll be naively collecting 2,000 times as much data about requests that satisfied your SLI than those that burnt error budget. How do you scale back the flood of data without losing the crucial information your engineering team needs to troubleshoot and understand your system's production behaviors?

Statistics can come to our rescue, enabling us to gather accurate, specific, and error-bounded data on our services' top-level performance and inner workings. We can keep the context of the anomalous data flows and cases in our supported services while not allowing the volume of ordinary data to drown it out.

Speakers
avatar for Liz Fong-Jones

Liz Fong-Jones

Field CTO, Honeycomb
Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 18+ years of experience. She is currently the Field CTO at Honeycomb, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.


Thursday October 3, 2019 14:00 - 14:45 BST
Track 1: The Liffey B

14:00 BST

How to Do SRE When You Have No SRE
This talk is for engineering organisations that don't have anyone dedicated to SRE type work, but you know enough to know that you really REALLY need it.

Where on earth can you even start? You want to make things better but you already have at least 2 other jobs to do as well. It doesn't seem possible.

Unfortunately, you're also the person that will be stuck fixing things when it all breaks so it's doubly in your best interest to try to prevent disasters if you can.

This talk gives extremely practical and realistic advice of how to get started, even if you only have 1 hour a week to dedicate to it. It'll help you find your weakest points and make things a little better, even without dedicated resources. Step by step, you'll be able to make your org much more reliable and reduce your stress levels. Also, less things will be on fire.

Speakers
JO

Joan O'Callaghan

Udemy
Joan O'Callaghan is an engineering director at Udemy. She has worked in SRE and Incident Management (in one form or another), for many years. She likes to host and write blameless post-mortems and take long walks on the beach where she has imaginary arguments with people that don't... Read More →


Thursday October 3, 2019 14:00 - 14:45 BST
Track 2: The Liffey A

14:00 BST

Statistics for Engineers
Gathering all kinds of telemetry data is key to operating reliable distributed systems at scale. Once you have set-up your monitoring systems and recorded all relevant data, the challenge becomes to make sense of it and extract valuable information, like:
  • Are we fulfilling our SLO/SLA?
  • How did our query response times change with the last update?
  • When will I run out of disk space, when we continue to grow like this?

Statistics is the art of extracting information from data. In this tutorial, we address the basic statistical knowledge that helps you at your daily work as a system operator. From the mathematical side, we will cover probabilistic models, summarising distributions with mean values, quantiles, and histograms and their relations. From the technological side, we will discuss metrics vs. event data, the effects of sub-sampling, how not to aggregate percentiles, t-digest and histogram summaries.

The tutorial will be tool agnostic, but tailored towards applications. In the computational examples we will be using Python and data from our production systems. At the end of the workshop attendees should have a clear picture of the mathematical features they need from their monitoring tools, for their application at hand.

Speakers

Thursday October 3, 2019 14:00 - 17:30 BST
Track 3: Liffey Hall 2

14:00 BST

Managing Microservices with Istio Service Mesh
Managing disparate microservices at scale is a real challenge for Ops and SRE teams. The workshop will explain and demonstrate the service mesh patterns implemented by Istio using the same declarative approach as kubernetes to implement microservices concerns without affecting your services.

Requirements
  1. Participants should be able to install gcloud CLI on their laptops.
  2. Participants should have a Google Cloud Platform account. If you don’t yet have one, you could create one using Google Cloud Free Tier.
Workshop: https://www.istioworkshop.io

Speakers
avatar for Rafik Harabi

Rafik Harabi

Solutions Architect, Innovsquare
Rafik Harabi is a Solution Architect devoted to help customers in their digital transformation journey. He currently spends most of his time architecting and deploying Cloud Native Services Platforms using Kubernetes and recently Istio Service Mesh. Before working on cloud migration... Read More →


Thursday October 3, 2019 14:00 - 17:30 BST
Liffey Hall 1

14:45 BST

Tracing Real-Time Distributed Systems
The concept of distributed tracing has often been explored in the context of web-based microservices in predominantly request/response style systems. But, what if you're dealing with a real-time data streaming system? How do you even start to model strongly asynchronous message flows, consisting of multi-service pipelines originating from many sources and distributed to even more consumers? These are the general characteristics of trading systems, which make tracing incredibly challenging.

This talk will explore our approach to applying these concepts to latency-sensitive real-time data streaming in large scale distributed systems. We will discuss the challenges of tracking long-running sessions, handling fan-in/fan-out data flows, and reducing storage costs while still capturing granular in-process tracing data. We will demonstrate how we utilise tracing to diagnose issues and measure service level indicators, as well as share our thoughts on how to further improve observability by applying these concepts on the client-side.

Speakers
avatar for Evgeny Yakimov

Evgeny Yakimov

Bloomberg LP
Evgeny is a software engineer turned SRE working at Bloomberg London with a focus on real-time distributed systems. He is a keen technology enthusiast, exploring how to apply SRE concepts such as tracing to the area of trading systems. He advocates for an SRE culture shift at Bloomberg... Read More →


Thursday October 3, 2019 14:45 - 15:30 BST
Track 1: The Liffey B

14:45 BST

One on One SRE
As someone who has carried the title of Site Reliability Engineer at many companies, I have struggled with how to influence an organization to make the changes necessary to ensure high availability, sustainably. Fixing things directly doesn't scale with the organization, so a broader approach is needed. Leadership usually desires the outcome of better availability and sometimes even say so publicly, but then what? In this talk, I will discuss the one-on-one approach I created for SRE outreach both proactively and in incident debriefs. I will demonstrate how this approach enables vulnerability and better information gathering through application of psychological safety principles.

Speakers
avatar for Amy Tobey

Amy Tobey

Sr SRE, Equinix
Amy Tobey has worked in tech for more than 20 years at companies of every size, working with everything from kernel code to user interfaces. These days she is senior principal engineer leading Applied Resilience Engineering at Equinix. When she's not working, she can be found with... Read More →


Thursday October 3, 2019 14:45 - 15:30 BST
Track 2: The Liffey A

15:30 BST

Break with Refreshments
Thursday October 3, 2019 15:30 - 16:00 BST
The Forum

16:00 BST

A Customer Service Approach to SRE
SREs are highly technical people, and have a bias toward technical solutions to technical problems. They enjoy well crafted APIs that they can build solid SLAs around, and allow teams work out where a problem lies.

This strength can hide an antipattern; being able to tell ourselves that our system is OK, it's everyone else who has the problem. This talk will take some case-studies from Facebook's "Server Lifecycle" team, to show how engineers can pretend that the systems they have built are perfect, and that it's actually the rest of the world that is to blame for misusing them.

I will talk about how the team used a customer service ethos to redesign their metrics, their service and the support methods, to build something that really served its customers.

Speakers
avatar for John Looney

John Looney

Production Engineering Manager, Reddit
John Looney has been working in multinationals that handle large amounts of private & personal data for two decades, and has been thinking about the real-world implications of applying the spirit of EU human rights legislation on global datacenters. He's been involved with running... Read More →


Thursday October 3, 2019 16:00 - 16:45 BST
Track 1: The Liffey B

16:00 BST

Prioritizing Trust While Creating Applications
Managing risk needs to scale as your product grows in popularity and complexity. In traditional software development, often security was treated as a last gating factor at best and post-incident concern at worst. How do we shift our security processes left—in other words, earlier in the development lifecycle? The cost of applying security practices too late can be catastrophic to a company, leading to the loss of customer trust and affecting the bottom line.

Join me in this session to learn how to leverage security tools and recommended practices to enable everyone to play a part in securing your application from discovery to the operation of your application.

Speakers
avatar for Jennifer Davis

Jennifer Davis

Dev Rel Manager, Google


Thursday October 3, 2019 16:00 - 16:45 BST
Track 2: The Liffey A

16:45 BST

SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager
SRE and product management—do those even go together? Yes! In this talk, we'll go over small ways and big strategies to form sustainable, impactful relationships with your users and build products that they love whether or not your SRE team has an official product manager. SRE teams' users are other engineers, data scientists, designers, and anyone else who pushes code at your company. It's not enough to build perfectly engineered platforms and tooling. SRE teams must build scalable, opinionated, USABLE products and workflows. This talk will give you the framework to get there and show you what traits translate to good product managers.

Speakers
avatar for Jen Wohlner

Jen Wohlner

product manager, platform engineering, Livepeer
Jen Wohlner leads product management at Livepeer, a decentralized video transcoding and live-streaming platform built on the Ethereum blockchain. Before Livepeer, Jen was the product manager for platform engineering at Fastly, an edge cloud platform that provides a content delivery... Read More →


Thursday October 3, 2019 16:45 - 17:30 BST
Track 1: The Liffey B

16:45 BST

Software Patching Needn't Be a Can of Worms
Let's face it—manual patching is no fun, all toil. The solution? Better automation, better patching.

"There's no record of what third-party software/versions we use. I don't know what updates are available, and of those, which are the most important. It's hard to get downtime on production systems. There's no test environment for this. I'm scared the upgrade will break stuff, and when it does, rolling back will be even harder."

If these complaints ring true for your organisation, this talk chases each one away with examples of applying automation for an easier life.

Speakers
PR

Philip Rowlands

Jane Street
Philip Rowlands has been an SRE since before he really understood what it was. Because he doesn't scale, he relies on software for leverage. He has worked over the years on automated telephony, Google Production SRE, Mainframe Linux, and more recently for various financial firms... Read More →


Thursday October 3, 2019 16:45 - 17:30 BST
Track 2: The Liffey A

17:30 BST

Lightning Talks
  • How to Achieve "100%" Availability
    Igor Ebner de Carvalho, Microsoft
  • Understanding Vicious Cycles with Causal Loop Diagrams
    Laura Nolan, Slack
  • Flamegraphs—A Meeting Point between SRE and Developers
    Amir Langer and Doron Sekler, eBay
  • Smart and Effective Way to Reduce Distributed Tracing Overhead
    Susobhit Panigrahi, VMware
  • No, Your Hardware is Still Mostly Software: Handling FW in Your Fleet
    Yannick Brosseau, Facebook
  • TLS Certificate Issuance Controls
    James Renken, Let’s Encrypt
  • Copilot: Stateless Service Mesh Routing for Performance and Resiliency
    Brennen Smith, Ookla (Speedtest.net)
  • How Shopify Launched the Welcome Back Returnship
    Jane Maguire
  • Managing On-Call Atrophy
    James Wynne, Pivotal

Thursday October 3, 2019 17:30 - 18:30 BST
The Liffey B

18:30 BST

Conference Reception
Thursday October 3, 2019 18:30 - 20:00 BST
Level 3 Foyer
 
Friday, October 4
 

08:00 BST

Morning Coffee and Tea
Friday October 4, 2019 08:00 - 09:00 BST
The Forum

09:00 BST

Bigtable: A Journey from Binary to Service and the Lessons Learned along the Way
In this talk, we'll examine the development of a global multi-tenant "Bigtable Service" based on Bigtable, a highly scalable wide column store originally developed for single user, single cluster instances. Because SREs value deduplication of effort, this type of service development work is often undertaken by SREs, but building a service is far more complicated than just wrapping "deploy" in a for loop. We'll discuss the challenges of correctly defining your "product", the revelation that the service layer wrapped around the core is a complex distributed system itself, some common traps that SREs fall into when designing services, and the challenges of migrating users to a central service. Finally, we will describe how the relationship between the core product development team and the SRE team has evolved and highlight best practices and anti-patterns for the developer: SRE relationship that we've learned on our journey.

Speakers
BG

Brendan Gleason

Google
Brendan is a Site Reliability Engineer in Google's New York City office. He primarily worked on Bigtable during his first six years at Google, first on the Bigtable Service and eventually on Cloud Bigtable. More recently Brendan has worked on Google Cloud Platform's developer and... Read More →
GP

Gaurav Prabhu Gaonkar

Google
Gaurav Prabhu Gaonkar leads a team of site reliability engineers responsible for running Bigtable as a service at Google. He is passionate about problem solving and has held multiple roles across various companies. Gaurav has deep experience in designing, building and scaling distributed... Read More →


Friday October 4, 2019 09:00 - 09:45 BST
Track 1: The Liffey B

09:00 BST

Building Resilience: How to Learn More from Incidents
Learning from incidents: it's not as easy as it sounds! Research from numerous safety-critical industries (aviation! healthcare! firefighting!) is changing what we know about how to build resilient systems and organizations in a turbulent world. This talk is going to share some of that research with you in a direct and practically-applicable way.

One major obstacle to building resilience in an engineering organization is the traditional approach to post-incident review, which focuses heavily on incident prevention. Come and learn:

that there is and always will be more to incident response and review than prevention, how to recognize and avoid four common traps during incident investigations, and when to apply four concrete recommendations on how to learn more from incidents in your organization.

Speakers
NS

Nick Stenning

Microsoft
Nick Stenning is a Site Reliability Engineer on Azure, poking and prodding at the internals of "somebody else's computers." He previously worked at the UK's Government Digital Service and at open-source startup Travis CI. He's been talking his colleagues' ears off on the topic of... Read More →


Friday October 4, 2019 09:00 - 09:45 BST
Track 2: The Liffey A

09:00 BST

What I Wish I Knew before Going On-Call
Firefighting a broken system is time-sensitive and stressful, but becomes even more challenging as teams and systems evolve. As an on-call engineer, scaling processes among humans is an important problem to solve. How do we ramp up new engineers effectively? How can we bring existing on-call engineers up to speed? In this workshop we’ll share common myths among new on-call engineers and the Do’s and Don’ts of on-call onboarding, as well as run through hands-on activities you can take back to work and apply directly to your own on-call processes.

Speakers
avatar for Chie Shu

Chie Shu

Yelp
Chie Shu is a backend Software Engineer at Yelp. She has worked on improving Yelp's revenue-critical Ads data pipeline to be more resilient to system failures, and designed heuristics used internally by executives and Product Managers to assess the financial impact of on-call incidents... Read More →


Friday October 4, 2019 09:00 - 10:30 BST
Track 3: Liffey Hall 2

09:45 BST

SDKs Are Not Services and What This Means for SREs
Building an SDK or embedded libraries for client teams to integrate into their software seems like a straightforward approach to onboarding a large number of teams onto your infrastructure. Depending on the domain, it can even seem like an obvious and durable decision. Unfortunately, this path is mired with hidden complexity that can have serious productivity consequences for client teams, increasing support related toil for the SRE team significantly and imperiling the viability of both the team and the service they provide.

Speakers
avatar for Justin Coffey

Justin Coffey

Engineer, Criteo
Justin Coffey is an engineering director in the SRE department of Criteo where he has led efforts in building out much of Criteo's data processing platform. In past lives he has built ecommerce, emailing and real estate platforms. He got his start in the industry way back in 1996... Read More →


Friday October 4, 2019 09:45 - 10:30 BST
Track 1: The Liffey B

09:45 BST

How Stripe Invests in Technical Infrastructure
Deciding what to work on is always difficult and is especially treacherous for folks working as infrastructure engineers and leaders. Will Larson unpacks the process of picking and prioritizing technical infrastructure work, which is essential to long-term company success but discussed infrequently. Will shares Stripe's approaches to prioritizing infrastructure as your company scales, justifying—and maybe even expanding—your company's spend on technical infrastructure, exploring the whole range of possible areas to invest into infrastructure, adapting your approach between periods of firefighting and periods of innovation, and balancing investment in supporting existing products and enabling new product development.

Speakers
WL

Will Larson

Stripe
Will Larson has been an engineering leader and software engineer at a number of technology companies including Digg, Uber, and Stripe. He is also the author of An Elegant Puzzle: Systems of Engineering Management.


Friday October 4, 2019 09:45 - 10:30 BST
Track 2: The Liffey A

10:30 BST

Break with Refreshments
Friday October 4, 2019 10:30 - 11:00 BST
The Forum

11:00 BST

Why Automating Everything Adds to Your Toil
An often-heard phrase whenever there is toil is just automate it. You would expect that from an SRE who has a software engineering background. It is what distinguishes Site Reliability Engineering from operations. But the wrong automation and too much automation can replace the existing toil with different toil, or in the worst case grow the existing mound of toil.

Find out when you should or shouldn't add automation, and how to build the right, sustainable automation.

Speakers
CT

Colin Thorne

SRE Lead, IBM
Colin is the worldwide SRE lead for IBM's Kubernetes Service with a career-long dedication to clean code and architecture. He loves learning and applying new practices and technologies, and then sharing them with anyone who happens to be near. Colin runs new graduate education, cloud... Read More →
avatar for Cameron McAllister

Cameron McAllister

IKS SRE, IBM
The hardest thing for Cameron McAllister when automating something is coming up with a good name for it! With that problem solved, he has been responsible for many automated systems, most of them with slack integrations, providing a self-service model to enhance development and SRE... Read More →


Friday October 4, 2019 11:00 - 11:45 BST
Track 1: The Liffey B

11:00 BST

Pushing through Friction
Things are broken. The deployment pipeline is painfully slow. Your engineering team has doubled in the last year and there's a lack of sufficient process and management. You git blame a file that's used everywhere but nobody understands it; the person who wrote it left the company five years ago.

As a senior-level engineering leader, experience tells you things could be better. You see the gaps. If only the company adopted policy A or dumped technology B, everyone would benefit. But there's so much inertia. The company has always used B. You are frustrated. Can you actually make a difference?

Yes. You are encountering organizational friction, and learning to identify, accept and push through friction is a key skill of engineering leaders. In this talk, Dan will talk about why organizational friction occurs and how to mitigate it. The ability to push through friction will distinguish you throughout your career.

Speakers
avatar for Dan Na

Dan Na

Staff Engineer and Team Lead, Squarespace
Dan Na is a Staff Engineer and Team Lead on the Internationalization Platform team at Squarespace in NYC. Previously he was an Engineering Manager and Senior Software Engineer at Etsy. He loves learning and teaching in a collaborative environment and solving both the technical and... Read More →


Friday October 4, 2019 11:00 - 11:45 BST
Track 2: The Liffey A

11:00 BST

Unconference: Unsolved Problems in SRE
There are a variety of unsolved problems in and around site reliability engineering. Some of these have been addressed during the conference, some deserve additional discussion, and some may have become apparent during the hallway tracks here in Dublin. If you would like to discuss any of these this is your session. You can suggest topics on the USENIX-SREcon slack #unsolved_problems channel.

Speakers
avatar for Kurt Andersen

Kurt Andersen

Kurt Andersen worked as the head of strategy for Blameless.com. Prior to that he was one of the leads for the Product-SRE organization at LinkedIn. Across the full spectrum of IT-influence, he is strongly committed to developing the best engineers and teams, and enabling them with... Read More →


Friday October 4, 2019 11:00 - 12:30 BST
Track 3: Liffey Hall 2

11:45 BST

Autopsy of a MySQL Automation Disaster
You deployed automation, enabled automatic database master failover and tested it many times: great, you can now sleep at night without being paged by a failing server. However, when you wake up in the morning, things might not have gone the way you expect. This talk will be about such a surprise.

Once upon a time, a failure brought down a MySQL master database. Automation kicked in and fixed things. However, a fancy failure, combined with human errors, an edge-case recovery, and a lack of oversight in tooling and scripting lead to a split-brain and data corruption. This talk will go into details about the convoluted—but still real-world—sequence of events that lead to this disaster. I cover what could have avoided the split-brain and what could have made data reconciliation easier.

Speakers
avatar for Jean-François Gagné

Jean-François Gagné

Infrastructure Engineer / System and MySQL Expert, MessageBird
Jean-François is a System/Infrastructure Engineer and MySQL Expert. One year ago, he joined MessageBird, an IT telco startup in Amsterdam, with the mission of scaling the MySQL infrastructure. Before that, J-F worked on growing the Booking.com MySQL and MariaDB installations (he... Read More →


Friday October 4, 2019 11:45 - 12:30 BST
Track 1: The Liffey B

11:45 BST

Perks and Pitfalls of Building a Remote First Team
Building teams is hard. Building remote teams is harder but definitely worth it. You get access to a global talent pool. You get engineering coverage that follows the sun. And you get to build a much more diverse and inclusive team. A remote team isn't without its perils though, it is easy to build silos, burn out engineers, inflame imposter syndrome, and starve community building.

Speakers
RN

Ryan Neal

Netlify
Ryan Neal is Head of Infrastructure and part of the founding team at Netlify. Previously, he worked on the infrastructure team at Yelp and worked in the defense sector at Palantir in the Middle East. Ryan is based in the Bay Area loves distributed systems, firespinning, and his golden... Read More →


Friday October 4, 2019 11:45 - 12:30 BST
Track 2: The Liffey A

12:30 BST

Luncheon
Friday October 4, 2019 12:30 - 14:00 BST
The Forum

14:00 BST

Expect the Unexpected: Preparing SRE Teams for Responding to Novel Failures
A core concept in SRE is that we learn from major system failures, using the experience gained to improve the resiliency of our systems. If we are successful at this, we avoid repeating the same customer impact the next time our systems fail in a similar way. This means when the next big failure happens, it will often be a novel problem. This talk will focus on how to prepare for novel large scale failures. I will start by summarizing common methods of incident training. This includes simulated disaster scenarios, and live system exercises involving controlled but real production system failures. I will outline the benefits of each approach, and our experience in employing them at Shopify as our team has grown. This talk will wrap up with a summary of a large scale incident exercise we ran involving a hundred people, an office building, and 20,000 pieces of lego.

Speakers
avatar for John Arthorne

John Arthorne

Senior Production Engineer, Shopify
John leads a developer team within the Shopify Production Engineering group, with a focus on building tools to improve the quality of production systems, and on engineering incident response. John is a frequent speaker at technical conferences, including most recently SRECon, DevOps... Read More →


Friday October 4, 2019 14:00 - 14:30 BST
Track 1: The Liffey B

14:00 BST

Evolution of Observability Tools at Pinterest
This talk will cover how observability tools at Pinterest evolved over time to fulfill the changing requirements as we grew from a small startup to web scale. These tools include metrics system, log search and distributed tracing.

Speakers
NA

Naoman Abbas

Pinterest
Naoman Abbas is the engineering manager for Pinterest's Observability team, which is responsible for building and operating observability tools like the company's metrics system, logsearch, and distributed tracing. Previously, Naoman was a software engineer building cloud platform... Read More →


Friday October 4, 2019 14:00 - 14:45 BST
Track 2: The Liffey A

14:30 BST

Hiring Great SREs
Hiring is hard. Hiring in tech is often harder because we tend to focus on concrete, measurable skills and often ignore or devalue soft skills since they're not as easy to evaluate.

Geared at both IC's and Managers, come learn some directed ways of thinking about hiring, conducting interviews, and performing valuations with concrete examples that can be used in practice to improve your hiring pipeline.

Speakers
avatar for Brian Rutkin

Brian Rutkin

Staff SRE, Twitter, Inc.
Brian is an SRE at Twitter where he works on Core Services and all the things they touch (so pretty much everything). Often that means just trying to ensure all the different services and people get along together.


Friday October 4, 2019 14:30 - 15:00 BST
Track 1: The Liffey B

14:45 BST

How to SRE When Everything's Already on Fire
We've all read the SRE books and heard stories of a magical land of Engineering organizations with functioning SRE; one where following SRE best practices will lead to a better reality for both you and your users. But how do we get there? And, what does that road look like?

This talk presents a case study on how our team, stuck in a deep reliability hole maintaining our company's centralized logging platform, adopted many SRE best practices to resolve a several-months-long incident. It's the story of how we took the highest-trafficked system in our infrastructure from being reliable ~85% of the time to a trusted and documented 99.9%.

Speakers
avatar for Alex Hidalgo

Alex Hidalgo

Site Reliability Engineer, Squarespace
Alex Hidalgo has been a Site Reliability Engineer since 2011. During that time he has developed a deep love for sustainable operations, metrics and monitoring, and using error budgets to drive almost every decision. Alex's previous jobs have included IT support, network security... Read More →
avatar for Alex Lee

Alex Lee

Squarespace
Alex Lee is an SRE at Squarespace, where he's spent the past 5 years working on systems and processes that enable more reliable engineering. He currently leads the Observability Team, building and maintaining the tools that monitor Squarespace. Based out of New York City, Alex is... Read More →


Friday October 4, 2019 14:45 - 15:30 BST
Track 2: The Liffey A

15:00 BST

SRE in the Third Age
In the first age, SRE was proprietary to Google. As a term, it was so puzzling, that the Google recruiters tried for a while to avoid it in job descriptions because nobody would apply for such a mystery job.

In the second age, SRE became a well-known discipline in the tech community, including books and conferences (like this one). Organizations that were distinctly different from Google, not only in terms of scale but also culturally, adopted SRE for their own circumstances and needs.

These days, it appears we are approaching the late stage of the second age. Signs are that recruiters now use the term SRE in job descriptions to attract applicants and that we can pride ourselves on our desirability in the work market.

The time is ripe to think about the third age—it might very well mean the end of SRE as we know it!

Speakers
avatar for Björn Rabenstein

Björn Rabenstein

Engineer, Grafana Labs
Björn is an engineer at Grafana and a Prometheus developer. Previously, he was a Production Engineer at SoundCloud, a Site Reliability Engineer at Google, and a number cruncher for science.


Friday October 4, 2019 15:00 - 15:30 BST
Track 1: The Liffey B

15:30 BST

Break with Refreshments
Friday October 4, 2019 15:30 - 16:00 BST
Level 1 Foyer

16:00 BST

Fault Tree Analysis Applied to Apache Kafka
This talk should provide a framework for answers the following common questions a Kafka operator or user might have: What should your replication factor be for your Kafka topics? How many partitions should you have? How many consumers should I provision? What should my ISR setting be? Should I use RAID or not?

Speakers
avatar for Andrey Falko

Andrey Falko

Lyft
Andrey Falko is one of the first Reliability Software Engineers at hired at Lyft, where he has been for more than a year. He is currently focused on building and scaling reliable PubSub systems for Lyft's Data Platform. Prior to Lyft, Andrey worked at Salesforce for nine years where... Read More →


Friday October 4, 2019 16:00 - 16:45 BST
The Liffey B

16:45 BST

Applicable and Achievable Formal Verification
Formal verification is often considered an overly rigorous, and potentially unnecessary technique to be deployed on everyday systems. There are numerous misconceptions about the capability and automation of formal verification techniques, and when and how they can be deployed. This talk will thus provide an introductory overview of the verification tools and techniques deployed in industry, specifically, the safety critical industry, at different rigour levels, and how these techniques can be adapted to your current existing system infrastructure.

Speakers
HK

Heidy Khlaaf

Adelard LLP
Heidy Khlaaf is a Research Consultant at Adelard LLP where she evaluates, specifies, and verifies the implementations of safety-critical systems. She received her Ph.D. from University College London where she developed novel research methodologies, in part with Microsoft Research... Read More →


Friday October 4, 2019 16:45 - 17:30 BST
The Liffey B

17:30 BST

Closing Remarks
Speakers
avatar for Emil Stolarsky

Emil Stolarsky

Site Reliability Engineer, Wave Mobile Money
Emil is an SRE at Wave Mobile Money, helping make Africa the first cashless continent. Previously, he worked on caching, performance, and disaster recovery at Shopify, the internal Kubernetes platform at DigitalOcean, and everything in between at Cheddar. In addition to speaking at... Read More →
MS

Murali Suriar

Snowflake
Murali Suriar is a lapsed computer science graduate, turned network engineer, turned SRE. Working on traffic management at Snowflake after 12 years at Google. Currently learning what "the cloud is just someone else's computer" means.


Friday October 4, 2019 17:30 - 17:35 BST
The Liffey B
 
Filter sessions
Apply filters to sessions.