SLOconf 2022
The world wants to share and learn about SLOs and who are we to stop them?
Learn about the success of SLOconf 2022, as we’re bringing back the virtual conference to our community in 2023!
Speakers
Ajuna Kyaruzi
Developer Relations
Datadog
Ajuna Kyaruzi
Developer Relations
Datadog
Ensuring Reliability using SLO Burn Rates
Service Level Objectives (SLOs) are a measurement of the reliability and general experience your end users and customers can expect. SLO burn rate is a value that indicates how fast your error budget is consumed relative to your SLO’s target length. In this talk, we’ll cover how to calculate burn rates and how alongside error budgets to get a solid, actionable metric for balancing innovation and velocity with reliability and safety over a specific period of time. You will also learn how to use burn rates as a measure of the potential of missing SLO targets.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelAleksander Tarraro
Software Engineer
Nobl9
Aleksander Tarraro
Software Engineer
Nobl9
What is an Error Budget For?
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelAlex Hidalgo
Principal Reliability Advocate
Nobl9
Alex Hidalgo
SLOs for Everyone
Service Level Objectives are often thought of as an engineering tasks meant to help engineers better understand their services. But the concepts behind such an approach can be useful for so many other things! Let’s talk about how you can use SLOs to better think about money, your human factors, high-level business decisions, and even uses outside the world of tech!
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelAlex Rasmussen
Principal Cloud Economist
The Duckbill Group
Alex Rasmussen
Principal Cloud Economist
The Duckbill Group
The Real World Math and Implications of S3's 99.999999999% A...
Alex Rasmussen
The Real World Math and Implications of S3's 99.999999999% Advertised Durability
Amazon S3 advertises 11 9’s of durability for both Standard and One Zone-IA object storage classes. This is a mind-bogglingly high level of durability. The way AWS puts it, if you’ve stored 10 million objects, you can expect to lose about one object about every 10,000 years - even within a single availability zone!
It’s important to note, though, that this is the designed durability, not the guaranteed durability. There are all kinds of things that can go wrong that could impact durability on AWS’s side, and it’s even more likely that problems with your application can impact the durability of your data in S3, even if AWS does everything right.
In this talk, we’ll talk about some of those things that might go wrong, and explore what S3’s advertised durability actually means to you if you’re someone who builds systems that use it.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelAndreas Grabner
DevOps Activist at Dynatrace & DevRel for CNCF Kep...
Dynatrace
Andreas Grabner
DevOps Activist at Dynatrace & DevRel for CNCF Keptn
Dynatrace
Tips for Running Successful SLO Workshops in Under One Hour
Andreas Grabner
Tips for Running Successful SLO Workshops in Under One Hour
I have been advising organizations over the past year on how to start with SLOs. I struggled the same as many do but eventually found an approach that has now worked repeatedly over the past months. Based on one specific workshop I ran, my audience will learn about Business Level Objectives such as “Mobile App Adoption”, “App Rating” and how to break those into more technical Service Level Objectives such as “Crash Rates”, “Availability”, “Performance”, “Error Rates”. I will explain how to define but also capture and report on those SLOs.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelAndrew Newdigate
Distinguished Engineer
GitLab Inc.
Andrew Newdigate
After living in London in the UK for 17 years, he recently relocated back to his hometown of Cape Town in South Africa.
Everyone Can Contribute to Our SLO
At GitLab, we’ve built an extensive framework for defining service level indicators (SLIs) for our different services. This allows us to take a simple definition, and turn that into dashboards and alerts. There are different owners involved: Infrastructure and stage groups. The SLIs we use to monitor GitLab.com are attributed to groups building the features we run. Everyone is held to the same 99.95% SLO, everyone can contribute to our observability.
Join this talk to learn about the challenges with SLOs and error budgets. Hear how we are aggregating our infrastructure SLIs by features groups, and how we are involving groups in improving our SLI definitions.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelAndrew Snyder
Senior DevOps Consultant
Contino
Andrew Snyder
Senior DevOps Consultant
Contino
Using a Service Canvas to Define SLOs
Through the use of our visual Service Canvas methodology, we have enabled IT teams to collaboratively discuss and readily identify their catalog of services from their internal customers’ perspectives and to identify their Service Level Objectives (SLOs) and corresponding Service Level Indicators (SLIs) for all of their offerings. The Service Canvas method utilizes a top-down whiteboard of swim lanes which may be completed virtually / remotely using freely available SAAS tools, such as Miro, or in-person using Sticky Notes. Each swim-lane row is a repository for the descriptive components required for establishing the services requiring objectives and indicators, and their corresponding qualities that are neseccary for identifying all meaningful metrics. Using this following top-down visual approach that helps subsequent answers arise organically once the team completes their first service mapping, we have had much success with enabling teams to quickly identify their Service Catalog along with corresponding Service Level Objectives with Service Level Indicators (image of Service Canvas available upon request – it’s much prettier in person!):
- Service What is the service that your internal customer wants to use?
- Components Which components within the service do your internal users interact with?
- Customers Who uses your services?
- Target Category Which qualities of the service are important to your internal customers?
- Customer Needs How does your internal customer describe a level of service that meets their business requirements?
- Service Level Indicator How can we measure that we are meeting our customer’s service qualities needs?
- Service Level Objective What are the goals we will target in order to maintain our internal dustomer’s satisfaction with our services?
By iterating through each of the teams services in their offerings and coaching them to get feedback and buy-in from their internal customers regarding these SLO and SLI definitions for each service, we have enjoyed much success in the efficacy of this method to the point where it has now become a staple element in our SRE transformation engagements.
This talk will provide all of the necessary tools and demonstrate the facilitation of this Service Canvas method, and attendees will gain this powerful new way of helping IT teams catalog their services and assign meaningful objectives to them, in coordination with the consumers of these services – their internal customer pool.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelAshutosh Agrawal
Platform architect
Disney+Hotstar
Ashutosh Agrawal
Platform architect
Disney+Hotstar
Using SLOs at Scale: How Disney+ Hotstar Streams One of the ...
Ashutosh Agrawal
Platform architect
Disney+Hotstar
In the past, Ashutosh has built one of the highest scale operating systems at Hotstar - SSAI, which does live ad insertion in a video live stream for more than 15M+ Concurrent users. Building this system has helped him understand a lot of fundamentals about how tech works.
Using SLOs at Scale: How Disney+ Hotstar Streams One of the Biggest Sports Tournaments in the World!
The last few years of successfully streaming the IPL at this scale have forced us at Disney+Hotstar to go back to the drawing board and think about site reliability engineering from a first principle point of view.
Outside-in monitoring of services was our goal as we headed into this uphill task, with the following accepted criteria for our systems:
- Build correlation across different services across different layers.
- Drill down and narrow down to a layer/component causing the trouble.
- Detect degradations post changes/deployments on the platform.
- Automatically detect thresholds and set alerts on the data.
- Detect anomalous patterns across the platform.
- Monitor all resources created in production, and auto-discover correlations
It was clear to us as an organisation that to be able to achieve this wish list, we have to have a way to measure and monitor the current performance of our services - this is where the concept of service level objectives entered the picture!
Today, Disney+ Hotstar extensively uses SLOs to measure performance of our systems. This talk will deep dive into how we went about thinking about SLOs for this scale, how we measured the base performance of our systems to actually start setting SLOs, why we chose the compliance targets and periods we chose and much more. It has been an exciting last few years at Disney+ Hotstar and I would love for everyone to know how SLOs help engineering scale!
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelAustin Krauza
Site Reliability Engineer
JPMorgan Chase
Austin Krauza
Site Reliability Engineer
JPMorgan Chase
What Are SLIs and Why Should I Care?
Austin Krauza
Site Reliability Engineer
JPMorgan Chase
What Are SLIs and Why Should I Care?
Every day we interact with various systems, both online and offline – from our visit to the bagel store to checking our email on mobile. We commonly talk about Service Level Objectives (SLOs) within the Reliability Engineering community. Still, we rarely talk about the underpinning indicators that tell us whether a system is healthy. Typically, engineers begin picking the low-hanging fruit objectives (such as request time and volume) rather than doing a complete analysis of their dependencies and the metrics available to them from their systems or application. By leveraging the VALET (Volume, Availability, Latency, Errors, and Tickets) framework, teams can begin exploring their metrics and better define their “health” indicators, leading to more sustainable Service Level Objectives.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelAustin Parker
Head of Developer Relations
Lightstep
Austin Parker
Head of Developer Relations
Lightstep
Burn Your Dashboards: The Case For SLO-First Monitoring
Austin Parker
Head of Developer Relations
Lightstep
Burn Your Dashboards: The Case For SLO-First Monitoring
SLO adoption in existing organizations is often seen as a secondary method of communicating reliability; A way to translate internal performance metrics into more “human-digestible” forms. What if we flipped this supposition around, though? Is our reliance on traditional monitoring practice, of staring at a bunch of indicators on a dashboard, actually holding us back from realizing the benefits of SLOs? In this talk, I’ll present an alternate theory of how SLO’s should be used as your primary measurement of service health for on-call, and how you should incorporate them into your runbooks and release processes.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelBob Van Landuyt
Senior Developer Scalability
GitLab Inc.
Bob Van Landuyt
Senior Developer Scalability
GitLab Inc.
Everyone Can Contribute to Our SLO
Bob Van Landuyt
Senior Developer Scalability
GitLab Inc.
Everyone Can Contribute to Our SLO
At GitLab, we’ve built an extensive framework for defining service level indicators (SLIs) for our different services. This allows us to take a simple definition, and turn that into dashboards and alerts. There are different owners involved: Infrastructure and stage groups. The SLIs we use to monitor GitLab.com are attributed to groups building the features we run. Everyone is held to the same 99.95% SLO, everyone can contribute to our observability.
Join this talk to learn about the challenges with SLOs and error budgets. Hear how we are aggregating our infrastructure SLIs by features groups, and how we are involving groups in improving our SLI definitions.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelChristian Beedgen
CTO & Co-Founder
Sumo Logic
Christian Beedgen
CTO & Co-Founder
Sumo Logic
Dynamic Environments Need SLOs
The goal is to refocus attention from individual KPIs to broader indicators to help observers manage very large, complex systems and ensure that their SLOs can be met. This service-level orientation allows teams to address unreliable applications by using SLOs to objectively calculate adherence to the SLA. Managing the SLOs is key to ensuring that modern app stacks are performing reliably for end users. However, many organizations either avoid defining SLOs altogether or use manual processes and tools to track and manage them. Without structured approaches to SLO management, Observability is a pointless investment; simply put, you get what you measure. This talk will discuss how to monitor and create SLOs, dependencies that drive composite SLOs; and how to use the service map to analyze dependencies to set and predict dependencies as a key differentiator.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelColin Curtin
Engineering Leader
Square
Colin Curtin
Engineering Leader
Square
Humans First: Using Error Budgets to Keep Your Team Happy an...
Colin Curtin
Engineering Leader
Square
Humans First: Using Error Budgets to Keep Your Team Happy and Healthy
Humans are minimal creatures. They need food, water, sleep, sunlight, joy, etc. What a terrible choice for a system that requires consistent performance, right? Well, except for how adaptable humans are! They can do many different things, like writing software in collaboration with others, dreaming up wild plans, and solving constraints in hundreds of dimensions.
In this talk, we’ll pound more nails in the coffin of industrialization by centering human needs at work. We will measure and correct simple indicators of success: PTO, interruptions, total working time, and work outside of work hours. That’s it? Yes. If you nail these SLOs, I guarantee your humans will be happy, healthy, and referring colleagues to your team.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelColin Douch
Systems Reliability Engineer
Cloudflare
Colin Douch
Systems Reliability Engineer
Cloudflare
You Have an SLO, Whether You Know It or Not
Colin Douch
Systems Reliability Engineer
Cloudflare
You Have an SLO, Whether You Know It or Not
Cloudflare’s Observability Team learned the lesson of implicit SLOs the hard way. When you aren’t actively setting expectations with your customers they pick their own SLOs, which are often at odds with what you were expecting to offer in the first place. In this talk, Colin will outline the steps that his Observability Team took to get a handle on the expectations of its customers, properly establish SLOs, and utilize those SLOs to progress further discussions about the reliability of their systems.
In the process, Colin will talk about how damaging these implicit SLOs can be to the perceived reliability of your systems, how to have those conversations with your customers, and what operators can do to avoid them in the first place.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelDotan Horovits
Developer Advocate
Logz.io
Dotan Horovits
Developer Advocate
Logz.io
OpenTelemetry: the Open Source Vision for Unified Observabil...
Dotan Horovits
Developer Advocate
Logz.io
OpenTelemetry: the Open Source Vision for Unified Observability
Everyone wants observability into their system, but find themselves with too many vendors and tools, each with its own API, SDK, agent and collectors.
In this talk I will present OpenTelemetry, an ambitious open source project with the promise of a unified framework for collecting observability data. With OpenTelemetry you could instrument your application in a vendor-agnostic way, and then analyze the telemetry data in your backend tool of choice, whether Prometheus, Jaeger, Zipkin, or others.
I will cover the current state of the various projects of OpenTelemetry (across programming languages, exporters, receivers, protocols), some of which not even GA yet, and provide useful guidance on how to get started with it.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelEmily Gorcenski
Head of Data and AI
Thoughtworks
Emily Gorcenski
Head of Data and AI
Thoughtworks
A Better SLO for Data-Intensive Systems
Emily Gorcenski
Head of Data and AI
Thoughtworks
A Better SLO for Data-Intensive Systems
Availability is one of the core SLOs that comes to mind when designing any new system. However, in the analytics and business intelligence space, stakeholders still demand an impossible “24/7 uptime” SLO while simultaneously struggling to find ways to improve reliability in the system. Every data engineer is familiar with broken pipelines, long-running jobs crashing after 11 hours of runtime, and angry analytics stakeholders wondering where the data is or why it isn’t up-to-date. Let’s solve that problem by designing SLOs that make sense for the uniquenesses of analytics and business intelligence use cases. This talk proposed such an SLO in a way that should inspire similar thinking that can be extended to quality, latency, and other data-relevant measures.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelErik Morgan
Professional Chef
8 Hands Farm
Erik Morgan
Professional Chef
8 Hands Farm
The SLO Food Movement: Error Creep Management in the Restaur...
Erik Morgan
Professional Chef
8 Hands Farm
The SLO Food Movement: Error Creep Management in the Restaurant Kitchen
In addition to guests, restaurants also host many errors great and small. Therefore there is a constant challenge to prevent small error creep as well as the obvious catastrophes; but when bad things happen, and they always do, how do you get back on track?
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelFred Moyer
Observability Engineer
Zendesk
Fred Moyer
Observability Engineer
Zendesk
Don’t Go Chasing Percentiles, Use Histograms if You Want Pre...
Fred Moyer
Observability Engineer
Zendesk
Don’t Go Chasing Percentiles, Use Histograms if You Want Precision SLO Latency
“Don’t go chasing percentiles, use histograms if you want precision SLO latency”
This talk will discuss the motivations and mathematical reasons for using statistical distributions for Latency based SLOs. The status quo in monitoring and observability tooling relies upon percentile based latency metrics. Those inputs work fine when everything is working well, but as practitioners we are tasked with delivering best in class results when things are at their worst.
Come learn a little math I present proven techniques at scale for high precision latency SLOs using histograms.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelGreg Patmore
Contino
Greg Patmore
Contino
Over the last 4 years, Greg has been leading teams for Contino on a variety of Cloud and Devops projects that are truly transforming the organizations he works with. His projects are centered around rapid assessment, cloud transformation, workload and technology optimization, security and compliance integration, and process engineering.
Using a Service Canvas to Define SLOs
Through the use of our visual Service Canvas methodology, we have enabled IT teams to collaboratively discuss and readily identify their catalog of services from their internal customers’ perspectives and to identify their Service Level Objectives (SLOs) and corresponding Service Level Indicators (SLIs) for all of their offerings. The Service Canvas method utilizes a top-down whiteboard of swim lanes which may be completed virtually / remotely using freely available SAAS tools, such as Miro, or in-person using Sticky Notes. Each swim-lane row is a repository for the descriptive components required for establishing the services requiring objectives and indicators, and their corresponding qualities that are neseccary for identifying all meaningful metrics. Using this following top-down visual approach that helps subsequent answers arise organically once the team completes their first service mapping, we have had much success with enabling teams to quickly identify their Service Catalog along with corresponding Service Level Objectives with Service Level Indicators (image of Service Canvas available upon request – it’s much prettier in person!):
- Service What is the service that your internal customer wants to use?
- Components Which components within the service do your internal users interact with?
- Customers Who uses your services?
- Target Category Which qualities of the service are important to your internal customers?
- Customer Needs How does your internal customer describe a level of service that meets their business requirements?
- Service Level Indicator How can we measure that we are meeting our customer’s service qualities needs?
- Service Level Objective What are the goals we will target in order to maintain our internal dustomer’s satisfaction with our services?
By iterating through each of the teams services in their offerings and coaching them to get feedback and buy-in from their internal customers regarding these SLO and SLI definitions for each service, we have enjoyed much success in the efficacy of this method to the point where it has now become a staple element in our SRE transformation engagements.
This talk will provide all of the necessary tools and demonstrate the facilitation of this Service Canvas method, and attendees will gain this powerful new way of helping IT teams catalog their services and assign meaningful objectives to them, in coordination with the consumers of these services – their internal customer pool.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelGwen De Leon
Site Reliability Engineer
IAG
Gwen De Leon
Site Reliability Engineer
IAG
Defining SLOs When You Don't Know Anything About SLOs
Gwen De Leon
Site Reliability Engineer
IAG
Defining SLOs When You Don't Know Anything About SLOs
In this talk we walk through our SLO definition workshop, a facilitated session that we used at IAG as an experiment to help teams embed customer focus. We talk openly about what did and did not work, and the experimentation and adjustments we made along the way.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelHeidi Waterhouse
Developer advocate
LaunchDarkly
Heidi Waterhouse
Developer advocate
LaunchDarkly
Uh-oh: How Automating Responses Saves Your SLO
Heidi Waterhouse
Developer advocate
LaunchDarkly
Uh-oh: How Automating Responses Saves Your SLO
Service-level objectives are about the way your users experience your system. We spend a lot of time trying to ensure that the system meets those objectives, but because of the nature of systems, humans, and fate, sometimes our systems have problems.
When that happens, how do we respond? Is it a panic? Is it a manual method? Is it something that requires people noticing something and reacting the right way?
Planning and automating our response as much as possible saves time and our SLO number, even if we don’t have a precise answer for what has gone wrong. In this talk, we’ll walk through a couple examples of how you can use standard tools to preserve your service levels.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelIan Bartholomew
Site Reliability Engineering Manager
Nobl9
Ian Bartholomew
Site Reliability Engineering Manager
Nobl9
What Does "Reliability" Actually Mean?
Ian Bartholomew
Site Reliability Engineering Manager
Nobl9
What Does "Reliability" Actually Mean?
One of the most important things when developing SLOs is to have meaningful SLIs. A key characteristic of meaningful SLIs is focusing on reliability from a user perspective. This talk will focus on what is reliability and how to integrate that into the creation of meaningful SLIs and ultimately better SLOs,
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelIoannis Georgoulas
Senior Site Reliability Engineering Manager
Paddle.com
Ioannis Georgoulas
Senior Site Reliability Engineering Manager
Paddle.com
SLIs the Hard Way
SLIs the Hard Way
In this talk, Ioannis will focus on the challenging task of setting up SLIs. The speaker explains with real-world examples the difficulty of some use-cases (cron jobs, queues, low traffic, internal tooling, etc.) to set up meaningful SLIs and present some tips and lessons learned… the hard way.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelIsobel Redelmeier
Senior Site Reliability Engineer
Discord
Isobel Redelmeier
Senior Site Reliability Engineer
Discord
SLOs for the New Dog Owner
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelJakub Warczarek
Principal Architect
Nobl9
Jakub Warczarek
Principal Architect
Nobl9
How to Answer Tricky Questions the SRE Way
Save money and time with answers for following questions. Why shouldn’t aim for 100% reliability? When speeding up an application two times doesn’t improve user experience? And what should wake up on-call engineers?
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelJames Strong
Lead Solutions Architect
Chainguard
James Strong
Lead Solutions Architect
Chainguard
SLOs For Your Software Supply Chain Security
James Strong
Lead Solutions Architect
Chainguard
SLOs For Your Software Supply Chain Security
Supply chain security is paramount to your service level objectives. A compromised system impacts services and your company’s reputation. A rogue dependency will demolish your error budget and is 100% preventable. In this talk, we will discuss SLOs from a security perspective that developers and organizations as a whole should measure to increase the security of their services, supply chains, and peace of mind.
SLOs extend to the entire Supply Chain, not just operations. Adding SLOs into the Development of services improves SLOs at the run time of an application’s life cycle. We will review metrics at the Organization level that companies should be enforcing, such as the number of signed containers, the number of container image sources, MTTR of a CVE, among others impacting the security of development, build, and ultimately the runtime of your services. This presentation is for anyone responsible for their software pipeline, from developers who can learn about what security measures they should be implementing for their services up to CISOs looking to improve the supply chain security for their organization.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelJan Ritter
DevOps Engineer Lead
Trusted Shops
Jan Ritter
DevOps Engineer Lead
Trusted Shops
Good Reasons for SLOs in Less Than 10 Minutes
Jan Ritter
DevOps Engineer Lead
Trusted Shops
Jan Ritter is DevOps Engineer Lead at Trusted Shops
Good Reasons for SLOs in Less Than 10 Minutes
For both engineering and product / business, there are good reasons to use service level objectives (SLOs). This talk will, based on concrete examples, give you some of these reasons to start your SLO journey.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelJason Yee
Director of Advocacy
Gremlin Inc.
Jason Yee
Director of Advocacy
Gremlin Inc.
Budgets are for Spending
Whenever the news would announce that some extremely wealthy person died, my dad would remind me, “You can’t take it with you.” The statement was intended to frame money as a resource—something to be used with intention, not squandered or hoarded.
With reliability, we’ve adopted the financial framing of Error Budgets, and along with it we’ve also adopted some poor financial habits. This often leads to teams wasting their Error Budgets on unforeseen incidents or simply not using them at all.
In this session, I’ll share how to spend your Error Budget in order to improve your applications, your engineering teams, and the overall success of your organization.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelJayesh Ahire
Product Manager
Last9
Jayesh Ahire
Product Manager
Last9
Jayesh Ahire is the Product Manager at Last9. He is the AWS ML Hero, Twilio champion, and maintainer of OSS Project Hypertrace. He is the Organizer of AWS UG, Elastic UG, TensorFlow UG, Microsoft AI community, and many other communities in India. His research interest involved Distributed neural computers and Defi. In his free time, he likes to read and he is learning to play the piano.
Monitoring Services Not Servers
The Internet and Pets have an old relationship. It started with the infamous Pets.com. While unfortunately, the business crashed, it established that online was here to stay. To run a business online, we used to buy server hardware for operations. We named these with respect—animals, dragons, star wars, wines, or movie characters. Just like our pets. Fast forward to today, Infrastructure is overwhelmed with pets again. This time around, we are exchanging pet photos and not pet supplies. Suddenly, we have a flock of these servers at our disposal. As the scale evolved, the rise of Service oriented architecture was inevitable. But in this micro-services led world, what has become absolutely clear is that one needs to start monitoring services and not servers. In this talk, we intend to explain the rationale behind Monitoring Services and not Servers, define the different types of services available today and define the key SLOs that should be measured for each type of service. We will also cover how SLOs will help you solve the ultimate debate on Feature v/s Stability and solve for cascading impacts, ever so common in today’s world of distributed systems.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelJennifer Robertson, M.S., CCC-SLP
Speech-Language Pathologist
Jennifer Robertson, M.S., CCC-SLP
Speech-Language Pathologist
Now You're Talking: SLIs & SLOs for SLPs
Jennifer Robertson, M.S., CCC-SLP
Speech-Language Pathologist
Now You're Talking: SLIs & SLOs for SLPs
SLIs and SLOs aren’t just for tech! While trying to teach small children how to communicate, I need a way to keep track of what they’re doing, what they should be doing, and why. The objectives SLPs set are critical not just for data, but to ensure that what we’re keeping track of actually matters.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelJohn Willis
Red Hat
John Willis
Red Hat
Dr. Deming Would Have Hated OKRs, but Would Have Loved SLOs
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelJulie Gunderson
Senior Reliability Advocate
Gremlin Inc.
Julie Gunderson
Senior Reliability Advocate
Gremlin Inc.
Reducing Trauma in Production with SLOs and Chaos Engineerin...
Julie Gunderson
In her off time Julie can be found either traipsing through the mountains in Idaho, or making circuit boards into wearable art.
Reducing Trauma in Production with SLOs and Chaos Engineering
Customer experience is the responsibility of the entire team. Many organizations leave reliability up to the SRE team, however reliability should be built in from the very beginning. In this talk Mandi and Julie will discuss what Service Levels Objectives are, why they are important to the organization, and how to define and set them. Going beyond SLOs, attendees will learn what Chaos Engineering is and practical ways to ensure compliance and resilience with best practices. We’ll show you how to focus your goals and error budgets with examples that will lead to reliability and improved user experience.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelKasia Zemka
Engineering Manager & Software Engineer
Nobl9
Kasia Zemka
Engineering Manager & Software Engineer
Nobl9
Kasia’s adventure with programming and the IT industry began 10 years ago. At the turn of these years, she worked in various roles and fulfilled herself in many fields, including frontend development, testing, team leadership, and management. Currently, she uses all the experience she gained during this time to build reliable software and spread domain knowledge. And with that said, Kasia is glad to work at Nobl9 - a startup where she can pursue herself in this field on a daily basis.
SLOs are Not Only for SREs
When you think about SLOs, you probably think about SREs. But if we think about the whole idea behind SLOs and SLO culture, we quickly realize it is a great tool for teams to stay on track with their goals. One of the best solutions to achieve a consensus in this regard is to have a common understanding what’s our objective and when good is good enough. SLOs are a great way to do it, and they are applicable throught the whole software development processes. We have things we monitor and standards we want to meet. In my talk I want to deep dive into this aspect of SLOs, explaining why they’re not only for SREs, and how they can be used by anyone in our industry in our everyday work.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelKeri Melich
Senior Site Reliability Engineer
Nobl9
Growing Business Ops Through SLOs
SLOs are a great tool for helping engineers communicate with the business. But in this talk, we’re going to look at SLOs from the perspective of everyone except an engineer. We’ll touch on SLAs, Roadmapping, and the monetary benefits of setting realistic reliability goals.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelKit Merker
COO
Nobl9
Kit Merker
The SLODLC: SLO Development Lifecycle
An overview of the SLODLC, a repeatable methodology for adopting SLOs across the organization. You’ll get a walkthrough of the methodology, the resources, examples, and how to get started in your organization.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelLeo Vasiliou
Director of Product Marketing
Catchpoint Systems
Leo Vasiliou
Director of Product Marketing
Catchpoint Systems
Perform: How many Nines? Depends on Accumulation
Leo Vasiliou
Director of Product Marketing
Catchpoint Systems
Perform: How many Nines? Depends on Accumulation
Meet the powerful analytic for performance-based SLOs. This talk starts with the fact that most teaching SLO discussions focus on using an internal, non-cumulative endpoint (e.g. how many successful GET requests to /API) to illustrate SLO concepts. And arriving at the fact that when it comes to setting SLO for cumulative endpoints (e.g. an app or page consisting of many, distributed requests), determining the number of nines for this objective must be accordingly adjusted to account. In other words, three or four nines may be acceptable for /API. But three or four nines for an experience-based (cumulative) endpoint is not practical. In this session, will discuss the various adjustments needed for experience-based (cumulative) endpoints through both an availability and performance lens. Will further expand on the performance lens and discuss semi-advanced distribution functions for analyzing them – with the ultimate goal being reliable, resilient experiences to better serve self, team, and business.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelLiz Fong-Jones
Developer advocate, Labor And Ethics Organizer, & ...
Honeycomb
Liz Fong-Jones
Developer advocate, Labor And Ethics Organizer, & Site Reliability Engineer
Honeycomb
Evaluating Event-based SLOs at Scale
Liz Fong-Jones
She lives in Vancouver, BC with her wife Elly, partners, and a Samoyed/Golden Retriever mix, and in Sydney, NSW. She plays classical piano, leads an EVE Online alliance, and advocates for transgender rights.
Evaluating Event-based SLOs at Scale
How do you evaluate and alert on thousands of SLOs based on millions of incoming telemetry events per second? Originally, we evaluated the data at rest every minute, but we found that inefficient and expensive. Learn how you too can turn high-throughput logs into SLO gold with streaming evaluation.
In the early days of our SLO evaluation system, we ran up an AWS lambda bill of $10k in a few days, before implementing caching. As our system scaled, we discovered a majority of our total computational resources were devoted just to processing the SLOs from the events that arrived in the last minute, let alone user interactive querying or other analysis. Thus, we needed to switch to streaming SLO evaluation based on incoming events, leveraging Kafka consumer groups. In this talk, you’ll learn from our mistakes and be better able to implement streaming SLO evaluation at scale for true real-time SLO computation, action, and iteration on existing SLOs.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelLuis Parada
Head of Engineering
FARFETCH
Luis Parada
Prior to this Parada led Farfetch ID, the group responsible for FARFETCH’s Identity Provider, Authentication & Authorisation, Customer and Partner Account data.
Parada likes to focus a lot on continuous improvement of teams and organisations and on personal development, sharing insights on his YouTube channel and via a Monthly Newsletter called A Leader’s Mindset.
How FARFETCH is dealing with SLOs
FARFETCH is the Leading Global Platform for Luxury Fashion Industry, currently with over 2500 engineers that implement our platform that contains around 1000 services. For us, customer and partner satisfaction is paramount and as such the reliability of our platform is extremely important. This is the story of how we’re tackling the issue of reliability transparency, and how we’re implementing SLOs using the OpenSLO standard.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelLukasz Dobek
Software Engineer
Nobl9
Lukasz Dobek
Currently, he’s developing Service Level Objectives platform at Nobl9, helping to make a cultural shift to the Site Reliability Engineering mindset.
Service Level Objectives: What Are Those?
As Service Level Objectives gain more and more traction, they’re also becoming a foundation for building reliable and customer-centric software. It’s important to know what they are, how can they help, and how to pick them. This quick talk aims to go through the very basics of SLOs and enable beginners to make their first steps in the reliability domain.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelMandi Walls
DevOps Advocate
PagerDuty
Mandi Walls
DevOps Advocate
PagerDuty
Reducing Trauma in Production with SLOs and Chaos Engineerin...
Mandi Walls
DevOps Advocate
PagerDuty
Reducing Trauma in Production with SLOs and Chaos Engineering
Customer experience is the responsibility of the entire team. Many organizations leave reliability up to the SRE team, however reliability should be built in from the very beginning. In this talk Mandi and Julie will discuss what Service Levels Objectives are, why they are important to the organization, and how to define and set them. Going beyond SLOs, attendees will learn what Chaos Engineering is and practical ways to ensure compliance and resilience with best practices. We’ll show you how to focus your goals and error budgets with examples that will lead to reliability and improved user experience.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelMatthew Macdonald-Wallace
Principal Consultant
Contino
Matthew Macdonald-Wallace
Principal Consultant
Contino
Automated deployments of SLOs Using Sloth and Prometheus
Matthew Macdonald-Wallace
Principal Consultant
Contino
With a career that started in first-line tech support over 20 years ago, since then Matt has worked at all levels of customer support, managed a shared web hosting cluster serving data and email for over 500,000 sites, held roles as a software developer building a cloud platform for public use, and provided consultancy services to some of the UK’s best-known companies.
Matt’s experience of working with both physical hardware, IoT devices, and cloud platforms has led him to a strange position in the world of Observabilty in which I’ve been able to monitor servers, serverless, and even farm animals using the same set of Open Source tools - tools that he continues to recommend to our customers.
When he’s not preaching about the wonders of what good dashboards look like, he helps run the local Makerspace, plays his guitar, and spends time with his family. Of course, he’s also frequently seen out and about with the family dog!
Automated deployments of SLOs Using Sloth and Prometheus
In this talk, I’ll take you through a quick tour of how Contino.io helps our customers manage and deploy their SLIs and SLOs using https://sloth.dev
We’ll cover the CI pipeline commands that we use to validate, generate, and deploy SLI’s and SLO’s simultaneously as the application and the different options available for Prometheus, Cortex, and Grafana Cloud.
We’ve only got 10 minutes, so it will need to be quick, but thankfully the tools available make this a lot easier than you think!
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelMax Knee
Principal Software Engineer
Comcast
Max Knee
Principal Software Engineer
Comcast
Defining SLOs for Running a Delivery Tool as a Platform
Max Knee
Principal Software Engineer
Comcast
Defining SLOs for Running a Delivery Tool as a Platform
We at Comcast run ConcourseCI as our CI/CD tool company wide. In an effort to make it easier for us to operate and determine problem areas, we needed to implement SLIs and SLOs. This brings up an interesting problem, how does one running an open source project implement SLIs and SLOs if they don’t have direct ownership of the code and adhere to the maintainer’s vision? We did exactly that, by using their existing metrics to define and create our own SLIs and SLOs to ensure that our instance is performing up to our needs and standards.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelMichael Ericksen
Site Reliability Engineer
Intelligent Medical Objects
Michael Ericksen
Site Reliability Engineer
Intelligent Medical Objects
Reliability Lessons from the Back of a Bicycle: 100% is Neve...
Michael Ericksen
Reliability Lessons from the Back of a Bicycle: 100% is Never the Goal
According to a 2018 study by the US Department of Transportation, almost 60% of vehicle trips are less than six miles and 75% are less than ten. This talk discusses anthropogenic climate change and examines how service level objectives shaped one family’s strategy to reduce their dependency on cars and fossil fuels. In it, the presenter explores how tools like service level indicators and objectives helped their family side-step impossible goals like eliminating 100% of car trips and instead focus on more realistic ones like “How many trips less than 2 miles might we make by bicycle instead?” To conclude, the presenter discusses how an approach on good-enough rather than perfection leaves space for the “reliability events” like a burst appendix or winter in Chicago that periodically interrupt daily life.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelMichael Friedrich
Senior Developer Evangelist
GitLab Inc.
Michael Friedrich
Senior Developer Evangelist
GitLab Inc.
Left Shift your SLOs with Chaos
Michael Friedrich
Left Shift your SLOs with Chaos
Developers and SREs are instrumenting applications and apply observability workflows with metrics, traces, logs, and beyond. The first service level objective (SLO) is defined, now what - wait for the first production incident?
Think of day-2-Ops: SLOs need to be well understood and simulated early in the development process. False-positive alerts can lead to on-call fatigue. How to simulate an incident? Add chaos to production and simulate network failures, broken apps, etc. - and validate the SLOs. Developers can add their own chaos experiments too.
Join this talk to learn how SLOs can be shifted left with chaos, and get inspired by new tools and workflows for your production environment.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelNikhil Unni
Co-Founder & Chief Architect
Cortex
Nikhil Unni
Co-Founder & Chief Architect
Cortex
We’re Not the Bad Guys: Extending Your Influence as an SRE
Nikhil Unni
Co-Founder & Chief Architect
Cortex
We’re Not the Bad Guys: Extending Your Influence as an SRE
To start, how do you get people to care about reliability? At the end of the day, all engineering work should tie back to the business. Underscore that reliability leads to money saved in the long run. Maintaining SLOs, minimizing downtime, and speeding up developer productivity all have a real dollar amount saved for businesses.
Next, you need to align with leadership and engineers on standards: Avoid subjective criteria, instead defining objective standards like SRE readiness checklists and SLO adoption targets. This will not only allow you to take stock of the current state of the world, but will also let you track your reliability journey over time.
After that, make it as easy as possible for your engineers to adopt best practices: use automation, tooling and work with counterparts in infra/platform to drive org-wide adoption of tools and processes. It’s much easier to get an organization aligned on the “right way” to do things if it’s seamless for them to do so.
Finally, after driving reliability initiatives for several quarters, prove the value of the work. Tie it back to the business use case, and track results to make sure that the investment is paying off.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelPedro Alves
Zalando
Pedro Alves
Zalando
Operation Based SLOs
The industry by and large assigns SLOs on Services (it is in the name after all). At Zalando, we did the same thing, but we weren’t getting all the nice benefits we were expecting. A couple of things we struggled with was managing the vast collection of SLOs for our hundres of microservices, while trying to translate the performance of all of those SLOs to the performance of the customer experience.
In this talk we’ll present a new implementation of SLOs that are defined on end-customer operations. We’ll see what the advantages are for engineers and managers alike when using these SLOs. We’ll also take a look at how Distributed Tracing can be used to enable the measurement of these SLOs and serve as key ingredients for Symptom Based Alerting.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelPetr Hajek
Senior Engineer
Omio
Petr Hajek
Senior Engineer
Omio
Alerting on High and Low Traffic Using SLOs
This talk describes the solution of implementing SLO alerting in an environment of 50+ services with varying traffic:
- some services generate hundreds of events per hour while some generate units of events per day
- peak vs off-peak traffic variation is 10x
- the services to monitor are integrating with 3rd party APIs which often generate a high volume of errors
Our solution is using standard tooling (Grafana, Graphite, and Terraform), which makes it easy to implement by other teams or organizations
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelPieter van Noordennen
Senior Director, Growth
Slim.AI
Pieter van Noordennen
Senior Director, Growth
Slim.AI
Growing Fast, Growing SLO
We know SLOs are great for SREs and DevOps professionals in gauging reliability, managing tech debt, and ensuring the team is working on the right things at the right time. More and more business leaders, especially in product management and related areas, are also embracing SLOs as ways to manage business outcomes. And the best SLO practitioners are adept at showing the impact SLOs have on their customers, not just their up-time.
Pieter van Noordennen is Senior Director of Growth for Slim.AI, a fast-scaling startup working on dev tools for cloud-native apps. Growth is rife with metrics, most commonly tracked in the form of OKRs or KPIs, and when a product is expanding quickly, there are often trade-offs that are difficult to assess and prioritize in the heat of the moment. For instance, should the team focus on increasing the acquisition of new users or focus on more deeply engaging with the users you already have? Most CEOs would want both, but how do we come up with a framework for making these trade-offs in real-time?
A recent convert to SLOs (thanks Kit!), Pieter wanted to see what he could learn from translating the current OKRs and KPIs into SLOs and tracking them that way. In this talk, he’ll share his approach, process, and learning. He will share:
- A comparison of SLOs and OKRs as tools for guiding business outcomes
- An assessment of how SLOs can be applied to traditional marketing metrics
- Learnings, gotchas, and real-life experience in socializing SLOs in a non-traditional function area
This talk is aimed at:
- business leaders looking to assess the use of SLOs in their daily work
- startup founders and early employees interested in new approaches to measuring key metrics in their growth journey
- and SLO Practioners interested in expanding the use of SLOs in their organizations to new functional areas.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelPiotr Ptak
Software Engineer
Nobl9
Piotr Ptak
Software Engineer
Nobl9
Known mainly as a human being living in Poznań, Piotr is a programmer interested about all things computers. He’s a free software enthusiast that focuses (currently) on the world of distributed systems. Despite being often stuck in his mind, you can find him wandering through different dimensions playing guitar or harmonica, drawing things that grab his attention, and talking about SLOs.
He’s constantly in search for new ways of communicating with nature, for example hiking and climbing. And for unknown reasons, he’s fond of staying deep inside the forest at night.
Write Your Code with SLOs in Mind
Software Development Life Cycle consists of the multiple phases, like Requirements Analysis, Design, Implementation, Testing and Deployment. As more and more people are adopting the Site Reliability Engineering philosophy, the question that comes to our minds is: How do SLOs fit in this cycle? How do we incorporate SLOs into methodologies and techniques that we built and polished over the years of software development industry?
Adding SLOs as the next and final step of such cycle may seem tempting (for some), but there are many advantages of doing that in the earlier phases. In this talk I’m gonna discuss these benefits and highlights some principles and practices for teams to follow this approach.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelPrathamesh Sonpatki
Software Engineer
Last9
Prathamesh Sonpatki
Software Engineer
Last9
Custom SLOs for all: Manage Your Business With Your Own Metr...
Prathamesh Sonpatki
Software Engineer
Last9
Custom SLOs for all: Manage Your Business With Your Own Metrics
As software becomes more and more complicated, it is essential to give users the power to measure and optimize whatever metric they desire, depending on the priority for their business. This becomes important when out-of-the-box solutions fail to provide users with the flexibility to choose their metrics. Users typically tend to move to open source tools providing such leeway.
This talk is intended for such users - power users who are writing their own PromQL queries and heavily leveraging existing tools in the market for custom tailor-made monitoring solutions for their organizations OR up & coming users that intend to go down this rabbit hole.
This talk is intended for such users - power users who are writing their own PromQL queries and heavily leveraging existing tools in the market for custom tailor-made monitoring solutions for their organisations OR up & coming users that intend to go down this rabbit hole.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelRama Kulasekaran
service reliability advocate
Optum
Rama Kulasekaran
service reliability advocate
Optum
Measuring Service Reliability for Greater User Experience: T...
Rama Kulasekaran
service reliability advocate
Optum
Measuring Service Reliability for Greater User Experience: Theory to Practice!
If you are a business that stands up and sells any service to your customers and assures them about always-on and reliability, where and how do you get started? What framework and approach you could use to make the tangible process that makes sense for you and your customer? Perfection and 100% reliability isn’t the goal, but setting measurable and concrete reliability targets will result in happy customers. Start the SLO discussion early in the design process. So, what is SLI, SLO? What should be your SLO? Why should you pick SLOs from historical performance vs current performance? What other metrics you should care about and track are industry standards while you make better software, faster? What specific measures can prevent from getting paged in the middle of the night as a result of unstable service?
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelRicardo Castro
Lead SRE
Anova
Ricardo Castro
Lead SRE
Anova
First Principles: How to Learn about SLO-based Approaches fr...
Ricardo Castro
Lead SRE
Anova
First Principles: How to Learn about SLO-based Approaches from Scratch
SRE and SLOs are all the hype. Everywhere you look, there they are. You’ve read about it. Watched a few talks. But still, you’re having some trouble applying them to your context. Don’t worry, you’re not alone.
Putting SRE and SLOs aside for a bit, your ultimate goal is to amaze your customers. You want to make them happy using your product or service. And you know that happy customers are good for business. SRE and SLOs can help you achieve that.
This talk will reason from first principles. It will break down these complicated concepts into basic elements. They will then be reassembled from the bottom-up. Reasoning from first principles is one of the best ways to learn new and complex concepts.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelRobert Ross
CEO
FireHydrant
Robert Ross
CEO
FireHydrant
Why You Should Probably Only Alert on SLOs
This talk focuses on moving away from alerting on computer vitals and more to customer pain measured through SLOs. We’ll focus on symptoms not causes and how this can better tie your product/engineering organization to customer satisfaction. A highly energetic talk that will question the norms that almost all organizations I’ve worked at have operated under: I’ll just page someone when CPU is greater than 80%. Why? I say we should be cheering about that with a pizza party if customers are as happy as they were before the alert went out.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelSal Furino
CRE
Sal Furino
Sal Furino is a Customer Reliablity Engineer. During his career he's worked as a TPM, SRE, Developer, Sys Admin, and IT support. While not working he enjoys cooking, gamings, traveling, skiings, and golfing. Sal lives in Queens with his parter and has a BS in Applied Mathematics from Marist College.
Life of an SLO
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelSal Kimmich
Product Strategist, Developer Advocate
Reliably
Sal Kimmich
Product Strategist, Developer Advocate
Reliably
Culture Clash: Why DevOps, SRE and Cybersecurity Teams Have ...
Sal Kimmich
Culture Clash: Why DevOps, SRE and Cybersecurity Teams Have Different Motivations and Professional Cultures
The most beautiful thing about SRE is your error budget, but in cybersecurity, error budgets just can’t happen when a critical vulnerability is found in what we call “zero day” events. Here we’ll talk about shared SLOs that serve to keep source code running, while keeping vulnerabilities out. This is a great talk for SREs, cybersecurity engineers and researchers, and especially the managers that oversee both of them. There are incredibly different methods and motivations for accelerating delivery and tolerating some risk with an error budget, and the cybersecurity approach of zero-tolerance to critical vulnerabilities. Both approaches are absolutely necessary to keep software running in the real world. This talk will show us how we can structurally support shared aims with SLOs, and most importantly, you’ll walk away with a little more empathy for two dev teams who typically only come together to put out fires for a better culture across teams in modern DevOps driven enterprise.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelShelby Spees
Site Reliability Engineer
Equinix Metal
Shelby Spees
Site Reliability Engineer
Equinix Metal
Intro to Tracing-Based SLOs
Distributed tracing is becoming more accessible with libraries like OpenTelemetry, and custom instrumentation helps us capture data that better represents the user experience. Shelby Spees, SRE at Equinix, shares the benefits of tracing-based SLOs and how to define SLIs for trace data using examples from Equinix Metal.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelStefan Zier
Chief Architect
Sumo Logic
Stefan Zier
Chief Architect
Sumo Logic
Customer-Centric SLOs: Track SLOs for Every Customer
Stefan Zier
Chief Architect
Sumo Logic
Customer-Centric SLOs: Track SLOs for Every Customer
As an enterprise vendor, we don’t have millions of customers, but thousands, each with specific needs and behaviors. In this talk, I will walk you through how we mapped “is Sumo Logic working” to concrete SLOs, and how we track them on a per-customer basis. We use this data to surface new issues that would have been lost in a “global SLO” but may have major impact on individual customers.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelStephan Mousset
Lead Reliability Advocate
ING
Stephan Mousset
Lead Reliability Advocate
ING
How to Plant the SLO Seed, from Availability Reporting to SL...
Stephan Mousset
Lead Reliability Advocate
ING
How to Plant the SLO Seed, from Availability Reporting to SLO: A Real Life Story from the ING
ING is a major global financial institution with many external stakeholders / regulators that request availability reporting from us to prove that we meet the availability targets of our regulators. But how do we translate this into the experience of our typical customers and how do we steer on meeting their expectations? And how do we use it to steer our squads on where to put their engineering power to get the maximum value for our customers? This is where the SLI’s, SLO’s and Error Budget concept comes in which gives us an opportunity to not only steer on end to end availability but also on all other quality criteria our typical customer expects from us. In this story I will explain our journey so far on how we managed to make this a global ING standard in the future, expect a practical talk with a lot of real life usable examples.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelStephen Townshend
Site Reliability Engineer
IAG
Stephen Townshend
Site Reliability Engineer
IAG
Defining SLOs When You Don't Know Anything About SLOs
Stephen Townshend
Site Reliability Engineer
IAG
Defining SLOs When You Don't Know Anything About SLOs
In this talk we walk through our SLO definition workshop, a facilitated session that we used at IAG as an experiment to help teams embed customer focus. We talk openly about what did and did not work, and the experimentation and adjustments we made along the way.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelSteve McGhee
Reliability Advocacy Engineer
Google SRE
SLO Classical Lit
How can Greek Literature help us understand Internet Services? In this talk, Steve will provide a few points from this Ancient Literature and how to relate them to our modern Internet world.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelTimothy Bonci
Principal DevOps Engineer
Cimpress
Timothy Bonci
Principal DevOps Engineer
Cimpress
Measure What Matters: How it Started and How it's Going Prov...
Timothy Bonci
Principal DevOps Engineer
Cimpress
Measure What Matters: How it Started and How it's Going Providing a CI/CD Platform
When I started building a new platform I had to decide when it was important enough to be woken up, so I needed to establish an SLO. What I found is that measuring what is easy to measure doesn’t always capture the intent of your SLI. I had to iterate a few times and take new things into consideration as the proverbial water started to drain and uncovered more rocks.
Enjoyed video? Share your thoughts on Slack!
Open Slack ChannelWeyert de Boer
Head of App Store Engineering
Tapico
Weyert de Boer
Weyert also contributes to various communities and is part of the OpenSLO team to help define the SLOs in a declarative way.
In his spare time, Weyert enjoys reading about ancient history, a hobby paleoanthropologist, helping developers out in various communities, and trying to get better at oil painting.
OpenSLO Alerting
As part of the OpenSLo specification team I have worked on defining an extension which allows you to define alerting as part of your OpenSLO definitions. I would like to talk about how you can define alerts for your OpenSLOs
Enjoyed video? Share your thoughts on Slack!
Open Slack Channel2022 SLOconf Highlights
2022 Talks
Custom SLOs for all: Manage Your Busines...
Automated deployments of SLOs Using Slot...
Burn Your Dashboards: The Case For SLO-F...
First Principles: How to Learn about SLO...
You Have an SLO, Whether You Know It or ...
Defining SLOs When You Don't Know Anythi...
SLIs the Hard Way
Good Reasons for SLOs in Less Than 10 Mi...
Tips for Running Successful SLO Workshop...
Customer-Centric SLOs: Track SLOs for Ev...
Previous Sponsors
Media Sponsors
DevOps.com
DevOps.com hosts a variety of articles, videos, podcasts and
custom content, all designed to educate, inform and engage.
DevOps.com hosts a variety of articles, videos, po...
TFiR
TFiR is a video-focussed story-telling platform covering Open Source, Cloud Native Computing, Security, Edge, 5G & AI/ML.
TFiR is a video-focussed story-telling platform co...
The New Stack
For developers and engineers building and managing new stacks around the world that are built on open source technologies and distributed infrastructures.
For developers and engineers building and managing...
VMblog
VMblog.com is dedicated to spreading the word about modern Data Center technologies like Virtualization, Cloud Computing, Containers, Hyperconvergence, IoT, Software-Defined "X", etc.
VMblog.com is dedicated to spreading the word abou...