Why You Need an Error Budget and How to Make It Work

Wed 26 September 2018

Editorial note: I originally wrote this post for the TechBeacon blog. You can check out the original here, at their site.

How many times have you seen Google go down? Not many, I bet. You might not even notice it if it happened. If you did, you’d probably assume it’s an internet connection problem.

But Google isn’t perfect. As Werner Vogels says, “Everything fails, all the time.” If even Google doesn’t have 100% uptime, maybe we should accept we won’t either. Shouldn’t we instead focus on how to recover from failure? It’s been proven that systems fail when they change, but your customers probably wouldn’t like it too much if you simply never updated your software. So since we have to change, how do we know when to change?

Maybe you’ve had good uptime recently, and a small fail won’t hurt you so much. But you have to know whether that’s the case or not. You need metrics that tell you if it’s a good idea to freeze changes for a time or if having some errors is still acceptable.

Why You Need an Error Budget

If you’ve compared cloud services, you may have heard about the nines of availability. It’s basically a number that tells you how much of the time systems are going to be down. When someone talks about having four nines (99.99%) of availability for a system, that person is saying the system will be down only 52 minutes and 35 seconds a year. The more decimals, the more uptime. For instance, let’s say that you defined a rule specifying that the system has to respond in under 500 ms 99.99 times out of 100. If latency goes up, then your system is considered down because it’s above the 500 ms threshold.

This number is used to define the service level agreement (SLA) or service level objective (SLO). The error budget is how much time you’re willing to allow your systems to be down, and it will depend heavily on the SLA that you’ve defined with the product team. Everyone would like to have systems with 100% uptime, but you need to be realistic. How much availability are you willing to provide, based on how much your customers care? Are your users going to notice that your system is up 100% of the time? What about 99.99%? Or even 99%? They might not.

It’s important that you have an SLA and SLO that works for you so that, at the moment a deployment fails, you’ll think twice before trying to fix something in production or go back to a stable environment. Having an error budget helps support the plan of not pushing changes if people lack trust in those new changes.

Uptime vs Innovation: Should I Pick One?

High uptime has risk beyond financial costs and complexity. It’ll also put you in the position of worrying too much when deploying changes. Some might use error budgets to support their theory that every time a change happens, the stability of the system is affected. That means no more changes, in their mind. But I’d advise against having that mindset. It’s better to avoid risking stability in other ways.

Operations will always seek to have systems that are highly available by putting in place replication, redundancy, auto-scaling, backups, and everything that makes systems more robust. On the other hand, developers will try to write code that satisfies the requirements that came from the business. That’s how the DevOps movement started: people wanted to create a culture where these frictions are minimal.

If you care more about having several nines of availability than releasing new features, the result will be that innovation will stop. Sure, it might be better to be conservative than to take the risk, right? But let’s face it. No one will care very much about your reliability if your system doesn’t provide any value. Successful systems are those that solve a problem users have. There are always trade-offs, but keeping systems static won’t allow you to keep your customers happy.

How Do We Keep the Budget Positive?

It’s important to have room in your error budget in case something happens that’s external to deployments—something like internet connection issues, fires in the data center, cloud providers being down, and any problem that’s not in our hands to fix (and that complaining on Twitter won’t solve).

When you push changes gradually, you’re more in control of the error budget. If something starts to affect uptime, you can roll back immediately before it consumes the budget. Also, you might soar over your error budget if you don’t release in small batches. Deployment strategies like blue/green deployments or canary releases are a good option to keep numbers positive. Automation becomes your best friend here. Every second counts, especially when you need to do a rollback.

You can also start with the code. How does your code respond if the database has problems? Or what about Redis? Problems with dependencies will always happen, so it’s better if your application can support that. Let’s say your system is composed of several microservices. If one of those microservices is down, instead of failing, the client should have a default response or take data from the local cache. For example, Netflix has a really good library called Hystrix. If you’re not in the Java world, you can still internalize the principles behind other companies’ level of support for problems.

Fail, But Don’t Get Caught

Failure is an option, but the trick here is how you manage it. Netflix actually practices failure all the time with their chaos monkeys. They go to the extreme of bringing down entire clusters in production several times, all the time. Now, let’s be clear about this. If Netflix is down, users might get mad, but no one will die. They won’t lose money, either, because of their monthly plans system. But their reputation could be affected if the system is constantly down, impacting revenue months later through lost users.

So, should you start bringing servers down? What will be the business’s reaction when you tell them? They’ll probably freak out and respond with a solid no. It’ll depend on the impact downtime has on the users. But even if you don’t put your systems in failure situations in production, like Netflix, you should at least practice it in a testing environment. Doing that shows you care about reliability and are prepared for common failure scenarios.

When AWS went down some years ago, Netflix was one of the few that survived that failure. They failed but didn’t get caught.

Keep Innovating While Staying Up

With DevOps, some organizations are so focused on delivering fast that they sometimes don’t adequately assess risk. And if that’s you, developing an error budget can help you to be aware and respond properly. Most businesses prefer to be conservative, so you need to learn how to sell DevOps or even automation to management while accounting for risk. At the same time, you need to keep innovating without affecting reliability.

Having an error budget will force you to have metrics in place to know if you’re meeting expectations or not, and it will help you take action to reduce the chances of being unreliable.

Error budgets give you more than just a number. They’ll change your thinking when you’re delivering software. You’ll want to shift to the left everything that will make your systems more reliable.

Christian Melendez

Why You Need an Error Budget and How to Make It Work

Why You Need an Error Budget

Uptime vs Innovation: Should I Pick One?

How Do We Keep the Budget Positive?

Fail, But Don’t Get Caught

Keep Innovating While Staying Up

Would you like to be notified of any new post?