Which DevOps Metrics Matter?

Mon 20 August 2018

Editorial note: I originally wrote this post for the Rollout blog. You can check out the original here, at their site.

DevOps Metrics

Some time ago, I decided to start dieting for the millionth time. All previous tries were a complete failure. But this last time was different.

It was different because not only was I going to a nutritionist, but also every time I visited her, she took my measurements. That way, she knew if I was following the recommended diet and exercise, and if I was, she knew what changes she needed to suggest based on the results. I was seeing a small amount of progress every week. It didn’t matter if I lost only one pound; it was something.

When building a product, we should always try to find ways of improving what we’re doing to produce—to deliver faster, cheaper and on time. There are practices and processes that help us, especially in software development. But when you implement something new in your process, you also need to know if it’s working or not. Even if everyone else says it works for them, you might be doing something a different way…maybe even the wrong way. How can you know if what you’re doing is adding value or not? Well, you need to measure.

When you start practicing DevOps, it’s very helpful to have numbers that tell you how good or bad it’s going. But which metrics would you use? Well, let’s explore the what, why, and how of some important metrics in DevOps.

How Many Times Are You Doing Deployments?

Organizations usually do deployments quarterly or monthly—or with the rise of Scrum, every two weeks, if user stories are complete.

You need a deployment frequency metric. That means counting the times you do deployments of the applications under your DevOps implementation. The more deployments you do (or can do), the better. You don’t necessarily need to do frequent deployments if you’re not changing the system very often. In other words, deployment frequency correlates to the times you complete changes in the code.

Is your change ready to be deployed? Go ahead, do it! Or maybe you’re good enough that you have it ready to go on the date that marketing set for a campaign. Awesome! But it doesn’t matter if you deploy when coding is done or not. You can use feature flags to deploy anytime you want with the feature turned off and then turn it on when you’re prepared to release.

It’s important that you know how many times you’re doing deployments because it’s a sign of how fast and easy a deployment is to do. If they’re not easy and you don’t deploy very often, your next task is to find out why and improve. It might be because your developers’ confidence in deployments is broken. Or maybe testing quality isn’t good enough. Don’t let these issues stand; you have to address them.

You don’t have to be Amazon, Facebook, or Flickr to do ten deployments a day. It should be possible when you need to.

Now, how to keep track of this? Well, every time you do a deployment, save a record in a database or any other service you use for logging, like ELK. Then, build dashboards to have a better view of the past. You’ll know if you’re improving or getting worse at a glance.

How Much Time Does It Take From Code Commit to Release?

The DevOps movement started because after developers finish coding, moving to production tends to be problematic. I’ve experienced this struggle in my job. Basically, we would spend two weeks coding and another two weeks trying to release. It didn’t always take that long, but the time to deploy after development delivered certainly wasn’t consistent. Add to that the pressure of management asking why it was taking so long to deliver, and you have one stressed-out team.

My second suggested DevOps metric measures the time between a developer saying “OK, I’m ready to have this change in production” to actually putting that change in production. If you don’t measure that time, you might not improve upon it—or it will be too hard to identify why it’s taking too long. In this case, the less time it takes, the better. So this metric number should be low, and as low as you can possibly make it because you need to consider compiling, publishing and testing.

Knowing that you take too long to deliver will incentivize you to identify where you’re spending most of that time. It could be that you’re wasting too much time fixing merge conflicts. If that’s the case, then the solution is to have a trunk-base development with feature flags. It could be that you’re doing things differently in each environment, so there’s always something new to fix. The solution is to have production-like environments.

The idea is to know you have a problem and that you need to do something to fix it.

So, how can you keep track of this? Well, because the time between delivery and deployment could be due to many things, start by recording that time in a database or app. Save the date and time the change was validated in a dev environment and ready to be released. Then, when that change is deployed, record the date and time again, do the math, and store the total time it took.

How Many Times Do Deployments Fail?

The metric I’ll speak about measures how many times you’re in need of rolling back or turning a feature off because of the increase in errors. The closer to zero this number is, the better. Measuring this will tell you about the quality of your process before the release. It can also tell you how many times you’ve practiced putting things in a production environment.

It’s important to have this metric because it will tell your customers or upper management how good or bad your DevOps efforts are going.

You need to measure this if you want to worry less about deployments. I know some organizations that reserve Christmas Eve to do high-risk deployments. Why? Users are too busy spending time with family, eating, and celebrating to care if the system is down. But you shouldn’t have to wait for a time when your users won’t notice you’ve failed.

How can you keep track of this? For this metric, it’s not about measuring time. And you might be already measuring what you need to, but not for the purpose of DevOps metrics. I’m talking about knowing how many 5XX and 4XX errors you have or how high latency is after a deployment. Even if your system isn’t a web app, it doesn’t matter: you need to keep track of system errors.

An even better metric is to have key performance indicators (KPIs). These will vary from one industry to other. Retailers may measure sales while banks may measure transactions, for instance. So your KPIs could be something like the amount of failed purchases, the number of times the user has to abandon the page—you get the idea. As long as you have a way of knowing the system is failing, you should use that information to know the number of times a deployment has failed.

How Much Time Does It Take To Recover?

The next metric I’ll mention requires you to measure the time it takes you to recover from a failed deployment. It’s also known as “mean time to recover” (MTTR). The less time it takes, the better.

You need to measure this because it will tell you how good you are at recovering the state of the system. If you’re putting stability above all else, your response to seeing a deployment fail will be to immediately roll back. And you don’t have to build a rollback process. It could be as simple as turning a feature flag off. If you read this blog, you’re already aware of how powerful feature flags can be. And they’re especially useful in these types of scenarios.

How can we keep track of how we’re doing on recovery time? It’s as easy as having a record of the time it takes you to roll back to a stable version. If you have one-click deployments, going to a stable state of the system shouldn’t take that much time. How much automation you have here will help you, definitely.

How Are Your Metrics Doing?

Correlate all metrics and you’ll have a better understanding of what’s happening with your DevOps implementation. You don’t need to have good numbers when you start, but make sure you know your initial state—like the before/after photos you see when someone succeeded in losing weight.

In the end, what you should seek to have is

A significantly higher number of deployments per day or per week. The more, the better.
A decrease in the amount of time you take to deploy a change after the code is finished. The less time, the better.
A smaller number of failures after doing a deployment. The fewer, the better.
A decrease in the amount of time you take to recover from a failure. The less time, the better.

You can start by having simple measurements of each of the above metrics—a simple number that includes the concept. Or you can have detailed metrics. You’ll need them when trying to improve the numbers. Either way, don’t make guesses about your progress with DevOps—use metrics to prove it with data.

Christian Melendez