Growing a High-Performance DevOps Culture

Fri 08 June 2018

Editorial note: I originally wrote this post for the Scalyr blog. You can check out the original here, at their site.

Culture is one of those things where we all know what it is but can’t explain it. Well, according to Wikipedia, culture is “the social behavior and norms found in human societies.” But in simple words, it’s all about people: how they interact, how they behave, how they talk, and what they practice. And culture is the foundation of a successful implementation of DevOps.

John Willis, an established speaker and writer on the subject of DevOps, coined the term CAMS (culture, automation, measurement, sharing) at a talk where he explained that DevOps culture is about breaking down silos. But what I find most striking about his discussion of culture, as summarized in the DevOps Dictionary, is the observation that “fostering a safe environment for innovation and productivity is a key challenge for leadership and directly opposes our tribal managerial instincts.” So the starting point for your DevOps journey is good leadership. After that, it’s just about how to grow your team to become a high-performing one.

A high-performing team in DevOps, according to recent research, is one that

Does deployments often, meaning several times a day.
Delivers a change with a fast lead time (minutes) after it’s been pushed to a shared repository.
Has a short (again, minutes) mean time to recover (MTTR).
Has a small change failure rate (described here).

So how do you grow a high-performance DevOps culture? You create a culture that will produce a team that delivers on time with confidence in a predictable manner. Here are the things that will help you get there.

Increase Deployment Frequency

Increasing deployment frequency could be seen as a contradiction to delivering in a reliable manner. But working in small batches isn’t anything new. Scrum and Extreme Programming are two management systems for software development that emphasize delivering in short cycles. While deployment frequency varies from one user to the next, people who practice Scrum typically deploy biweekly. And sure, that may seem fast if you’re new to agile. But there are some companies that would find it laughable to call two weeks a “short cycle.”

High performers are known for doing several deployments per day. The average high-performance team does four daily deployments, and some do way more than that. Etsy does eighty, and Amazon and Netflix deploy thousands of times every day, according to this study. That seems crazy, but doing so increases delivery throughput and lets organizations get feedback all the time. Customers or stakeholders don’t have to wait for the sprint to finish. As soon as development has finished coding and verified that the change adds value (and doesn’t break anything), there’s no reason to keep waiting to deploy.

Working in small batches is more than a practice. It’s a mindset. It requires that all members of the organization—developers, operators, businesspeople, and most importantly, leaders—think about it all the time. The result will be that you’ll increase customer collaboration because they’ll have something to try every day.

When you incubate an idea for too long, it becomes obsolete and useless. Deliver, and do it often.

Find Countermeasures to Increase Speed

How will you know if something’s working or not if you don’t measure it?

Measuring plays a big role in building a high-performance team. It’s important that you identify what’s stopping you from delivering on time. High performers have all sort of metrics in place to help them do that.

One very important metric is lead time. Lead time is how long it takes to implement a change in a production environment after the code has been committed in version control. Basically, it’s how much time it takes to deploy and test code changes. This is important because it won’t matter how fast the developer codes; a change could take weeks to be delivered.

Do you know what’s causing big lead times in your workflow? You might not.

For example, you might be working with long-lived feature branches. Are you measuring the time it takes to merge the feature into the master branch? If you’re spending too much time on this task that doesn’t have much value, it’s time to find a countermeasure.

According to lean training company Velaction, countermeasures are “the actions taken to reduce or eliminate the root causes of problems that are preventing you from reaching your goals.” So a good countermeasure in the above example is to use trunk-base development. You’ll always commit to the master branch, reducing the issue of having to spend too much time fixing merge conflicts.

After you’ve applied a countermeasure, you might need to apply another one somewhere else. But it will be hard to identify which if you’re not measuring everything you’re doing.

Having a culture of continuous learning in your organization will help you become better across the board. You won’t stop improving once you’ve made the transition from doing deployments once a month to two weeks. You’ll continue working to identify what’s preventing you from doing daily deployments. And deployments are just one example. What about code coverage? Manual testing? Immutable infrastructure? The list goes on. But the idea is that you’ll always have a chance to improve what remains.

Stop Pointing Fingers

There’s nothing more harmful in a culture than pointing fingers when something goes wrong. No one that provoked a downtime said, “I’m bored. Let’s bring the system down, just for fun.” (Well, unless they’re hackers.) And if colleagues make someone feel guilty after causing a downtime, will that person admit fault the next time? Not likely.

Failure is inevitable, so organizations should embrace it. It’s better to find mechanisms to reduce the time it takes to recover from failure than to try to avoid mistakes. One way to avoid failure is to do deployments less frequently, but that stops innovation.

What you need in your organization is a blameless culture.

This is a culture that accepts that we all make mistakes sometimes. Being able to say, “I screwed up. I’m sorry. Can someone help me fix it?” will help your organization focus on a solution rather than the problem. And if there’s no stigma associated with guilt, the team can be honest about who else contributed to the problem.

It’s important to build trust in your team. Making mistakes shouldn’t be a punishable offense. Instead, you should anticipate failure and then fail on purpose. Netflix does it all the time—they bring servers down just to test their fault-tolerant architecture.

Now, telling management that you want to intentionally cause problems doesn’t sound good. But what if you did it in an environment other than production? What if it only affected certain users? These are ways that you can enable experimentation and build trust in the team.

Let your people learn from their mistakes. Find ways to prevent those things from happening again. When something bad happens, it’s not about one person. Stop pointing fingers so everyone accepts responsibility for their mistakes. A healthy organization embraces failure and learns from it.

Enable Production-Like Environments

High performers have a low failure rate. Deployments don’t usually fail because high-performing teams deploy in small batches, they have practices in place that reduce lead time, and most importantly, they have a way to recover quickly in case of any failure.

How is that possible?

Well, one option is to have a production-like environment. It lets everyone experiment with what it’s like to work on a production server. And it lets the team practice deployments as if they were doing it in a production environment.

For example, if you’re using Jenkins as your CI tool, you can make sure every developer has their own copy of Jenkins installed locally. Why? Having it running locally on their machines will let developers test their code before integrating it into the source control. The flow will be like this: do some changes, pull the latest changes from the repository, fix any conflicts, simulate the integration locally, and if it works, push the changes. By doing this, the practice of continuous integration will have the team always ready to deliver.

Work on building a culture that’s always ready by practicing before going live. This includes building, packing, testing, deploying, rolling back…you name it. You’ll avoid any surprises when going live, and that will reduce your failure rate, increasing the trust in the team.

Do It Again, but Better Every Time

You’ll start seeing the benefits of everything I described here once you’ve made it a habit. After all, practice makes perfect. People might not see the value of these initiatives right away, but don’t get frustrated. Just keep pushing until everyone gets it.

Fostering a safe environment for innovation and productivity should be the main goal in every DevOps implementation. Building trust in the team brings enormous benefits because members will have a feeling of ownership when doing things.

When you reward certain behaviors, you see that behavior more often. So make sure everything you reward contributes to building a high-performance culture.

Christian Melendez