Pitfalls with DevOps at Scale

Mon 28 January 2019

Editorial note: I originally wrote this post for the Enov8 blog. You can check out the original here, at their site.

Let's get started by defining what DevOps is.

I know, I know; there are tons of definitions. But the one I like most is from Gene Kim:

DevOps is those set of cultural norms and technology practices that enable the fast flow of planned work from, among others, development, through tests into operations while preserving world-class reliability, operation and security. DevOps is not about what you do, but what your outcomes are.

Some might think that because DevOps is hard to implement, it’s not for everyone, especially not large organizations. That’s not true. Outcomes are what really matter, not how you get there. That’s why I like this definition; it shows that DevOps really is for everyone.

As with almost everything in IT, there are pitfalls that can keep you from reaching your expected outcomes. But before we get into them, remember that for the purposes of this article, we’re looking at DevOps at scale. So first, let’s define our terms.

What Does “At Scale” Mean?

An organization that’s operating at scale is able to grow to meet greater demand without too much hassle.

In a small startup, disruptive actions aren’t that difficult to implement—there’s not much to lose, and you need to meet deliverables fast in order to get more funds. With only a few people, it’s easier to communicate and reach agreements. But when more people are involved and the application becomes more critical, things start to change.

Organizations that run at scale have a whole host of issues to consider that small startups do not. They usually need to deal with compliance and auditors. They have to coordinate many developers, each with a different mindset and knowledge base. And sometimes, they depend on partners in order to solve problems faster, no matter that doing so costs more money in the short term.

Big, scaled-up organizations also like to make longterm plans and infrequently release app changes because the impact of things going wrong would mean losing tons of money. It’s understandable, then, why these organizations are very cautious of anything that might speed up their work (like DevOps).

So yes, things are different at scale. And at scale, DevOps has some specific pitfalls that you need to take care of.

Difficult Coordination Between Teams

The more people are involved, the more difficult it is to coordinate teams. That’s not just because as humans we’re not great communicators. It’s also because having more people means that there are more dependencies and applications are more complex.

Not knowing the application’s dependencies is a big pitfall that can keep you from being able to confidently release code. If you don’t know how parts of your application relate to each other, you might end up fixing one thing while breaking another. It’s necessary to have a shared understanding of the impact of each component in the system.

I’ve heard that at Netflix there’s no single person that knows every part of the system in detail. That’s problematic, and that’s why Jeff Bezos says that, at Amazon, developers should always think about exposing services through APIs. Doing so means that every interaction between different teams is clearly documented, and everyone works in a culture that values knowledge-sharing.

Picking the Wrong Projects

You can’t eat an elephant in one piece; you need to eat it in small chunks. And you can’t apply DevOps to the organization all at once; you need to do it little by little. Every time you choose a project, you need to know all its dependencies, its impact, and how stable it is. This is especially crucial for the first project.

You first need to establish some base knowledge and start simple. Don’t get overwhelmed because everyone is talking about cool things like infrastructure as code, containers, or microservices. Sure, those things can help, but you don’t need all of that to improve the outcomes of DevOps. Why don’t you simply start by getting rid of manual changes like deployments and leaving a trace to make changes visible?

Find out what aspects of your application are not adding value and focus on those. Automation is just one example.

When starting out, pick a project that is complex enough to do interesting things and that has a low enough impact on revenue so as not to make too big a mess things don’t go well. You can then scale and replicate what you learn in the process. Soon you’ll see better outcomes as an organization, not just by project.

Picking the wrong projects could make everyone think that DevOps is making things worse.

Lack of a Well-Established Framework

Employees last an average of two years in the big tech companies. Make sure you’ve established a framework that works for you before scaling DevOps. This will help with both onboarding new employees and ensuring that when someone leaves, his or her knowledge will be retained.

Just make sure you don’t create strict rules that are too difficult to change. For instance, you could make practices like infrastructure as code mandatory because you’ve noticed that the labor of maintaining infrastructure is not adding value. But you could let the team choose which tool to use to implement infrastructure as code.

Other practices you build into your framework could include trunk-based development, feature flags, test-driven development (TDD), not SSH’ing to servers and using centralized logging, among others.

Let’s take the Google example. Google runs production systems at scale by having site reliability engineers (SREs) that sometimes function as temporal consultants. SREs make sure that the team stabilizes the project such that it gets more reliable. They have a checklist of things that the team needs to do to make sure ensuing changes comply with Google’s standards.

Large organizations need a standardized way of working; otherwise, the turnover will be too problematic.

Lack of Production-Like Environments

Production-like environments could be difficult and expensive if you don’t start out with the mindset of working with homogeneous environments when creating the systems.

Preparing test environments is no easy task; preparing production-like environments is even trickier. But if this process is not automated as self-service, the path to deploying in production could be painful.

Having a production-like environment doesn’t mean you’ll have an exact copy of the environment for development or testing. It means that if in production you have, say, a load balancer, you also have a load balancer everywhere else.

Your production-like environments do not have to be at the same scale as those in production. But quantity and capacity of the resources should be your only difference—unless you need to do performance testing, but that should be a matter of scaling out the infrastructure.

Why is having homogeneous environments important? Well, they’ll let you do deployments the same way in all environments and run experiments before you go live. You don’t want any surprises when you make changes available to your users.

Lack of Meaningful Metrics

How sure are you that the things you’re doing are adding value? What’s the impact of automating certain manual processes? If you’re automating something that’s rarely used, the effort may not be worth it. If you’re using trunk-based development and spending too much time resolving merge conflicts, maybe you should switch to short-lived branches instead.

It’s important that every decision you make is based on data, not assumptions or because everyone is doing it. According to the state of DevOps report from last year, these are the key metrics you should be measuring:

Deployment frequency. Small batches are less risky.
Time it takes to do deployments. It should be a deterministic and boring task.
Number of times a deployment has failed. There might be things you didn’t consider before going live.
Time it takes to recover from a failure. Things like rollbacks or self-healing architectures could help.

Measure meaningful things. Otherwise, you won’t know what’s working to improve the outcomes of DevOps.

Know Your Weakness and Get Stronger

DevOps outcomes take time to be seen. That’s to be expected—you need cultural change before you can make real progress with different processes like DevOps. And changing people’s mindsets is usually hard to do. It’s in our nature as humans to avoid the feeling of discomfort that accompanies change.

In your continuous DevOps journey, measurement is crucial. You need to know what and where your weaknesses are in order to get stronger and achieve better outcomes for the organization. All industries and organizations are different. Some will encounter all the pitfalls I mentioned in this list; others, only a few.

Remember: DevOps can benefit your organization only as long as you set yourself up for success. If you’re putting time and effort into improving outcomes but aren’t adding value, you aren’t serving the organization. Check to make sure you aren’t falling prey to one of these pitfalls before applying DevOps at scale.

Christian Melendez