What is site reliability engineering (SRE) and how is it different from DevOps?

Site reliability engineering (SRE) is Google’s model of service management where software engineers run production systems using a software engineering approach. It’s clear that Google is unique, and they usually need to tackle software bugs and errors in different and non-conventional ways. But having software engineers doing a …

Continue reading »

Why You Need an Error Budget and How to Make It Work

Why You Need an Error Budget and How to Make It Work

How many times have you seen Google go down? Not many, I bet. You might not even notice it if it happened. If you did, you’d probably assume it’s an internet connection problem.

But Google isn’t perfect. As Werner Vogels says, “Everything fails, all the time.” If …

Continue reading »