HTTP Monitor: What It Is and Why You Need ItTweet Mon 26 March 2018
Editorial note: I originally wrote this post for the Scalyr blog. You can check out the original here, at their site.
One day, one of our main web APIs was down, and the first person that knew it was my boss. We were so worried about bringing the API up that we never paid attention to how he was able to be one step ahead of us. There were times when we even thought he had nothing else to do than constantly refresh the web page. But the truth is that he wasn’t doing that at all. He was using an HTTP monitor that emailed him every time the API was down, slow, or unresponsive.
It was actually lucky for us that he had that monitor: it helped everyone fix things before our clients could notice. But what is an HTTP monitor, anyway? And why else would you need it?
It’s a Friend Who’s Watching Out For You All the Time
Think of an HTTP monitor like it’s a friend you’re paying to constantly check if your HTTP endpoints are working. That friend will hit the endpoints you tell him to, and he’ll follow the exact orders you gave him. Those instructions will be something like, “Hey, I need you to hit this endpoint every five minutes. And only call me if they’re responding with anything other than OK or if you haven’t heard anything back from them in less then a second.”
You can use an HTTP monitor not just for web APIs but also for any of your websites. All that matters is that the application you want to monitor can be accessed using HTTP. It’s also important that they’re publicly accessed. If they’re not, you can make use of a bastion host for critical-no-public endpoints. This is needed because an HTTP monitor is something that’s outside your network. Imagine that you or your provider have massive downtime. Your application will be down, but you also have that good friend who’s supposed to tell that you’re down. People also use HTTP monitors as part of the integration testing suite to validate a deployment.
HTTP monitors are really simple, but fault tolerance and high availability in these type of systems make them complicated. And I bet you have enough of a time trying to keep your application up and running to want to worry about something else. Scalyr helps you to alleviate the load by offering HTTP monitors to probe your servers and measure availability, status, and performance.
Some Metrics You Need to Know Before You Start
It’s not just about whether it works or it doesn’t work. In order to make better use of HTTP monitors, we need to understand a handful of metrics and codes you can use to know how bad, exactly, your app is behaving.
Let’s explore these metrics and codes:
HTTP Response Codes
There are several codes that will give you more information about what happened in a request. But when monitoring HTTP endpoints, the only code that you’ll probably use is the “200 OK.” If you don’t receive this code in the response, something is wrong.
They are grouped like this:
- 1XX is only informational.
- 2XX means success.
- 3XX means a redirection happened.
- 4XX means something in the user’s request is not correct.
- 5XX means that something is wrong in your servers.
So, it doesn’t make sense to tell your monitor to watch for any code other than the “200 OK.” If you get something different than this code, it will be useful to troubleshoot, not to monitor.
This is simply the payload you receive back when hitting the endpoint. It could be in various formats such as JSON, HTML, or even plain text. This is useful because you can configure your monitors to extract, parse, and analyze the text.
So for example, if the word “error” is found, you can trigger an alert even though you’ve received a 200-status code.
Or what about when, for example, an endpoint should return data for Ferraris and you’re getting back data for Lamborghinis? That could be caused by a data migration in the database. But it doesn’t really matter what caused it; it matters that you can be notified almost in real time automatically.
The response time is the time the endpoint takes to respond with a status code and a response body. It’s also called latency. Response time will depend on the endpoint you’re monitoring and the SLAs you’ve defined for it. If it’s an API, it usually needs to be in a range of 100 ms to 500 ms, but if it’s a website, the rage varies from 1 s to 2 s. But it depends on your use case.
There are different metrics that serve different purposes. You can combine them so the monitor will alert you only when something really bad is happening, and not just because there was a temporary problem. And it’s important that your applications are returning the proper code too. I’ve seen cases where the endpoint is returning a 200 code, but when I looked at the response body, it was clear that the application was throwing an error.
You Can React When Something’s Wrong
After you know what data is important when probing HTTP endpoints, the next step is to configure the monitor so that it can notify the proper people. This part is tricky, but it’s the most important step because you don’t want your monitor to become the boy who cried wolf.
For this reason, you might want to only be alerted when something can’t be recovered or healed by itself. Otherwise, the next time you receive an alert, you’ll rapidly think, “Nah, I always get these alerts. The recovery notifications will come after—they always do.” But no, after five minutes you start receiving calls from everyone, including your boss, telling you that the site is down.
So let me give you some examples of HTTP monitor configurations that have worked for me. An alert will be triggered when at least one of the following happens:
- The system received a status code that’s different than 200 OK.
- It takes more than 800 ms to respond back.
- The response body size in KBs is zero.
- The response JSON doesn’t include the “success” value in the “status” property.
- A custom header in the response isn’t found.
- The HTTPS endpoint is timing out after 800 ms and the status code is not 200 OK.
The idea is that when you receive an alert, you previously verified all the contexts of the request. Downtime could be caused by a ton of different reasons. It could be something with a server, and you just need to replace it. Or maybe an external dependency like others’ API endpoints are having problems and you just need to disable them temporarily. What happened doesn’t matter too much at the moment of being alerted. What matters is that you can act soon to reduce your system’s downtime.
You’ll Always Be One Step Ahead
Don’t let your clients be the first in line when reporting a problem in your system. Always be one step ahead of them and find the mechanisms to recover quickly. HTTP monitors will help you to minimize downtime by being aware. They’ll do this boring job for you.
These type of monitors will inform you about the uptime percentage of your system endpoints. They’ll let you know when endpoints aren’t loading or taking too much time. And it’s not just that. You can do interesting things like parse the response body and look for any unexpected results. So it’s important that the dev team knows what status should be returned depending on the result.
Know what you have at your disposal, and take advantage of it.