The Eclectic Art Of System Monitoring

The Problem

At Gnip, we deploy our software to hundreds of servers across multiple data centers and heterogeneous environments. Just architecting the software itself and how to create a streamlined deployment process on this scale is in itself worth a full blog post. In this post though I would like to focus on the aspect of how to make sure that your full application stack is healthy all the time.

The Solution?

On the mechanical side the solution is very straight forward: You deploy a system monitoring tool that watches continuously all servers and all applications that will send you friendly messages when something does not look right. There are plenty of tools available for this purpose. We ourselves use Nagios, which has become to some extend an industry standard.

That was easy, right?

Sadly, the first sentence in this paragraph leaves a lot of details out, and these details can kill your application stack and your engineering team at the same time. (Either by not finding or problems, or by keeping your engineers up all night.) In this post, I’d like to share how we at Gnip go about keeping software, hardware, and people-ware happy.

Watching All Applications

It is certainly not enough to just ensure that an application is running. In order to get meaningful insight, you have to add health checks that are particular to the application itself. For instance, in an application that consumes a continuous stream of data, you can be fairly certain that there is a problem if the transaction volume drops below a certain rate. Nagios allows quite easy integration with these kind of application specific health checks. We ourselves decided to expose an HTTP endpoint on our applications that will expose a digest of all health checks registered for an application. This allows the Nagios check to be uniform across all applications.

TIP: Standardize health APIs

Sending Friendly Messages

We configured our system so that certain types of alerts in our production system will cause text messages to be sent to the engineer who is on call. All engineers participate in this on call rotation. This process ensures a large sense of ownership, because it does not take many pages in the middle of the night to change an engineers’ attitude towards writing reliable and scalable code. (If it doesn’t bother the engineer herself, the grumpy spouse will have a strong opinion, too.)

TIP: Every engineer participates in on-call rotation

We also have a set of health checks that will send out emails instead of pages. We use this for situations where it appears that no immediate action can or has to be taken.

TIP: Categorize actionable and non-actionable alerts

What Does Healthy Mean Anyway?

In this setup it becomes quickly apparent that finding the right thresholds for triggering alerts is non trivial. Due to inconsistent inbound data volumes, it is easy to set thresholds that are too aggressive, which leads to a flood of meaningless noise drowning out critical failures. The opposite is to set thresholds that are too casual, and you’ll never find out about real problems. So, instead of using your gut instinct for setting thresholds, mine data like log files and metrics to figure out what historically would be appropriate thresholds.

TIP: Mine metrics for proper thresholds

Self Healing

As it turns out it happens that you get periodically the same alert for the same reason and the fix is to manually step through some actions that usually fix the problem. At that point, the obvious question is why not to automate the fix and trigger it directly when the problem is detected by the health check. The tendency can be for engineers to postpone the automation effort, partially because it is not interesting feature work. Additionally, it can be tremendously tedious to automate the manual steps, since they frequently involve tools that application developers are sometimes not exposed to. (cron, bash, etc.)

Sometimes you might have to spend a couple of hours automating something that takes  one minute to do manually. The importance of reducing context switches from feature work can not be stressed enough and you the investment will always pay off.

TIP: Application developers learn a new tool every week!

Conclusion

One can categorize issues in a production system by assessing whether an issue raised an alert, whether the issues requires manual intervention, and whether the issue is an actual problem or just a reporting flaw. The following venn diagram illustrates a situation that leads to unhappy customers and tired engineers:

So, for a smoothly running engineering organization you clearly want to focus your energy on getting to a better state.

TIP: Get to this state: