Tracking Cluster Health With Redis

Historical PowerTrack Background

Our new Historical PowerTrack for Twitter allows you to run filtered searches across the entire Twitter archive. At its heart, it is a cluster of machines where each machine runs a handful of Resque workers that do all of the heavy lifting of processing and filtering the data.

Like any good production application, we have an abundant amount of health reporting in place to monitor problems and quickly respond to issues.

Resque itself runs on top of Redis, an open source data structure server. Since we already have our Redis infrastructure set up to power Resque, we can leverage Redis to track our workers health information.

One of the things we track per worker is an “error rate”.  The error rate is the number of errors seen in the last 30 minutes. Once that climbs above a threshold, we then send an alert so that we can examine why that worker is having issues.

Implementation 0

Using Resque Job Hooks a worker is notified when it fails to process a job.

module HealthChecks
  def on_failure_record_health_status
    # Redis code will go here

Inside that callback, we then simply create a new key to represent the failure:

setex gnip:health:worker:<worker_id>:error_instance:<unix_timestamp>, 30*60, ‘’

We also give the key an expiry of 30 minutes when we create it.

Then in our health checking code we can simply get the count of error_instances for a given worker_id:

keys gnip:health:worker:<worker_id>:error_instance:*

Since an error instance key dies after 30 minutes, the size of the resulting set from `KEYS` will tell us the error rate! Huzzah!

While this solution works, and was quick to get up and running, there was a catch; namely this tidbit from the Redis documentation on the KEYS command:

“Warning: consider KEYS as a command that should only be used in production environments with extreme care. It may ruin performance when it is executed against large databases. This command is intended for debugging and special operations, such as changing your keyspace layout. Don’t use KEYS in your regular application code. If you’re looking for a way to find keys in a subset of your keyspace, consider using sets.”

Our first implementation is workable, but not very performant or safe. Luckily the docs give us a clue in the warning message: sets.

Current Implementation

Redis describes itself as a data structure server. This means that the value for a given key can be a basic string or number, but can also be a Hash, List, Set, or Sorted Set.

What we’re interested in are Sorted Sets. They work just like normal Sets: you add members, and they must be unique (just like mathematical sets), you can union, intersect, and difference them, however each member also has an associated non-unique score. Redis will automatically keep the set sorted for you by the score.

Sorted Sets have an associated command that is crucial for the the current implementation:

ZCOUNT key min max – Returns the number of elements in the sorted set at key with a score between min and max.

ZCOUNT will return the count of elements in the range of a *score*, not the value of the members. This is pretty efficient too, since Redis is already holding that set sorted by score.

The current implementation now works like this:

Every time an error occurs, we record the Unix timestamp (seconds since Jan 1st 1970).

We then add a new member to the error rate set for a worker, where both the member and the score is the Unix timestamp. We also reset the expiration for the key to 30 minutes (less things to manually clean up later).

zadd gnip:health:worker:<worker_id>:error_instances unix_ts, unix_ts
expire gnip:health:worker:<worker_id>:error_instances 30*60

To calculate the error rate for a worker we can then do:

zcount gnip:health:worker:<worker_id>:error_instances, unix_timestamp_30_seconds_ago, unix_timestamp_now

Because our score is a timestamp, old errors “fall out” of our view when calculating the error rate. Then after 30 minutes of no errors, the key disappears.

Redis commands are also very forgiving of state, so a ZCOUNT against a non-existent key is 0, which is usually going to be the happy path.

What’s really great is that instead of one worker generating multiple keys to represent errors, we now only create one key for each work to represent the error rate.

Other Considerations

There are a few other ways we could have implemented this feature. We could have used milliseconds, but in our app it’s highly unlikely that multiple errors would be thrown in a second.

We could also store the exception message (or even full stack trace) as the member in the set instead of the timestamp again (perhaps prefixed with timestamp to make it unique), and then used ZRANGEBYSCORE just like we used ZCOUNT to get the list of errors in the last 30 minutes.

While this (and other health checks) were written while developing our product, they are not specific to it. Would you, dear Reader, find it helpful if Resque tracked things like error rate, sequential failures, etc per worker (either in resque core, or a add-on gem)?

To wrap up, Redis has a bunch of great commands and data types to solve real problems quickly and easily! Consider this pattern the next time you’re looking to keep track of the rate of events in a given time window.