Enriching With Redis: Part 1

We’ve talked previously about how we use Redis to Track Cluster Health for our Historical PowerTrack product.

Now we’ll be exploring how we enrich with Redis.

Adding the Bling

For every activity that flows through Gnip, we add extra metadata that provides additional information that our customers are interested in. We call this extra metadata Enrichments, and it’s what allows customers to filter on URLs or Profile Location.

As you can imagine enriching every activity in realtime requires a lot of work to make sure that we don’t add significant delay to the realtime stream. When you think of an enrichment such as Profile Location, it’s hard to imagine doing all the work to match the user bio location for every activity as we see them.

Instead, we depend on “pre-computing” as much as we can, caching that pre-chewed data, and then in realtime quickly fetching and enriching a particular activity.

We still need something low-latency, and fast to serve as our cache, and Redis fits that bill pretty nicely.

Profile Geo

For this post, we’ll use our Profile Geo enrichment as an example for all of our work. If you’re not familiar with this enrichment you can read more about it here, but for a quick overview, this enrichment takes the opaque, user-defined profile location string from each activity, and attempts to find a matching, structured geo entity. This adds metadata like name, timezone, long/lat center point, and hierarchical info (i.e. Boulder, is part of Colorado). Our PowerTrack customers are then able to filter on this new data.

Infrastructure

Non-realtime processes are responsible for crunching, and storing the relevant enrichment data into the Redis cache. This data is stored in Google protocol buffers for quick fetching, parsing and reducing memory pressure in Redis.

locaterator_diagram

For Profile Geo, we have an app that sniffs for geotagged activities. When it receives one, it then attempts to geo locate the location string in that user’s profile. If the distance between the geo entity that we find based on the profile location is close to the longitude, latitude in the activity then we store the mapping between that profile location string and the given geo entity. If that mapping already existed, it increases a “score” for that mapping. Later when our worker process goes to perform the actual enrichment, it pulls the full set of mappings for that profile location string, and chooses the best one based on score and other signals.

Redis really shines in this type of low-latency, batched, workload. The main Redis instance in the Profile Location cache performs about 20k operations per second, over a working set of 2.8G.

Redis meet Gnip, Gnip meet Redis

Each Redis host also has a Gnip specific app, that lets us tie in the Redis host to our own health checks, and configuration rollout. The app does a few extra things like periodically uploading RDBs to S3 for insurance against losing an entire machine.

Jedis

We make use of the excellent Jedis library to access all the data that we store in our various Redis caches.

Failover

We built in an idea of “failover”. All of our Redis instances participate in master-slave replication. Our “failover” is a simple “given a list of priority sorted Redis hosts, try to use the highest priority host always”.

For example, if the master goes down then every client will fail over to querying the slave (but the app that is responsible for populating the cache is never configured to see the slave, so will just stop writing), and then when the master comes back online, everyone can start querying it again.

This lets us address operational things such as network hiccups, or node failures, but it also lets us do things like upgrade Redis live without having to stop delivering enrichments.

This is a trade off between availability and correctness. We’re ok with being a little bit stale if it means we can still serve a reasonable answer instead of full-on unavailability of the entire enrichment for everyone.

Optimizations

Pipelining

We are also able to get such good performance by using a feature called pipelining. Redis is a request / response based protocol. Send a command, wait, read the response. While this is simple, a network roundtrip per command is not going to give you good throughput when you need to keep with fast-moving streams (like the Twitter firehose). But Redis also ships with a time-honored feature, whereby you send multiple commands at once, and then wait for the responses for all of those commands. This feature allows us to bulk write to and read from our caches.

Hashing

When we first started proofing out a cache with Redis, we used it in a pretty straightforward manner. Each individual geo entity was given its own key (i.e. `geo_entity:`), with the value being the GPB representing all the interesting things about the Entity (name, long/lat, etc).

For example, if we had a geo entity with an id of ‘1235594’, we’d get this as the key:

`geo_entities:1235594`. The value for that key would be the binary protobuf value.

However, this naive approach is wasteful with memory. With the knowledge that Redis stores hashes efficiently and some inspiration from this blog post, we tweaked our data model to get better memory performance.

We decided to take each geo entity Id, divide it by 1000, and use the quotient to build the key. We treated that key like a hash, and use the “remainder” (Id – (quotient * 1000)) as the field into that hash. By creating many hashes we give Redis a chance to do more efficient encoding of the fields and values within each hash, which dramatically reduces our memory usage. (You can read more about Redis memory optimization here.)

Again working with a geo entity with an id of ‘1235594’:

The original plan is simple; the key is: `geo_entities:1235594`, and we can do a simple SET:

SET ‘geo_entities:1235594’

Using hashes, things are a little bit more complicated:

key_id = 1235594 / 1000 = 1235 # integer division
field_id = 1235594 - (1235 * 1000) = 594

Given that we just do a HSET:

HSET ‘geo_entities:1235’ ‘594’

Another example with key id 1235777:

HSET ‘geo_entities:1235 ‘777’

However, the payoff is worth it. At time of posting, we hold about 3.5 million individual (normalized) user-generated profile location mappings to a candidate set of geo entities. Each member of that set has associated metadata that we use to improve and refine the system over time.

All told that consumes about 3G of memory in Redis. On average that is around 100 bytes per unique user profile location. That is a pretty great use of space!

Same Time, Same Place

Next time, we’ll discuss another abstraction we’ve built over Jedis to ease interacting with Redis from our client applications.