Exploring S3 Read Performance

This blog post is a collaboration between Jud Valeski and Nick Matheson.

When you use Peta as the unit to measure your S3 storage usage, you’re dealing with a lot of data. Gnip has been a tremendously heavy S3 user since S3 came online; it remains our large-scale durable storage solution.

Over the years, we’ve worked with a wide array of data access models to accomplish a variety of things. From backfilling a customer’s data outage from our archives, to creating low-latency search products, how and where we access data is an integral part of our thinking.

If you’ve ever worked with really large data sets you know that the main challenge around them is moving them from application A to application B. Sneakernet is often the best way as it turns out.

Some of our datasets on S3 are large enough that moving them around is so impractical that we’ve had to explore various data access patterns and approaches that use S3 as big blob backend storage.

One such approach is using HTTP byte range requests into our S3 buckets in order to access just the blocks of data we’re interested in. This allows one to separate the data request from the actual “file” (which may be inconveniently large to transfer across the wire in low-latency access scenarios). It does require an index into said S3 buckets however; that needs to be calculated ahead of time.

Gnip recently experimented with this concept. Here are the findings. We’ve run into a few folks over the years who considered using S3 in this manner and we wanted to share our knowledge in hopes of furthering exploration of S3 as a blob storage device.

The Setup

Our tests explored a few combinations of node (ec2 instance) count, thread count and byte range size. The node count was varied from 1 to 7, the thread count from 1 to 400 and the byte range size between 1KB and 10KB. The range lookups were performed randomly over a single bucket with roughly half a million keys and 100TB of data.

The Results

The investigation started with a focus on scalability of a single node. The throughput of a single thread is not very compelling at ~8 lookups per second, but these lookups scale near linearly with increased thread count, even out to 400 threads on our c1.xlarge instance. At 400 threads making 10KB lookups network saturation becomes an issue as sustained 40+ MB/s is high for a c1.xlarge in our experience.

The next phase of the investigation focused on scaling out by adding more nodes. For these tests there was a significant difference between the scalability of 1KB vs 10KB lookups. With 10KB lookups even scaling beyond 2 nodes yielded rapidly diminishing returns. In the context of this test, we were unable to find a similar limit for 1KB lookups although there was a slight slope change for the last data point (7 nodes). It should be noted that these are limits for our specific bucket/key/partition structure and access pattern and thus may not be representative of true S3 limits or other usage scenarios.

In addition to the scalability tests, we also tracked the request latencies with a focus on the 90th, 95th and 99th percentile times. Under light and moderate load, the 99th percentile times were both in the 450-500ms range. Under moderate and heavy load these declined significantly to 1s and 6s respectively. The frequency of read/connect timeouts under moderate/heavy load significantly increased as well. There was significant variance in the percentile results both over time of day/week as well as minute to minute.

The Bottlenecks

Overall, this approach scaled better than we had initially anticipated but it did take some work on the client side to be as lean as possible. Things like keep-alive/connection pool tuning and releasing network resources as soon as possible (gotta read from/close those input streams) were all necessary to get to this level of throughput.

These tests started to bump into both per node as well as aggregate bandwidth limits. On a per node basis this is unlikely to be of practical concern since at this level of throughput cpu was at 50% utilization and any ‘real application’ logic would likely become the actual bottleneck effectively easing off the network. We will continue to investigate the aggregate performance levels as we suspect this has more to do with our particular data distribution, usage and S3 partitioning for which we have done limited tuning/investigation.

These tests should also be taken with a grain of salt as they likely represent a sunny day scenario that will likely not hold under the variety of conditions that we see running production systems at Amazon.