Exploring S3 Read Performance

This blog post is a collaboration between Jud Valeski and Nick Matheson.

When you use Peta as the unit to measure your S3 storage usage, you’re dealing with a lot of data. Gnip has been a tremendously heavy S3 user since S3 came online; it remains our large-scale durable storage solution.

Over the years, we’ve worked with a wide array of data access models to accomplish a variety of things. From backfilling a customer’s data outage from our archives, to creating low-latency search products, how and where we access data is an integral part of our thinking.

If you’ve ever worked with really large data sets you know that the main challenge around them is moving them from application A to application B. Sneakernet is often the best way as it turns out.

Some of our datasets on S3 are large enough that moving them around is so impractical that we’ve had to explore various data access patterns and approaches that use S3 as big blob backend storage.

One such approach is using HTTP byte range requests into our S3 buckets in order to access just the blocks of data we’re interested in. This allows one to separate the data request from the actual “file” (which may be inconveniently large to transfer across the wire in low-latency access scenarios). It does require an index into said S3 buckets however; that needs to be calculated ahead of time.

Gnip recently experimented with this concept. Here are the findings. We’ve run into a few folks over the years who considered using S3 in this manner and we wanted to share our knowledge in hopes of furthering exploration of S3 as a blob storage device.

The Setup

Our tests explored a few combinations of node (ec2 instance) count, thread count and byte range size. The node count was varied from 1 to 7, the thread count from 1 to 400 and the byte range size between 1KB and 10KB. The range lookups were performed randomly over a single bucket with roughly half a million keys and 100TB of data.

The Results

The investigation started with a focus on scalability of a single node. The throughput of a single thread is not very compelling at ~8 lookups per second, but these lookups scale near linearly with increased thread count, even out to 400 threads on our c1.xlarge instance. At 400 threads making 10KB lookups network saturation becomes an issue as sustained 40+ MB/s is high for a c1.xlarge in our experience.

The next phase of the investigation focused on scaling out by adding more nodes. For these tests there was a significant difference between the scalability of 1KB vs 10KB lookups. With 10KB lookups even scaling beyond 2 nodes yielded rapidly diminishing returns. In the context of this test, we were unable to find a similar limit for 1KB lookups although there was a slight slope change for the last data point (7 nodes). It should be noted that these are limits for our specific bucket/key/partition structure and access pattern and thus may not be representative of true S3 limits or other usage scenarios.

In addition to the scalability tests, we also tracked the request latencies with a focus on the 90th, 95th and 99th percentile times. Under light and moderate load, the 99th percentile times were both in the 450-500ms range. Under moderate and heavy load these declined significantly to 1s and 6s respectively. The frequency of read/connect timeouts under moderate/heavy load significantly increased as well. There was significant variance in the percentile results both over time of day/week as well as minute to minute.

The Bottlenecks

Overall, this approach scaled better than we had initially anticipated but it did take some work on the client side to be as lean as possible. Things like keep-alive/connection pool tuning and releasing network resources as soon as possible (gotta read from/close those input streams) were all necessary to get to this level of throughput.

These tests started to bump into both per node as well as aggregate bandwidth limits. On a per node basis this is unlikely to be of practical concern since at this level of throughput cpu was at 50% utilization and any ‘real application’ logic would likely become the actual bottleneck effectively easing off the network. We will continue to investigate the aggregate performance levels as we suspect this has more to do with our particular data distribution, usage and S3 partitioning for which we have done limited tuning/investigation.

These tests should also be taken with a grain of salt as they likely represent a sunny day scenario that will likely not hold under the variety of conditions that we see running production systems at Amazon.

 

The Eclectic Art Of System Monitoring

The Problem

At Gnip, we deploy our software to hundreds of servers across multiple data centers and heterogeneous environments. Just architecting the software itself and how to create a streamlined deployment process on this scale is in itself worth a full blog post. In this post though I would like to focus on the aspect of how to make sure that your full application stack is healthy all the time.

The Solution?

On the mechanical side the solution is very straight forward: You deploy a system monitoring tool that watches continuously all servers and all applications that will send you friendly messages when something does not look right. There are plenty of tools available for this purpose. We ourselves use Nagios, which has become to some extend an industry standard.

That was easy, right?

Sadly, the first sentence in this paragraph leaves a lot of details out, and these details can kill your application stack and your engineering team at the same time. (Either by not finding or problems, or by keeping your engineers up all night.) In this post, I’d like to share how we at Gnip go about keeping software, hardware, and people-ware happy.

Watching All Applications

It is certainly not enough to just ensure that an application is running. In order to get meaningful insight, you have to add health checks that are particular to the application itself. For instance, in an application that consumes a continuous stream of data, you can be fairly certain that there is a problem if the transaction volume drops below a certain rate. Nagios allows quite easy integration with these kind of application specific health checks. We ourselves decided to expose an HTTP endpoint on our applications that will expose a digest of all health checks registered for an application. This allows the Nagios check to be uniform across all applications.

TIP: Standardize health APIs

Sending Friendly Messages

We configured our system so that certain types of alerts in our production system will cause text messages to be sent to the engineer who is on call. All engineers participate in this on call rotation. This process ensures a large sense of ownership, because it does not take many pages in the middle of the night to change an engineers’ attitude towards writing reliable and scalable code. (If it doesn’t bother the engineer herself, the grumpy spouse will have a strong opinion, too.)

TIP: Every engineer participates in on-call rotation

We also have a set of health checks that will send out emails instead of pages. We use this for situations where it appears that no immediate action can or has to be taken.

TIP: Categorize actionable and non-actionable alerts

What Does Healthy Mean Anyway?

In this setup it becomes quickly apparent that finding the right thresholds for triggering alerts is non trivial. Due to inconsistent inbound data volumes, it is easy to set thresholds that are too aggressive, which leads to a flood of meaningless noise drowning out critical failures. The opposite is to set thresholds that are too casual, and you’ll never find out about real problems. So, instead of using your gut instinct for setting thresholds, mine data like log files and metrics to figure out what historically would be appropriate thresholds.

TIP: Mine metrics for proper thresholds

Self Healing

As it turns out it happens that you get periodically the same alert for the same reason and the fix is to manually step through some actions that usually fix the problem. At that point, the obvious question is why not to automate the fix and trigger it directly when the problem is detected by the health check. The tendency can be for engineers to postpone the automation effort, partially because it is not interesting feature work. Additionally, it can be tremendously tedious to automate the manual steps, since they frequently involve tools that application developers are sometimes not exposed to. (cron, bash, etc.)

Sometimes you might have to spend a couple of hours automating something that takes  one minute to do manually. The importance of reducing context switches from feature work can not be stressed enough and you the investment will always pay off.

TIP: Application developers learn a new tool every week!

Conclusion

One can categorize issues in a production system by assessing whether an issue raised an alert, whether the issues requires manual intervention, and whether the issue is an actual problem or just a reporting flaw. The following venn diagram illustrates a situation that leads to unhappy customers and tired engineers:

So, for a smoothly running engineering organization you clearly want to focus your energy on getting to a better state.

TIP: Get to this state:

Chef, EC2 Auto Scaling, and Spot Instances for Fun and Profit

If you have a problem with dynamic CPU and or I/O requirements that is easily distributed across homogenous nodes, EC2 Auto Scaling can be a great tool to add to your stack.  If you can design your application in a way that can gracefully handle nodes terminating while processing work, Amazon EC2 Spot Instances can provide a great cost savings.  We’ve used these technologies in conjunction with Resque with great success.  This model would likely fit other job management frameworks such as Sidekiq or Gearman.

If you’re in your own datacenter you can achieve similar functionality with Eucalyptus or OpenStack/Heat, but we’ll focus on using EC2 Auto Scaling as a concrete example for this blog post.  By the end of this post, you should have a good idea of how to set up your own EC2 Auto Scaling cluster using Chef and Spot instances.

Create a Package That Installs and Runs Chef
You’ll want to create a package (rpm, deb, etc..) that contains definitions of all the software necessary to start “doing work” on your newly provisioned worker node.  We managed to achieve this by creating an RPM that includes our chef cookbooks using the process outlined in this previous blog post.  In the next step we’ll create an AMI that will install these cookbooks via yum when our node boots up and then run chef to turn our base server into a fully functional worker node.

Create Base AMI
We subscribe to the principle that you should have a small number of Amazon Machine Images (AMIs) that have only the minimal amount of information to bootstrap themselves.  This information is typically just:

1.  Which environment am I currently in? (review, staging, or prod)
2.  What type of node am I?  (worker node, web server, etc..)
3.  Where can I download the necessary cookbooks to provision myself? (configure yum or deb sources)

Following this principle, we have created an AMI that runs a “firstboot.sh” init.d script.  This first boot script will configure the node to look at the appropriate yum repo and install the RPM we created in the previous step.  This way the AMI can remain relatively static and you can iterate on your bootstrap code without having to follow the cumbersome process of creating a new AMI each time.  After the cookbooks have been pulled down to the local filesystem, our first boot script will run chef solo to install the necessary software to start “doing work”.

In order to create your own AMI, you’ll need to:

1.  Boot a new EC2 Server.
2.  Install Amazon EC2 Command Line Tools
3.  Customize your server to run your own “first boot” script that will install and run chef-solo.
4.  Run the following command to bundle your server to a compressed image in the /tmp directory:

ec2-bundle-vol -k /path/to/aws-priv-key.pem 
-c /path/to/aws-cert.pem -u <aws-account-id>

5.  Upload your bundled image to s3

ec2-upload-bundle -b <s3_path>-m /tmp/image.manifest.xml 
-a <access_key_id> -s <secret_key>

6.  Register your newly created AMI with your account

ec2-register <s3_path>/image.manifest.xml 
-K /path/to/aws-priv-key.pem -C /path/to/aws-cert.pem

Create Auto Scaling Group
Now that we have an AMI that will start performing work upon boot, we can leverage Amazon’s EC2 Auto Scaling to start going wide.  The idea is that you define a cluster of machines that use the AMI we just created and you define how many servers you want up at any given time.  If you want to spin up 50 servers, you simply set the “DesiredCapacity” for you group to 50 and within minutes you will have 50 fresh new worker nodes.  There are two discrete steps needed to make this happen.  Let’s illustrate how to do this with Fog:

Create Launch Config

as = Fog::AWS::AutoScaling.new(:aws_access_key_id => access_key_id, 
                               :aws_secret_access_key => access_secret_key)
as.create_launch_configuration(<ami_id>, 
                               <machine_type>, 
                               <launch_config_name>, 
                               "SecurityGroups" => <security_groups>, 
                               "KeyName" => <aws_key_pair_name>, 
                               "SpotPrice" => <spot_bid_price>)

This will create a launch configuration that we will use to define our Auto Scaling group.

Create Auto Scaling Group

as.create_auto_scaling_group(<auto_scaling_group_name>, 
                             <availability_zones>, 
                             <launch_config_name>, 
                             <max_size>, <min_size>, 
                             "DesiredCapacity" => <number_of_instances>)

This will create an Auto Scaling Group and will spin up <number_of_instances> servers using the AMI defined in our launch configuration above.

Note that one of the parameters we’ve passed to our Launch Configuration is “SpotPrice”.  This allows you to leverage Amazon’s Spot Instances.  The idea is that you will pay whatever the “market rate” for the given machine_type you’re provisioning.  If the market rate elevates above your SpotPrice, instances in your cluster will begin to terminate.  Your application should be tolerant of these failures.  If this is a mission critical application, you should likely create a “backup” Auto Scaling group without the SpotPrice parameter.  This means you will pay whatever the On-Demand price for your given machine_type, but will allow you to continue to process work when resources are sparse.

Grow / Shrink Auto Scaling Group

Depending on your application, you’ll likely want to grow and shrink your Auto Scaling group depending on how much work needs to be done.  This is as simple as the following API call:

as.set_desired_capacity(<auto_scaling_group_name>, 
                        <number_of_instances>)

The ability to easily spin up 500 worker nodes with one API call can be a very powerful thing, especially when dealing with the amount of data we deal with at Gnip.

Future improvements to this process include wiring this process into Chef Server for easier centralized config management across datacenters.  If these sort of challenges sound interesting, be sure to check out our job postings, we’d love to talk to you!

Application Deployment at Gnip

Managing your application code on more than 500 servers is a non-trivial task. One of the tenets we’ve held onto closely as an engineering team at Gnip is “owning your code all the way to the metal”. In order to promote this sense of ownership, we try to keep a clean and simple deployment process.

To illustrate our application deployment process, let’s assume that we’re checking in a new feature to our favorite Gnip application, the flingerator. We will also assume that we have a fully provisioned server that is already running an older version of our code (I’ll save provisioning / bootstrapping servers for another blog post). The process is as follows:

1. Commit: git commit -am “er/ch: checking in my super awesome new feature”
2. Build: One of our numerous cruisecontrol.rb servers picks up the changeset from git and uses maven to build an RPM.
3. Promote: After the build completes, run cap flingerator:promote -S environment=review.
4. Deploy: cap flingerator:roll -S environment=review

Let me break down what is happening at each step of the process:

Commit
Every developer commits or merges their feature branch into master. Every piece of code that lands on master is code reviewed by the developer who wrote the code and at least one other developer. The commit message includes the initials of the developer who wrote the feature as well as the person who code reviewed it. After the commit is made, the master branch is pushed up to github.

Build
After the commit lands on master, our build server (cruisecontrol.rb) uses maven to run automated tests, build jars, and create RPM(s). After the RPM is created, cruisecontrol.rb then copies said RPM to our yum repo server into the “build” directory. Although the build is copied to our yum repo server, it is not ready for deployment just yet.

Promote
After cruise.rb has successfully transferred the RPM to the yum server’s “build” directory, the developer can promote the new code into a particular environment by running the following capistrano command: cap flingerator:promote -S environment=review. This command uses capistrano to ssh to the yum repo server and creates a symlink from the “build” directory to the review (or staging or prod) environment directory. This action makes said RPM available to install on any server in a particular environment via yum.

Deploy
Now that the RPM has been promoted, it is now available via the gnip yum repo. It is now up to the dev to run another capistrano command to deploy the code: cap flingerator:roll -S environment=review. This command ssh’es to each flingerator server and runs “yum update flingerator”. This installs the new code onto the filesystem. After successful completion of the “yum install” the application process is restarted and the new code is running.

This process uses proven technologies to create a stable and repeatable deployment process, which is extremely important in order to provide an enterprise grade customer experience.