Chef, EC2 Auto Scaling, and Spot Instances for Fun and Profit

If you have a problem with dynamic CPU and or I/O requirements that is easily distributed across homogenous nodes, EC2 Auto Scaling can be a great tool to add to your stack.  If you can design your application in a way that can gracefully handle nodes terminating while processing work, Amazon EC2 Spot Instances can provide a great cost savings.  We’ve used these technologies in conjunction with Resque with great success.  This model would likely fit other job management frameworks such as Sidekiq or Gearman.

If you’re in your own datacenter you can achieve similar functionality with Eucalyptus or OpenStack/Heat, but we’ll focus on using EC2 Auto Scaling as a concrete example for this blog post.  By the end of this post, you should have a good idea of how to set up your own EC2 Auto Scaling cluster using Chef and Spot instances.

Create a Package That Installs and Runs Chef
You’ll want to create a package (rpm, deb, etc..) that contains definitions of all the software necessary to start “doing work” on your newly provisioned worker node.  We managed to achieve this by creating an RPM that includes our chef cookbooks using the process outlined in this previous blog post.  In the next step we’ll create an AMI that will install these cookbooks via yum when our node boots up and then run chef to turn our base server into a fully functional worker node.

Create Base AMI
We subscribe to the principle that you should have a small number of Amazon Machine Images (AMIs) that have only the minimal amount of information to bootstrap themselves.  This information is typically just:

1.  Which environment am I currently in? (review, staging, or prod)
2.  What type of node am I?  (worker node, web server, etc..)
3.  Where can I download the necessary cookbooks to provision myself? (configure yum or deb sources)

Following this principle, we have created an AMI that runs a “firstboot.sh” init.d script.  This first boot script will configure the node to look at the appropriate yum repo and install the RPM we created in the previous step.  This way the AMI can remain relatively static and you can iterate on your bootstrap code without having to follow the cumbersome process of creating a new AMI each time.  After the cookbooks have been pulled down to the local filesystem, our first boot script will run chef solo to install the necessary software to start “doing work”.

In order to create your own AMI, you’ll need to:

1.  Boot a new EC2 Server.
2.  Install Amazon EC2 Command Line Tools
3.  Customize your server to run your own “first boot” script that will install and run chef-solo.
4.  Run the following command to bundle your server to a compressed image in the /tmp directory:

ec2-bundle-vol -k /path/to/aws-priv-key.pem 
-c /path/to/aws-cert.pem -u <aws-account-id>

5.  Upload your bundled image to s3

ec2-upload-bundle -b <s3_path>-m /tmp/image.manifest.xml 
-a <access_key_id> -s <secret_key>

6.  Register your newly created AMI with your account

ec2-register <s3_path>/image.manifest.xml 
-K /path/to/aws-priv-key.pem -C /path/to/aws-cert.pem

Create Auto Scaling Group
Now that we have an AMI that will start performing work upon boot, we can leverage Amazon’s EC2 Auto Scaling to start going wide.  The idea is that you define a cluster of machines that use the AMI we just created and you define how many servers you want up at any given time.  If you want to spin up 50 servers, you simply set the “DesiredCapacity” for you group to 50 and within minutes you will have 50 fresh new worker nodes.  There are two discrete steps needed to make this happen.  Let’s illustrate how to do this with Fog:

Create Launch Config

as = Fog::AWS::AutoScaling.new(:aws_access_key_id => access_key_id, 
                               :aws_secret_access_key => access_secret_key)
as.create_launch_configuration(<ami_id>, 
                               <machine_type>, 
                               <launch_config_name>, 
                               "SecurityGroups" => <security_groups>, 
                               "KeyName" => <aws_key_pair_name>, 
                               "SpotPrice" => <spot_bid_price>)

This will create a launch configuration that we will use to define our Auto Scaling group.

Create Auto Scaling Group

as.create_auto_scaling_group(<auto_scaling_group_name>, 
                             <availability_zones>, 
                             <launch_config_name>, 
                             <max_size>, <min_size>, 
                             "DesiredCapacity" => <number_of_instances>)

This will create an Auto Scaling Group and will spin up <number_of_instances> servers using the AMI defined in our launch configuration above.

Note that one of the parameters we’ve passed to our Launch Configuration is “SpotPrice”.  This allows you to leverage Amazon’s Spot Instances.  The idea is that you will pay whatever the “market rate” for the given machine_type you’re provisioning.  If the market rate elevates above your SpotPrice, instances in your cluster will begin to terminate.  Your application should be tolerant of these failures.  If this is a mission critical application, you should likely create a “backup” Auto Scaling group without the SpotPrice parameter.  This means you will pay whatever the On-Demand price for your given machine_type, but will allow you to continue to process work when resources are sparse.

Grow / Shrink Auto Scaling Group

Depending on your application, you’ll likely want to grow and shrink your Auto Scaling group depending on how much work needs to be done.  This is as simple as the following API call:

as.set_desired_capacity(<auto_scaling_group_name>, 
                        <number_of_instances>)

The ability to easily spin up 500 worker nodes with one API call can be a very powerful thing, especially when dealing with the amount of data we deal with at Gnip.

Future improvements to this process include wiring this process into Chef Server for easier centralized config management across datacenters.  If these sort of challenges sound interesting, be sure to check out our job postings, we’d love to talk to you!

Application Deployment at Gnip

Managing your application code on more than 500 servers is a non-trivial task. One of the tenets we’ve held onto closely as an engineering team at Gnip is “owning your code all the way to the metal”. In order to promote this sense of ownership, we try to keep a clean and simple deployment process.

To illustrate our application deployment process, let’s assume that we’re checking in a new feature to our favorite Gnip application, the flingerator. We will also assume that we have a fully provisioned server that is already running an older version of our code (I’ll save provisioning / bootstrapping servers for another blog post). The process is as follows:

1. Commit: git commit -am “er/ch: checking in my super awesome new feature”
2. Build: One of our numerous cruisecontrol.rb servers picks up the changeset from git and uses maven to build an RPM.
3. Promote: After the build completes, run cap flingerator:promote -S environment=review.
4. Deploy: cap flingerator:roll -S environment=review

Let me break down what is happening at each step of the process:

Commit
Every developer commits or merges their feature branch into master. Every piece of code that lands on master is code reviewed by the developer who wrote the code and at least one other developer. The commit message includes the initials of the developer who wrote the feature as well as the person who code reviewed it. After the commit is made, the master branch is pushed up to github.

Build
After the commit lands on master, our build server (cruisecontrol.rb) uses maven to run automated tests, build jars, and create RPM(s). After the RPM is created, cruisecontrol.rb then copies said RPM to our yum repo server into the “build” directory. Although the build is copied to our yum repo server, it is not ready for deployment just yet.

Promote
After cruise.rb has successfully transferred the RPM to the yum server’s “build” directory, the developer can promote the new code into a particular environment by running the following capistrano command: cap flingerator:promote -S environment=review. This command uses capistrano to ssh to the yum repo server and creates a symlink from the “build” directory to the review (or staging or prod) environment directory. This action makes said RPM available to install on any server in a particular environment via yum.

Deploy
Now that the RPM has been promoted, it is now available via the gnip yum repo. It is now up to the dev to run another capistrano command to deploy the code: cap flingerator:roll -S environment=review. This command ssh’es to each flingerator server and runs “yum update flingerator”. This installs the new code onto the filesystem. After successful completion of the “yum install” the application process is restarted and the new code is running.

This process uses proven technologies to create a stable and repeatable deployment process, which is extremely important in order to provide an enterprise grade customer experience.