If you have a problem with dynamic CPU and or I/O requirements that is easily distributed across homogenous nodes, EC2 Auto Scaling can be a great tool to add to your stack. If you can design your application in a way that can gracefully handle nodes terminating while processing work, Amazon EC2 Spot Instances can provide a great cost savings. We’ve used these technologies in conjunction with Resque with great success. This model would likely fit other job management frameworks such as Sidekiq or Gearman.
If you’re in your own datacenter you can achieve similar functionality with Eucalyptus or OpenStack/Heat, but we’ll focus on using EC2 Auto Scaling as a concrete example for this blog post. By the end of this post, you should have a good idea of how to set up your own EC2 Auto Scaling cluster using Chef and Spot instances.
Create a Package That Installs and Runs Chef
You’ll want to create a package (rpm, deb, etc..) that contains definitions of all the software necessary to start “doing work” on your newly provisioned worker node. We managed to achieve this by creating an RPM that includes our chef cookbooks using the process outlined in this previous blog post. In the next step we’ll create an AMI that will install these cookbooks via yum when our node boots up and then run chef to turn our base server into a fully functional worker node.
Create Base AMI
We subscribe to the principle that you should have a small number of Amazon Machine Images (AMIs) that have only the minimal amount of information to bootstrap themselves. This information is typically just:
1. Which environment am I currently in? (review, staging, or prod)
2. What type of node am I? (worker node, web server, etc..)
3. Where can I download the necessary cookbooks to provision myself? (configure yum or deb sources)
Following this principle, we have created an AMI that runs a “firstboot.sh” init.d script. This first boot script will configure the node to look at the appropriate yum repo and install the RPM we created in the previous step. This way the AMI can remain relatively static and you can iterate on your bootstrap code without having to follow the cumbersome process of creating a new AMI each time. After the cookbooks have been pulled down to the local filesystem, our first boot script will run chef solo to install the necessary software to start “doing work”.
In order to create your own AMI, you’ll need to:
1. Boot a new EC2 Server.
2. Install Amazon EC2 Command Line Tools
3. Customize your server to run your own “first boot” script that will install and run chef-solo.
4. Run the following command to bundle your server to a compressed image in the /tmp directory:
ec2-bundle-vol -k /path/to/aws-priv-key.pem -c /path/to/aws-cert.pem -u <aws-account-id>
5. Upload your bundled image to s3
ec2-upload-bundle -b <s3_path>-m /tmp/image.manifest.xml -a <access_key_id> -s <secret_key>
6. Register your newly created AMI with your account
ec2-register <s3_path>/image.manifest.xml -K /path/to/aws-priv-key.pem -C /path/to/aws-cert.pem
Create Auto Scaling Group
Now that we have an AMI that will start performing work upon boot, we can leverage Amazon’s EC2 Auto Scaling to start going wide. The idea is that you define a cluster of machines that use the AMI we just created and you define how many servers you want up at any given time. If you want to spin up 50 servers, you simply set the “DesiredCapacity” for you group to 50 and within minutes you will have 50 fresh new worker nodes. There are two discrete steps needed to make this happen. Let’s illustrate how to do this with Fog:
Create Launch Config
as = Fog::AWS::AutoScaling.new(:aws_access_key_id => access_key_id, :aws_secret_access_key => access_secret_key)
as.create_launch_configuration(<ami_id>, <machine_type>, <launch_config_name>, "SecurityGroups" => <security_groups>, "KeyName" => <aws_key_pair_name>, "SpotPrice" => <spot_bid_price>)
This will create a launch configuration that we will use to define our Auto Scaling group.
Create Auto Scaling Group
as.create_auto_scaling_group(<auto_scaling_group_name>, <availability_zones>, <launch_config_name>, <max_size>, <min_size>, "DesiredCapacity" => <number_of_instances>)
This will create an Auto Scaling Group and will spin up <number_of_instances> servers using the AMI defined in our launch configuration above.
Note that one of the parameters we’ve passed to our Launch Configuration is “SpotPrice”. This allows you to leverage Amazon’s Spot Instances. The idea is that you will pay whatever the “market rate” for the given machine_type you’re provisioning. If the market rate elevates above your SpotPrice, instances in your cluster will begin to terminate. Your application should be tolerant of these failures. If this is a mission critical application, you should likely create a “backup” Auto Scaling group without the SpotPrice parameter. This means you will pay whatever the On-Demand price for your given machine_type, but will allow you to continue to process work when resources are sparse.
Grow / Shrink Auto Scaling Group
Depending on your application, you’ll likely want to grow and shrink your Auto Scaling group depending on how much work needs to be done. This is as simple as the following API call:
The ability to easily spin up 500 worker nodes with one API call can be a very powerful thing, especially when dealing with the amount of data we deal with at Gnip.
Future improvements to this process include wiring this process into Chef Server for easier centralized config management across datacenters. If these sort of challenges sound interesting, be sure to check out our job postings, we’d love to talk to you!