DocRaptor

Gone in 60 Seconds: How We Moved From Linode to AWS With Less Than a Minute of Downtime

Recently we moved DocRaptor from Linode to AWS, increasing average document creation speed by 6.9%, reducing network errors by an incredible 84.1%, and saving ourselves a ton of devops time related to scaling. Our move to AWS also allowed us to double our simultaneous request limit, scale up to meeting customer demand in a matter of minutes instead of hours, and gave us more time to help you create the documents you need.

DocRaptor is an infrastructure service, so our customers expect us to be up and creating PDFs and Excel docs at all hours. When we had datacenter issues 2 weeks in a row with Linode, we knew it was time to switch. We thought through all the details that needed to be accounted for, and started working on a migration plan. Over the course of a few weeks this plan solidified, and we scripted portions that could be automated.

The week before the migration we ran through the whole plan a few times, only excluding steps that would actually break production. This revealed a few small things that we missed, and highlighted a few more things we could automate, but mostly gave us confidence that our plan was going to work.

Even though we were migrating from Linode to AWS, most of our plan is data center agnostic, and we hope anyone who is thinking about migrating data centers will benefit. Here’s a step by step account of what our data center migration from Linode to AWS looked like.

Minimizing Downtime

One of our core goals for this migration was keeping customer impact low, and we spent a bunch of time planning to make our window of downtime as small as possible. We achieved this by breaking the migration down into several smaller, easily reversible steps, as well as having many tests along the way to verify that we were able to move from one step to the next with confidence. This careful approach allowed us to fully migrate DocRaptor from Linode to AWS with less than 60 seconds of downtime.

Testing AWS

We needed to verify that DocRaptor was setup and configured correctly in the new data center. We began by making a ton of documents on AWS and verifying these documents came out perfectly. We take customer privacy very seriously, so it was important to implement document verification in an automated way so even we didn’t need to manually inspect every document to verify it looked good.

Luckily most of the documents we generate produce the same binary file when generated multiple times (excluding a randomly generated id field in the PDF). This allowed us to verify AWS was working correctly on most documents, but there were a handful with different binary files. We needed to know why these documents were different, so we did some science.

Verifying Document Creation

DocRaptor stores a log of your document generation so you can get more detail if you run into issues. The reason we saw a few identical documents with differing binary files was minute differences in asset loading. Sometimes a customer has network or asset loading issues. Obviously if an image or a remote font fails to load the resulting documents are going to differ. This meant that we actually needed to test the variation within both systems, and compare those in order to know that AWS was performing at least as well as Linode.

Fortunately for us there are automated visual comparison tools. ImageMagick has a great tool called Compare that accepts two input images and returns a number indicating how similar the two images are (PHASH). Here you can see an example of an image diff using different font options from our example invoice on our try it out page.

Image diff using Compare

This technique allowed us to set a threshold, automatically run PDFs through Linode and AWS, convert them to images, and compare the PHASH values to our threshold. We sampled a huge number of customer documents and ran this comparison process on all of them. Because customer assets sometimes vary from request to request, we made our comparison more statistical by running the same document through both Linode and AWS multiple times, then checking the variation within each group.

Lo and behold, this finally gave us the confidence we were looking for that AWS was generating great looking documents, but what about when DocRaptor doesn’t produce a PDF or Excel document?

We still needed to verify that invalid input was handled the same between the AWS and Linode clusters. DocRaptor produces helpful error messages on a variety of invalid inputs such as document timeouts, invalid html, and javascript timeouts. While running documents through both services, we compared the error messages returned and verified that they were the same.

This was a great opportunity to gather statistics on how long it took to generate these documents, as well as any network errors that occurred. This experiment proved that on average we were producing documents 6.9% faster on AWS, with a whopping 84.1% fewer network errors!

6.9% faster

84.1% fewer network errors

Automating Our Infrastructure

DocRaptor is a complex system with a ton of moving parts. Different tools are required to do HTML parsing, Javascript execution, font handling, and PDF or Excel generation itself. We’ve been successful until now by cloning servers with the same configuration, but this method has quite a few downsides. Each new server still required some manual setup. We also didn’t clearly document and codify our setup procedure for each new server. As anyone who’s written software before knows, manual steps are a recipe for problems. It was time to fully automate our app setup!

Automating from scratch can definitely be a hassle, but we had already automated many of the big components thanks to our recent infrastructure work on Gauges and Instrumental. We’ve automated the setup and deployment of those apps using a combination of self written scripts and Puppet. We just had to add DocRaptor specific setup to our automation.

With very little effort we were able to spin up a whole new cluster, including web servers and background worker servers. Using a combination of Puppet scripts and Capistrano tasks we were able to reduce the spinup of a new server (or even a completely new cluster) to a matter of minutes. This work has already helped us test new code internally by making it incredibly easy to spin up a new server to use for testing.

Migration details

Below we’ve broken down our migration plan into sections with some explanation of each step. Want to check out our entire migration plan? We’ve added a link to a gist at the bottom of this post.

Estimated total time to execute: 2hr not including pre-replication

Ensure performance test make_doc script has production endpoint Comment out deploys tests, DO NOT COMMIT

To ensure updates that we make to DocRaptor don’t cause issues we have a set of tests that run against production after every deploy. For this migration, however, we tested out of band in order to make deploys faster and limit any downtime.

Setup SSH tunnel between AWS background-001 to Linode MySQL
Replicate Linode MySQL to AWS (3hr+) as docraptor-production
One-time copy non-queue Redis data
Verify AWS endpoint working
  	  ./script/service_test http://aws.docraptor.com && ./script/service_test https://aws.docraptor.com

We’re using automated tests here to verify everything is working correctly.

-- Wait Till Day of Maintenance --

Tweet: Reminder: we'll be doing maintenance from 2-3pm EDT today.

Definitely wanted to keep all of our customers informed and post notice just in case there were issues.

Connect AWS to Linode MySQL
  Fallback: (AWS BRANCH) cap production resque:stop
Verify AWS endpoint working
  ./script/service_test http://aws.docraptor.com && ./script/service_test https://aws.docraptor.com

Here we pointed both AWS and Linode servers to our Linode mysql instance so everything was using the same persistent data store. Note that our fallback is one simple command that runs very quickly. If the verification test had failed there would only have been a few seconds of downtime at this point and we could then investigate and resolve at our leisure.

Mysql switch

Switch to old Instrumental API key in app and automation
  be sure to restart instrument_server

Up until this point we had AWS connected to a test application monitoring project so it didn’t conflict with the DocRaptor production monitoring data. Since AWS was now using the main persistent store we want all our data in one place.

Start continuous testing against production endpoint
Direct Linode LB to AWS ELB for 10 seconds
  Verify production traffic hits Linode
  Update Linode LB config to point to AWS
  Verify production traffic hits AWS
  Verify queued jobs in AWS
  Update Linode LB config to point back to Linode
  Verify production traffic hits Linode
  Verify queue draining on AWS

  Fallback: Manually move some jobs from AWS -> Linode Redis
Stop continuous testing (TODO: COULD BE MOVED LATER)

Here we’re running a small amount of production traffic through AWS using nginx as a proxy. This allows us to test AWS connectivity between all servers other than mysql and verify documents continue to be processed. In our configuration we had two different queues: one in Linode and one in AWS, so enqueued documents would automatically drain out of the old queue as they were worked. We also switched from Passenger to Unicorn as part of the migration, so we used that fact as a quick additional check to verify which service was actually handling requests. Again, fallback is easy, and in fact this entire step is basically a practice fallback for the next step.

Direct Linode LB to AWS ELB permanently
  cp /opt/nginx/conf/nginx.production.lb.conf.new /opt/nginx/conf/nginx.production.lb.conf
  /etc/init.d/lb_nginx configtest

  /etc/init.d/lb_nginx reload
  curl -I http://docraptor.com | grep Pass  # should be empty
  curl -I https://docraptor.com | grep Pass # should be empty and no ssl error
  Fallback: Same as the 10 second one above
Verify production endpoint working
  ./script/service_test http://docraptor.com && ./script/service_test https://docraptor.com
Verify no traffic hitting Linode app servers
Verify Linode Resque is drained
Make a new branch of aws_migration
  with AWS MySQL + Port
Deploy out-of-band AWS app instance pointed at AWS MySQL
  # uncomment, DO NOT COMMIT web-oob enabled!!!
  cap production deploy HOSTFILTER=web-oob.docraptor.com
  # recomment web-oob so we do not deploy it in the next steps
  # verify connection to correct MySQL:
  eb ssh web-oob.docraptor.com
  /data/docraptor/current/script/rails runner 'User.last; puts `lsof -i -p #{Process.pid}`'
Deploy AWS MySQL with pause to AWS instances
  cap production deploy deploy:web:enable
Clear failed jobs on Linode

We’ve verified that production traffic is handled correctly through AWS, so we’ll start forwarding all traffic through AWS. In order to minimize downtime we spun up a new out-of-band server that ran the new AWS-only code (web-oob), but didn’t add it to our AWS load balancer. This will allow us to later run tests against AWS without waiting for new code to be deployed to the AWS web servers. Another optimization to note here is the deploy of the new code to the in-band AWS servers has code to pause right before restarting the DocRaptor services. This is another way we kept downtime low, by essentially decreasing the “deploy time” to only as long as it takes to restart Unicorn and Resque.

Stage and switch using out-of-band server

-- Must Be In Maintenance Window --

Enable maintenance mode in PagerDuty
  Pingdom
Tweet: We're starting maintenance now, you may see intermittent errors over the next hour.
Run dc_switch cap task
  cap production dc_switch
  Fallback: is automatic
Continue paused deploy
  Issue?: cap production deploy:restart
Verify no passenger
  curl -I http://docraptor.com | grep Pass   # should be empty
  curl -I https://docraptor.com | grep Pass # should be empty and no ssl error
Verify production endpoint
  ./script/service_test http://docraptor.com && ./script/service_test https://docraptor.com
Requeue failed jobs
Stress test (10min)
  ./script/performance.rb old 1000000 pdf small | tee -a tmp/performance-old-pdf-small-final-switch.log
  Fallback: switch Linode LB to point to Linode apps (will lose data)
Wait a safe period (1hr?)

-- Maintenance Complete --

This is the only part of the migration where there is downtime. We need to switch which mysql instance we’re using, and we want to ensure replication has caught up when we do so. We inform customers that the maintenance window is beginning. We codified the majority of the actual switching procedure in a capistrano task called dc_switch, which allowed us to fail back to Linode very quickly in the case that anything went wrong.

Disable maintenance mode in PagerDuty
Tweet: Maintenance complete! Please enjoy your regularly scheduled document service :)

Scheduled maintenance is complete! Party Time! Everything below this point doesn’t have to be done immediately, so the pressure is off. Whew!

Enable cloudfront
Move cron jobs from Linode to AWS
Remove out-of-band server and cleanup cap tasks
Remove deploy pauser
Enable gitflow
Switch DNS
Wait at least 48 hours
Move any outstanding temporary storage files
Shutdown linode non load balancer boxes
Possibly wait more
Verify no traffic hitting Linode load balancers for 1 day
Shutdown Linode
Remove Linode boxes from AWS security groups

And there you have it, a sub-60-second downtime data center migration!

Want to check out our full migration plan? Here’s a gist we put together while working on our data center migration.

Data center migration is always a nerve wracking process, but by carefully planning and limiting your chance for downtime you can have a much less stressful time.