AWS Lambda - Automated Snapshots

This is my version based on the code from the blog post from Ryan S. Brown, I recommend reading his blog before mine, you can find it here. I'm also including a recipe of how to deploy, my opinion fo why you should use the funtion the way it is and how to use the function to backup and restore your EC2 Volumes. Enjoy!

What's the difference?

I added the functionality to include the ec2 tags to the snapshot, downside is that now it makes a call to the API for each snapshot, it doesn't update them all at the same time (with just one call).

What it does?

The purpose of this tool is to automate the process of backing up important servers in AWS, the idea is to have a way that we can control what servers will need daily backups and what retention period will need. For this we're going to be using some AWS services: Lambda and EC2 Tags.

By having daily snapshots we will have the opportunity to restore an instance easily, reasons can vary but common ones are: corrupted volumes, AZ failure, etc. So, how this process works? Here's how:

  1. There are two lambda functions, one for taking snapshots and other one to act as janitor
  2. Both functions are configured to run on daily basis
  3. The backup lambda look for instances that have the 'Backup' tag (it doesn't matter the value), then it takes snapshots of all volumes attached to the instance and looks for the value in the tag 'Retention' to set an expiration date for the snapshot, if there's no tag it will use a default value (30 days). When creating the snapshot it adds the expiration date, so for example the function runs on November 1st and it has a renetion period of 7 days, it will add an expiration date of November 8th.
  4. The janitor lambda will look for snapshots that has to delete, meaning that it will take the date when it's running and will search for snapshots that have the tag 'DeleteOn' with the date.

Why snapshots and not AMIs?

There are a couple of reasons for this decision:

AMIs - AMIs are cool because in order to restore we just launch a new instance with few clicks - Downside of this is that there are times that we will need to keep the IP, IAM Role assigned, Tags and SG ... and when an emergency comes taking care of all of those aspects is too risky and complex - There's no guarantee at 100% that when we take an AMI from a running instance that when restoring the AMI will keep working, because it was still running there might be some files corrupted (that's why the recommendation is to stop the instance first)

Snapshots - Are incremental, only the first time will take time to create it - It's easy to restore (read below for details)

For more details about them, please check the official docs: AMIs and Snapshots.

How to upload/configure the code to AWS Lambda?

For both functions, we'll need to do the following:

  1. Create Lambda IAM Role with the permissions described in the policy from this repo (iam/policy.json)
  2. Go to AWS Lambda and click on "Create a lambda function" button
  3. Select "Blank Function" blueprint
  4. Configure the trigger, click on the left button and choose: "CloudWatch Events - Schedule"
  5. Put a rule name if it's new or choose from an existing one
  6. Choose a schedule expression, in this case should be one day
  7. Check the "Enable trigger" checkbox
  8. Click "Next"
  9. Configure the lambda function:
  10. Put the name of the function
  11. Choose "Python 2.7" as the Runtime
  12. In the box below, paste the code of the lambda function in the src folder of this repo
  13. For Role, select the option "Choose an existing role"
  14. Choose an existing role (the one created in step 0)
  15. Leave evyrthing else as default
  16. Click Next

Ref: http://docs.aws.amazon.com/lambda/latest/dg/get-started-create-function.html

How to restore an instance from a snapshot?

Follow this instructions to restore an instance with previous snapshot:

  1. Go to EC2
  2. Identify the instance and the volume(s) with problems, take note of the volume id and the AZ of the instance
  3. Go to Snapshots and search for snapshots from volume id with problems
  4. Choose the most recent snapshot and create a volume in the same AZ from the instance
  5. While the volume(s) is/are being created, stop the instance with problems
  6. Detach the volume(s) from the instance, take note of the "Block Device" name (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-detaching-volume.html)
  7. When the volume(s) is/are ready, attach the volume to the instance with the same "Block Device" name (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-attaching-volume.html)
  8. Start the instance

Questions or suggestions?

Easy, just send me an email or open an issue here