Running AWS ECS with Spot Instances

In November 2015 we started using AWS Elastic Container Service in production. I became deeply involved with this new product of AWS and quickly found its up- and downsides (i.e.: You still can't delete Task Definitions, wtf AWS, we have like 200 of them by now). We had to invent several tweaks in order to get all microservices we run into containers. By now we mostly control ECS via its API because the console is very confusing and overloaded.

Customized ECS instances

We created a specific AMI that is used as the root partition of a new instance. Two EBS volumes are added to hold Docker images and Docker logs. All of launch parameters of an instance are defined in an Ansible playbook, to avoid manual errors. The playbook will make the API call to AWS and start the instance.

tasks:
  - name: Launch instance
    ec2:
      key_name: "{{ key_name }}"
      group_id: sg-****
      instance_type: m4.xlarge
      image: ami-****
      wait: true
      region: eu-west-1
      vpc_subnet_id: "{{ subnet }}"
      assign_public_ip: no
      user_data: |
          #!/bin/bash 
          echo "Defaults:ec2-user !requiretty" > /etc/sudoers.d/disable_requiretty
          rm /var/lib/ecs/data/ecs_agent_data.json
          cat << EOF > /etc/ecs/ecs.config
          ECS_CLUSTER={{ ecs_cluster }}
          ECS_ENGINE_AUTH_TYPE=docker
          ECS_ENGINE_AUTH_DATA={"https://ourdockerrepo.it/v2/":{"username":"********","password":"******************","email":"administration@company.com"}}
          EOF
      instance_tags:
        Name: "{{ ecs_cluster }}-ECS-{{ suffix.stdout }}"
        stage: live
        zone: "{{ dns_zone }}"
        type: "ecs"
      volumes:
        - device_name: /dev/xvda
          snapshot: snap-*****
          volume_type: gp2
          volume_size: 30
        - device_name: /dev/sdb
          snapshot: snap-*****
          volume_type: gp2
          volume_size: 80
          delete_on_termination: true
        - device_name: /dev/sdc
          snapshot: snap-*****
          volume_type: gp2
          volume_size: 80
          delete_on_termination: true
    register: ec2

It will then hand over the instance to another playbook, that does some bootstrapping. In the end the instance reboots and joins one of our ECS clusters.

Just recently AWS came around with the brand new ECS AMI 2015.09.f and one thing particularly caught my eye:

Amazon ECS-optimized AMIs from version 2015.09.d and later launch with an 8 GiB volume for the operating system that is attached at /dev/xvda and mounted as the root of the file system. There is an additional 22 GiB volume that is attached at /dev/xvdcz that Docker uses for image and metadata storage. The volume is configured as a Logical Volume Management (LVM) device and it is accessed directly by Docker via the devicemapper back end. Because the volume is not mounted, you cannot use standard storage information commands (such as df -h) to determine the available storage. However, you can use LVM commands and docker info to find the available storage...

This means, that we can get rid of our custom setup, as the instance utilizes the LVM automatically.

Spot Instances

Spot instances are kind of the meta problem that Amazon is facing ~10 years after starting AWS. They have so much unused compute power lying around, that they are are giving it out on a stock like market. You can bid how much money you are willing to spend on an instance. Whenever the market price is below that value, you get an instance. If it rises above that value, your instance has 2 minutes to get its shit together and AWS will terminate it. However, if you diversify your instances over all AZs and different instance sizes, the chance of termination is very low. There are also some limitations to these instances:

Spot instances can not be stopped. Shutting them down from command line will terminate the instance.
EC2 does not allow specifying an EBS volume as root of a Spot Instance, only an AMI image.

Combining ECS and Spot instances

I got the idea to run ECS with spot instances in order to get a cheap way of scaling up and down.

But but but, the instances can be terminated every second! You can not run this in a live environment!!

I think I can, because with a price that is >80% below the on demand price, I can overprovision with a diversified group of instance sizes.

If you launch a group of spot instances together, they are called a fleet and you can manage them via the Spot Fleet Requests API. So if you have 15 instances and need 5 more in that fleet you run:
aws ec2 modify-spot-fleet-request --spot-fleet-request-id sfr-73fbd2ce-aa30-494c-8788-1cee4EXAMPLE --target-capacity 20

Okok, but how do you get it all up and running in the first place?

You can use the new AWS Spot Instance console and get all your settings right. Be sure to add one extra EBS volume under some mount point. In the end the console will give you a nice button where you can download your configuration as JSON.
We need to modify this JSON file in order to get the mount point of the extra EBS volume to /dev/xvdcz. It will look like the below snippet. I edited out the other instance sizes. You can just add more blocks in the LaunchSpecifications

{
  "IamFleetRole": "arn:aws:iam::***********:role/aws-ec2-spot-fleet-role",
  "AllocationStrategy": "diversified",
  "TargetCapacity": 5,
  "SpotPrice": "1.00",
  "ValidFrom": "2016-02-15T14:40:42Z",
  "ValidUntil": "2017-02-15T14:28:42Z",
  "TerminateInstancesWithExpiration": false,
  "LaunchSpecifications": [
    {
      "ImageId": "ami-76e95b05",
      "InstanceType": "m4.xlarge",
      "KeyName": "YourSSHKey",
      "IamInstanceProfile": {
        "Arn": "arn:aws:iam::***********:instance-profile/SomeProfile"
      },
      "BlockDeviceMappings": [
        {
          "DeviceName": "/dev/xvdcz",
          "Ebs": {
            "DeleteOnTermination": true,
            "VolumeType": "gp2",
            "VolumeSize": 40,
            "Encrypted": false
          }
        }
      ],
      "SecurityGroups": [
        {
          "GroupId": "sg-******"
        }
      ],
      "SubnetId": "subnet-******",
      "UserData": "---UserDataInBase64---"
    }
  ]
}

You can now run this file directly into the API via:
aws ec2 request-spot-fleet --spot-fleet-request-config file://spot_request.json

As you already saw in a previous chapter, we use the UserData field to run some commands after the instance launches. I extended this field and each new instance will now pull a playbook repository from our Bitbucket and have Ansible run the bootstrap playbook locally. The following blogpost was very helpful with this: https://ivan-site.com/2014/10/auto-scaling-on-amazon-ec2-with-ansible/

Conclusion

In this post I described my idea to utilize AWS Spot Instances as a cheap way of scaling an ECS cluster. I showed only API examples, as I still think that the console is not a good start for this. However AWS is very well aware of the limitations and ECS is under heavy development. There are some cool features to come in the near future

The next post here will cover a way to gracefully handle the termination of a spot instance.