Gracefully handle the termination of AWS Spot Instances

As I announced in my last post, I want to write today about the termination of AWS Spot Instances and how I set up a Termination-Spotter Service.

If the price for a spot instance rises above the limit that you are willing to pay for it, you will lose this instance. However you will not lose it out of a sudden. AWS gives you a two minute warning before termination. This warning comes in form of an API at http://169.254.169.254/latest/meta-data/spot/termination-time. This endpoint will become available, when your instance has been marked for termination. AWS recommends, that interested applications poll for the termination notice at five-second intervals.

Well quick and dirty the following lines will do the trick:

#!/usr/bin/env bash
while true
  do
    if [ -z $(curl -Is http://169.254.169.254/latest/meta-data/spot/termination-time | head -1 | grep 404 | cut -d \  -f 2) ]
      then
        # run something that deals with the termination
        break
      else
        sleep 5
    fi
  done

You maybe want to write an init file so start the script easily:

description "Termination Spotter"
author "Sebastian Herzberg"
start on runlevel [2345]

pre-start script
  echo "[`date`] Termination Spotter starting" >> /var/log/termination-spotter.log
end script

exec /bin/sh /var/opt/termination-spotter.sh > /dev/null &

All our ECS instances are enlisted in several load balancers. One for each service container that runs on the instance. When the termination notice comes up, the instance needs to be pulled out of every ELB and deregistered from the ECS cluster. I am an Ansible fan, so I wrote a playbook that is triggered by the above script. It makes use of the local ECS metadata API that is available at http://localhost:51678/v1/metadata

---
- name: Deregister from any ELB and Cluster
# =========================================
  hosts: localhost
  gather_facts: yes
  connection: local

  tasks:
    - name: Gather EC2 facts
      ec2_facts:

    - name: Deregistering target instance from ELBs
      ec2_elb:
        instance_id: "{{ ansible_ec2_instance_id }}"
        region: "eu-west-1" 
        state: "absent"
        wait: no

    - name: Get the cluster name of the current instance
      shell: "curl -m 10 -s http://localhost:51678/v1/metadata | jq -r '. | .Cluster' | awk -F/ '{print $NF}'"
      register: clustername

    - name: Get the instance ARN
      shell: "curl -m 10 -s http://localhost:51678/v1/metadata | jq -r '. | .ContainerInstanceArn'"
      register: instance_arn

    - name: Deregister instance from cluster
      shell: "aws ecs deregister-container-instance --cluster {{ clustername.stdout }} --region eu-west-1 --container-instance {{ instance_arn.stdout }} --force"

So first it will deregister the instance from all the ELBs it is registered to. Then it will check in which ECS cluster the instance is and get the ARN for the instance in order to deregister it from the cluster. I skipped a step here, where the playbook will fire a Slack notification about the upcoming termination. After that, we just wait for the instance to die.

Thats it for today, thank you for reading and see you soon.

Sebastian Herzberg

Read more posts by this author.