Part 1 - Why change a running system?
At FREE NOW (formerly mytaxi) we run a microservice backend consisting of roughly 300 services supported by a set of auxiliary applications. Since 2015 we have been using AWS ECS as the underlying container platform. A lot of tooling was built around it, including a custom tailored tool called IDOL. It handles deployments so that developers do not have to deal with the AWS API. It updates ECS task definitions and the service it belongs to. IDOL makes a lot of assumptions about the service landscape and the containers that are used. You could only deploy services that have a healthcheck and you would be forced to use a canary stage where a single container with a new version would be deployed as a new ECS service. Other pain points of the existing platform were:
-
Can’t deploy any type of service, such as Daemons or Cronjobs.
-
Conventions are scattered across multiple sources, including a large CSV acting as a service catalogue, Atlassian Bamboo Plugin configuration and ECS Autoscaling configuration.
-
Requires manual work (e.g. DNS entries, Autoscaling limits)
-
Two HTTP routers— Netflix Zuul for path based routing and Fabio for host based routing.
-
ECS does not have the capability to prevent unhealthy services from being targeted by the HTTP routers and to fill this gap we utilized Consul. However, ECS does not take any consequences from a failing healthcheck in Consul.
To cope with the rapid growth of our tech-teams and to get rid of the above mentioned pain points, we started researching in spring 2018 how to bring our platform and service deployments into the future.
After visiting KubeCon Copenhagen (2018), we settled on Kubernetes as the target container platform we shall build on.
We certainly didn’t spend all the time in the Bella Center. Mikkeller Baghaven is worth a visit to try one of their awesome barrel aged beers!
After choosing Kubernetes, the first order of business was to learn how to set up clusters in our existing AWS accounts. We run ~20 AWS accounts that are bootstrapped with a Terraform module. The only thing required to start is a /19 legacy IP range and a name. After the module ran successfully we end up with an account that is connected to our corporate network and authentication system Okta; it comes with private and public subnets and a set of sane security defaults.
Since we build almost everything with Terraform, we thought to also build our Kubernetes clusters with it. However, the knowledge about how to setup a cluster was pretty small in our team. The requirements that needed to be fulfilled were:
-
Must work with existing AWS VPCs
-
Authentication/Authorization must hook into Okta
-
Should allow for customization API Server Flags and Kubelet Flags
-
Cluster Updates should be automatable
Kops offered solutions for all the above, but the kops Terraform output wasn’t satisfying So we started to use Terraform data resources to query existing infrastructure and render out a Kops cluster.yaml. Since all AWS accounts are set up the same, a Terraform module emerged that generates cluster specifications. These cluster.yaml files are then used to manually create a cluster with kops create -f cluster.yaml. Subsequent rolling cluster updates are then done in a custom Jenkins pipeline.
By autumn 2018 we had a couple of clusters running on Kubernetes 1.10 and 1.11 but no way to bootstrap them with auxiliary tools and no way deploy actual services to it from our CI/CD Pipelines.
So we did what we always do when we have a large project and need some quiet time to really dive deep into it. We packed a car full of food and beverages and rented a huge house in Denmark for a week!
More about the targets for the new platform and what else happened in Denmark will follow in part 2…