When Amazon Web Services overtakes one of your custom features
Table of Contents
This post is also published on Medium It’s kind of a standing joke in the industry - do some cool thing with AWS to implement an infrastructure feature, and if it works well, Amazon will come along a few months later with some matching in-house feature. Sometimes that feature might be a relatively simple thing, maybe something that was obviously missing, sometimes you might have had a whole project that was essentially deprecated, and sometimes it’s a feature of a larger piece of work that means you have to adjust or re-evaluate your approach. What do you do when this happens? This article covers an example of each - one that happened with a company I was working with; one that happened with a third-party project we made use of, and one that happened with my own project.
The Simple Missing Feature
In early 2018 I took up a position with DAZN. We had decided to run many of our services on AWS’ Elastic Container Service (ECS). For those not familiar with it, ECS is an AWS-specific Docker orchestrator and scheduler- broadly equivalent to Kubernetes, but predating it and with lots of AWS-specific tie-ins and features. At the time it was possible to use ECS either on customer-managed virtual machine instances (ECS on EC2) or on an abstracted AWS-Managed platform (Fargate). We were using ECS on EC2 but neither flavour offered the equivalent of k8s daemon sets- ensuring that a service or container runs on each and every node. Our team got around this by going old-school and preparing custom AMI’s (System images) with RPM installs of our logging and metrics agents etc. This was a quick and effective solution. In June 2018 AWS introduced daemon tasks Although this was exactly what we would have asked for originally, there were some gaps and delays in implementation initially- both with AWS and third party tooling, e.g. Terraform- and it would not have immediately been ready for us to switch to. In a few more months it was fully implemented and we could have moved but we were already now set up and comfortable with our own solution, and we also had some things now being built in that could not be implemented as daemon tasks- specifically logging onto ECS cluster hosts using widdix/aws-ec2-ssh (see next example). At the time I moved on to a new organisation (summer 2019), we were still using the RPM method for pre-existing features and those that were not compatible with daemon tasks. Simply put, the gains of moving to the OEM standard were not sufficient to offset the marginal gain.
The third party project now ‘eclipsed?’
As mentioned above, when I was at DAZN we made use of the third-party widdix/aws-ec2-ssh project. As it says on the GitHub page, this project lets you ‘Use your IAM user’s public SSH key to get access via SSH to an EC2 instance’. We used this on our ECS cluster hosts and as a heavily locked-down ‘tunnel’ ssh server in place of a VPN. Eventually we swapped out the authentication logic due to IAM rate limiting, but the principle was the same. We could manage and log SSH access to specific instances for individual users across AWS accounts in our AWS organisation with self-service SSH keys. In September 2018 AWS introduced Systems Manager Session Manager Initially this looked like it might be able to replace the Widdix project but this impression was short lived - it was/is a custom agent solution requiring supported software on both the remote and the local host. As others have described “An agent running on the EC2 instance connects to the Systems Manager’s backend and executes commands on the machine. Therefore, the EC2 instance needs access to the Internet or a VPC endpoint”. Separately, whilst user access could be granted on an individual basis, user identities are flattened to ‘ec2-user’. This and other deficiencies at the time of review meant that we did not take up Systems Manager Session Manager. In summer 2019 I left DAZN and went to work at a GCP shop. I subsequently learned that Michael Wittig (The ‘Wi..’ in Widdix) wrote (11 Jun 2019) ‘AWS SSM is a trojan horse: fix it now!’ In June 2019, Amazon introduced EC2 Instance Connect (‘EIC’) . From the summary: Amazon EC2 Instance Connect provides a simple and secure way to connect to your instances using Secure Shell (SSH). With EC2 Instance Connect, you use AWS Identity and Access Management (IAM) policies and principals to control SSH access to your instances, removing the need to share and manage SSH keys. All connection requests using EC2 Instance Connect are logged to AWS CloudTrail so that you can audit connection requests. At the time of writing I have not looked into EIC thoroughly. Arguably EIC DOES look like an effective replacement for the Widdix project in many circumstances. There are some ‘nice-to-have features - more on these in the next section - some of which would be critical in some settings, but the core idea of elf-service SSH keys being managed via AWS with IAM policies and APIs and used to access EC2 instances is covered. That is, until you read that Widdix feel that there may be some issues - ‘EC2 Instance Connect is an insecure default!’.
Re-evaluation of a core feature in my own project
Whilst I was at DAZN, I was able to develop significant improvements to my own open source bastion project - terraform-aws-ssh-bastion-service - and we used it there from early 2018. The key features of this project are:
- Containerising a bastion to render it stateless
- Making it highly-available
- Self-service SSH keys being managed via AWS with IAM policies and APIs and used to access an instance in a VPC
- Making the project available as public Terraform module
As described above, I’d considered and disregarded Systems Manager Session Manager when it came out. I then left to work in a GCP shop in late June. As a result I did not become aware of EIC until December 2019- when I was giving a presentation at the London Hashicorp User Group. You can see the video on YouTube . I was covering the above points- the specifics of this bastion project and the generality of using a microcomponent approach to infrastructure. At the end of my presentation there was a question regarding EC2 Instance Connect. At the time I was not familiar with it but I would now like to respond here: EC2 Instance Connect (‘EIC’, introduced late June 2019) is a great-looking system which I haven’t used in production, having moved to a GCP shop just before it was introduced. The principal differences between it and the system that I’ve developed after deployment are that with EIC everyone is logging in as ec2-user
on a given host which is by default stateful and not containerised. Although cloudtrail logging is provided, this is actually for the pushing of SSH keys to the metadata service and not anything directly on the host. Since all users are flattened into a single identity on the stateful host there is no compartmentalisation between them and any onward events cannot be discretely logged against a single user. Users can easily interfere with one another’s work. Of course all instances using it must have port 22 exposed to the outside world - this does not appear to be configurable. In summary, Whilst EIC has the distinct advantages of being first party, with ephemeral SSH keys, it has the downsides that instances are stateful and that flattened identities preclude detailed logging. It is also of course an AWS feature, rather than a terraform module and addresses nothing in the way of high availability etc. The key inspiration for my design was in fact bastion containerisation which is similarly unaddressed. Concerns have been raised that permitting access via this method grants onward access to all instances in the VPC running AWS-specific Amazon Linux or Ubuntu AMI’s and further onward to everything that those instances have access to. Of course there is nothing to stop using the public module that I’ve created together with EC2 instance connect. The module allows swapping out the authentication mechanism with anything you like. You could then have a containerised bastion with EIC based authentication. The cost of the ephemeral containerised hosts in my design with individual user identities is of course the greater vulnerability to DDOS than a stateful host, just as I described in reference to the Widdix project in my presentation and project readme. An obvious next step for my own project is to implement EIC as an alternative authentication mechanism (or to create a new project implementing it in this way). At the moment I am reviewing the possibilities !