Table of Contents

Cloud infra as code solutions

I recently completed some projects requiring use of CloudFormation in a real environment where I would typically use Terraform. I wanted to make the most of it and wrote up my experiences and conclusions across various projects using CloudFormation and its variations - SAM; CDK; CDKTF- compared to Terraform. Spoiler: CloudFormation has its niche(s), but it remains (IMO) a specialised tool.

The SAM Stack

As countless others have written, ‘AWS Serverless Application Model’ (‘SAM’) is an area where CloudFormation really shines. I have now done a few projects with this but for the most complex I had designed a solution for an archiving project that involved:

  • 2 Lambdas
  • 2 DynamoDB tables
  • An S3 Bucket
  • 2 State machines (1 to control each Lambda)

With SAM we can define AWS resources specific to and in the context of our Lambdas. The implied policy generation and templates can then really speed things up as opposed to having to explicitly state policies, trust relationships, attachments, etc. The sam build/sam package… stuff being slipstreamed in is excellent for managing external libraries imports (I was using Python and e.g. requests external module). You can also bust out to ‘plain’ CloudFormation in SAM templates and so this can be considered a superset of that. I did not explore all of the further stuff around tests, logging etc using SAM on this occasion.

Here SAM greatly simplifies the development and deployment workload compared to Terraform (or plain CloudFormation!). I would certainly use it again for this type of use case.

Connecting outside of AWS

In one project which I worked on, there was a requirement to create an AWS SAAS app instance and integrate it with an external Single Sign On Identity provider at tthe account level - i.e. Not through AWS SSO/‘IMA Identity Center’. Connecting to services outside of AWS can be challenging with CloudFormation, and this project was no exception. For this particular case there were some mutual backward and forward steps that meant it was not possible to create just the resources on one side of the divide in isolation. Here we needed the App instance to exist so that we could get the instanceID, to use with the external IDP, to get the XML config, to supply to the AWS identity provider… If the shop is only using CloudFormation then this means that most or all of the task is going to be Click-Ops in this sort of scenario. I did not try to do the equivalent with Terraform on this specific occasion but I have worked with Okta and AWS-SSO with Terraform and done similar with e.g. Kubernetes federation. Entra/AAD has an officially supported Terraform provider.

Conditional resources

For another project I needed to create a (WAF) WebACL with essentially a checkbox selection of rules. In this case the rules were IPSets. Nearly all of these would be the same in each case where used but each deployment needed to be able to have its own configurable selection applied from these. My requirement included 14 rules, each with a default value and an accompanying boolean conditional.

Plain CloudFormation

Despite the fact that I was working with exclusively AWS first-party resources, CloudFormation was a disappointment here. Whilst CloudFormation technically supports loops, with Fn::ForEach as part of AWS::LanguageExtensions transform I wasn’t able to find a way to make useful use of this and get it working in this case. SAM was not a help here because I wasn’t doing anything with Lambda or IAM or associated resources.

My resulting CloudFormation (YAML) template was 555 lines long and very hard to manage. For a given rule I had 9 lines in the ‘Parameters’ section; 2 in ‘Conditions’ 14 for the attribute of the WebACL resource and 10 for the IPSet resource. Total 35 lines per rule in 4 isolated sections, all of which needed to match and correlate with one another. Bear in mind that there were 14 sets of these to manage in a single file. For a single rule this looked like:

Click to expand sample CloudFormation YAML:

Parameters:
  IncludeClientSpecificIPs:
    Type: String
    AllowedValues: ["yes", "no"]
    Default: "no"
    Description: Include Client-specific IP addresses in the passlist
  ClientSpecificIPAddresses:
    Type: CommaDelimitedList
    Description: Client-specific IP addresses to be passlisted.
    Default: ""

Conditions:
  IncludeClientSpecificIPsCondition: !Equals [!Ref IncludeClientSpecificIPs, "yes"]
  HasClientSpecificIPs: !Not [!Equals [!Join ["", !Ref ClientSpecificIPAddresses], ""]]

Resources:
# Showing top level resource for clarity
  WebACL:
    Type: AWS::WAFv2::WebACL
    Properties:
      Name: !Ref WebACLName
      DefaultAction:
        Allow: {}
      Scope: REGIONAL
      VisibilityConfig:
        SampledRequestsEnabled: true
        CloudWatchMetricsEnabled: true
        MetricName: MyWebACL
      Rules:
    #   Our specific rule:
        - !If
          - IncludeClientSpecificIPsCondition
          - Name: ClientSpecificIPPasslist
            Priority: 3
            Action:
              Allow: {}
            Statement:
              IPSetReferenceStatement:
                Arn: !GetAtt ClientSpecificIPSet.Arn
            VisibilityConfig:
              SampledRequestsEnabled: true
              CloudWatchMetricsEnabled: true
              MetricName: ClientSpecificIPPasslist
          - !Ref "AWS::NoValue"
    

  ClientSpecificIPSet:
    Type: AWS::WAFv2::IPSet
    Condition: IncludeClientSpecificIPsCondition
    Properties:
      Name: ClientSpecificIPSet
      Scope: REGIONAL
      IPAddressVersion: IPV4
      Addresses: !If
        - HasClientSpecificIPs
        - !Split [",", !Join [",", !Ref ClientSpecificIPAddresses]]
        - ["127.0.0.1/32"]

Variations for different deployments are the usual CF json variables file like:

[
    {
        "ParameterKey": "ClientSpecificIPPasslist",
        "ParameterValue": "yes"
    }
]

I had great difficulty finding an effective solution to live linting and tool-tipping in an IDE (despite trying several different ones). Of course CloudFormation YAML isn’t quite valid YAML (even with the Fn::/!Sub etc shorthands excepted) and cfn-lint and friends often seem to struggle to differentiate between errors and warnings/recommendations, or to recognise that actually it is valid CloudFormation. I understand that this is a common frustration but realise also that this is not a core limitation of CloudFormation.

A limitation of CloudFormation is its inability to query the environment at deploy time - there is no equivalent to Terraform’s data lookups. If I wanted to e.g. query the VPC that I was in then my choices seemed to be:

  1. Manually copy across a static value
  2. Create and include a lambda in my stack to make the query and return the result
  3. Rely on an Export from another stack

Whilst the 2nd option is possible, at this point it’s questionable if we are any longer using the right tool for the job. We’re essentially breaking out into free-form customisation to make up for gaps in the tool itself. Yes you can do this in other configuration management tools, e.g. Ansible with ansible.builtin.command or Terraform with null_resource and theremote-exec/local_exec provisioners, but both tools warn against this use and advise that it’s a last resort. Yes you can even create custom providers but as a rule you shouldn’t!

Option 3 pulls us into one of the traps of CloudFormation:

After another stack imports an output value, you can’t delete the stack that is exporting the output value or modify the exported output value. All the imports must be removed before you can delete the exporting stack or modify the output value.

This limitation causes some teams (quite reasonably!) to eschew the use of CloudFormation Export/!ImportValue.

Alternative 1: CDK

It seems that a ’typical’ next step for people running into these sorts of issues is CDK. I suppose that this is understandable if you are already finding yourself writing custom lambdas to make obvious queries for your CloudFormation, and may be considered trivial if you are in the business of writing lambdas regularly. I have had concerns in the past about the blurring of the architectural domain boundaries between infra and application code with this methodology but I explored CDK here with both Python and TypeScript. With good architectural discipline there is no intrinsic reason why it should not be possible to maintain good architectural boundaries, just as discussed with SAM above. Obviously any CDK approach is optimised toward a code-literate team and approach. I considered 2 language variations: TypeScript (seems most popular), and Python (more familiar in my case):

CDK (Typescript)

At this point I would not consider myself expert with TypeScript, let alone CDK. Given the generation and synth process I also don’t think it would be fair to look at total lines of code. The lines added across the project after the init process were:

  • cdk.json (5 lines for total 77)
  • bin/waf-api-gateways.ts (20 lines for total 26)
  • lib/waf-api-gateways-stack.ts (51 lines for total 56)

So a total of 76 (or 159, depending how you count) lines for the changed files only and I was able to generate sensible CloudFormation with cdk synth. Because TypeScript is a ‘real’ programming language I could use a loop in my main code. For different deployments I can apply different cdk.json with a --context flag at deploy time, looking like

{
  "app": "npx ts-node --prefer-ts-exts bin/waf-webacls.ts",
  "context": {
    "webAclName": "DevWebACL",
    "includeRules": ["ClientSpecificIPPasslist"]
  }
}

I can’t speak from experience as to how extensible this is with regard to, e.g. subsequent integration into a bigger stack, but certainly as a free standing deployment project this is vastly more sane than the raw CloudFormation. It’s straightforward and simple to add or modify rules etc.

CDK (Python)

I had a brief look at a similar port to Python CDK. The lines of code and usage, as one might expect, were very similar to the Typescript example besides the different language syntax and I would express similar sentiments.

Alternative 2: Terraform

The obvious alternative to CloudFormation would be Terraform. As with CloudFormation there are a couple of variations on it. Rendering the equivalent to the 555 line CloudFormation Template described above in plain terraform initially came to 249 lines, using a dynamic resource block for rule with a nested loop and an additional nested loop for the conditional resource creation. A simpler approach of using a map variable for the IPSet rules coupled with a for_each loop in main.tf for the IPsets and a dynamic block for the rule attribute totalled 79 lines- N.B. this is a complete code solution, not an excerpt or module:

Click to expand shorter Terraform solution

provider "aws" {
  region = "us-west-2"
}

resource "aws_wafv2_web_acl" "web_acl" {
  name  = var.web_acl_name
  scope = "REGIONAL"
  default_action {
    allow {}
  }
  visibility_config {
    cloudwatch_metrics_enabled = true
    metric_name                = "MyWebACL"
    sampled_requests_enabled   = true
  }

  dynamic "rule" {
    for_each = var.include_rules
    content {
      name     = rule.value
      priority = index(var.include_rules, rule.value) + 1
      action {
        allow {}
      }
      statement {
        ip_set_reference_statement {
          arn = aws_wafv2_ip_set.passlist[rule.value].arn
        }
      }
      visibility_config {
        cloudwatch_metrics_enabled = true
        metric_name                = rule.value
        sampled_requests_enabled   = true
      }
    }
  }
}

resource "aws_wafv2_ip_set" "passlist" {
  for_each = { for key, value in var.ip_configs : key => value if length(value) > 0 }

  name               = each.key
  scope              = "REGIONAL"
  ip_address_version = "IPV4"
  addresses          = each.value
}

variable "web_acl_name" {
  description = "The name of the Web ACL to be created."
  type        = string
  default     = "WebACL"
}

variable "include_rules" {
  description = "List of rules to include in the Web ACL."
  type        = list(string)
  default     = []
}

variable "ip_configs" {
  description = "Mapping of IP set names to their IP addresses."
  type        = map(list(string))
  default = {
    DataCentreIPPasslist        = ["1.1.1.1/32", "1.0.0.1/32"]
    ClientSpecificIPPasslist    = []
    VPNIPPasslist               = ["8.8.8.8/32", "8.8.8.1/32"]
    SharedTestEuWest1IPPasslist = []
    SharedProdEuWest1IPPasslist = []
    SharedTestEuWest2IPPasslist = []
    SharedProdEuWest2IPPasslist = []
    DevEuWest1IPPasslist        = []
    ProdEuWest1IPPasslist       = []
    TestEuWest1IPPasslist       = []
    DevuWest2IPPasslist         = []
    ProdEuWest2IPPasslist       = []
    TestEuWest2IPPasslist       = []
  }
}

Alternative 3: CDKTF

I briefly looked at what CDKTF (TypeScript) might look like and, at least superficially, it seemed similar to CDK typescript. It was more fiddly to get working, required more imports and the code was formatted a little differently but it was usable.

CDKTF supports several providers including AWS, Azure, Google Cloud and GitHub so it already has a broader set of targets than CDK. The Hashicorp sell up to Terraform Cloud is a little annoying but not unbearable. I can’t speak from experience as to how difficult it would be to mix and match CDK and CDKTF for different deployments in an organisations but I would presume not very since the languages and abstraction techniques are similar.

CloudFormation Hooks to an Organization with service-managed StackSets

This is not something I have looked at recently but which I did make use of with my team on a past project and briefly mentioned when writing about CloudFormation previously. I will defer to AWS own published article for the details but the key point is that with CloudFormation it is possible to deploy to Organisational Units (OUs) and define rules for inheritance, i.e. when an AWS account joins or leaves an OU. In such case the existing StackSet can be instantiated for a new account or stack called for a departing one automatically based on lifecycle events. Terraform does not have a direct equivalent to this, essentially having a static model, but there are a couple of workarounds for this sort of case. One is to deploy CloudFormation StackSets with Terraform. This is certainly possible but not enjoyable- I found when I did this that I was working to the lowest common denominator of each. The other would be to use Account Factory for Terraform with some Lambda glue to trigger based off Control Tower LifeCycle events. I confess I have not used this myself but certainly it is possible. Being honest however, this is a fairly rarefied use case. If it’s something that you’re considering then I would expect that you are already a long way down the road and have the capability to work around the issues in any event.

An Aside on CloudFormation Modules

I put this in a separate section because it’s relevant to the overall discussion but not directly related to the projects discussed. There’s a reason for this. CloudFormation modules are, so far as I can tell, essentially a CTO tick box feature with little clear practical utility. Compared to Terraform where modules are a key feature of the landscape, Cloudformation modules barely warrant the name. If I understand correctly from the CF Module documentation:

  • A module must be registered in the account and region in which you want to use it.

  • During stack operations, CloudFormation uses whatever version of the module that’s currently registered as the default version in the account and region in which the stack operation is being performed. This includes modules that are nested in other modules.

  • Therefore, be aware that if you have different versions of the same module registered as the default version in different accounts or regions, using the same template may result in different results.

From my perspective this feature has limited practical utility. My key reasons for using modules in Terraform are:

  • Avoiding having to duplicate code
  • Being able to use off the peg components from third parties without having to manage them yourself
  • Being able to ensure that changes from new versions don’t impact existing deployments using old versions, i.e. unpredictable regressions.

These advantages are less ‘pronounced’ in the CloudFormation implementation…

Based on the documentation, rather than my own real world experience at this point, this difference also remains true with the CDK transpiler case for each, i.e. CDKTF can use regular Terraform modules (where the provider is supported by CDKTF).

Thoughts and Recommendations

Based on my experiences and the examples discussed, CloudFormation may not be the best fit as a general-purpose tool in the infrastructure space. It shines in specific areas, such as SAM deployments with Lambdas, which I would recommend. However, for many other use cases, alternatives like Terraform or CDK offer more flexibility and manageability.

The limitations in loops and modules in CloudFormation lead to more complex and difficult-to-manage code compared to tools that support these features. Additionally, CloudFormation’s inability to handle environment queries and non-AWS resources introduces avoidable complexity.

Arguments that CloudFormation is better supported by AWS, or simpler due to its YAML syntax, do not fully address the practical challenges. For example, my real-world use case resulting in 555 lines of CloudFormation code with complex internal dependencies, whereas the equivalent Terraform configuration was a much simpler and more maintainable 79 lines. For the argument that YAML is ‘simpler’ I would point to Kubernetes - which is largely configured with YAML and yet few would describe as simple. I don’t think that there’s any real comparison for complexity between the Terraform and the plain CloudFormation equivalents above. If I were leading a team with limited experience then I would feel more certain of predictable use and maintenance using the Terraform than the CloudFormation. Even if you don’t understand how it works, it is straightforward to see how rulesets are defined and included and little understanding of the core logic is needed to add to or modify them. Moreover, strong tooling support for Terraform enhances its usability and reliability with e.g. linting, formatting, tool-tipping, etc.

For teams comfortable with coding, CDK and CDKTF offer sensible solutions, using familiar programming languages with the caveats that good architectural discipline is needed and the range of providers is limited. While I am not an expert in CDK, it appears that transitioning between CDK and CDKTF is practicable and offers additional flexibility. Although I have not at this point explored it myself I understand that it is possible to use SAM and CDK together too.

Overall, while CloudFormation has specific strengths, broader infrastructure needs are often better served by alternative tools that provide greater flexibility, simplicity, and support.

Update 12th July - word from another source

Wouldn’t you know? Just after publishing I come across this article from SST: Moving away from CDK. It covers the nuances of CDK, and CloudFormation in the context of it, in far more depth and detail than I could above. Short version- a significant third party tooling resource partnership has decided to move away from CDK to CDKTF because of the limitations that they have found in it as part of CloudFormation and that they describe in some detail.