Table of Contents

Monolith and Microservices in the Cloud

How involved is it to break down a monolith into microservices? Is it worth it? Is serverless better than Kubernetes? Here’s my example of breaking up my own Kubernetes ‘mini monolith’ into microservices and you can see both the before and after, together with my thoughts on it below. A deployed running example service is currently (August 2024) available here although I don’t promise to keep it running there.

Debugging is like being a detective in a crime movie where you are also the murderer - @iamdevloper.

Background

I previously ported my own command line Go app to a web app, containerised it, and developed a helm chart to deploy on Kubernetes with ArgoCD. The original app was intended to be cross-platform and to implement some features that aren’t normally considered in card shuffling apps. There wasn’t that much to the container port, and so the article I wrote about it is fairly boring. I would have liked to host a running instance of the app myself to give convenient access to the result but didn’t want to be paying $5 ($6 with tax) each month to do so with AWS App Runner, or more for a cluster or EC2 host. I’ve recently been doing some stuff with Serverless (SAM) and thought that maybe Lambda would be a cost-effective way to run this publicly.

Initial Design

I started with my containerised app because I had already added web pages and pictures to it, served as assets from the binary’s local file system, using gin to serve three routes/APIs:

  • /options - landing page and form where the user may choose options and proceed to either of the other two routes
  • /license - A static page with a copy of the AGPL software license
  • /draw - a web page with the illustrated results of the card draw

My initial plan was to make the least change possible and make all this a lambda rather than a docker image. Go, as a custom runtime, is clearly not a first-class citizen for Lambda, certainly compared to e.g. Python. Go does not normally require a Makefile and lambda does not normally require a bootstrap file but anyway, once set up this was (initially!) a relatively minor speed bump.

I got my deployment building, packaging, and deploying with SAM-CLI, and I could get to the first endpoint for my app. What I couldn’t do was get to the two subsequent pages (/draw and /license) via API Gateway. In each case I got {"message":"Forbidden"} and on inspection a 403 error in my browser network tools. I wasn’t seeing anything in Cloudwatch logs but I could access the resources without difficulty using curl. Initially, I wasted time investigating this as a CORS error - although in hindsight it should have been obvious that it wasn’t that. Unsurprisingly using https://github.com/rs/cors - a specific module for handling CORS with Gin- didn’t help. Neither did trying to look for non-existent, detailed, API gateway settings when I was using HttpAPI on API Gateway v2…

The First Redesign

The issue turned out to be a good deal more subtle than expected. After many blind alleys chasing CORS phantoms, I wound up splitting my application into 3 separate ones, one for each function. This turned out to be much simpler to do than expected and I would probably try and do this sooner in the future because it made it significantly easier to isolate the problem. I was now able to get to /options and /license but not /draw. After much investigation, it turned out that API Gateway was sending POST requests but these were being interpreted by the code in my Lambda as GET. After adding debug code to examine the request from inside the lambda it turned out that it was doing this because the HTTPMethod field in the API gateway request appeared to gin be empty. Further investigation informed that this in turn was because I was using API Gateway v2 with HttpAPI and that gin doesn’t support this, at least without Lambda proxy…

I now had to decide which part(s) of my /draw service to redesign - I could reconfigure for an (API Gateway formalised) rest API, or set up lambda proxy, or I could look for an alternative to gin. I chose the last, replacing gin with Amazon’s own httpadapter which was able to correctly recognise API Gateway v2 requests without other changes to deployment. I suppose the second benefit of breaking into microservices here was that I only had to do this refactor for /draw since the other APIs were now working.

The Second Redesign

I was now able to reach all 3 endpoints but image display was broken for /draw. I thought that the issue might be with the lambda binary trying to retrieve the images from the local file system and so moved to embed them in my binary, rather than fetching them from the local filesystem but that didn’t help. I did a lot of debugging there to confirm that the image files were present, could be listed, etc. Whatever, they still didn’t show in my browser.

The Third Redesign

I happened upon the suggestion that whilst the /draw page was being served with POST, the images were still being retrieved by my browser with GET. Whilst this could be considered ‘obvious’ it was perhaps something of a GoLang issue - each Lambda function here being in effect its own standalone custom web server. The most immediate and obvious solution was to leave the bulk of the code unchanged but to serve the images from S3/CloudFront instead of locally. This would reduce Lambda overhead, likely be a better fit with the ecosystem, and easily handle the combined protocol for the browser fetch. I refactored to this, initially in a more complicated manner than was needed, before recalling modern CloudFront practices and refactoring these components based on the Terraform I originally wrote for this very website… Once I moved from OriginAccessIdentity to OriginAccessControl, I had all three functions working, good times!

Finishing Touches?

After some small tweaks around graphic design (’there’s a design?’, yes, I know) and display, and reducing logging chatter, the last part was to integrate a proper DNS entry and TLS certificate to make this service appear as part of my website. This wasn’t too much work to specify once I got the hang of the SAM way of handling it. What was difficult was that I wanted to include an alternative template.yaml in case some imaginary user wanted to deploy without a proper DNS entry. This consideration turned out to be extremely painful.

The Final Boss

I had decided to implement the code for my preassigned DNS entry as a feature branch, off the ‘plain’ version of the code without this feature. The plain version ‘worked’ and so I wanted to keep that as a reference against any changes. It still worked of course, as did the ‘proper DNS’ version. What didn’t work was every possible version of trying to combine the two. From my feature branch, the version with Route53 DNS entry deployed fine. For some reason, for the version without Route53, the options binary appeared to be built locally, but the source files and assets got uploaded instead of the binary and assets. The Makefile, code, assets, and the section of the template.yaml describing that lambda were the same in both cases. I tried deleting the local .aws-sam directory, deleting and redeploying the stack and even the sam-cli S3 bucket. The mismatch was consistent.

The issue turned out to be an apparent bug (I’m not 100% certain of this, or sure where to report this since nobody else seems to have run into it). Unfortunately, as of August 2024, this means that using the command flag --template-file to specify a template named something other than the default template.yaml means that the source code and assets are uploaded rather than the built binaries and assets for the lambdas. I don’t know why this is or if it impacts other compiled language lambdas. Obviously scripting language lambdas deployed as source to run against a provided runtime, e.g. in Python or Javascript, wouldn’t be impacted by this. My solution here was to not try to have switchable SAM templates 🫤

As a fallback I tried to implement conditionally created resources in my SAM template. This is an approach that I’ve used many times with Terraform and is documented for CloudFormation. Unfortunately, I wasn’t able to get conditional logic working in a combined template at either the ‘Resource’ or the resource attribute level. Supposedly it is possible, but as with many things CloudFormation, the documentation and available examples are extremely sparse. I simply wasn’t prepared to spend hours or days on something that wasn’t even assured to be possible. In the ‘finished’ version of my project there are two template files, with advice in the readme to rename them if need be before deployment.

Reflections

In the finished project, the original self-contained monolith is broken into multiple microservices- a separate lambda for each of the three functions, an S3 bucket and Cloudfront for the image assets, and an API gateway for the coordination. It’s now serverless and the user experience is indistinguishable from the docker/Kubernetes version. On to the hard questions:

How involved was it to break down a monolith into microservices?

As you can see from the above, the serverless port was a significant amount of work, especially for something that was essentially fairly trivial to begin with. I would be very circumspect about recommending doing this for anything in a real-world production organisation. Yes, it would probably be a lot simpler to port something already in a scripted language and arguably it would have been quicker and easier here to simply re-create the original functionality in one, starting with a serverless design. In any case, it might as well be a whole new app given the effort expended. The starting point app was written in Go for good reasons of course. Obviously getting to the end point was something of a journey, and it should be possible to make a monolithic lambda with separate entrypoints for each API/function, e.g. with Lambda proxy. I’m not sure that that would be desirable from any perspective given where we are now however. Splitting this specific app into microservices was only implemented in order to track down and resolve issues. It wasn’t the goal and it wasn’t really the difficult part, the blind alleys and debugging were.

Was it worth it?

This is like the old quip to hobbyists that ‘Linux is only free if your time has no value’. My back-of-envelope calculations, not including data transit, were $5/month to run the docker version on AppRunner or (assuming 1 million invocations each month) about 30 Cents/month for serverless. Yes, that’s a great saving as a percentage, probably even more since I don’t expect millions, but for a pretty small total. Certainly the developer time would not be worth it in the real world.

Is serverless better than Kubernetes or other Container Solutions in this case?

(Note that I am not considering the general case). Now that we have it, obviously the serverless version is more practical and economical to support for public consumption for me and my particular use case. Realistically I can’t see this being a meaningful question in the real world, where there would be many services running across multiple environments.

Considerations for Future Projects and the Real World

It was several orders of magnitude quicker to get from the command-line desktop app to the dockerised web app and accompanying packaged kubernetes installation than to the serverless version. The Gin framework ‘just worked’ in the docker-based settings, without any of the complications of POST/GET with API Gateway. The tooling for Docker, Helm, ArgoCD, etc. is better documented and supported, and the iteration speed is vastly quicker than SAM. The Helm/ArgoCD installation and uninstallation takes a few seconds whereas the SAM stack takes several minutes. Fundamental show-stopping bugs in documented behaviour simply don’t hang around that long in popular open-source tooling - either the behaviour or the documentation will get fixed fairly swiftly- but the backend of CloudFormation is a proprietary black box. These and other criticisms have been pointed out by others much better than I could describe them. It’s also been pointed out by others that Serverless expects high privileges to deploy, especially in development. By contrast the docker-based port can be deployed with far more limited direct privileges only on the target host rather than the whole environment. Whilst I have highlighted Kubernetes, it is far from the only option for hosting containerised applications.

With containerised apps, you have the freedom to use any version of any language or library and can rely on the predictability of containerisation. With SAM, there’s a limited number of ‘first-class’ languages and whilst you can do your own thing with custom runtimes you can also run into undocumented behaviour that can be tricky or impossible to remedy directly. I suppose I could have gone a step beyond and put my project into CDK and this would have eliminated the template.yaml issue with SAM since this would have been generated as part of the run each time. I don’t know how it might have impacted the other issues.

As I have said, I don’t think there is a ‘general’ answer but for any given scenario based on my experience here it looks to me that ‘which is better’ depends on multiple factors:

  • Are you already using containerised applications/ container orchestration?
  • Are you already big on serverless?
  • Are you starting from scratch (or able to start over) or are you starting with something not ‘optimised’ for serverless?
  • Do you need to run across multiple clouds? Whilst Azure Functions support Go as a custom runtime and GCP (unsurprisingly) supports Go natively for Cloud Functions, the SAM version is very obviously heavily tied to AWS. Whilst it would doubtless be possible to port it to other clouds, it would be a non-trivial re-write to cover the different permissions models, feature sets etc. for serverless on them in ways that simply wouldn’t impact e.g. a kubernetes deployment.

My experience here with the additional factors below may not be universal but would certainly colour my opinion for the future:

  • Containerisation tooling, e.g. Docker; Kubernetes; Helm; ArgoCD, is much richer, more robust, predictable, better documented, and has better examples of code in use than SAM/CloudFormation.
  • Container deployment can be managed with fewer and more obvious privileges in the wider environment than serverless.
  • There are many options for local workstation development for containerised apps. The serverless landscape doesn’t really match this.
  • Container iteration can be much quicker than serverless deployments.
  • Container development/deployment is more language/framework agnostic than serverless (only matters if you are not using an already-favoured language).
  • If you are using favoured languages and resources then Serverless can be quick to develop and deploy.
  • Container orchestration platforms like Kubernetes are resource-heavy to manage, hence the prevalence of platform teams in organisations that employ them.
  • For a standalone developer, Serverless obviously requires far less overhead to deploy and run than e.g. setting up Kubernetes, a deployment path, etc., let alone to scale and monitor, but at the cost of a potentially tougher development and deployment journey. There are also other options such as App Runner, which I have not explored, or ECS which I have not used for some time.
  • Once developed Serverless can potentially be run for a lower ‘standalone’ ongoing cost than a container orchestrated version.