Table of Contents

Stepping Into Serverless image

I recently had the opportunity to work on a project archiving data from a third-party service programmatically and making it available on a new platform. This was a change from the more platform-focussed work I’ve tended to focus on but I agreed to look into it and see what I could come up with. I was pleasantly surprised by the results, using the Serverless Application Model (SAM) and AWS Lambda. This article discusses that project and a (pocket sized) equivalent available on my GitHub, demonstrating a similar approach to a similar challenge- archiving health data with easy retrieval. It’s one thing to talk about solutions; it’s another to demonstrate them!

The Original Project

The situation was that there was a third party system hosting a large number of records related to a SAAS product. The org had now moved to a different supplier and wanted these records archived to AWS. The requirement was to not only retrieve these records but to make them searchable/accessible by semi-technical people. The simple solution of a carry out on physical storage was not an option but an API was available. Apparently someone had been asked to tackle this in the past but had struggled with ’timeouts’. There was no code or documentation from any previous efforts available to me, nor any other technical people with any familiarity with the situation.

Research

I started with locating the correct API/documentation- not as straightforward as it might sound, due to previous commercial acquisitions - and attempting to query it with some desktop scripts. I established that:

  • There were a large number of records
  • I would have to enumerate the records to get basic information
  • I would need a second pass to get the more detailed data pertaining to each entry
  • There were various restrictions to work within, including:
    • The range of records which could be enumerated via any single request
    • A limited lifetime for the bearer token issued at sign in for a request

At this point I had a script to handle the enumeration with a given range and another to populate a given record.

The Design Phase

I had the idea of some version of my desktop scripts being orchestrated to achieve this with retries, parallelism, etc. For each ‘script’ I would need some way to ensure that it picked up where the previous run or batch left off, without repeating previous work, and put the results somewhere accessible.

Whatever I came up with, I was expecting that I would be building myself and that I could not count on any maintenance for it subsequently. Neither was a team providing any sort of in-house hosting or deployment service available to me. This caused me to think of a serverless solution. AWS Lambda can be thought of as a script hosting solution, and of course has good support for managing retries, parallelism, etc. SAM has ’easy’ integrations for building, packaging and deploying lambdas as part of a complete stack.

The Demo Project

At this point I shall move to referencing principally the demo project. It’s easier to talk about in detail and the relevance to the original project should be obvious.

I decided to query NHS website contents. This material is available via an API that is well documented and possible to sign up to without manual review or financial payment. There’s also the example of the NHS’ own website demonstrating displaying the same material, so it’s easy to see what things should look like! I decided to query ‘medicines’.

Research

Again I started with some desktop scripts- and continued from there. Again it wasn’t immediately obvious which API I should be querying out of the multiple available. For example the ‘Sandbox’ API is very quick to query because it doesn’t require authentication and has only a tiny selection of mocked data. On the other hand it’s not really representative because it doesn’t require authentication and has only a tiny selection of mocked data…

The Integration API, which I wound up using, requires a significantly more complex authentication process than my original project. It is also heavily rate-limited, and there are only (at the time of writing) 274 medicines listed. In both cases the bearer token issued on authentication has a short lifetime- for the NHS API this is 5 minutes.

I got the point where I had two scripts- one which could set up the authentication, and one which could authenticate and then download a list of all medicines- essentially the ’enumerate’ step - with an exponential backoff to handle rate limiting.

The Design Phase

Here I had a clear idea of what I wanted to do in AWS:

  1. Create a Lambda to automate setting up the authentication material- essentially creating an RSA key pair and matching JSON Web Key Set (‘JWKS’), putting the public key and JWKS to Parameter store and the private key to secrets manager.
  2. Create a second Lambda to use the issued API key and accessory secrets from the first to enumerate the medicines and write the results to DynamoDB
  3. Use a step function to orchestrate a third lambda to populate further data for each item in the table using the same secrets.
  4. Deploy everything using SAM and Python - For the original project the org had been very keen on using CloudFormation and was a Python shop so this was a ‘safe’ choice at the Layer 8 level. It turned out to be well suited for this use case and so I did the same here.

The key differences from the original project here were that:

  • With the NHS API:
    • Due to the small(ish) data set I could enumerate all the data in the lifetime of a single bearer token/Lambda. This is fine because we can demonstrate state management and orchestration with the ‘populate’ Lambda
    • The rate limiting meant that I would not need to worry about any parallelism but I would need to handle waits/retries
  • In the original project:
    • No accessory secrets were needed beyond an API key and so the authentication lambda was not needed
    • There was already an assigned UUID for each entry
    • Some of the data would be put in DynamoDB and some in S3
    • In order to reduce API calls and accommodate parallelism I implemented a caching mechanism (to DynamoDB) for the JWT bearer token.

Implementation

The final implementation is available in the repo so I will only cover the narrative here. In summary:

  • There is a single SAM deployment
  • The first lambda is run manually one time only to set up accessory secrets which need to be manually registered with the NHS API
  • The second lambda is run manually, again one time only, to enumerate the DynamoDB table
  • A State Machine Step function is run manually to populate all items in the DynamoDB table with the chosen additional field by calling the third lambda to process entries in batches:
stateDiagram-v2 [*] --> FetchAdditionalField: Start FetchAdditionalField --> CheckMoreItems: Fetch additional field CheckMoreItems --> WaitBeforeNextFetch: More items? WaitBeforeNextFetch --> FetchAdditionalField: Wait 1 second CheckMoreItems --> EndState: No more items EndState --> [*]: Succeed

Translating the GetAuth function to Lambda was not especially difficult - RSA key pairs are pretty well understood and the NHS developer portal validates the JWKS on upload. I had more trouble with the enumerate (ListAllMedicines) Function. It was difficult to see what the Lambda was doing and this bug in SAM made it difficult for me to update my logger output level. Eventually I got there and was able to validate the authentication, resolving things like secrets being fetched as json when they should be a string or being indexed under the wrong key… It was then fairly simple to unpack and index the correct fields to DynamoDB.

I now had an equivalent in AWS to my desktop code from which to move forward with populating my data in DynamoDB. There isn’t a provided UUID field so I decided that I would use the (medicine) URL as the partition key and the Name as the sort key. This allows using the scan operation to find items without the additional field and get the data in batches, rather than having to use the query operation to get a single item. This enables state management and orchestration with the ‘populate’ function. In the original project I had been able to use a provided item UUID and so this approach aligned with that and with the constraints and requirements of the NHS API.

I hadn’t felt that it made much sense to develop a version of the next part on desktop- the State machine and the ‘populate’ (FetchAdditionalField) Lambda. Here I decided that I would scan the DynamoDB table on each iteration, get 25 unprocessed rows and process for them, orchestrated with a step function before the next iteration. As described above, due to rate limiting and bearer token lifetime, I didn’t implement any auth caching or parallelism here as I had previously. This also meant that it wasn’t worth doing duration based invocations, e.g. ‘get what you can in 12 minutes and then stop new work’, hence the batch-based approach.

Reflection

Ultimately I am pleased with this step into Serverless. I’m sure it’s very old news for many, and yes there are some sharp edges, but I found it an efficient way to get my project up and running without requiring additional infra support for deployment or ongoing maintenance. I really liked the implied IAM policy relationships and the all-in-one build for my lambdas that SAM supported.