Implementing the ELK stack with microservice containers on AWS with Terraform
Table of Contents
**25 Minute reading time **(but article is composed of short, numbered sections!)
1 - why is this article different to every other blog post on the ELK stack?
There’s a lot of articles on ElasticStack/ELK components out there, I found a LOT that were extremely basic, essentially school project level reiterations of official elastic.co documentation and also a few that were very high level, essentially assuming that you already know everything and are wanting to ’talk shop’. I really struggled to find anything that covered a full use case in any detail without hand-waving over the fiddly bits. Whilst I do give some basic info here, I don’t intend to re-iterate official documentation and I cover a full use case including some of the blind alleys and pitfalls. This article concentrates on the technical challenges and solutions. It is not intended to be an introduction or comprehensive guide to the Elastic Stack. Bear in mind that I was starting more or less from scratch without previous production experience with AWS, negligible previous exposure to Docker and no prior experience with Terraform or with Elastic Stack.
2- The existing base, driver for change and brief
The existing set up was that the existing services were deployed principally as microservices in docker containers. Each container writes journald back to the CoreOS Host from where a logging container sends the collated systemd journal to Splunk. Splunk is really good from an ease of use perspective, especially if you don’t know exactly what your logging requirements are upfront. It is a paid service. you upload your data and then you can construct charts, queries etc on the results. The queries are in a proprietary format but it isn’t anything too hard. You buy a licence to upload so much data/day and you pay extra if you go over that. It’s a bit like a mobile phone plan, although not a very flexible one. In my case we only had Splunk logging in production, not staging, and we sometimes went over the daily allowance. The idea was that we could reduce overheads with a (semi) bespoke solution that we could deploy to staging (where it would essentially be doing ‘production’ work), polish and then potentially deploy also to production. Not realised at the time was that with Splunk data is indexed on read, making queries extremely flexible but relatively slow and that Elasticsearch is index on write, requiring more care at the indexing stage but giving far faster interaction at the query stage. The idea for the logging section of the stack was to: Switch the logging output from Splunk to Amazon ElasticSearch service and visualise with Kibana using the AWS provided endpoint. To fit in with the existing infrastructure the solution would need to be
- Implemented using Docker Container(s) on CoreOS via cloudconfig
- Deployed via Terraform
Refactoring an existing solution has its good points and bad. On the one hand, what you’re trying to achieve is relatively clearly defined and obvious, on the other, you’re not starting with a clean slate. I am also a fan of principle of least change, i.e. not changing more than necessary. In my case I was fortunate that I could build on the exiting base- I would not need to make any changes to existing running code or hosts to accommodate the changed logging output. I could also work alongside the existing stack without having to make any changes or interruption to it. **N.B. **I started with version 5.5/5.6 of Elastic stack but moved to 6.0.1 after it became available on AWS
3 - a working stack
I had not previously deployed stuff with Docker but even aside from the requirements to deploy this way, it turned out to be admirably suited. I had initially presumed that I would be able to download some sort of logstash container, plug some variable into a config file and get going but oh no. People sometimes describe ELK as ‘build your own stack’. To me that sounds like Ikea furniture. I’d describe it as more like ‘here’s some instructions and some seeds. Over there is a hill with iron ore and a river with some clay… After some research and experiment I decided that my best bet was to use a Journalbeat container to collect the systemd journal and pass it to an (official elastic.co) Logstash container to pass it in turn up to AWS ElasticSearch service. I toyed with trying to get everything going on a single container but: The Journalbeat dev likes to use Debian and write in Go. He says _“The underlying system library go-systemd makes heavy usage of cgo and the final binary will be linked against all client libraries that are needed in order to interact with sd-journal. That means that the resulting binary is not really Linux distribution independent (which is kind of expected in a way).”_ The Logstash Container that I settled on was the official one from elastic.co. This is built on CentOS 7… Addtionally, I needed to build in the AWS Logstash output plugin, which is a Ruby Gem… so yeah, two seperate containers. Microservices FTW. Later on I tried again- see ‘Analysis in staging’- but it was a waste of time (although I discovered that I did not need to build the gem myself), essentially this is a philosophical rather than practical concern, the paired containers work fine. I connected the two containers on the host using a private docker bridged overlay network and from there to an AWS ElasticSearch domain. I soon found that my console when I connected to the logstash container was filling up with rubbish because of course elastic.co want you to be logging to their services and the official container was basically complaining that I wasn’t. I found that the easiest way was to uninstall some of the plugins from the docker image but due to dependency issues I wound up having to remove three - x-pack; couchdb and elastic.co. Once I had this working I could work on my output configuration to match up with the endpoint. I think ‘confusing’ is a good way to describe an endpoint because there is no protocol at the beginning of the URL. UPDATE from version 6 it is possible to use the ‘-oss’ Logstash container versions provided by elastic.co and no plugins need to be disabled or removed here. At this point I had simply systemd output from my dev host- in this case the ubuntu desktop installation on my work machine- going to my elasticsearch domain. Not very interesting, and beyond ‘yay, it works’ not much to be done with the data. I also had to learn how to implement a suitable AWS policy to whitelist the office CIDR (IP address) ranges to manage access control and an IAM user for the logstash output.
4 - preparing the stack to be adaptable
The next step was to shift my static docker configurations to variables so that they could be launched with environment variables set by command arguments. This would ensure that for instance, the logstash worker amazon output gem configuration could be set at runtime rather than buildtime and so allow use of the same image across various AWS accounts. A bit fiddly (with lots of -e -e -e) because I was not using docker-compose or other container orchestration system but nothing too terrible. If I had been more accustomed to Docker then I could have done this from the beginning. I would definitely recommend it as an approach however (‘immutable’ guys nod heads in agreement).
5 - Autobuilding
In order to assure consistency, code portabililty, etc. My stack had to be assembeled using Jenkins and from there pushed to a private docker hub. In retrospect this was something of a red herring or artefact of the pre-existing landscape. The revisions I made to my logging containers from upstream were small enough not to need the Jenkins build and could have been handled with docker-compose although I don’t know how well this would have worked with the pre-existing coreos cloudconfig setup. Again I had not created Jenkins builds before but it was not _so_ arduous. The gotchas that I faced were all around the Amazon Ruby Gem signed AWS output plugin- I had to call installation of rbenv on the Jenkins slave container to build it - and learning how to archive the output build for incorporating into the logstatsh container image. Soon enough though I had my stack autobuilding- a matched pair of functioning docker images that could be run with arbitrary environment variables in order to suit different environments using the same image- dev, staging, production etc. UPDATE when I moved to later versions of logstash I found that I could no longer easily build and install my Ruby Gem BUT I found that I could simply do an online installation of the plugin, removing a build step and artifact in my Jenkins pipeline.
6 - creating a Terraform plan
Terraform is something else I had not worked with previously. I like it but I found it _very_ terse with regard to documentation. Even after several weeks I found it easy to spend hours trying to get some small section of a plan syntactically correct and working because the documentation is so brief that a lot of info is considered ‘implicit’ and (I guess because it is so new still) there are few examples from other users. Once a plan is working correctly it is a wonderful thing but it is not fast to write with for a novice, certainly not compared to procedural tools like Ansible. Initially I found the variables interpolation a bit mind-boggling- clearly I am not alone in this- but I stuck with it to get something with some grace that could all work of an obvious set of variables. After I had finished the MK I version of the plan I had to revise it to run as its own AWS user but the final stack looks like this: Terraform is obviously run with AWS environment variables set for a given environment (AWS account and region). Typically you set these using ‘aws config’ but in my case I have a handy set of aliases that sets these as child shell process environment variables. In either case, using these with interpolation, the terraform plan creates a user (for the logstash worker) and collects an AWS key pair for them. Then the plan creates an ElasticSearch domain with a security policy permitting IAM access to that user, using the AWS account code pulled back from the AWS provider. The three docker elements are then called with arguments. For the Logstash worker it is provided with the AWS account keys, ElasticSearch endpoint etc, using CloudConfig, i.e. Systemd unit files for CoreOS - essentially arriving at the container as shell launch commands. These are interpolated from the user data template in the terraform plan which in turn interpolates them from the resources created there earlier. Interpolaception. The big win for me at this point was the interpolation for the ElasticSearch Policy- using *aws_elasticsearch_domain_policy* I was able to have Terraform create the AWS user based on the stack name, call keys for them, create an ElasticSearch Domain, create a security policy for that domain to permit acesss for this user (as well as a series of whitelisted IP address ranges) and pass the user credentials and endpoint details down to the final container running on the CoreOS host (which is also created dynamically). Most comparable examples online describe manually creating the user and getting keys, etc, etc. The really big advantage of having terraform apply the policy this way rather than old-schooling a blob of json into the plan is that as well as permitting interpolation, Terraform can actually compare the proposed policy with that in place and recognise if they match- no more waiting ages for AWS to recreate a policy that’s the same as the one you already had if there are no changes (although changes do still take forever). We now have the glimmerings of an automatically built and deployed reproducible working stack. I won’t kid, it took me, a while, and I wasn’t entirely alone but we got there!
7 -preparing to deploy to staging
I now had to split my plan into two parts and amend with env files to:
- Create a single ES domain in staging but permit many installations of the docker stack logging to it
- Set conditional variables to only create a CoreOS host in dev, not in staging The key trick was to use Terraform remote state to dump the variable from creating the domain, policy and user into an S3 bucket and then pick that up with a second plan. This approach gave the required modularity and integration with pre-existing terraform plans.
8 -Deploying to Staging
This went really well and my ES domain stayed up for, oh about 11 hours before it died due to being overwhelmed with data. I hadn’t considered proper clustering and had just gone with the ‘m3 medium’ with 8GB SSD as I had in dev. Deploying to a few hosts in staging put greater demand on the ES domain. Researching what I needed was ’tricky’ because
- There aren’t any straight and simple answers to be had- lots of tooth-sucking and chin stroking around write speed, read speed and other stuff that seems broadly remniscent of people discussing the best cflags to set on the Gentoo laptop installation when that was a thing… moving on…
- Most advice and documentation presumes that you are managing your own installation on your own server which again doesn’t help.
- There are extremely valid reasons for clustering with an (odd) number of master nodes and some data nodes but this is well described elsewhere.
In the end I decided to go with a cluster of three masters and 5 data nodes, and wound up (top suggestion from my colleague) applying the changes in the AWS console and then iterating the terraform plan until it proposed no changes. This was complicated because I still wanted the single ES instance for dev, in the same plan, just with a different env file exporting different values. This was a bit of a pain with a few wrong turns since terraform doesn’t like passing booleans to variable strings but a few ‘blah_instances_number=0’ later and we got there. I went with EBS magnetic storage with the idea that it would have capcity for a month’s data. A key concept here is that you don’t ‘delete’ data from ES as such, rather you mark it as deleted. If you want to reclaim the space then you have to combine one or more indices with deleted data into a new index, leaving the deleted data behind and then delete the old index
9 - analysis in staging
This was my first chance to look at the real log material in Kibana. I really struggled with the visualistions but I found the [introductory videos on Logstash and Kibana to be valuable (and no I do not normally care for video tutorials). It soon became evident that although I could do a lot with Kibana, I wasn’t going to be able to present much meaningful analysis without structuring the data. Obviously we are used to the Splunk approach but ELK is opposite in this key respect: Splunk indexes on read, ELK indexes on write. This means that ELK is really fast and interactive compared to Splunk but you _have_ to structure your data somehow. This was not only a key revelation for me- from an elastic.co staffer- but also where I encountered some really deep pitfalls. The ElasticSearch field auto suggestions, whilst useful are NOT enough. During this analysis period I tried to get logstash-input-journald (gem) plugin going, on the presumption that it might be easier to structure my data with just a single ‘piece’ of software. I soon ruled out this approach because it isn’t simpler, or even relevant and because it’s actually quite hard to reliably mount and access a host systemd journal. Did you know that systemd has optional features for the journal that may or not be compiled in, including lz4? Neither did I. And that even an ubuntu docker container may not be able to read its host journal if the host has lz4 and the container doesn’t? Seriously when trying to get to the bottom of this brings you to a post from Lennart Poettering on Github saying (in 2015)
I cannot verify @scampi’s files, since they use LZ4 compression, which I cannot build on Fedora. (Given that we never supported that officially, we should probably just kill support for it in systemd, actually. @keszybz, what’s the latest on LZ4 in Fedora with sane ABI?
you can see that something is screwy. I was not going to try and recompile systemd so I went back to my two original containers which work.
10 - stumbling around in a sea of incorrect documentation
ELK does a great job of identifying fields in json documents (as supplied by logstash), typing them, and letting you query and chart the results. In my case however, all of these fields were metadata with the actual meat in a journal message string. This is what I would have to extract to be able able to structure and analyse my data. ELK has had some major changes in a relatively short time and this appears to be continuing. If you want to ‘script fields’ into your data then there are a number of methods available. You can script if you like but you have to pick a language. It would seem that at one time Java was favoured, then Python in a subsequent version, then Groovy but now all of these are deprecated in favour of Lucene, which only handles numeric operations, so now the new one is ‘Painless’, yeah. It’s a bit like Google’s instant messenger Android apps TBH. The generational thing isn’t entirely obvious either so you could be reading a blog entry from 18 months ago and then find that actually nothing in it is recommended and it misses the modern options which came subsequently. Still worse if the piece is years old. Even worse if you are trying to understand and evaluate several possible alternatives. When I realised that Lucene and Painless scripting weren’t really relevant for a hosted kibana service (they are pitched at elastic apps) I had to find something else. Elastic.co do try with their documentation but their editors don’t do a great job of prettying up the dev documentation from the git repos so paragraphs can be broken or jumbled. In all cases there is an over-reliance on canned examples like processing syslog requests (over and over) which is incredibly unhelpful because there is a lot of built in magic just for processing syslog so it isn’t a generic example and doesn’t help with the vanilla stuff. (I also think the focus on syslog is a bit ridiculous considering that new Linux deployments will no longer be using native syslog logging as an originator). I really think that this is the number one thing that Elastic.co could improve- yes they have clear versioning in their online documentation but it is by no means clear if a feature has been deprecated or superseded in many cases- an overall feature timeline would be extremely useful, as would some real world worked examples! Seriously Hashicorp manage examples for every single (Official) Terraform feature. Yes it is terse and unforgiving but it is extremely clear and unambiguous.
11 - Enter Grok
This is the unsung hero. I had been concentrating on my input (journalbeat) and my endpoints and had not looked much at Logstash. The grok filter was just what was needed to process the logs. A nice, stable, well understood and well documented regex model with online debugging. I still did not find it easy, or obvious but I was able to get a couple of patterns sorted for what I needed. Initially I was using custom patterns where I could have used builtin patterns, but I got there and I refined the result. The patterns that come out aren’t the same as traditional regexes and you can’t use awk, cut and friends, but they are OK and once you understand what the builtin patterns are they’re great.
12 - Logstash Templates
The thing that I got really hung up on for a while was trying to get 2 IP addresses geolocated. I could get one, or the other, but not both. Eventually I found that this was down to the template in logstash- which I had not been previously aware of: To take a high level view, in order to analyse your data, it has to be structured. You can structure it in Elasticsearch and/or you can structure it in Logstash. I was working only with the Logstash config which was great, I was able to get my cutom fields picked out in json and represented as fields when I reviewed in Kibana, fantastic! The issue was with GeoIP. I could see that the Logstash GeoIP library was working because I was getting a whole load of geographic fields populated by the parser. I just wasn’t getting the ‘GeoIP’ field that I needed there. Doubly frustrating was that when I looked at the raw json in Kibana, it was clearly there. The subtle point is that the ‘GeoIP’ Key takes an *array* as a value, yeah… …and so it wasn’t being indexed, even when I changed my (custom) name to ‘GeoIP’. It turns out that I needed to use a matching template. Now, I could go on a deep dive again, figuring out where the template file is in my logstash container, how to effectively edit it and build that into my deployment image or… What I actually did was change the output string which I’d left as the default from the Amazon gem as ‘Production_logs-%{YYY,MM,DD}’ to ‘Logstash-%{YYY,MM,DD}’. Immediately the new stack went live I was using the default outout template (with GeoIP) and using the default index in Kibana. Everything was working much more smoothly. Yes there may be occasions when you want two or more independent sets of indices for your domain or a non-default base naming scheme but really who is going to be looking at that from a data analytics point of view, or even care? If you are deploying a cloud-based deployment then you almost certainly want to have seperate domains for each set of data. At this point I deployed the new configuration to staging, looking forward to some structured and relevant data to build with.
13 - remapping and reindexing
And here, at a point I am sure any ElasticSearch veteran will be familiar with, I found that I had a field mapping conflict- whilst grappling with logstash grok filters I had initially used custom mappings for some fields and these were now text when I needed numbers. I now recreated the filter with standard logstash patterns and identified the fields in question as ’number:int’. To be doubly sure, I proactively uploaded an index template with this mapping for the next day. It took, the new documents followed the mapping and I was delighted, until the next day when I found that the field mapping had changed, yet again to ’number:long’. This was disappointing as i now had a field mapping conflict with three different datatypes. I decided that since ElasticSearch wanted to default to ’long’ for a number field, I would live with that, since it was not so terribly important, despite the official guidance advising using the smallest type that would suit a purpose (and actually a ‘short’ would have been sufficient). The official guidance says that unless explicitly told otherwise that ElasticSearch will decide for itself what is the most suitable type of mapping, regardless of what is requested by the log shipper… yeah… After a time learning how to use the web console/CURL commands to download, and upload mapping templates, and work with creating indices. I decided that the easiest option was to create ‘b’ versions for the indices with undesired mappings for the fields in questions, upload copies of the current index template that was automatically created and working and then copy across my historic data. With ElasticSearch a reindex means reading all of your data out of the old index and into a new one. My reindex took about 24 hours for 30 GB of data. This was a flag for production use- if we are expecting up to 30GB/Day incoming data then it is essentially not going to be possible to reindex with the same size of cluster.
14- The Kibana Dashboard
Once I had my indices and field mappings in order I set about creating a Kibana dashboard. Initially this was frustrating and I struggled to understand how to create worthwhile charts. The elastic.co introductory video was helpful. Obviously everyone will want their own dashboard based on their own circumstances. In my case I found the TimeLion and time series charts to ultimately not be very useful beyond slavishly recreating the pre-existing Splunk dashboards. The tile server for the geoip map is a bit of a gotcha- I wound up using ows.mundialis.de because elastic.co (quite understandably) don’t make their map tile server available for the AWS offering
14 - adding User agent
Buoyed up by finally having a dashboard to show and being able to install logstash plugins online, I iterated my stack to include useragent plugin. This worked well with the preidentified field that I had already pulled out with grok filters and I found it useful to have charts of both the raw data and the plugin-processed fields- a lot of client user agents for my use case are embedded systems and not templated in the plugin.
15 - new Major release
Around this time, Amazon incremented to version 6 of the ElasticStack. I decided that it was probably better to refactor to this ahead of deployment to production but there were a number of breaking changes as well as a number of pluses. The significant issues were around trying to get ElasticSearch templated for index creation. Short version, I dumped an index mapping from the version 5 domain and massaged it until I got something acceptable out of the new one
16 - Access and privilege control - reviewing the options
N.B. In April 2018 Amazon introduced Amazon Cognito authentication for Kibana and so some of the following must be considered dated, if not obsolete. This is one of the big selling points of the licensed version of Elastic Stack- smoothly integrated and tightly controlled access and privilege control via X-Pack. It can integrate with LDAP; oauth; Active Directory; etc, etc. This isn’t available with Amazon’s version. Frankly Amazon handwave over this IMO. All of the canned examples talk about using IP address whitelisting which is only *access control*. Yes there are IAM controls which are fine for *access* but aren’t helpful for Kibana. Whilst you can set IAM privileges on the root ElasticSearch domain by attaching policies to IAM users to whom access is permitted, these don’t actually apply to Kibana, even though it is run under an IAM identity- the ‘dev tools’ console remains available as the equivalent to an SQL root prompt on the domain. Only the the AWS ES IAM privileges;
- es:ESHttpGet
- es:ESHttpHead
- es:ESHttpPost
- es:ESHttpPut
really affect Kibana and if you disable them then you don’t get a working Kibana. All of the documentation that I found around non-X-Pack controls was based around configuring Kibana, which of course you can’t do with AWS service. Unfortunate. Around this time I attended an ElasticStack meetup and spoke with some of the team and was encouraged to explore their AWS-hosted licensed service. I spent an hour online with one of the team subsequently exploring the options. Whilst the solution offered would undoubtedly have ticked all the boxes, the cost was *very* high, essentially *double* what we were paying for Splunk. Since cost was a key driver for commencing the project, this completely ruled out this option. So how to proceed? I could not find a ready solution described anywhere. It seemed implicit that once people started using the stack ‘properly’ in an organisation of size that they were expected to sign up to the licensed option.
17 - privilege and access control implemented
This part involved a great deal of trial and error but eventually a solution came into being. I implemented this in Docker-compose with the design being based on using upstream containers and run-time variables as much as possible. I had been specifically asked to ensure the read solution could be portable, even to a bare-metal datacentre. I decided to run my cluster on a dedicated EC2 instance intially with the idea being to maintain the flexibility to move to Nomad, ECS or some other scheduling service in principle. Starting from the back-end: rlister/aws-es-kibana This is a signing proxy that allows access to the AWS ElasticSearch and Kibana endpoints under an IAM user identity - simply ’es-maintenance-worker’ in our case docker.elastic.co/kibana/kibana-oss Kibana instance running (unmodified) in a container - this allows admin access using the signing proxy- with a ‘virtualhost’ flag set for the nginx proxy Another Kibana instance running with console.enabled=false otherwise the same as above - this will be our ‘read only’ kibana, with access using the same signing proxy. It would be nice to also disable the ‘management’ tab but this is not an option without X-pack We now have privilege control jwilder/nginx-proxy nginx dynamic docker proxy using docker gen - this automatically finds the kibana instances and allows us to address them as virtual hosts with simpleauth We now have our own user access authorisation We already had some user access set up elsewhere for some nginx proxies using Consul keys as the source of authority so I was asked to leverage that. To do this I added to the stack: hashicorp/consul A consul client instance to run on the ES-cluster and join the pre-existing Consul cluster hashicorp/consul-template A consul-template instance to take the relevant keys from the local Consul agent and bang them out to nginx as the relevant htpasswd files The two consul containers are simply off-the peg hashicorp containers. Consul-template and nginx mount a common volume We now have effective privilege separation and login authorisaton. Since the AWS stuff is all IAM based, we don’t need to worry about any IP whitelisting or where our monitoring cluster runs: as long as it has http access to the elasticsearch domain and the consul cluster it will work.
18 -Deployment and the joys of cloudinit
Initially I attempted to get this going with CoreOS using systemd unit files via cloudinit as with the writing part of the stack. I struggled for a while but wasn’t seeing much success. I decided to go back to using Docker-compose, with all of the relevant data input via cloudinit using terraform. Fairly early on I learned that on a general-purpose distro there are many sections available for cloudinit- packages, ssytemd units, users, etc, ect. A little later I learned that these were just a frustration because the order in which different sections are run is immutable and cannot be changed, essentially it’s difficult to park a binary or script and then call it, e.g. as a systemd unit. The solution that I wound up with was to:
-
Use a standard Ubuntu AMI
-
Use the ‘packages’ section to install docker and docker-io
-
Use ‘write_files’ to write out my docker-compose files, a systemd unit and a run-once bash script
-
Have the bash script do the heavy lifting with:
-
An upstream install of docker-compose (I wanted version 3 features) making a callback to the AWS magic url to get the private IP address for consul to advertise and writing this out as a .env file variable for docker-compose to interpolate
-
Enable and then start my custom docker-compose unit
By having everything as plaintext in my cloudinit user data I was able to have terraform interpolate relevant variables and by having ‘my’ parts called from a run once script I was able to easily have the actions executed in my chosen order. The completed host will survive a reboot but is essentially ephemeral and stateless in that no persistent data needs to be stored on the host and everything is managed by run-time configuration using commodity components, there is no need for custom containers, jenkins jobs or private docker registries.
19- cycling indices
One of the issues with the free version of elasticsearch is that of managing the data on hand. You can only have so many live indices and eventually you have to delete the old ones. To some extent this is made easier because by default Logstash creates new indices each day. AWS again makes it a little harder because you cannot install the ‘curator’ plugin that elastic.co would normally recommend for this. I wound up adding a container to my monitoring stack- ‘cronworker’ which calls a bashscript as a cronjob to delte indices more than $days old . Basically it’s a non-interactive version of this script. We can use ordinary curl here via the signing proxy container. Technically I *could* have run this cronjob on the Ubuntu host itself directly *but* this would have:
- reduced flexibility for other schedulers and scheduling services in future
- Made the stack less ‘self contained’
- Stopped everything being in a single location on the host with respect to Docker Compose
20 - kibana dashboard
Obviously the downside of running our own Kibana instances in containers is that we have to manage our own dashboards. For staging MK I I simply set up dashboards for each user, based around a sample set but, as I later realised, these were all using the same ‘.kibana’ index. Armed with this knowledge I could now have a separate index for the ‘read-only’ users specified as a run-time argument to that Kibana instance. Although I might still have to contend with user error deleting visualisations or breaking dashboards, this index can relatively easily be backed up and restore with curl commands using the cronworker as needed.
21 - Conclusion
Elasticsearch has a great deal to offer but it is a massive topic and rapidly innovating. It requires significant resource and resolve to set up and take advantage of if you want to set up and run it for yourself. Despite what a casual glance might suggest, the AWS managed service doesn’t really make it that much easier to manage. It’s very easy for something quick and dirty but anything for a team of any size will need access control and in my case I wound up managing everything *but* the ElasticSearch domain itself with upstream components. Even though I managed to do all of this in a stateless and ‘serverless’ way, if I were to be doing this again, I would consider implementing the domain on EC2 since the nominal advantages of not needing to manage the cluster hardware are quite likely outweighed by the loss of proper configuration of the domain and things like es_curator and privilege management. At times I found my journey extremely frustrating but I was pleased that the end result worked well and I came to really appreciate the tools I was working with. Definitely a fan of Terraform and Docker!