What have I done lately? – Part 3: Cloud Infra War Games

Context

I was new to the development team, so I picked up a story from the current sprint that would not need much context for me to work on it.

An EC2 instance had gotten into a problematic / unusable state due to a batch job that ran and triggered high memory and CPU utilization. The resulting minor incident took an hour or two to sort out, so it had some follow-up actions for how to ensure that it could be addressed sooner if it happened again.

There was no alerting in place to get early detection, so the objective was to set up some alerting that would allow earlier detection if this happened again.

Finding the Appropriate Thresholds

CloudWatch metrics gave us some indication of what typical peak levels were for the service, so it seemed reasonable to expect that specifying values above the peaks would be stuiable and then I could move on to another story.

Staging versus production

Everything seemed fine in staging, but there wasn’t much processing happening in the staging environment, so the instance was never anywhere near the thresholds.

Once the change reached production we started to get notifications way more frequently than I had expected. To save the person on call from false alarms, I decided to rollback the change. This was a situation where the configuration was external to our service code, so it wasn’t something that we could control with a feature flag, as we would for any normal change.

It turned out that I just needed to add an additional line of config in the service descriptor to make the alarm setup correctly differentiate the specific EC2 instances.

Don’t roll back in anger

(Obligatory pun on a song title)

It’s not ideal when one of the first things that you work on gets to production and then doesn’t work the way that you expected. It would be a bad look if we got to production again and had to roll back again.

So, to boost my own confidence and reassure the team I put some extra effort into demonstrating in a non-production environment that the alerting behaviour would trigger appropriately.

The War Games System

By adding a couple of lines to the IaC service descriptor for the service I enabled the instances to be deployed with a companion process in non-prod environments that can be called upon to trigger particular behaviours for simulating particular types of fault and additional load.

The running of a war game involved logging into a separate system and specifying some scheduling criteria along with the type of disruption that was wanted. As I hadn’t played with that for a while I made sure that I tried it out a couple of times before demoing it back to the team (full disclosure, I still got a bit lost during the demo, as the interface was not particularly intuitive).

What did we learn?

I learned to pay a bit more attention to the configuration options when dealing with Infrastructure as Code. Although this setup was a bit different to other alert thresholds, the differentiator was still relevant.

The team learned about the existence of the war game system and how to set it up – though this was only appropriate for about a third of the services that we owned and maintained, as we needed to provision some resources outside of the typical service setup for “reasons”.

I also learned that it is not a good idea to make a recording of a process on the first attempt (if there’s a bloopers reel in Confluence, I reckon my Loom recording deserves a spot).