Cloud Idle

There seems to be a buzz going around that the cloud computing hyperscalers may be starting to run short of resources for their customers’ planned workloads. That may or may not be isolated to particular specialised hardware such as GPUs due to the intensity of interest in AI, but it got me to contemplating potential consultancy opportunities…

The Sweet Spot Between Nanoservices and Monoliths

Most of the companies that I have worked for during my career have adopted some form of microservices based architecture for processing and serving out data for their applications.

A lot of established companies had gone through the old approach of having a single large back end application where all changes are deployed at once, and generally involved a release schedule with someone having to sign off on the changes, and / or having to run scripts to update the database – when I was that person, I remember explaining to my colleague “No Charlie, there’s no way we could script this”.

Accumulation of waste

From my experience, once you have had a few teams successfully building up microservices to support a company’s business for over 18 months there will inevitably be some components that no longer deserve to be kept around.

I have receipts, here are some real world examples where someone new coming in would raise an eyebrow at seeing the waste:

1000 Kafka topics when it was sufficient to have 1.
Backfill worker nodes that sat idle for at least a year.
Integration tests that spun up and triggered alarms in CloudWatch
- This delayed deploys by at least 10 minutes on each run

Changing Compute Model

When the type of waste involved is idle compute resources, such as EC2 instances that have nothing to process for hours or days on end, it may make sense to consider adjusting the codebase and deployment configuration to run on the serverless resources such as AWS Lambda or Azure Functions. This will depend on the nature of the processing involved, where stateless operations that do not have low latency requirements and have limited overall duration should be a reasonable fit.

Merging Services

Provided that the security and compliance requirements are compatible, sometimes it makes sense to combine the responsibilities of existing services together. This can have flow-on benefits by reducing the number of code repositories to be considered for situations such as urgent dependency upgrades and deploys.

The reduction of distinct units involved in a full system deployment can also be appealing when considering the deployment footprint per region or per isolated environment, as that could reduce the costs for higher availability and disaster recovery.

Rightsizing Resources

As the ancient expression goes, “Change is the only constant”.

Typically when a new service is provisioned and ramped up to take on 100% of traffic developers will be optimistic about how much demand to expect and prefer to over-provision the resources available to handle it. Over time interest may wane from customers and the development team, so it is worthwhile to periodically check back and re-evaluate how high the level of resource allocated should be.

As an example, if a DynamoDB table was set up as part of the provisioning for a service then it may have been configured with some high level of allowance for read capacity, which has an associated cost. By checking CloudWatch metrics and other indications of the actual demands being put on the system, we can make an educated guess about how the ongoing demand is going to look and adjust the configuration accordingly.

Rebalancing Resource Allocation

As a counter-point to the main theme of this post, we may find that resources are being placed under higher demand than was previously expected. If we take the DynamoDB read capacity example from earlier, we could treat high levels of reads as a signal that we should introduce caching to reduce the load.

The potential approaches to caching deserves its own series of blog posts, so you’ll have to excude me for not going too deeply into the details here. With that being said, as a potential teaser, two distinct setups that could be applied are:

Caching in memory on each compute instance
- Pro: No additional cloud resources to pay for
- Con: Duplication of data stored on each instance
- Con: Cache empties for each re-deploy
- Con: Potential serving of stale representation when record update / deletion occurs
Use of Redis as a fast access store for cached data
- Pro: Single centralised cache, so straight-forward to flush old state on delete / update
- Pro: Supports time to live for automatic flushing and refreshing
- Pro: Cached state remains independent of service re-deploys
- Con: Additional cloud resource to pay for and configure

Summary

The dependence on large cloud computing providers may be due to hit some limits.

Now is a good time to consider making our systems more efficient in their demands on compute and storage resources.

The software lifecycle doesn’t end when a feature has been delivered. Ongoing maintenance should be an opportunity to save costs without compromising on quality.

NB: This is not a deep dive, it just scratches the surface of some areas that I am familiar with from my decade or so of developing microservices and deploying them in the cloud.

(PS: No apologies to anyone who came here expecting a radical new talent contest following the American Idol concept).