Failure Modes of S3 Bucket Listing

Reconsidering the S3 Bucket List

In an earlier post I shared about how there are different options available for calling AWS’s S3 API to list the buckets:

Single call
Multiple calls, paging through results
Single method call to asynchronous non-blocking abstraction that may involve multiple remote API calls

Now I will scratch a little further and explain some pros and cons of the options, which may or may not lead me to completely contradict my earlier opinion.

Single call

For a lot of organisations the single call would be absolutely fine, as the total number of buckets will be less than the upper limit of 10,000.

Multiple calls, paging through results

The AWS documentation has a warning box stating, “Unpaginated ListBuckets requests are only supported for AWS accounts set to the default general purpose bucket quota of 10,000.”

https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListBuckets.html

Given that AWS has millions of customers, even if only one percent have exceeded the 10,000 bucket limit that will be tens of thousands of customers.

Here I am proposing that the code would involve accumulating all of the state involved and only returning after the available listing has successfully been completely consumed.

Single method, async pagination

Earlier on this would have been my preferred approach, but the possibility of a failure occuring partway through has turned me off.

Failing early is okay(ish) as we don’t have to consider what to do with any partially processed results.

Failing at any other point introduces complexity, due to some limitations of the API.

No ordering guarantee
- When you call listBuckets there is no guarantee on the ordering of the results
Continuation is an internal implementation detail
- Although the underlying API involves use of a continuation token, if an error is encountered there is no expectation that we would be able to re-use a continuation token for retrying
No meaningful filtering criteria
- We can consider this as related to the lack of ordering, we cannot send a request that specifies buckets with names that would appear after a given name, if we wanted to attempt to resume from the previous partially successful listing

If we had a situation that involved some expensive processing, or something requiring processing each bucket exactly once, then we would need something to be in place to keep track of the buckets that had already been processed.

Thinking aloud a bit here, for a situation involving some expensive processing of each bucket, we should include something in the system design to persist the status of that processing for each bucket. So I am not going to discard the asynchronous approach, particularly if we consider that listing buckets is likely to be a less expensive operation than listing the contents of individual buckets.

It Depends

It wouldn’t be a true technology blog without this typer of caveat.

Aside: Maybe, “IT” isn’t short for “Information Technology” but actually the it in, “It depends”?

Back to the topic at hand, the way to go about listing S3 buckets should depend on what the specific purpose is. An internal admin portal listing resources for accounts that have a few dozen buckets each will be very different to a system reconfiguration workflow involving tens or hundreds of thousands of buckets.

You don’t have to be Einstein to work out the approach, but this quote from him is still applicable, “Everything should be made as simple as possible, but not simpler”.