What guarantees are you making?

One of the fun things (really) about being around large services is reading postmortems. It’s always fun to play amateur detective and dig through what happened to take a service down. It’s a little like true crime, but with software.

One of the patterns that comes up a lot in these failures is that there was an aspect of the system that was being depended on as though it was a guarantee, but which wasn’t really a guarantee - and it changed. For example, a specific routing behavior might be said by the documentation to be random, but might be random in always the same way (so, not usually random then), until something changes in how the order is added (it’s random now!) - sometimes this happens when a new replica or route is added to a system, for example, or there is a partition or even sometimes a deployment. There can also be side effects that are silently depended on - a read might usually not cause a commit or a cache flush, but then suddenly start doing it and causing load or latency issues. Sometimes the guarantees are about latency - code will assume that an RPC will return quickly even in the error case - which it does, until something on the other end is slow rather than down, or the cache is messed up and missing all the time, and it takes 30 seconds instead of 3ms.

As both the author of a service or component, and the consumer of one, it’s important to try to be as explicit about these guarantees (or as the consumer, the dependencies). This explicitness should take the form of code whenever possible - unit tests, assertions on parameters or return values, fuzzing to make sure both sides can tolerate as many edge cases as possible, and always, always documentation. It’s hard to get all of them but it’s important to try.

There’s kind of a parallel to people, too. For example, one of the bad patterns in organizations is “hero culture”, where someone is constantly overstepping their job bounds to “save the day”. A sign of this is statements like “this X wouldn’t run without person P”. In this case, there’s a guarantee/dependency pair that’s implicit, and false: that the person is both critical and failure proof.

And guess what? When the personal guarantee fails, just like in the code case, you’ll get a failure in the system - the build doesn’t go out, the package doesn’t get signed, the bugs don’t get resolved, etc. The cure isn’t quite the same as in the code case, but it’s similar - instead of unit tests, you have well-documented processes. Instead of fuzzing, you rotate duties so that no one person is a single point of failure. Instead of documentation…actually you have documentation, where you’re clear about roles and tools.

Guarantees (and their partners, dependencies) are important both in organizations and APIs - we can’t live without them. But implicit ones are dangerous and to be avoided - it’s much healthier for everyone to know what the guarantees are, so they can design around them (and, hopefully, not break them).

So - what are the hidden guarantees in what you’re doing, or coding, and how can you make them see the light of day?

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics