What guarantees are you making?

Sam Schillace

Published Jan 27, 2023

One of the fun things (really) about being around large services is reading postmortems. It’s always fun to play amateur detective and dig through what happened to take a service down. It’s a little like true crime, but with software.

One of the patterns that comes up a lot in these failures is that there was an aspect of the system that was being depended on as though it was a guarantee, but which wasn’t really a guarantee - and it changed. For example, a specific routing behavior might be said by the documentation to be random, but might be random in always the same way (so, not usually random then), until something changes in how the order is added (it’s random now!) - sometimes this happens when a new replica or route is added to a system, for example, or there is a partition or even sometimes a deployment. There can also be side effects that are silently depended on - a read might usually not cause a commit or a cache flush, but then suddenly start doing it and causing load or latency issues. Sometimes the guarantees are about latency - code will assume that an RPC will return quickly even in the error case - which it does, until something on the other end is slow rather than down, or the cache is messed up and missing all the time, and it takes 30 seconds instead of 3ms.

As both the author of a service or component, and the consumer of one, it’s important to try to be as explicit about these guarantees (or as the consumer, the dependencies). This explicitness should take the form of code whenever possible - unit tests, assertions on parameters or return values, fuzzing to make sure both sides can tolerate as many edge cases as possible, and always, always documentation. It’s hard to get all of them but it’s important to try.

There’s kind of a parallel to people, too. For example, one of the bad patterns in organizations is “hero culture”, where someone is constantly overstepping their job bounds to “save the day”. A sign of this is statements like “this X wouldn’t run without person P”. In this case, there’s a guarantee/dependency pair that’s implicit, and false: that the person is both critical and failure proof.

Recommended by LinkedIn

Minimizing Risks: The Impact of Late Bug Detection

testRigor 6 months ago

Hard Tests

Caleb Crandall 10 months ago

Can we really keep making the same mistakes with…

Geoff Thompson 4 months ago

And guess what? When the personal guarantee fails, just like in the code case, you’ll get a failure in the system - the build doesn’t go out, the package doesn’t get signed, the bugs don’t get resolved, etc. The cure isn’t quite the same as in the code case, but it’s similar - instead of unit tests, you have well-documented processes. Instead of fuzzing, you rotate duties so that no one person is a single point of failure. Instead of documentation…actually you have documentation, where you’re clear about roles and tools.

Guarantees (and their partners, dependencies) are important both in organizations and APIs - we can’t live without them. But implicit ones are dangerous and to be avoided - it’s much healthier for everyone to know what the guarantees are, so they can design around them (and, hopefully, not break them).

So - what are the hidden guarantees in what you’re doing, or coding, and how can you make them see the light of day?

What guarantees are you making?

Sam Schillace

Recommended by LinkedIn

More articles by this author

Insights from the community

Others also viewed

Can we really keep making the same mistakes with software testing?

The Power of Agreement-Based Trust in Teams

Good code, Bad code

🚀 Understanding Load & Performance Testing: Key Concepts 🚀

From Glitch to Fix – Bug Resolution Journey

Stability, First.

How much “Shadow Code” is in your open-source applications?

Ways of securing your spring boot application

Software insights + hot takes for September

Will You Survive Your In-House Solution Stack Hack?

Explore topics

Recommended by LinkedIn

Looking back at the Schillace "laws"

Dec 23, 2024

A strange tech parable

Dec 16, 2024

Simplicity, Clarity, Humility

Dec 9, 2024

A matter of context

Dec 2, 2024

The tension between Chaos and Order

Nov 24, 2024

No Prize for Pessimism

Nov 18, 2024

Adding Value in the Age of AI

Nov 11, 2024

Don't use AI to make work for humans

Nov 4, 2024

Now is the time to keep working hard

Oct 27, 2024

Wait, am I the ornithoper?

Oct 20, 2024

Insights from the community

Others also viewed

Can we really keep making the same mistakes with software testing?

The Power of Agreement-Based Trust in Teams

Good code, Bad code

🚀 Understanding Load & Performance Testing: Key Concepts 🚀

From Glitch to Fix – Bug Resolution Journey

Stability, First.

How much “Shadow Code” is in your open-source applications?

Ways of securing your spring boot application

Software insights + hot takes for September

Will You Survive Your In-House Solution Stack Hack?

Explore topics