All Else Equal

In The Three-Body Problem, Liu Cixin describes how an alien species drives scientists to suicide by making it impossible for them to produce consistent experimental results. Some might find it difficult to relate to the scientists’ existential despair, but I found the premise compelling and chilling.

In this post, I do not tackle anything so sinister or abstract. Rather, I challenge a key assumption of A/B testing — namely, that all else is equal. I hope to inspire curiosity and reflection rather than existential despair.

A/B Testing: A Simple Example

A/B testing is the most popular method for online experimentation. It compares two versions of an application to determine which one performs better. Typically, one is the “treatment” we are considering as a change and the other is a “control” that represents the current state of the application.

For example, consider a simple A/B test to determine whether increasing the page size for search results from 10 (control) to 20 (treatment) leads to an increased conversion rate. This is about as simple an A/B test as it gets.

Let us imagine that the test is successful, delivering a statistically significant increase in the conversion rate. What does this tell us?

The World is Not So Simple

The answer may seem obvious: doubling the page size increases the conversion rate. More precisely, this result only holds if we hold all else equal — since the change in page size might interact with other changes, such as changing the page design. However,.we have to be even more pedantic: the result only holds given the current state of the world.

Consider the factors of screen size and network latency, both of which are determined by the searchers’ devices and locations. Both of these factors interact with page size to affect the experience. Increasing the page size may increase conversion in one set of conditions but decrease it in others.

In the physical world, we do not generally worry about the laws of nature being time-dependent. We treat the law of gravity and the speed of light as constants. However, the digital world changes far more rapidly than the physical one, as does user behavior. That makes it dangerous to assume that the conditions for an experiment hold indefinitely.

Do Not Despair!

If you rely on AB testing as part of your day job, you might find this state of affairs disheartening. But please do not despair! We have it much better than the scientists in The Three-Body Problem. No aliens are out to get us!

Fortunately, there are things we can do to detect changes likely to invalidate our experiments over time. Here are a few:

  • Reverse Testing. You can revisit an A/B test by reversing it — that is, using the current version as the control and the old version as the treatment. The catch is that maintaining the ability to perform reverse tests requires discipline and can incur technical debt.
  • Long-Term Holdout. Typical A/B tests are short, e.g., two weeks. Running the control for longer (e.g., three months or a year) hedges against conditions changing during that time.
  • Monitoring. While it is important to look at metrics when evaluating a change as part of an A/B test, it is also important to look broadly at metrics over time when you are not making any changes. Trends or sudden changes in metrics can tell you when the world is changing.
  • Snapshots. While monitoring can alert you to unexpected changes in metrics, a more direct approach is to take a snapshot of metrics that hold at the time an A/B is conducted. Changes in those are particularly likely to invalidate the test results.

This list is not exhaustive. Hopefully, it helps you think about ways to keep in mind conditions outside the explicit scope of your experiments.

The Only Constant is Change

As Heraclitus said, the only constant is change. When we perform A/B tests, we need to bear in mind that the results assume present conditions that are subject to change. As Ferris Bueller warned us, “Life moves pretty fast. If you don’t stop and look around once in a while, you could miss it”.

Russell Jurney

Graphs and Generative AI

3mo

Hahahaha, nice opener.

Joel Barajas

PhD, Principal Data Scientist, Ad Measurement Architect at Walmart Ads

3mo

I am glad you are talking about the dynamics of changing conditions (often over-looked by people running the tests). You are mainly touching about the external validity of a result (say it was tested in June) to hold all the time moving on (say in Christmas holidays). I tend to believe that the misconception comes from importing RCTs from medical treatments, where people’s health outcomes are easier to extrapolate. Long-term holdouts are probably best, but they are noisier (smaller groups) and sometimes difficult to disentangle if you have >5 small changes released over the course of the holdout. Probably it is better to to keep questioning side effects or unexpected behavior that trigger a new test again changing the “improved version”. This probably what keeping A/B tests even for mature products is needed to adapt to a ever-changing landscape. Just a POV

To view or add a comment, sign in

More articles by Daniel Tunkelang

  • Modeling Queries as Bags of Documents

    Modeling Queries as Bags of Documents

    Last week, I had the honor of presenting “Modeling Queries as Bags of Documents” at Search Solutions 2024 with Aritra…

  • Documents, Queries, and Categories

    Documents, Queries, and Categories

    I have published a number of posts and presentations about the bag-of-documents model, which essentially represents…

  • Where Do Categories Come From?

    Where Do Categories Come From?

    In my previous post, I argued that categories are fundamental for search applications. I characterized a robust set of…

    1 Comment
  • Categories are Fundamental for Search

    Categories are Fundamental for Search

    As a search consultant, I have learned to be flexible about structured data. However, I do insist on content being…

    5 Comments
  • Quo Vadis Nunc, Quora?

    Quo Vadis Nunc, Quora?

    I was one of Quora’s earliest users, earned Top Writer status for a few years, and topped the leaderboard as a 9-time…

    2 Comments
  • Seriously or Literally?

    Seriously or Literally?

    The other day, I posted about the need for search applications to take searchers seriously, not literally. This need…

  • Cold Start, Practical Edition

    Cold Start, Practical Edition

    If you are a search application developer or some other kind of machine learning practitioner, you have probably…

  • Take Searchers Seriously, Not Literally

    Take Searchers Seriously, Not Literally

    Search application developers manage numerous tradeoffs, foremost the tradeoff between precision and recall. Precision…

  • Hallucinating a Post-Search World

    Hallucinating a Post-Search World

    When I first heard about 3D printing, I imagined something like a Star Trek replicator that could synthesize arbitrary…

  • Handling Facets With Many Values

    Handling Facets With Many Values

    The previous post addresses the challenge of selecting which facets a search application should present to searchers as…

Insights from the community

Others also viewed

Explore topics