Why significance testing leads you astray - a Monte Carlo simulation

Why significance testing leads you astray - a Monte Carlo simulation

Most of us are aware that sample-based metrics are subject to noise. We are usually much less aware, however, to which degree noise creeps in our metrics. Significance tests are part of the problem, as we intuitively think they separate the wheat from the chaff. A simulation experiment shows that's not the case.

One of the reasons we underestimate the noise in our metrics is that we want to make sense of the world. So, when a metric goes up, it is reassuring to assume that it went up because we did something. When we observe a relationship between two variables, it is reassuring to assume that they really are interconnected somehow. Coincidence as an explanation is usually not on the menu.

Habits in marketing research are another reason for why we don't have a developed noise antenna: many of us have been taught we can beat uncertainty with significance testing. This tool was once invented to create more formal criteria for trying to separate random from real effects in the social sciences. But after decades of debate, many scholars and journals don't accept the significance test as the most important decision-making tool anymore. Not in marketing research though, where the tool is still bluntly (ab)used, leading to reporting 'findings' that are distracting at best and misleading at worst.

A table with only noise

Let’s look at a concrete example. In the cross-table below you see a fictitious result of a survey. It shows concept preference (rows) for different age groups (columns). The table is based on N=1,000 observations. You can see that preference varies barely with different age categories. That is no coincidence: I filled both variables with random data. That way, we are sure we are looking at sheer noise and thus know that if we do find significant effects, they are false positives. 

No alt text provided for this image

Every time I create the random data and run the table it will look slightly different. With N=1,000 behind this table, the odds that we find a strong relationship between the two variables are nil. But will we find significant differences? The answer, of course, is yes. How many depends on the approach you take.

One question, one test

A social scientist would (correctly) ask: is there a relationship between concept preference and age? This is one question, which takes one (chi-square) test to answer. With complete random data it would result in a false positive 1 out of 20 times (with a default risk percentage of 5%).

In marketing research, we take a different approach. We don't ask questions; we just test the shit out of our cross-tables and see what we dig up. The average cross-tabulation tool runs so-called Z-tests: a significance test for every available column pair in the table. In this example we have 5 column categories, which makes (5 x 4)/2 = 10 column pairs. With 4 rows this makes 10 x 4 = 40 significance tests for this single cross-table… Sounds insane already, doesn’t it? Hold on, it gets worse…

Turning noise into signal

So, what are the odds of finding at least one statistically significant pair in the table with this approach? Monte Carlo simulation shows that the answer is about 70%, with an average of about 2 significant pairs in this single table. In about 30% of the cases there are at least 3. Remember: we're dealing with completely random data.

See the problem?

Of course, the odds of finding false positives decrease with smaller tables and increase with bigger tables. The exact probability, however, of finding false positives is not that important; the thing to remember here is that with this common approach of testing column pairs in marketing research, we are sure to turn noise into signal.

True but trivial

In real life, you don’t deal with completely random data. That means that in many cases, some of your significant effects will be false positives and some will be true positives. You don’t know, however, which is which. Above, as it turns out, most true positives are small differences, which means they are likely to be insignificant form a commercial perspective. The more observations you’ll have, the smaller the differences that will be identified as ‘statistically significant’.

In summary then, with enough tests, you can be sure to find plenty of ‘statistically significant results’. But I hope the problem is clear: rely on statistical significance and you can be sure to report mainly noise and trivialities.

A better approach

In summary then, as a researcher, with only significance testing in your toolbox you don't deserve to be taken seriously as someone who contributes to commercial decisions. You should make sure you also:

  • Look at effect size: bigger differences are more likely to be commercially relevant.
  • Look at consistency: if you find a difference, you can hypothesize where else you expect to find differences.
  • Look at business relevance: will it make a difference to act on the observed difference? Will it be costly?

Effect size you can automate, for consistency and business relevance you must think. That’s why they are hardly used. Don’t let that stop you to add these criteria to your repertoire though.

 

I don't comment often on this website but this piece is just too good ;-) Thanks!

Romain Futrzynski

Data+AI Specialist @Databricks | PhD in Computational Physics

3y

"if you find a difference, you can hypothesize where else you expect to find differences" I really like that. This is really the core of the scientific method, not applying (complex) recipes.

Graham Hill (Dr G)

30 Years Marketing | 25 Years Customer Experience | 20 Years Decisioning | Opinions my own

3y

Thoughtful and well worth reading. What would Darrel Huff say?

Adrian Olszewski

Clinical Trials Biostatistician at 2KMM - 100% R-based CRO ⦿ Frequentist (non-Bayesian) paradigm ⦿ NOT a Data Scientist (no ML/AI) ⦿ Against anti-car/-meat/-cash and C40 restrictions

3y

Robert van Ossenbruggen In clinical trials, which is entirely based on the NHST via planned (all hypotheses and corresponding analytical tools: tests and models, defined a priori), designed RCTs we also (at least in the last decade; better late than never) employ the practical significance, which is the smallest meaningful effect, called here MCID - the minimal clinically important difference (attached). Anything that is smaller than it, is ignored regardless of whether p<0.0000001. A result must pass the "threshold of sensibility", only then it's checked for "statistical visibility" (the world "significance" can be replaced here with statistically "detectable" or "discernible"). Also, the opposite is important: to not throw away things that are "clinically visible" ("we can see that with naked eye") only underpowered. In this case I always analyse the symmetry of a corresponding CI (sadly quite a rare approach; people mostly look whether it covers or not the null/neutral value). Regarding the effect size, over years, after a short "friendship" I treat them carefully, but still (yet) giving them chance and report a few variants (stat. reviewers have various preferences) per parametric and non-parametric analysis.

  • No alternative text description for this image
Tom Lanktree

Lover of life, language and literature, advises brands on how to win hearts and wallets.

3y

Can this help explain why so many product launches go wrong? False positives giving ideas that a particular group likes a concept whereas no significant preference existed?

Like
Reply

To view or add a comment, sign in

More articles by Robert van Ossenbruggen

Insights from the community

Others also viewed

Explore topics