You invite your friends over for a gathering at your house.
As the party gets started, you watch in horror as a friend trips on the edge of your carpet before she even takes the first sip of her Cabernet (now spilled on your white couch).
You rush to fix the carpet. But before you do this, you hesitate. You’re a researcher!
Researchers don’t jump to conclusions. They take a measured approach to finding and fixing problems. And that starts with being careful not to jump to a conclusion after one observation.
You recall the Magic Number 5 and how to compute a sample size for problem discovery studies, so you patiently wait for the next four guests to enter the living room and observe what happens.
How many people do you need to observe tripping on the carpet before you have evidence of a real problem?
Seems like a pretty dumb idea to wait for more people to trip. Is it also a dumb idea to wonder what sample size you need to detect problems in usability studies?
We’ve heard versions of this story and read similar rationales posed by other researchers. We think the parable is both helpful and deceiving.
Why It’s Helpful: Once a Problem Has Been Seen, It Exists
The parable is helpful for a few reasons.
- Once you see a problem, there’s no longer a question of whether it exists. It does. The discussion shifts from discovery to recovery—is the problem serious enough to deserve the allocation of resources to fix it, and if so, what needs to be done to fix it?
- Just one user can provide powerful insights on what problems to fix.
- It reinforces the idea that the severity of a problem should be treated independently of the frequency (especially when a problem is obviously severe).
- It’s a reminder that overrecruiting is a waste of resources if you really don’t need more participants.
Why It’s Deceiving: Now the Problem is Known
The story is deceiving primarily because of how it’s framed.
How many participants do you need to see trip on your carpet before you decide it’s a problem serious enough to fix? In its more general form, once you’ve seen something that is obviously a potentially serious problem, how many more people do you need to see experience the problem before you do something about it? Framed this way, the question seems nonsensical because you already know there’s a problem.
Before the party, however, you didn’t know there was a problem.
The question in UX research is rarely, “Once we’ve seen a problem (now that we know it exists), how many more people do we need to see experiencing the problem?” The question is usually something like: “We don’t know what the problems are but want to find and fix as many of them as we can. What’s a reasonable number of people to observe to find a reasonable set of problems?”
So, even though the parable is helpful, it’s also deceiving because:
- Seeing a problem happen with the first user doesn’t mean that you need to test only one user.
- Nor does it mean you should dismiss the value of sample size estimates to guide the process of uncovering problems.
- Most usability tests look for multiple issues, not one issue, so it doesn’t make sense to have a stopping rule that ends the test after you’ve seen one participant have one issue.
When computing sample sizes, the question is not “How many people do you need to test once you’ve seen a problem?” The question is “How many people do you need to observe to see whether unknown problems exist?”
Plot Variation #1: No One Trips
Suppose we change the plot of the parable a bit. There’s no difference in the carpet. It’s still possible that it might trip someone. But in this version, you’ve invited five friends to the party, and as they enter the living room, no one trips on the carpet. You remain blissfully unaware of the potential problem.
If a tree falls in the forest and the first five hikers who cross over it don’t trip, is it a problem?
More generally, if a problem never occurs out of five opportunities, is that compelling evidence that there really isn’t a problem?
It is compelling evidence that the likelihood of tripping isn’t super high. The 95% adjusted-Wald binomial confidence interval for 0/5 occurrences of a potential problem ranges from 0 to 40%. With the evidence in hand, it’s plausible that the problem likelihood is 0% but it’s also plausible that it’s as high as 40%. This is related to our discussion in a previous article that it’s possible with a small sample to prove a problem exists, but it takes a very large sample to provide compelling evidence that a problem does not exist—a much larger sample than those typical of problem discovery studies.
How to Plan for Unknown Carpet (and Other) Hazards
As a UX researcher, you’re probably unconcerned about party planning and carpet hazards, but you’ll want to be sure you’ve found and fixed the usability problems in your products before users arrive.
We recommend doing a small sample usability test, which will uncover many of the more common (frequently occurring) problems. However, usability testing will not uncover all problems, certainly not all low-frequency critical problems—those that may happen infrequently but have the potential to cause major issues.
So, how can you plan the number of people to observe in a problem discovery usability test?
One of the most useful tools for this kind of planning is a sample size table like Table 1.
n | 1% | 5% | 10% | 15% | 25% | 30% | 50% | 75% |
1 | 1% | 5% | 10% | 15% | 25% | 30% | 50% | 75% |
2 | 2% | 10% | 19% | 28% | 44% | 51% | 75% | 94% |
3 | 3% | 14% | 27% | 39% | 58% | 66% | 88% | 98% |
4 | 4% | 19% | 34% | 48% | 68% | 76% | 94% | 100% |
5 | 5% | 23% | 41% | 56% | 76% | 83% | 97% | 100% |
6 | 6% | 26% | 47% | 62% | 82% | 88% | 98% | 100% |
7 | 7% | 30% | 52% | 68% | 87% | 92% | 99% | 100% |
8 | 8% | 34% | 57% | 73% | 90% | 94% | 100% | 100% |
9 | 9% | 37% | 61% | 77% | 92% | 96% | 100% | 100% |
10 | 10% | 40% | 65% | 80% | 94% | 97% | 100% | 100% |
The table shows that if you run a problem-discovery usability study with five participants, you’d expect to discover (see at least once) almost all the problems that would happen to half or more of the population (within the constraints of the problems that are discoverable in the task set). You’d also expect to discover about 76% of problems that would happen to a quarter of the population and over half the problems that would happen to 15% of the population. The expected discovery rates for problems that would happen to 10%, 5%, and 1% of the population are, respectively, 41%, 23%, and 5%. All these discovery rates improve when the sample size is increased to ten, but even then, the expected discovery rates for 10%, 5%, and 1% of the population are well below 100% (respectively, they are 65%, 40%, and 10%).
If after observing five or ten people you don’t see anything, it doesn’t mean you can guarantee there are NO problems. It just means it’s unlikely there are any common problems (assuming those users and tasks were representative and comprehensive).
Plot Variation #2: The Role of Severity in Prioritization
Suppose we change the plot of the story again. We’re back at the party, and you watch as your first guest steps on the carpet effortlessly while sipping wine (he tends to be a high stepper). The next guest trips but recovers quickly with no spill. Then comes the fateful moment when the third guest trips on the carpet, spills all the wine on your white couch, and breaks her ankle (and she’s a personal injury lawyer).
From the start of the party, the problem with the carpet was there. The problem was undiscovered with the first guest, discovered with the second guest but with minor impact, and then found again with the third guest with more significant consequences.
You can’t effectively prioritize problems based just on their frequency of occurrence. Of other factors to consider when prioritizing (e.g., cost of fixing), one of the most important is the impact of the problem on the person who experiences it. To further complicate things, different people may experience different impacts for the same underlying problem.
Because the mathematics used to model problem discovery is based on frequency, the discovery of a problem is only the first step to fixing it. It may compete with other known problems for limited resources where prioritization is based not only on problem frequency but also on its impact (observed and hypothetical worst cases).
Discussion and Summary
Parables can be powerful.
Traditional parables are usually short stories, often but not exclusively religious, that after reflection, drive home a moral lesson.
It’s also possible for a parable to be a single well-crafted question that provokes deep thought and leads to consideration of a larger truth or principle (e.g., the title of this article).
However, discussion of the meaning of a parable can be affected by its framing—what information contained in or implied by the parable is emphasized or left out.
These framing decisions may be associated with a particular agenda, so it’s important to keep this in mind when thinking critically about a particular presentation and discussion of a parable.
The answer to the carpet parable, as given, seems obvious—one person. But an equally valid answer could be zero people. Even if you never see someone trip on a carpet with a raised edge, as soon as you know there is a potential trip hazard (e.g., you spot the potential problem or someone else tells you about it), you should fix it because, in a worst-case scenario, someone could be injured or even die if it’s left unfixed. Things get even trickier when you move away from the given question and try to apply it to formative usability testing.
The primary purpose of formative usability testing is to set up the conditions (participants, tasks, and environments) that enable the discovery of unknown (or suspected but as yet unobserved) problems when using a product or system.
Bottom line: After reflection, the carpet parable fails to apply to formative usability testing because it’s focused on a single known problem that has the potential for catastrophic consequences rather than the discovery of a set of unknown problems that vary in their likelihood, severity (observed and potential), and priority for fixing.