Since its development in the 1980s, System Usability Scale (SUS) score analysis has focused on estimating and comparing means.
The mean is one type of average. Would it be more appropriate to estimate and compare SUS medians instead of means?
To investigate this question, we analyzed and compared SUS means and medians collected from over 18,000 individuals who used the SUS to rate the perceived usability of over 200 products and services. But before we get to those results, it helps to understand the difference between the median and mean and why you might choose one or the other.
Means and Medians
One of the first and easiest things to do with a data set is to find the mean. The mean is a measure of central tendency, one way of summarizing the middle of a dataset. To calculate the mean, add up the data points and divide by the total number in the group (the sample size, n). With the mean, every data point contributes to the estimate. The mean is the preferred method for data that are at least roughly symmetrical (i.e., the mean is about midway between the lowest and highest values). Of the many types of symmetrical distributions, the best known is the normal distribution.
When the data aren’t symmetrical, the mean can be sufficiently influenced by a few extreme data points to become a poor measure of the middle value. In these cases, the median, the center point of a distribution, is a better estimate of the most typical value. For example, this often happens with distributions of time data. For samples with an odd number of data points, the median is the central value; for samples with an even number, it’s the average of the two central values. With the median, only the center or center two data points directly contribute to the estimate while the other data points establish the center.
Means and Medians of Rating Scales
Individual Rating Scales
The median is a problematic measure of central tendency for individual rating scales. Because respondents select one number from a rating scale, the dataset is composed of integers. For a five-point scale, the median can take only the following values no matter how large the sample is: 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, and 5.0. (And it can only take the intermediate values when n is even.)
The mean, on the other hand, can take any value between 1 and 5, and as the sample size increases, it becomes more and more continuous. Because the mean can acquire a larger number of values, it can reflect significant differences between two samples more reliably than the median difference.
When scales are open-ended (have at least one endpoint at infinity, like time data), extreme values can affect the mean but will not affect medians. Rating scales, however, are not open-ended, so the median does not have a compelling advantage over the mean when analyzing individual rating scales.
The System Usability Scale
Things get more complicated when working with metrics that are composites of many individual rating scales, like the SUS. The SUS is made up of ten five-point items with the final score (the mean of the ten items) interpolated to range from 0–100, so it can take 40 values in 2.5-point increments (0, 2.5, 5.0, 7.5, … 97.5, 100). The median can take these values when n is odd. When n is even, 39 intermediate values (such as 1.25 and 3.75) are also possible, for a total of 79 potential median values separated by just 1.25 points. As for individual rating scales, the mean becomes more continuous as the sample size increases, but with so many possible median values, the difference in mean–median sensitivity is reduced for the SUS.
Historically, the typical practice has been to compute SUS means rather than medians, even though the distribution of the SUS is known to be asymmetrical. Thus, the methods developed over decades to interpret SUS scores are based on means, not medians.
We would never recommend relying on the median of an individual rating scale as a measure of central tendency, but we were curious about the typical difference between the means and medians of datasets collected with the SUS. Fortunately, we have a LOT of SUS data to answer this question.
Differences Between SUS Means and Medians
We compiled a large set of SUS data with 18,853 individual SUS scores, assessing 210 products and services, studied in 2010 through 2022 (primarily business and consumer software products, websites, and mobile apps). Figure 1 shows the expected asymmetric distribution of the data.
The distribution is left-skewed (with a number of atypically low SUS scores on the left). Spikes are at values of 50, 75, and 100, but not at 0 or 25. The spike at 50 was somewhat expected because one way to get a score of 50 is to rate each SUS item with the same response option. There is always a concern that this straightlining might indicate a respondent who isn’t carefully considering each item. This concern is partially (but not totally) alleviated by the differences in the number of 75s and 100s compared to 0s and 25s. For example, the only pattern that produces a score of 100 is alternating selections of 5 for odd-numbered items and 1 for even-numbered items, and the only way to get 0 is to reverse that pattern. If most respondents were selecting patterns at random, then the number of 100s and 0s would be similar—but they aren’t. Despite this, researchers should be suspicious enough of scores of 50 to investigate other aspects of those respondents’ behaviors to see whether the data should be retained or excluded from analysis. For the following analyses, we retained all the data but focused on analysis at the product rather than the individual level.
Analysis of Product SUS Means and Medians
Figure 2 shows the scatterplot of the SUS means and medians from all 210 products in the dataset. As expected, the means and medians had a strong linear relationship (r(208) = .97, p < .0001)—an almost perfect correlation.
Figure 3 shows the average difference (with 95% confidence intervals) between means and medians for all data (n = 210) and the data split between studies with relatively low sample sizes (n < 30, 61 products with n ranging from 5 to 26) and those with larger sample sizes (n > 30, 149 products with n ranging from 30 to 1,969).
Inspection of the confidence intervals in Figure 3 shows that, on average, SUS medians were about 2 points higher than SUS means, and there was no significant difference due to sample size. The standard deviation of the difference was larger when n < 30 (4.3 versus 2.5), reflected in the slightly larger range of the confidence interval.
For all products and the sample size splits, the lower limit of the confidence intervals was higher than 0, indicating that the median–mean difference was statistically significant (e.g., a difference of 0 is not plausible). Using the data from all products as the most precise estimate of the difference (smallest confidence interval), the confidence interval ranged from 1.7 to 2.5. This means that a mean difference of 2.0 is plausible, but differences lower than 1.7 or higher than 2.5 are not.
Analyses of Sample Sizes of Ten and Twenty
Perhaps there’s a difference when sample sizes are smaller? To explore the median–mean difference with smaller sample sizes, we focused on the 149 products with at least 30 SUS scores. For each of these products, we assigned a random number to each respondent and then sorted respondents by random number. We created two new datasets, with one containing the first ten randomly selected participants and the other containing the next 20 randomly selected participants.
Figure 4 shows the scatterplots for each new dataset. Consistent with the full set of data shown in Figure 1, the means and medians for both datasets have a strong linear relationship (n = 10: r(147) = .94, p < .0001; n = 20: r(147) = .95, p < .0001). The spread of points in the two graphs are consistent with their standard deviations (n = 10: 4.2; n = 20: 3.6), slightly more for the smaller sample size.
Figure 5 shows the average difference (with 95% confidence intervals) between means and medians for the two datasets.
The results in Figure 5 are consistent with those in Figure 3. SUS medians were about 2 points higher than the means with no significant difference due to sample size. The standard deviation of the difference when n = 10 (4.0) was slightly higher than when n = 20 (3.5).
Consistent with the analysis of all the data, the lower limit of the confidence intervals was higher than 0, indicating a statistically significant median–mean difference. When n = 10, the 95% confidence interval around the difference ranged from 1.3 to 2.6; when n = 20, it ranged from 1.6 to 2.7, both just a bit wider than the best (full-data) estimate of 1.7 to 2.5.
Summary and Discussion
We compared the means and medians of SUS scores from 18,853 individuals who used the SUS to rate the perceived usability of 210 products and services. Our key findings and conclusions are:
We recommend using the mean rather than the median. As described in more detail below, the medians of SUS distributions are slightly but consistently higher than the means, so researchers who use the various tools developed over the past few decades to interpret the SUS would slightly but consistently overestimate the quality of the user experience if they reported the median rather than the mean.
There was a statistically significant but small two-point difference between the means and medians. When a distribution is left-skewed, you expect the median to be larger than the mean because the extremely low scores exert some pull on the location of the mean. For these data, the median was typically about 2 points higher than the mean (between 1.7 and 2.5 with 95% confidence).
The difference between means and medians was consistent across different sample sizes. We estimated the difference using all data for all products, splitting the products into two groups with one containing the products that were assessed with n < 30 and the other with n ≥ 30, and randomly selecting individual participants from products with n ≥ 30 to assess sample sizes of 10 and 20. All analyses had average median–mean differences of about 2 with similar ranges of 95% confidence intervals.
The observed median–mean difference was consistent with the gaps between the possible medians of the SUS. An individual SUS score can take 40 values in 2.5-point increments from 0 to 100. When a sample size is odd, the median is restricted to these values. When a sample size is even, the median can, in some cases, also take 39 values midway between the 2.5-point increments. Of the 210 products in our dataset, 123 (59%) had an even sample size and 87 (41%) had an odd sample size. It is interesting that the observed difference of about 2 points is between the smaller increment of 1.25 points and the larger increment of 2.5 points. So:
- If every median from a study with an even sample size fell on an intermediate (1.25-point) increment, the typical increment would be 1.8 (.41 × 2.5 + .59 × 1.25).
- If half of the medians from a study with an even sample size fell on an intermediate increment, the typical increment would be 2.1 (.705 × 2.5 + .295 × 1.25).
- If none of the medians from a study with an even sample size fell on an intermediate increment, the typical increment would be 2.5 (.41 × 2.5 + .59 × 2.5).
This range (1.8 to 2.5) is remarkably close to our best estimate of the plausible range of the median-mean difference (1.7 to 2.5). It could be a coincidence, but it’s still intriguing.
The difference between SUS means and medians is small, but using the median could sometimes be problematic when using existing methods of interpreting the SUS. Given the typical distribution of the SUS, the median will almost always be greater than the mean—by our estimates, usually about 2 points higher. In most cases, this is close enough that the interpretation of both measures of central tendency will be the same. There might be differences, however, at interpretive boundaries (e.g., between B and A on a curved grading scale, or between Excellent and Best Imaginable on an adjective scale like those shown in Figure 6).