Methods in People Analytics - Measuring the difference between two groups
Methods for comparing groups are pretty common whether you're a medic or a psychologist

Methods in People Analytics - Measuring the difference between two groups

This is one of a series of articles which explore established and novel methodologies in People Analytics. See the previous article on handling restriction of range here.

One of the most common tasks that anyone working in People Analytics or a similar discipline needs to undertake is to measure the difference between two groups of people. These groups could have taken some sort of measurement instrument and you are interested to see if there is a bias of some form lurking in the shadows. Or it's possible they could have been the subject of a controlled experiment and you wish to measure its impact. Actually, nothing specific might have happened at all and you may just wish to track the standard measures that are collected about people in your organization (such as performance ratings) and compare two groups of interest.

The more trypanophobic among you may wonder why I have chosen to illustrate this article with the photograph above. I am drawing a link to medical research, which is really the only place you need to look to get a good briefing on which methods to apply when measuring the difference between groups. Just as we may investigate psychological constructs or measurement bias by looking at group differences, so does medical research use these methods to investigate epidemiological or clinical effects.

To my mind, the most common statistical methodologies to compare groups in psychology and in medicine are the same. Why? Because one man, a behavioral scientist, invented most of them.

The effect size

Jacob Cohen's record speaks for itself. How many behavioral scientists produce work so good that it is adopted by the entire medical field? There is no doubt that Cohen was a giant in the field of estimation statistics and gave rise to a whole host of measures named after him, including Cohen's kappa and Cohen's h (for a later article perhaps). His measures appear in so many thousands of papers that in 1997 he was awarded the American Psychological Society's Distinguished Lifetime Contribution award.

Cohen's d - otherwise known as the effect size - is what I want to focus on here. This is a statistic so simple and so obvious in hindsight that I found it a genuine surprise that it was such a recent contribution to the fields of psychology, medicine and statistics. But as I learned as a PhD student, things in mathematics look a lot simpler when written in a textbook than when used in practice.

Let me illustrate the value of Cohen's d in practice. Suppose you have a sample of high school students who have just taken a test that you have designed. Suppose that half of your sample is male and half is female. Suppose additionally that you want to know whether there is a difference in how males and females scored in your test. So you calculate that males scored an average of 31.2 out of 50 in your test and females scored an average of 29.8. This means males scored 1.4 points higher on average. Job done.

Or is it? What exactly does a difference of 1.4 mean? Is it big or small? Is it significant? Should you worry about it or not?

To illustrate my point, let's say your colleague has developed a different test, the same people have taken it, and you want to know if the gender difference is similar in your colleague's test. You calculate that in your colleague's test the average for males was 87.5 out of 150 and for females it was 81.7 out of 150. So the difference is 5.8 points in favor of males. Bigger than the 1.4 on your test, right?

Cohen's d

The example above highlights two problems that occur when you try to compare groups by just focusing on their means. One problem is that you have no way of sizing the difference or telling if it is significant because you have no comparator. Another is that you cannot directly compare two measures that have different underlying scales.

In mathematics, this can be translated into a problem of normalization. The different scales are the complicating factor here and we need to get rid of them in some way that allows us to directly compare the differences on the same scale. One option might be that we could just divide the second set of scores by three to make them a score out of 50. But what if we learn that nobody ever scores lower than 30 or higher than 100?

Cohen's insight was that the difference needs to be expressed as a proportion of the 'spread' of the measure. By dividing the differences by the standard deviation of the data for the entire group, you can express it in a way that allows it to be directly compared with other measures, because the standard deviation will take care of the 'spread' of the measurement scale - it's your normalizing factor. The formula is:

where M represents the Mean and SD the Standard Deviation.

Let's use this formula for our example above. We calculate the standard deviation of the entire sample for your test and it turns out to be 3.6, while for your colleague's test the spread is much greater and it calculates to 20.3. In this case Cohen's d for your test will be 1.4/3.6 = 0.39, while for your colleague's test it will be 5.8/20.3 = 0.29. So while men perform better on both tests, it seems that the effect is stronger for your test than for your colleague's.

Because of the calculation methodology, effect sizes are often expressed as 'standard deviations'. You could say, for example, that on your test men score about 0.4 standard deviations higher than women.

One less known subtlety of the calculation of Cohen's d is in the case where a controlled experiment has taken place, where you have a control group for which no action is taken and a treatment group for which a certain action has been taken. In this case, the difference should be calculated by dividing by the standard deviation of the control group, not the pooled group, because the control group is more likely to represent the statistical properties of the normal population (having not been the subject of any deliberate action).

What is a big or small effect size?

The beauty of Cohen's d is that, as a normalized statistic, you can use it to directly compare group differences on any set of measures. That helps with one of the problems mentioned above. The other problem - understanding what is a big or a small effect size - needs some guidance from the man himself. Cohen's Rule of Thumb states that 0.2 can be considered a small effect size, 0.5 moderate, and 0.8 large. Again, referring to your imaginary test above, the gender difference appears to be in the low to moderate range according to Cohen's Rule of Thumb.

Like all rules of thumb, it's nice to have a simple answer to a question, but it also involves a massive generalization. Different instruments have different naturally occurring effect sizes. For example, when you look at cognitive ability tests, racial differences are often larger than gender differences. Similarly, if an organization has a particular focus on an issue, like gender equality, what Cohen declared to be small could actually be too large for comfort. So don't go banding about Cohen's Rule of Thumb too freely without careful thought about the context you are operating in.

Jacob Cohen was a legend in his field, who gave us simple approaches to important measurement challenges. Like everything in the field of statistics, a thoughtful approach needs to be taken when applying Cohen's d. But if you've never used it before, its going to change your life!

I lead McKinsey's internal People Analytics and Measurement function. Originally I was a Pure Mathematician, then I became a Psychometrician. I am passionate about applying the rigor of both those disciplines to complex people questions. I'm also a coding geek and a massive fan of Japanese RPGs.

All opinions expressed are my own and not to be associated with my employer or any other organization I am associated with.

Arjun Dhingra - PMP, DCAM, PROSCI-CMP

Senior Manager - Data Strategy & Governance @CIBC - Ex EY-P | Deloitte | DCAM V2 | ISO8000 Certified

7y
Kamil Mysiak

Vice President of Artificial Intelligence at LawPro.ai

7y

Agreed! I come from a research stats background where significance, effect sizes and power are critically important. That said, I see less reliance on these fundamental statistical terms in our industry. Effect size is critical in your interpretation as significance is enormously influenced by sample size.

Like
Reply

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics