4 reasons why your data is biased

Fairgen

Boost your survey for granular insights.

Published Apr 28, 2022

Our ultimate mission at Fairgen is to make AI fair. Since AI is based on data, we believe this will happen through making data fair. To this end, we have created a platform that can debias datasets. But why does biased data even exist and why should we be concerned about it ? After all, data is supposed to reflect reality. Well, not exactly.

1. Data mirrors human behaviour.

Most datasets today come from historical data built by humans. Humans are discriminative. Those discriminative patterns are engraved in the data.

Example: women getting lower salaries at equal skills. If we are to train AI models with this data, the models will be discriminative too. This is a clear problem as you do not want to be teaching such a sexist pattern to an intelligent machine taking decisions at scale.

So should one look for other sources of data ? Should one stop using AI ? The best option is to simply use fairness-constrained data generation on those datasets. This will ensure the data has an equal percentage of men and women getting accepted for a loan.

2. Data reflects the past, not the future.

Any dataset being used in an ML model has been built through years of data collection. Over the years, the data distribution could have drifted to a new reality, meaning the AI model in production makes decisions with patterns of the past on input data from the present.

Example: an insurance AI model might give a price X to a health policy, then Corona hits, people get sick more often, and price X should be X+100. But the data is based on pre-corona so the AI model does not update its thinking to current conditions and it still prices the policy much lower than it should. This could bankrupt the company.

One solution is to use data augmentation to generate enough data to train your model using only data from this new period of time.

3. Data can create a self-fulfilling prophecy.

With the world modernizing itself with AI, the biggest danger is by far getting stuck in a bias loop.

Example: let's think of a world in the near future where AI models decide on who should get a loan using 1970 to 2020 data showing that 30% of women applicants get a loan against 60% of men. What will happen ? From 2020 to 2040, those AI models will reproduce the same sexist pattern. Then, the AI models trained in 2040 with data from the last 20 years will be trained on identically sexist data. We are in a bias loop.

This can be changed at the source with fairness-constrained data generation. On a positive note, if we get stuck in a loop of treating different subgroups equally then this is good news.

4. Data is unbalanced.

It often happens that a collected dataset fails to reflect reality because there are not enough data points of a particular subgroup.

Example: the market branding of a bank has mainly attracted male customers and the bank now has too few female data points to build a robust loan model for women.

A solution to this is data rebalancing which will increase the amount of women data points to the number of men data points.

Author: Nathan Cavaglione

4 reasons why your data is biased

Fairgen

Boost your survey for granular insights.

1. Data mirrors human behaviour.

2. Data reflects the past, not the future.

3. Data can create a self-fulfilling prophecy.

4. Data is unbalanced.

More articles by this author

Insights from the community

Others also viewed

Critical Thinking in the Age of AI: A Guide for Evaluating AI-Generated Information

AI and the Displacement of White Collar Professionals

Passing in the Labs, failing outside isn't what you want for your AI models

AI can wreak havoc if left unchecked by humans

Fair AI: The Uncomfortable Truth We Need to Face 🤖

AI and the Social Contract

Where is AI Heading in the Second Half of 2021?

When Machines cry looking at human history

How to approach the “unexplained”: Testing an AI Black Box

Explore topics

1. Data mirrors human behaviour.

2. Data reflects the past, not the future.

3. Data can create a self-fulfilling prophecy.

4. Data is unbalanced.

What does being "fair" even means?

May 4, 2022

Insights from the community

Others also viewed

Critical Thinking in the Age of AI: A Guide for Evaluating AI-Generated Information

AI and the Displacement of White Collar Professionals

Passing in the Labs, failing outside isn't what you want for your AI models

AI can wreak havoc if left unchecked by humans

Fair AI: The Uncomfortable Truth We Need to Face 🤖

AI and the Social Contract

Where is AI Heading in the Second Half of 2021?

When Machines cry looking at human history

How to approach the “unexplained”: Testing an AI Black Box

Explore topics