Synthetic Data: Accelerating AI Development While Safeguarding Privacy

Fawad A. Qureshi

Global Field CTO @ Snowflake | LinkedIn Learning Instructor | Sustainability 🌎, Data Strategy, Business Transformation

Published Sep 16, 2024

There is no AI strategy without a data strategy, a phrase used in most AI conferences today. However, there is increasing concern about using more data for training AI models due to concerns about privacy, fairness, bias, and more. The AI models are data hungry, but it may not be straightforward to provide them with all the data they need due to different legislations.

Limitations of Traditional Data Anonymization Techniques

Businesses have used traditional data anonymization techniques, such as storing only the last four digits of a credit card number, obfuscating names, and other parameters, but these are often not enough.

Destructive Alteration: Traditional methods often delete or distort significant data, reducing its usability for analysis and machine learning. From an AI/ML, the interdependent relationships between the data elements are critical for accurate model development, which may be lost during masking.
Insufficient Privacy: These techniques may not adequately protect against re-identification, especially with complex and detailed data.
Re-identification Risk: Masked data can still be reverse-engineered, potentially exposing sensitive information.
Inadequate for Complex Data: Traditional masking struggles with high-dimensional or behavioral data, making effective anonymization difficult.
Compliance Limitations: Older masking techniques often fail to meet modern privacy regulations like GDPR and CCPA, which demand more robust protection.

Reverse Engineering from Anonymized Data

I was working on a connected car hackathon a few years ago, and the customer only provided us the telemetry data from the cars, which was the car's speed every 60 seconds and the location. We had no master data such as customer name, location, etc. Just from that systematic telemetry data, we were able to deduce:

The home address of the car
The main purpose of the vehicle (leisure driving or commuting)
How many nights in a year the car is not parked at the home address?
Safety score of the driver
Favorite hotspots of the driver

There was no personal information in there, yet using different ML models, we could deduce insights that were quite frankly quite creepy.

What if we change the patterns altogether?

From the previous example, why didn't we change the location attributes and completely anonymize them? We can do it, but then we risk losing all the relationships within the data.

I ran a social network analysis on telecom data a few years back. First, we had to build a calling circle: the number of people each subscriber communicates with. So, if I call five distinct people in a month and send an SMS to three different people, my calling circle for that month has eight people. We had 14 Million subscribers, and they had 70 Million people in their calling circle altogether. So, on average, each subscriber had five people in their calling circle. As we know the distrbution is seldom linear like that in the real world. We found out that the top one million most connected people had 44 Million connections, with an average of 44 connections per subscriber (possibly teenagers texting among their large friend group), while the bottom 13 Million people had only 26 Million connections, with an average of 2 connections each. Using a simple anonymization technique, this data distribution could easily be lost, resulting in an inaccurate model being trained.

Recommended by LinkedIn

Data Privacy in the Age of Artificial Intelligence…

Debbie Reynolds 7 months ago

🌧️ AI Regulation Is Not Uncertain

Luiza Jarovsky 3 months ago

AI Risk, Meet AI Regs | The Singularity Monthly…

Singularity University 1 year ago

Enter Synthetic Data

Synthetic data refers to data that is artificially generated rather than obtained by direct measurement or collection from real-world sources. It is created using algorithms, simulations, or models to mimic the statistical properties and patterns of real data. Synthetic data is revolutionizing various fields by offering significant promise across privacy, fairness, and AI development. It accelerates data science projects, reduces costs, and supports data democratization when combined with secure research environments and federated learning techniques. European Commission Joint Research Center concluded in their Synthetic Data Report 2022 that Synthetic data will become the critical enabler for AI in business and policy applications.

However, synthetic data is not inherently private and can still leak information, necessitating careful implementation. While it provides valuable insights for early development, final models should be refined with real-world data to ensure accuracy. Outliers and rare data points remain challenging to anonymize, and privacy evaluations require assessing the data generation process itself. Despite its potential, especially in enhancing fairness and reducing bias, further research is needed to fully understand and harness synthetic data's capabilities.

Synthetic data is changing how businesses and organizations approach AI, data innovation, and privacy. It provides an anonymized, privacy-safe alternative to real-world data, allowing organizations to innovate while protecting sensitive customer information. This groundbreaking approach is transforming industries from healthcare to finance by enabling businesses to bypass strict data regulations without sacrificing utility.

One key reason synthetic data is gaining momentum is its ability to solve the “data privacy” challenge. In a world increasingly governed by privacy laws like GDPR and CCPA, companies face limitations when trying to innovate with customer data. Traditional anonymization methods, such as data masking or obfuscation, are insufficient for privacy protection and data utility. Synthetic data, however, allows companies to generate realistic datasets that maintain privacy while retaining data integrity, enabling advanced AI models to function without risking compliance issues.

A standout application of synthetic data is in AI training. Gartner estimates that by 2030, synthetic data will completely overshadow real data in AI models. By providing high-quality, granular data without the privacy risks associated with real-world data, synthetic datasets enhance machine learning models, product development, and data analytics. Synthetic Data Vault is an open-source Python library for synthetic data generation.

Use Cases for Synthetic Data

AI Training: Synthetic data is widely used in machine learning to train sophisticated AI models with high-quality, granular data that retains the necessary details without privacy risks.
External Data Sharing: Synthetic data enables safe collaboration between organizations, partners, startups, and researchers by providing realistic datasets without exposing sensitive information.
Digital Product Development: During the development phase, product teams can use synthetic data to simulate real-world data, allowing them to test and refine products before they go live.
Data Analytics: Similar to AI development, synthetic data provides high-quality datasets for advanced analytics without the risk of violating privacy regulations.
Open Data Sharing: Public sector organizations can use synthetic data to democratize data access, helping startups, SMEs, and researchers innovate without needing real data.
Software Testing and Cloud Migration: Synthetic data is useful in testing software systems or migrating data to the cloud, allowing for realistic testing environments without compromising privacy.
Data Retention and Cross-Border Reporting: It helps organizations meet regulatory requirements around data retention and facilitates cross-border reporting without legal complications.
AI Fairness and Governance: Synthetic data supports responsible AI by enhancing machine learning systems' fairness, explainability, and transparency.
Data Augmentation, Simulation, and Diversification: It can augment datasets, simulate various scenarios, and introduce more diversity into data for AI and analytics.

Conclusion

When used correctly, synthetic data can speed up AI development as there is less data exposure risk with synthetically generated data. The data still has the inherent attributes of real-world data but doesn't contain any information that can be traced back to the original customers. This means that data can be democratized both internally and externally for different kinds of testing and running different simulation models. Keep in mind that since the synthetic data will learn from the real data, any inherent biases will be propagated in the synthetic data as well.

It goes without saying that there is no silver bullet (there never is), so be very mindful of when you are providing data to your AI models, whether real or synthetic.

If you like, please subscribe to the FAQ on Data newsletter and/or follow Fawad Qureshi on LinkedIn.

Elisha Foust, PhD

Senior Director Business Operations | Certified in Data Analytics | Positive Change Maker | Be happy at work

3mo

Woot! I understood everything you said in this article thanks to my current course. One question for you - could we manipulate synthetic data to be less biased at the training stage? And is that useful?

1 Reaction

John Timmers

Solutions Engineer - Software Architect | Software Development, Sales Engineer, Analytics, Big Data, Cloud

3mo

Seeing everything you mentioned in here Fawad. Nice work! Thanks for summing it all up, and thanks for the SDV Python library tip!

1 Reaction

chiao min chang

Data Science and AI

3mo

🧡

1 Reaction

See more comments

To view or add a comment, sign in

Synthetic Data: Accelerating AI Development While Safeguarding Privacy

Fawad A. Qureshi

Global Field CTO @ Snowflake | LinkedIn Learning Instructor | Sustainability 🌎, Data Strategy, Business Transformation

Limitations of Traditional Data Anonymization Techniques

Reverse Engineering from Anonymized Data

What if we change the patterns altogether?

Recommended by LinkedIn

Enter Synthetic Data

Use Cases for Synthetic Data

Conclusion

More articles by Fawad A. Qureshi

Insights from the community

Others also viewed

Why your organization needs an AI officer

🦅 The AI Act and the rapid rise of AI governance

Maximizing Data Privacy for Organizations in the Generative AI Era

Navigating AI Regulations: How Abstrabit Ensures Compliance in Every Solution

The Retroactive Collapse of Anonymity: The AI Reckoning

Framing Policy on the Use of AI and Digital Platforms in Government Organizations: A Strategic Imperative

AI Regulations Are Coming. Could They Put Your Business at Risk?

Towards a responsible "Her": A holistic evaluation of personal AI companions with long-term memory (Part 2)

EU v US Approaches to AI Regulation

With AI: What is New is Old, and What is Old is New

Explore topics

Limitations of Traditional Data Anonymization Techniques

Reverse Engineering from Anonymized Data

What if we change the patterns altogether?

Recommended by LinkedIn

Enter Synthetic Data

Use Cases for Synthetic Data

Conclusion

More articles by Fawad A. Qureshi

The Art of Cutting Through Noise: Why Problem-Finders Matter More Than Ever

The 52 Books I read in 2024

When Mentoring Meets Scale

Enshittification: the gradual decay of online platforms

Matching the Zipcode: The Art of Connecting Through Shared Interests

AI: The New Race to the Moon

We are all in sales

Converting Digital Exhaust to Digital Fuel

The Power of a Great Network

The Rise of Shadow AI: Productivity Boost or Data Disaster?

Insights from the community

Others also viewed

Why your organization needs an AI officer

🦅 The AI Act and the rapid rise of AI governance

Maximizing Data Privacy for Organizations in the Generative AI Era

Navigating AI Regulations: How Abstrabit Ensures Compliance in Every Solution

The Retroactive Collapse of Anonymity: The AI Reckoning

Framing Policy on the Use of AI and Digital Platforms in Government Organizations: A Strategic Imperative

AI Regulations Are Coming. Could They Put Your Business at Risk?

Towards a responsible "Her": A holistic evaluation of personal AI companions with long-term memory (Part 2)

EU v US Approaches to AI Regulation

With AI: What is New is Old, and What is Old is New

Explore topics