Synthetic Data: Accelerating AI Development While Safeguarding Privacy

Synthetic Data: Accelerating AI Development While Safeguarding Privacy

There is no AI strategy without a data strategy, a phrase used in most AI conferences today. However, there is increasing concern about using more data for training AI models due to concerns about privacy, fairness, bias, and more. The AI models are data hungry, but it may not be straightforward to provide them with all the data they need due to different legislations.

Limitations of Traditional Data Anonymization Techniques

Businesses have used traditional data anonymization techniques, such as storing only the last four digits of a credit card number, obfuscating names, and other parameters, but these are often not enough.

  • Destructive Alteration: Traditional methods often delete or distort significant data, reducing its usability for analysis and machine learning. From an AI/ML, the interdependent relationships between the data elements are critical for accurate model development, which may be lost during masking.
  • Insufficient Privacy: These techniques may not adequately protect against re-identification, especially with complex and detailed data.
  • Re-identification Risk: Masked data can still be reverse-engineered, potentially exposing sensitive information.
  • Inadequate for Complex Data: Traditional masking struggles with high-dimensional or behavioral data, making effective anonymization difficult.
  • Compliance Limitations: Older masking techniques often fail to meet modern privacy regulations like GDPR and CCPA, which demand more robust protection.

Reverse Engineering from Anonymized Data

I was working on a connected car hackathon a few years ago, and the customer only provided us the telemetry data from the cars, which was the car's speed every 60 seconds and the location. We had no master data such as customer name, location, etc. Just from that systematic telemetry data, we were able to deduce:

  • The home address of the car
  • The main purpose of the vehicle (leisure driving or commuting)
  • How many nights in a year the car is not parked at the home address?
  • Safety score of the driver
  • Favorite hotspots of the driver

There was no personal information in there, yet using different ML models, we could deduce insights that were quite frankly quite creepy.

What if we change the patterns altogether?

From the previous example, why didn't we change the location attributes and completely anonymize them? We can do it, but then we risk losing all the relationships within the data.

I ran a social network analysis on telecom data a few years back. First, we had to build a calling circle: the number of people each subscriber communicates with. So, if I call five distinct people in a month and send an SMS to three different people, my calling circle for that month has eight people. We had 14 Million subscribers, and they had 70 Million people in their calling circle altogether. So, on average, each subscriber had five people in their calling circle. As we know the distrbution is seldom linear like that in the real world. We found out that the top one million most connected people had 44 Million connections, with an average of 44 connections per subscriber (possibly teenagers texting among their large friend group), while the bottom 13 Million people had only 26 Million connections, with an average of 2 connections each. Using a simple anonymization technique, this data distribution could easily be lost, resulting in an inaccurate model being trained.

Enter Synthetic Data

Synthetic data refers to data that is artificially generated rather than obtained by direct measurement or collection from real-world sources. It is created using algorithms, simulations, or models to mimic the statistical properties and patterns of real data. Synthetic data is revolutionizing various fields by offering significant promise across privacy, fairness, and AI development. It accelerates data science projects, reduces costs, and supports data democratization when combined with secure research environments and federated learning techniques. European Commission Joint Research Center concluded in their Synthetic Data Report 2022 that Synthetic data will become the critical enabler for AI in business and policy applications.

However, synthetic data is not inherently private and can still leak information, necessitating careful implementation. While it provides valuable insights for early development, final models should be refined with real-world data to ensure accuracy. Outliers and rare data points remain challenging to anonymize, and privacy evaluations require assessing the data generation process itself. Despite its potential, especially in enhancing fairness and reducing bias, further research is needed to fully understand and harness synthetic data's capabilities.

Synthetic data is changing how businesses and organizations approach AI, data innovation, and privacy. It provides an anonymized, privacy-safe alternative to real-world data, allowing organizations to innovate while protecting sensitive customer information. This groundbreaking approach is transforming industries from healthcare to finance by enabling businesses to bypass strict data regulations without sacrificing utility.

One key reason synthetic data is gaining momentum is its ability to solve the “data privacy” challenge. In a world increasingly governed by privacy laws like GDPR and CCPA, companies face limitations when trying to innovate with customer data. Traditional anonymization methods, such as data masking or obfuscation, are insufficient for privacy protection and data utility. Synthetic data, however, allows companies to generate realistic datasets that maintain privacy while retaining data integrity, enabling advanced AI models to function without risking compliance issues.

A standout application of synthetic data is in AI training. Gartner estimates that by 2030, synthetic data will completely overshadow real data in AI models. By providing high-quality, granular data without the privacy risks associated with real-world data, synthetic datasets enhance machine learning models, product development, and data analytics. Synthetic Data Vault is an open-source Python library for synthetic data generation.

Use Cases for Synthetic Data

  • AI Training: Synthetic data is widely used in machine learning to train sophisticated AI models with high-quality, granular data that retains the necessary details without privacy risks.
  • External Data Sharing: Synthetic data enables safe collaboration between organizations, partners, startups, and researchers by providing realistic datasets without exposing sensitive information.
  • Digital Product Development: During the development phase, product teams can use synthetic data to simulate real-world data, allowing them to test and refine products before they go live.
  • Data Analytics: Similar to AI development, synthetic data provides high-quality datasets for advanced analytics without the risk of violating privacy regulations.
  • Open Data Sharing: Public sector organizations can use synthetic data to democratize data access, helping startups, SMEs, and researchers innovate without needing real data.
  • Software Testing and Cloud Migration: Synthetic data is useful in testing software systems or migrating data to the cloud, allowing for realistic testing environments without compromising privacy.
  • Data Retention and Cross-Border Reporting: It helps organizations meet regulatory requirements around data retention and facilitates cross-border reporting without legal complications.
  • AI Fairness and Governance: Synthetic data supports responsible AI by enhancing machine learning systems' fairness, explainability, and transparency.
  • Data Augmentation, Simulation, and Diversification: It can augment datasets, simulate various scenarios, and introduce more diversity into data for AI and analytics.

Conclusion

When used correctly, synthetic data can speed up AI development as there is less data exposure risk with synthetically generated data. The data still has the inherent attributes of real-world data but doesn't contain any information that can be traced back to the original customers. This means that data can be democratized both internally and externally for different kinds of testing and running different simulation models. Keep in mind that since the synthetic data will learn from the real data, any inherent biases will be propagated in the synthetic data as well.

It goes without saying that there is no silver bullet (there never is), so be very mindful of when you are providing data to your AI models, whether real or synthetic.


If you like, please subscribe to the FAQ on Data newsletter and/or follow Fawad Qureshi on LinkedIn.

Elisha Foust, PhD

Senior Director Business Operations | Certified in Data Analytics | Positive Change Maker | Be happy at work

3mo

Woot! I understood everything you said in this article thanks to my current course. One question for you - could we manipulate synthetic data to be less biased at the training stage? And is that useful?

John Timmers

Solutions Engineer - Software Architect | Software Development, Sales Engineer, Analytics, Big Data, Cloud

3mo

Seeing everything you mentioned in here Fawad. Nice work! Thanks for summing it all up, and thanks for the SDV Python library tip!

To view or add a comment, sign in

More articles by Fawad A. Qureshi

  • The Art of Cutting Through Noise: Why Problem-Finders Matter More Than Ever

    The Art of Cutting Through Noise: Why Problem-Finders Matter More Than Ever

    For years, I've explained that Business Intelligence (BI) is about finding answers to business questions, while…

    1 Comment
  • The 52 Books I read in 2024

    The 52 Books I read in 2024

    I started reading (or rather listening to books) recently. It has helped me generate new ideas, deliver new…

    37 Comments
  • When Mentoring Meets Scale

    When Mentoring Meets Scale

    I started offering one-on-one mentoring sessions a few years ago, helping individuals tackle challenges and grow their…

    2 Comments
  • Enshittification: the gradual decay of online platforms

    Enshittification: the gradual decay of online platforms

    Do you know what Enshittification is and why it has been picked as the Word of the Year 2024 by Macquarie Dictionary?…

    4 Comments
  • Matching the Zipcode: The Art of Connecting Through Shared Interests

    Matching the Zipcode: The Art of Connecting Through Shared Interests

    Years ago, I was reading a book called The God Debris by Scott Adams which is a thought-provoking book presented as a…

    4 Comments
  • AI: The New Race to the Moon

    AI: The New Race to the Moon

    When President John F. Kennedy declared in 1962 that the US would land on the moon, he sparked a global innovation race.

    2 Comments
  • We are all in sales

    We are all in sales

    Salespeople often get a bad reputation thanks to stereotypes like the proverbial used car salesman. This is the reason…

    5 Comments
  • Converting Digital Exhaust to Digital Fuel

    Converting Digital Exhaust to Digital Fuel

    Since the beginning of mankind, we have been generating data, whether analog or digital. For most of history, this data…

    1 Comment
  • The Power of a Great Network

    The Power of a Great Network

    In my native language, there is a saying that if you work in a perfumery, you automatically start smelling nice. Even…

    5 Comments
  • The Rise of Shadow AI: Productivity Boost or Data Disaster?

    The Rise of Shadow AI: Productivity Boost or Data Disaster?

    I write weekly on different topics related to Data and AI. Feel free to subscribe to FAQ on Data newsletter and/or…

Insights from the community

Others also viewed

Explore topics