The Advantages of Synthetic Data

Andrey Shtylenko

Digital Humans @ Meta Reality Labs

Published Nov 23, 2022

Welcome to the Reality Gap newsletter, which focuses on synthetic data, generative AI, and data-centric computer vision. If you'd like to be notified about the next edition, just click "Subscribe" at the top of this page.

Synthetic data generation technology is a relatively recent addition to the toolkit of machine learning engineers. Still, it has already evolved from the initially supportive role of augmenting real-world data to one enabling a new wave of AI innovation.

Gartner famously claimed that “by 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated”. Gartner has also put Synthetic Data on the “Impact Radar for Edge AI” making it the top 3 high-profile technologies.

Let’s revise once again what led to the popularization of synthetic data and where it can have an advantage over the conventional approach of using real-world data.

1. Cost and Speed of Data Acquisition

Engineering products enabled by machine learning algorithms require fast experimentation and prototyping of product and algorithm hypotheses.

Collecting real-world data is usually time-consuming, expensive, and dependent on the product’s sensor stack.

Synthetic data generation requires some upfront production but past this initial production stage, it is relatively fast and cheap to generate in high volumes.

Accounting for synthetic data in the early stages of the project allows for rapid experimentation and consecutive bootstrapping of the product and foundational algorithms. After an early prototype is built and end-to-end data architecture is established, real-world data can now be collected by allowing early adopters to use the solution.

No alt text provided for this image — Demonstration of a synthetically generated scene for in-cabin driver monitoring by SKY ENGINE AI.

2. Performance Over a Large Distribution of Scenarios and Rare Events

There is usually a high cost associated with collecting a diverse real-world dataset for training predictive models. It is even more costly to collect a dataset representative of rare events and edge-case scenarios.

In fact, many consider the progress of autonomous systems hindered by the presence of such rare events. Autonomous car manufacturers have been promising robust driving for years but we are yet to see it functioning across the full spectrum of weather conditions and road scenarios.

Andrej Karpathy, the former director of Autopilot at Tesla famously said “Because [the autonomous vehicle] is a product in the hands of customers, you are forced to go through the long tail. You cannot do just 95% and call it a day. The long tail brings all kinds of interesting challenges.”

With synthetic data, engineers can now be proactive about building diverse datasets by creating new scenarios and managing the distribution of scene parameters.

Synthetic data makes a strong promise if not fully solving, but dramatically contributing to solving the “long tail” problem by allowing simulation of rare scenarios.

While helping with rare events, having diverse datasets also allows for reducing bias in predictive models. For example, if a dataset contains data from only one population group, then the accuracy of a model built with it will be limited as it can only be applied to that population group. With synthetic data, it is possible to generate more diverse datasets so that the trained model will be applicable to a greater number of population groups.

3. Quality of Ground Truth Annotation

The majority of state-of-the-art algorithms within computer vision are built in a supervised manner — the algorithm learns to recognize patterns by observing the data and associating it with evident ground truth labels. In order for the model to know what truth is to learn prediction, each data point in the dataset needs to be annotated.

The current conventional approach is to use manually annotated data — review each example and label it by hand. At the scale of large machine learning datasets with hundreds of thousands and even millions of data points, this approach is often costly, generally time-consuming, and error-prone.

Some complex computer vision scenarios, like dense scenes, heavy occlusions, and complex sensor modalities like depth and surface normals are even impractical to be annotated manually.

Manual data annotation can also introduce bias to the data as annotators might make errors in judgment while performing data labeling.

Synthetic data allows for automatic cost-free high-quality pixel-perfect ground truth annotations without the need for manual annotation.

Because synthetic scenes are simulated, ground truth information is present about heavily occluded and even invisible objects in complex scenarios.

4. Privacy Concerns and Cost of Handling Sensitive Data

Collecting real-world datasets is often associated with major privacy risks when storing, sharing, and annotating Personally Identifiable Information (PII) or other types of sensitive data.

And even when this sensitive data is handled properly while training predictive models, there are still risks of "inference attacks" where models can "leak" and compromise information about its training data.

In such scenarios, using synthetic data can provide a viable solution to generate datasets without having direct access to sensitive information while preserving the statistical features that would be required to train and evaluate a model.

There is a substantial cost saving associated with storing and handling of synthetic data compared to real sensitive data. According to Gartner, synthetic data will enable organizations to avoid 70% of privacy violation sanctions.

5. Robust Object Occlusion and Permanence Training

Some scenarios where dense object scenes are frequent and object occlusion is common are particularly challenging to solve using models trained on real data. Not only annotation of dense scenes is exponentially more costly, but it is also heavily biased as annotators don’t see occluded parts.

6. Seamless experimentation with sensor modalities and rapid product prototyping

The performance of computer vision models is massively dependent on the sensor stack used to collect the training data and how well it matches the final product’s sensor stack. For example, if the focal length of the camera lens used to collect training data is different from the camera used in production — the accuracy of prediction will be degraded. The same goes for all other intrinsic and extrinsic (i.e. location and rotation) sensor parameters.

While prototyping the product, it is common to experiment with sensor stack, so therefore risks associated with data collection every time sensors are modified should be accounted for.

Synthetic data provides a huge value here. It can accommodate a wide variety of interchangeable sensor modalities like depth and point clouds. It can also generate new sensing modalities like optical flow and surface normals.

Synthetic data can also support product development and help evaluate perception algorithm accuracy for different sensor modalities, intrinsic parameters, and placements.

Synthetic data generation technology has come a long way in recent years and shows great promise for the future of AI training and development. It evolved from a secondary supportive role (in conjunction with real data) to a primary driver of AI innovation.

It is an exciting area of research and development and I think we are just beginning to scratch the surface of what's possible.

Although I believe the future of synthetic data is bright, I don't advocate for the fact that it is a silver bullet solution for every problem. In the upcoming publications of this newsletter, we will reflect on the use cases where the application of synthetic data is still considered to be impractical or challenging.

And now, let's dive into the news headlines of recent days!

Shutterstock and Getty Images Appear to be Removing User-Uploaded AI-Generated Images

Just weeks after announcing a collaboration with OpenAI and future integration of DALL-E, Shutterstock appears to be removing AI-generated images uploaded by users stating that "[artwork] appears to have been created with AI-generative technology is not acceptable [on the platform]". Getty Images announced that it would ban AI-generated images. This move raises a lot of questions on what exactly can be considered AI-generated images since many image editing platforms like Adobe Photoshop, Topaz, and Luminar now offer some sort of AI-aided functionality.

NVIDIA Announces Magic3D Text-to-3D Content Creation Tool

NVIDIA's Magic3D is a new text-to-3D content creation tool that creates 3D mesh models with unprecedented quality. Together with image conditioning techniques as well as a prompt-based editing approach, it provides users with new ways to control 3D synthesis, opening up new avenues to various creative applications.

Details on NVIDIA Omniverse Kit 104

We have already reported on the recent NVIDIA Omniverse release. NVIDIA's Damien Fagnou, VP of Software has offered a detailed view of the new functionality that allows developers to more easily create, package, and publish metaverse applications.

The Illustrated Stable Diffusion

Jay Alammar has written a wonderfully illustrated explanation of Stable Diffusion. This is probably the best explanation I've come across so far.

Parallel Domain Raises $30 Million Series B

Last week was marked by the heartwarming news of synthetic data generation vendor Parallel Domain having raised a Series B funding round. Congrats to Kevin McNamara , James Grieve , and the whole team!

And that's a wrap!

Here are a few more ways you can learn about synthetic data and generative AI:

The Ground Truth Newsletter by our friends at Datagen
Open Synthetics — The Open Community for the Creation and Use of Synthetic Data in AI curated by our friends at Synthesis AI
Our friends at Gretel.ai have launched Synthetic Data Community Discord Channel

See you next week!

Andrey

The Reality Gap

6,242 followers

+ Subscribe

Derek Edwards

Vice President, Sales Development & Operations

Great job as always, Andrey!

1 Reaction

Andrey Shtylenko

Digital Humans @ Meta Reality Labs

My wife read this post and told me it was too long for a newsletter but I still decided to go a bit longform for this one as I was not able to find a comparative detailed review on this subject. In the upcoming publications we will talk about where synthetic data might still be impractical to use.

1 Reaction

Bartek Włodarczyk

Advancing AI with Synthetic Data Cloud as CEO, PhD at SKY ENGINE AI

Thanks for including SKY ENGINE AI's latest advance in synthetic DMS Andrey.

1 Reaction

The Advantages of Synthetic Data

Andrey Shtylenko

Digital Humans @ Meta Reality Labs

1. Cost and Speed of Data Acquisition

2. Performance Over a Large Distribution of Scenarios and Rare Events

3. Quality of Ground Truth Annotation

Synthetic data allows for automatic cost-free high-quality pixel-perfect ground truth annotations without the need for manual annotation.

4. Privacy Concerns and Cost of Handling Sensitive Data

5. Robust Object Occlusion and Permanence Training

Recommended by LinkedIn

6. Seamless experimentation with sensor modalities and rapid product prototyping

Shutterstock and Getty Images Appear to be Removing User-Uploaded AI-Generated Images

NVIDIA Announces Magic3D Text-to-3D Content Creation Tool

Details on NVIDIA Omniverse Kit 104

The Illustrated Stable Diffusion

Parallel Domain Raises $30 Million Series B

The Reality Gap

6,242 followers

More articles by this author

Insights from the community

Others also viewed

AI Assistants 101 - the busy person's crash course

Machine Learning and AI: The Game-Changers in Modern Technology

The Key to the Best AI Model is curated data

From Plus/Minus to Probability

How Artificial Intelligence Will Transform Businesses

Artificial or Augmented Intelligence: Talks with Intel’s Chief Data Scientist, Bob Rogers

Your AI Weekend Curated Reads: 6-21-24

🌐🎯💡🚀 The Transformative Power of AI in Point Cloud Technology: Revolutionizing 3D Data Analysis

Tesla's Talkative Transformer: Unveiling the World's First Chatty AI-Powered Autonomous Car

De-mystifying The Ultimate Debate: Artificial Intelligence vs. Human Intelligence

Explore topics

1. Cost and Speed of Data Acquisition

2. Performance Over a Large Distribution of Scenarios and Rare Events

3. Quality of Ground Truth Annotation

Synthetic data allows for automatic cost-free high-quality pixel-perfect ground truth annotations without the need for manual annotation.

4. Privacy Concerns and Cost of Handling Sensitive Data

5. Robust Object Occlusion and Permanence Training

Recommended by LinkedIn

6. Seamless experimentation with sensor modalities and rapid product prototyping

Shutterstock and Getty Images Appear to be Removing User-Uploaded AI-Generated Images

NVIDIA Announces Magic3D Text-to-3D Content Creation Tool

Details on NVIDIA Omniverse Kit 104

The Illustrated Stable Diffusion

Parallel Domain Raises $30 Million Series B

The Reality Gap

6,242 followers

Back to Business: CES 2023, Microsoft's Investment in OpenAI and Other Industry News

Jan 12, 2023

2022: A Look Back at the Best Year for Synthetic Data Generation (Yet)

Dec 21, 2022

2022: My Top 10 Picks for Educational Content on Synthetic Data for Computer Vision

Dec 15, 2022

Art Marketplaces Respond to Generative AI

Dec 7, 2022

The Anatomy of Synthetic Data Generation

Nov 30, 2022

Rise of Digital Humans — The Reality Gap #4

Nov 17, 2022

Artists Get Creative With Generative AI — The Reality Gap #3

Nov 10, 2022

Generative AI Transforming Synthetic Data Workflows — The Reality Gap #2

Nov 3, 2022

Generative AI Goes Mainstream — The Reality Gap #1

Oct 25, 2022

2020 Best AI & Robotics Learning Resources

Jan 8, 2021

Insights from the community

Others also viewed

AI Assistants 101 - the busy person's crash course

Machine Learning and AI: The Game-Changers in Modern Technology

The Key to the Best AI Model is curated data

From Plus/Minus to Probability

How Artificial Intelligence Will Transform Businesses

Artificial or Augmented Intelligence: Talks with Intel’s Chief Data Scientist, Bob Rogers

Your AI Weekend Curated Reads: 6-21-24

🌐🎯💡🚀 The Transformative Power of AI in Point Cloud Technology: Revolutionizing 3D Data Analysis

Tesla's Talkative Transformer: Unveiling the World's First Chatty AI-Powered Autonomous Car

De-mystifying The Ultimate Debate: Artificial Intelligence vs. Human Intelligence

Explore topics