The Advantages of Synthetic Data
Image generated using OpenAI DALL-E

The Advantages of Synthetic Data

Welcome to the Reality Gap newsletter, which focuses on synthetic data, generative AI, and data-centric computer vision. If you'd like to be notified about the next edition, just click "Subscribe" at the top of this page.


Synthetic data generation technology is a relatively recent addition to the toolkit of machine learning engineers. Still, it has already evolved from the initially supportive role of augmenting real-world data to one enabling a new wave of AI innovation.

Gartner famously claimed that “by 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated”. Gartner has also put Synthetic Data on the “Impact Radar for Edge AI” making it the top 3 high-profile technologies.

Let’s revise once again what led to the popularization of synthetic data and where it can have an advantage over the conventional approach of using real-world data.

1. Cost and Speed of Data Acquisition

Engineering products enabled by machine learning algorithms require fast experimentation and prototyping of product and algorithm hypotheses.

Collecting real-world data is usually time-consuming, expensive, and dependent on the product’s sensor stack.

Synthetic data generation requires some upfront production but past this initial production stage, it is relatively fast and cheap to generate in high volumes.

Accounting for synthetic data in the early stages of the project allows for rapid experimentation and consecutive bootstrapping of the product and foundational algorithms. After an early prototype is built and end-to-end data architecture is established, real-world data can now be collected by allowing early adopters to use the solution.

No alt text provided for this image
Demonstration of a synthetically generated scene for in-cabin driver monitoring by SKY ENGINE AI.

2. Performance Over a Large Distribution of Scenarios and Rare Events

There is usually a high cost associated with collecting a diverse real-world dataset for training predictive models. It is even more costly to collect a dataset representative of rare events and edge-case scenarios.

In fact, many consider the progress of autonomous systems hindered by the presence of such rare events. Autonomous car manufacturers have been promising robust driving for years but we are yet to see it functioning across the full spectrum of weather conditions and road scenarios.

Andrej Karpathy, the former director of Autopilot at Tesla famously said “Because [the autonomous vehicle] is a product in the hands of customers, you are forced to go through the long tail. You cannot do just 95% and call it a day. The long tail brings all kinds of interesting challenges.”

With synthetic data, engineers can now be proactive about building diverse datasets by creating new scenarios and managing the distribution of scene parameters.

Synthetic data makes a strong promise if not fully solving, but dramatically contributing to solving the “long tail” problem by allowing simulation of rare scenarios.

No alt text provided for this image
Tesla heavily uses synthetic data to address rare event challenges.

While helping with rare events, having diverse datasets also allows for reducing bias in predictive models. For example, if a dataset contains data from only one population group, then the accuracy of a model built with it will be limited as it can only be applied to that population group. With synthetic data, it is possible to generate more diverse datasets so that the trained model will be applicable to a greater number of population groups.

3. Quality of Ground Truth Annotation

The majority of state-of-the-art algorithms within computer vision are built in a supervised manner — the algorithm learns to recognize patterns by observing the data and associating it with evident ground truth labels. In order for the model to know what truth is to learn prediction, each data point in the dataset needs to be annotated.

The current conventional approach is to use manually annotated data — review each example and label it by hand. At the scale of large machine learning datasets with hundreds of thousands and even millions of data points, this approach is often costly, generally time-consuming, and error-prone.

Some complex computer vision scenarios, like dense scenes, heavy occlusions, and complex sensor modalities like depth and surface normals are even impractical to be annotated manually.

Manual data annotation can also introduce bias to the data as annotators might make errors in judgment while performing data labeling.

Synthetic data allows for automatic cost-free high-quality pixel-perfect ground truth annotations without the need for manual annotation.

No alt text provided for this image
"Synthetic Data with Digital Humans" by Erroll Wood and Tadas Baltrusaitis from Microsoft’s Mixed Reality & AI Lab

Because synthetic scenes are simulated, ground truth information is present about heavily occluded and even invisible objects in complex scenarios.

4. Privacy Concerns and Cost of Handling Sensitive Data

Collecting real-world datasets is often associated with major privacy risks when storing, sharing, and annotating Personally Identifiable Information (PII) or other types of sensitive data.

And even when this sensitive data is handled properly while training predictive models, there are still risks of "inference attacks" where models can "leak" and compromise information about its training data.

In such scenarios, using synthetic data can provide a viable solution to generate datasets without having direct access to sensitive information while preserving the statistical features that would be required to train and evaluate a model.

There is a substantial cost saving associated with storing and handling of synthetic data compared to real sensitive data. According to Gartner, synthetic data will enable organizations to avoid 70% of privacy violation sanctions.
No alt text provided for this image
"Synthetic Data with Digital Humans" by Erroll Wood and Tadas Baltrusaitis from Microsoft’s Mixed Reality & AI Lab

5. Robust Object Occlusion and Permanence Training

Some scenarios where dense object scenes are frequent and object occlusion is common are particularly challenging to solve using models trained on real data. Not only annotation of dense scenes is exponentially more costly, but it is also heavily biased as annotators don’t see occluded parts.

Having correct annotation for occluded scenes is crucial for object tracking and path prediction algorithms (like pedestrian path prediction within autonomous driving context) and is critical to robust operation.

As synthetic data is rendered from simulated scenes, rendering engines are fully 3D-aware. This allows pixel-perfect annotation of occluded objects and for the first time enables the engineering of robust object-tracking algorithms for occluded contexts.

No alt text provided for this image
An example of a dense video sequence with heavily occluded pedestrians. Video by Nuro.

6. Seamless experimentation with sensor modalities and rapid product prototyping

The performance of computer vision models is massively dependent on the sensor stack used to collect the training data and how well it matches the final product’s sensor stack. For example, if the focal length of the camera lens used to collect training data is different from the camera used in production — the accuracy of prediction will be degraded. The same goes for all other intrinsic and extrinsic (i.e. location and rotation) sensor parameters.

While prototyping the product, it is common to experiment with sensor stack, so therefore risks associated with data collection every time sensors are modified should be accounted for.

Synthetic data provides a huge value here. It can accommodate a wide variety of interchangeable sensor modalities like depth and point clouds. It can also generate new sensing modalities like optical flow and surface normals.

Synthetic data can also support product development and help evaluate perception algorithm accuracy for different sensor modalities, intrinsic parameters, and placements.

No alt text provided for this image
NVIDIA Drive Sim platform demonstration showcasing multiple sensor simulation

Synthetic data generation technology has come a long way in recent years and shows great promise for the future of AI training and development. It evolved from a secondary supportive role (in conjunction with real data) to a primary driver of AI innovation.

It is an exciting area of research and development and I think we are just beginning to scratch the surface of what's possible.

Although I believe the future of synthetic data is bright, I don't advocate for the fact that it is a silver bullet solution for every problem. In the upcoming publications of this newsletter, we will reflect on the use cases where the application of synthetic data is still considered to be impractical or challenging.

And now, let's dive into the news headlines of recent days!


Shutterstock and Getty Images Appear to be Removing User-Uploaded AI-Generated Images

Just weeks after announcing a collaboration with OpenAI and future integration of DALL-E, Shutterstock appears to be removing AI-generated images uploaded by users stating that "[artwork] appears to have been created with AI-generative technology is not acceptable [on the platform]". Getty Images announced that it would ban AI-generated images. This move raises a lot of questions on what exactly can be considered AI-generated images since many image editing platforms like Adobe Photoshop, Topaz, and Luminar now offer some sort of AI-aided functionality.

NVIDIA Announces Magic3D Text-to-3D Content Creation Tool

NVIDIA's Magic3D is a new text-to-3D content creation tool that creates 3D mesh models with unprecedented quality. Together with image conditioning techniques as well as a prompt-based editing approach, it provides users with new ways to control 3D synthesis, opening up new avenues to various creative applications.

Details on NVIDIA Omniverse Kit 104

We have already reported on the recent NVIDIA Omniverse release. NVIDIA's Damien Fagnou, VP of Software has offered a detailed view of the new functionality that allows developers to more easily create, package, and publish metaverse applications.

The Illustrated Stable Diffusion

Jay Alammar has written a wonderfully illustrated explanation of Stable Diffusion. This is probably the best explanation I've come across so far.

Parallel Domain Raises $30 Million Series B

Last week was marked by the heartwarming news of synthetic data generation vendor Parallel Domain having raised a Series B funding round. Congrats to Kevin McNamara , James Grieve , and the whole team!

And that's a wrap!

Here are a few more ways you can learn about synthetic data and generative AI:

See you next week!

Andrey

Derek Edwards

Vice President, Sales Development & Operations

2y

Great job as always, Andrey!

Andrey Shtylenko

Digital Humans @ Meta Reality Labs

2y

My wife read this post and told me it was too long for a newsletter but I still decided to go a bit longform for this one as I was not able to find a comparative detailed review on this subject. In the upcoming publications we will talk about where synthetic data might still be impractical to use.

Bartek Włodarczyk

Advancing AI with Synthetic Data Cloud as CEO, PhD at SKY ENGINE AI

2y

Thanks for including SKY ENGINE AI's latest advance in synthetic DMS Andrey.

Kevin McNamara

Founder at Parallel Domain

2y

thanks for the shoutout, andrey! what a great newsletter

To view or add a comment, sign in

More articles by Andrey Shtylenko

Insights from the community

Others also viewed

Explore topics