The Advantages of Synthetic Data
Welcome to the Reality Gap newsletter, which focuses on synthetic data, generative AI, and data-centric computer vision. If you'd like to be notified about the next edition, just click "Subscribe" at the top of this page.
Synthetic data generation technology is a relatively recent addition to the toolkit of machine learning engineers. Still, it has already evolved from the initially supportive role of augmenting real-world data to one enabling a new wave of AI innovation.
Gartner famously claimed that “by 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated”. Gartner has also put Synthetic Data on the “Impact Radar for Edge AI” making it the top 3 high-profile technologies.
Let’s revise once again what led to the popularization of synthetic data and where it can have an advantage over the conventional approach of using real-world data.
1. Cost and Speed of Data Acquisition
Engineering products enabled by machine learning algorithms require fast experimentation and prototyping of product and algorithm hypotheses.
Collecting real-world data is usually time-consuming, expensive, and dependent on the product’s sensor stack.
Synthetic data generation requires some upfront production but past this initial production stage, it is relatively fast and cheap to generate in high volumes.
Accounting for synthetic data in the early stages of the project allows for rapid experimentation and consecutive bootstrapping of the product and foundational algorithms. After an early prototype is built and end-to-end data architecture is established, real-world data can now be collected by allowing early adopters to use the solution.
2. Performance Over a Large Distribution of Scenarios and Rare Events
There is usually a high cost associated with collecting a diverse real-world dataset for training predictive models. It is even more costly to collect a dataset representative of rare events and edge-case scenarios.
In fact, many consider the progress of autonomous systems hindered by the presence of such rare events. Autonomous car manufacturers have been promising robust driving for years but we are yet to see it functioning across the full spectrum of weather conditions and road scenarios.
Andrej Karpathy, the former director of Autopilot at Tesla famously said “Because [the autonomous vehicle] is a product in the hands of customers, you are forced to go through the long tail. You cannot do just 95% and call it a day. The long tail brings all kinds of interesting challenges.”
With synthetic data, engineers can now be proactive about building diverse datasets by creating new scenarios and managing the distribution of scene parameters.
Synthetic data makes a strong promise if not fully solving, but dramatically contributing to solving the “long tail” problem by allowing simulation of rare scenarios.
While helping with rare events, having diverse datasets also allows for reducing bias in predictive models. For example, if a dataset contains data from only one population group, then the accuracy of a model built with it will be limited as it can only be applied to that population group. With synthetic data, it is possible to generate more diverse datasets so that the trained model will be applicable to a greater number of population groups.
3. Quality of Ground Truth Annotation
The majority of state-of-the-art algorithms within computer vision are built in a supervised manner — the algorithm learns to recognize patterns by observing the data and associating it with evident ground truth labels. In order for the model to know what truth is to learn prediction, each data point in the dataset needs to be annotated.
The current conventional approach is to use manually annotated data — review each example and label it by hand. At the scale of large machine learning datasets with hundreds of thousands and even millions of data points, this approach is often costly, generally time-consuming, and error-prone.
Some complex computer vision scenarios, like dense scenes, heavy occlusions, and complex sensor modalities like depth and surface normals are even impractical to be annotated manually.
Manual data annotation can also introduce bias to the data as annotators might make errors in judgment while performing data labeling.
Synthetic data allows for automatic cost-free high-quality pixel-perfect ground truth annotations without the need for manual annotation.
Because synthetic scenes are simulated, ground truth information is present about heavily occluded and even invisible objects in complex scenarios.
4. Privacy Concerns and Cost of Handling Sensitive Data
Collecting real-world datasets is often associated with major privacy risks when storing, sharing, and annotating Personally Identifiable Information (PII) or other types of sensitive data.
And even when this sensitive data is handled properly while training predictive models, there are still risks of "inference attacks" where models can "leak" and compromise information about its training data.
In such scenarios, using synthetic data can provide a viable solution to generate datasets without having direct access to sensitive information while preserving the statistical features that would be required to train and evaluate a model.
There is a substantial cost saving associated with storing and handling of synthetic data compared to real sensitive data. According to Gartner, synthetic data will enable organizations to avoid 70% of privacy violation sanctions.
5. Robust Object Occlusion and Permanence Training
Some scenarios where dense object scenes are frequent and object occlusion is common are particularly challenging to solve using models trained on real data. Not only annotation of dense scenes is exponentially more costly, but it is also heavily biased as annotators don’t see occluded parts.
Recommended by LinkedIn
Having correct annotation for occluded scenes is crucial for object tracking and path prediction algorithms (like pedestrian path prediction within autonomous driving context) and is critical to robust operation.
As synthetic data is rendered from simulated scenes, rendering engines are fully 3D-aware. This allows pixel-perfect annotation of occluded objects and for the first time enables the engineering of robust object-tracking algorithms for occluded contexts.
6. Seamless experimentation with sensor modalities and rapid product prototyping
The performance of computer vision models is massively dependent on the sensor stack used to collect the training data and how well it matches the final product’s sensor stack. For example, if the focal length of the camera lens used to collect training data is different from the camera used in production — the accuracy of prediction will be degraded. The same goes for all other intrinsic and extrinsic (i.e. location and rotation) sensor parameters.
While prototyping the product, it is common to experiment with sensor stack, so therefore risks associated with data collection every time sensors are modified should be accounted for.
Synthetic data provides a huge value here. It can accommodate a wide variety of interchangeable sensor modalities like depth and point clouds. It can also generate new sensing modalities like optical flow and surface normals.
Synthetic data can also support product development and help evaluate perception algorithm accuracy for different sensor modalities, intrinsic parameters, and placements.
Synthetic data generation technology has come a long way in recent years and shows great promise for the future of AI training and development. It evolved from a secondary supportive role (in conjunction with real data) to a primary driver of AI innovation.
It is an exciting area of research and development and I think we are just beginning to scratch the surface of what's possible.
Although I believe the future of synthetic data is bright, I don't advocate for the fact that it is a silver bullet solution for every problem. In the upcoming publications of this newsletter, we will reflect on the use cases where the application of synthetic data is still considered to be impractical or challenging.
And now, let's dive into the news headlines of recent days!
Shutterstock and Getty Images Appear to be Removing User-Uploaded AI-Generated Images
Just weeks after announcing a collaboration with OpenAI and future integration of DALL-E, Shutterstock appears to be removing AI-generated images uploaded by users stating that "[artwork] appears to have been created with AI-generative technology is not acceptable [on the platform]". Getty Images announced that it would ban AI-generated images. This move raises a lot of questions on what exactly can be considered AI-generated images since many image editing platforms like Adobe Photoshop, Topaz, and Luminar now offer some sort of AI-aided functionality.
NVIDIA Announces Magic3D Text-to-3D Content Creation Tool
NVIDIA's Magic3D is a new text-to-3D content creation tool that creates 3D mesh models with unprecedented quality. Together with image conditioning techniques as well as a prompt-based editing approach, it provides users with new ways to control 3D synthesis, opening up new avenues to various creative applications.
Details on NVIDIA Omniverse Kit 104
We have already reported on the recent NVIDIA Omniverse release. NVIDIA's Damien Fagnou, VP of Software has offered a detailed view of the new functionality that allows developers to more easily create, package, and publish metaverse applications.
The Illustrated Stable Diffusion
Jay Alammar has written a wonderfully illustrated explanation of Stable Diffusion. This is probably the best explanation I've come across so far.
Parallel Domain Raises $30 Million Series B
Last week was marked by the heartwarming news of synthetic data generation vendor Parallel Domain having raised a Series B funding round. Congrats to Kevin McNamara , James Grieve , and the whole team!
And that's a wrap!
Here are a few more ways you can learn about synthetic data and generative AI:
See you next week!
Andrey
Vice President, Sales Development & Operations
2yGreat job as always, Andrey!
Digital Humans @ Meta Reality Labs
2yMy wife read this post and told me it was too long for a newsletter but I still decided to go a bit longform for this one as I was not able to find a comparative detailed review on this subject. In the upcoming publications we will talk about where synthetic data might still be impractical to use.
Advancing AI with Synthetic Data Cloud as CEO, PhD at SKY ENGINE AI
2yThanks for including SKY ENGINE AI's latest advance in synthetic DMS Andrey.
Founder at Parallel Domain
2ythanks for the shoutout, andrey! what a great newsletter