When It Comes To AI—Synthetic Data Has A Dirty Little Secret

When It Comes To AI—Synthetic Data Has A Dirty Little Secret


I like to help young machine learning engineers and data scientists out whenever I can, and while answering a thread on Reddit about data quality, the question about synthetic data came up, and I think it's a good opportunity for a quick article discussing this particular area of machine learning.

When it comes to machine learning, the data is everything. You can't refine gasoline without sufficient and quality crude—the same goes for creating robust and predictive machine learning models.

Oftentimes, CIO's and CTO's find themselves caught up in the whirlwind of AI and all the wonderful things AI can bring to business operations—but like any other discipline, you have to crawl before you can walk, before you can run.


What is Synthetic Data?

Synthetic data refers to artificially generated data that simulates real-world data but is not directly collected from actual observations. It's often used in machine learning when real data is scarce, sensitive, or hard to obtain.


No alt text provided for this image

For example, if a business wants to build a predictive model to determine if a certain type of customer is likely to buy another product the business produces, but the business does not have enough customer transactions—making a predictive model based on the business's real data alone would be non-predictive.

Enter synthetic data, where if you are certain that your limited dataset represents the true underlying nature of reality, you can extend your dataset using synthetic data.

No alt text provided for this image

However, using synthetic data comes with its own set of advantages and disadvantages...

Advantages

Scalability: Synthetic data can be generated in large quantities, allowing for more comprehensive training of machine learning models. As mentioned, this is particularly useful when real data is limited.

Privacy and Security: When dealing with sensitive or private data, synthetic data can be used to create a realistic representation without exposing actual individual information. This helps in compliance with data protection regulations.

Diversity: Synthetic data can be customized to cover a wide range of scenarios and edge cases, enhancing the robustness of the trained model. This can result in better generalization and performance.

Data Augmentation: Synthetic data can be combined with real data for data augmentation, effectively expanding the training dataset and reducing overfitting.


Disadvantages

Realism: Generating synthetic data that accurately captures the complexity and nuances of real-world data can be challenging. Poorly generated synthetic data might mislead the model or introduce biases.

Bias Amplification: If the algorithms used to generate synthetic data introduce biases, these biases can be amplified during training, leading to skewed model outcomes.

Generalization Concerns: Models trained on synthetic data may not always generalize well to real-world data. This is particularly true when the synthetic data does not fully capture the variability present in actual observations.

Complexity: Creating high-quality synthetic data requires careful consideration of various factors, such as feature distributions, correlations, and relationships. This complexity can lead to errors or inconsistencies in the generated data.

Resource Intensive: Depending on the complexity of the data and the methods used for generation, creating synthetic data can be computationally expensive and time-consuming.

Lack of Context: Synthetic data might not capture the contextual richness of real-world situations. This could be a limitation when training models that rely heavily on context, like natural language processing tasks.

Validation and Testing: It can be challenging to validate the quality and effectiveness of synthetic data. Proper validation is necessary to ensure that the synthetic data is helping rather than hindering the learning process.


What Constitutes High-Quality Data?

Oftentimes, business professionals and executives interpret high-quality data as "a lot of data, that is centrally stored, structured, clean, and labeled with no missing values."

However, from an engineering standpoint, this is all relatively easy to do. A quick python script, Apache Beam job, or Apache Spark job and a datastore can take care of all that.

What Beam, Spark, python scripts, and datastores cannot do is guarantee that your data is representative of reality.

Apache Beam, Spark, and python scripts cannot guarantee that data is representative of reality.


Your Map Is Not The Territory

No alt text provided for this image

This is what we are referring to when we say "high-quality" data—is your dataset set expansive and representative enough of the world out there?

And this is where the true data science begins before the machine learning can even start. You need to have enough real-world data to be reasonably sure that the distribution of your dataset is what you can expect to see going forward.

Not only that, you have to be sure that the features or columns that you are feeding into your model actually have a naturally occurring predictive nature. This is the hardest part about machine learning that people don't realize—there actually has to be some predictive association between input features and the target (predicted) variable.

That's when using things like principle component analysis (PCA), autoregressive analysis, or stochastic modeling comes into play well before the machine learning part of things—to determine if what you are trying to predict has some type of relationship to what you plan on feeding into a model.


No alt text provided for this image
PCA helps determine what indeed is predictive for your target variable by maximizing the variance and minimizing the residuals of your data set.


For example, in the Reddit thread, I use the example of number of consecutive rainy days being a potential predictor of daily percent change in umbrella sales. This is intuitive, because if people see that it has or will be raining a lot, they are more likely to purchase or even upgrade their current umbrellas.

That is a naturally-occurring predictive phenomenon that a machine learning model can then be used to approximate a function that better explains "x number of days of rain" results in "y % increase in sales".

This is the true nature of machine learning—better explaining the y's given the x's. That's all. But if you throw the whole kitchen sink of features into a model and none of those features have really anything to do with the target variable in real life, the model will be non-predictive.

Or even worse—maybe only two out of 50 of those features have some predictive relationship, but because you threw the whole kitchen sink at the model, the model weights won't adjust sufficiently to give higher weighting to the predictive factors. This is known as the curse of dimensionality.

Synthetic Data Just Extends Your Map

Getting back to the synthetic data and whether it being useful or not—the answer is it depends.

Synthetic data assumes that the limited map you have of a territory will repeat the same patterns and signals over and over as you extend the map outwards in all directions.

However, if the your map is not the territory and the underlying data you have in your possession is not truly representative of what is really out there, but only represents your little corner of the universe—then synthetic data will just confirm the biases of your map.


No alt text provided for this image
Your map may not be the territory, because your map maybe biased—or skewed!


What this means is that your model will converge and have accuracy only in training and validation—but when you go to use it in the real-world, it will fail.

It Starts With Understanding Your Territory

No alt text provided for this image

It goes without saying that in order to better predict an avalanche, you need to have a full and accurate map of the mountain.

The same goes for when it comes to basing business decisions on the outputs of machine learning models—you need to make sure that what you have in terms of data will get you to where you want to go in terms of reality.

You need to make sure that what you have in terms of data will get you to where you want to go in terms of reality.

At Synvestable, our professional services understand the entire machine learning process intimately.

No alt text provided for this image

Before we could have any success at all predicting over 10,000 stocks every month with reasonably accuracy, we needed to make sure that the data we were feeding into our models was representative of what may come after.

This included feature engineering, normalization, stochastic modeling, reduction of dimensionality, feature combining, and even probabilistic modeling.

If you're an enterprise organization in the financial services or banking space and are beginning to embark on your journey in machine learning and augmenting your business operations with AI—reach out to us.

We work with several major wealth management companies and have empowered them holistically—from their data science strategy to full implementation of machine learning services that are now augmenting their clients' personalized experiences at the speed of real-time data.

For a unique perspective on how machine learning will affect financial services, check out our ground-breaking whitepaper The Human-Digital Experience of Hybrid Wealth Managementjust click the image link below.


No alt text provided for this image
Click the image to read our ground-breaking research.


To view or add a comment, sign in

More articles by John Anthony Radosta

Insights from the community

Others also viewed

Explore topics