How much data do I need to use AI for engineering products?
In this post, I wanted to summarise some practical considerations around the data that you require to build an impactful AI model on engineering R&D data. I have been trying to make this article useful for both beginners and people with prior experience. Would love your comments on whether you find it useful and suggestions for improvement
As machine learning is all about deriving relationships from data without explicitly defining them yourself, everybody's first question when starting is: How much and what data do I need? And after that immediate follow-up question is: what will I be able to predict with this? Learning new things, predicting things for new products, or automating processes successfully are crucial to justify the time spent building an AI solution, so everybody usually wants to know things up front.
In this article, I will try to give you as many useful insights as possible so you can answer this seemingly simple yet surprisingly difficult question! If this is too long for you, just reach out to info@monolithai.com and have one of the applied engineers at Monolith explain it to you in person. Like me they love nothing better than chatting about engineering challenges :)
Alright, here are the 4 main principles you need to consider:
1. WHAT TYPE OF DATA YOU HAVE (Time series, flow fields, 3D design, simple tabular data, ... in general, more complex data types like flow fields or CAD designs can contain much more information per sample, but can also be harder to work with. )
2. QUALITY OF YOUR DATA (If some classes or output ranges are rare, more data will be required to understand them)
3. WHAT ALGORITHM YOU USE (Deep Learning needs lots of data, Gaussian Processes can work with very little)
4. WHAT PROBLEM YOU WANT TO SOLVE (Understand trends, make predictions, perform optimization, and how good your model needs to be useful)
We will now look at some concrete examples for each of these points.
1. WHAT TYPE OF DATA YOU HAVE
a) TABULARLY STRUCTURED DATA
Structured and labeled data in a simple tabular form is the format that is easiest to use for machine learning. It makes life super easy from a data scientist's point of view. Sadly, it requires the most pre-processing from subject matter experts to create it. Moreover, it also is the least informative as you only get one digit per data point instead of an entire image or a design. Lots of engineering companies now have data policy officers to make sure people save things so they can be computationally processed. If you haven't considered it yet, you should. Your company should not create any data that cannot be used for learning in 2021.
b) TIME SERIES
Working with raw time series works quite well for machine learning in general and it is often the format that comes off sensors that you put on engineering equipment. From identifying autopilots to performing predictive maintenance, the fact that a signal is sampled at a high sampling frequency (like 100 samples per second) usually creates high amounts of data very quickly that you can easily learn lots from. A challenge when combining different signals is usually to ensure they are all at the same sampling rate, but Monolith for example has a toolbox that resamples data automatically.
c) CAD DATA
Hardly anybody in engineering today knows that you can directly process 3D mesh information. Partially that is probably because except for Monolith nobody has built and commercialised tools that make it possible to do deep learning on CAD data. The team are real though leaders in this field and you can do incredible stuff with this new toolset. If this seems strange then just remind yourself that designs or a mesh is just a large number of (X,Y,Z) coordinates. 3D design deep learning is the most exciting new area of machine learning with the largest potential to revolutionise engineering today in my opinion. To prove that I would point to how deep learning transformed image processing, you can find another article on that here.
d) SIMULATION (CAE) DATA
You can also learn directly from simulation field data using deep learning algorithms. Again, this is hardly known among engineers but has actually become more prominent in 2021. DeepMind, Nvidia and Monolith are just some companies that have started research on this. While DeepMind tand NVidia are mostly interested in the benefits for computer graphics, video games, and movies, we at Monolith are excited about the posssibilites it has to model engineering system. Imagine this: If I run 10 wind turbine simulations and create structured output data and I only learn from the drag coefficient, I end up with 10 points for a fairly complex design, really not a lot of data considering the complexity of flows. If instead, I try to learn from the entire flow field – each containing 250,000 cells – I have more than 2.5 million points to learn from. Suddenly, 10 simulations are a lot of information given the complexity of the problem. Read more about how Siemens is using this approach for combustion chamber simulation.
2. QUALITY OF DATA
Like with everything in life, quantity is nothing without quality. Let’s say you are building a model to learn when a hard disc will fail. Collecting 1 Million data points without a single failure will teach you nothing at all when failures occur - whereas as little as 5 discs that actually failed for different reasons can teach you lots!
Here are a couple of tricks to better understand what high-quality data will mean for regression or classification problems. Don't know the difference? Read this
For regression/prediction problems, it is useful to get samples that cover your design space pretty evenly. See below two different designs of experiments (DoE) for simple tabular data. What I am plotting here is the knowledge we have about a part of the design space in color. Dark red means I know what is going on. Blue means I am uncertain what will happen here. So you can see that for five blades, no designs or data points were created. (makes sense, who has ever seen a five-bladed wind turbine?)
Here is another example showing a few randomly scattered points with big gaps in between them.
So the conclusion here: for simple data if you do not have information for a specific range, as you have never built a wind turbine with 5 blades or you have never tried to drive your car at 200 miles an hour, well then your chance of getting good answers using machine learning is low.
Personally, I like picturing any problem as an unknown mountain range that I am trying to measure. I have to run around manually and make measurements to build a map and after each measurement, I conclude what I think the terrain looks like. If I get two measurements of the same height I think this must be plain, but there could still be a valley in between. I don't know until I check. At NASA we used to say: there is no free lunch. Every problem requires a specific number of samples to understand it and you won't get away without doing the work unless a) you have prior experience of what this looks like or b) you have a really clever strategy of how to get your next measurement. Both a) and b) are fantastic mechanisms that make machine learning really powerful for engineering. a) is called transfer learning and b) active learning.
Below is a really old gif I created when I was a fellow at Imperial College London where I was trying to learn an unknown surface one sample at a time. You can see that the mathematical conclusions are completely wrong at first but then get better.
As engineering is mostly about different designs and not single data points, below is a more practical example that can help you tune your intuition of what designs you can make predictions for. Let's assume you have a database of rims with two or three spokes. It stands to reason that you will be able to make good predictions for rims that also have three or for spokes but that it will be hard applying this model to something with 6 spokes.
If you have a really big database of lots of different designs and cannot remember what might be similar or dissimilar from what you have tried before, you can use probabilistic methods. You can see in the dials below that the prediction algorithm is slightly uncertain about the performance range. If this uncertainty goes up it can usually be interpreted as: this is a very new design that I need to learn about first.
Recommended by LinkedIn
Finally, here comes the best and most exciting news about machine learning: Once you implement the system it starts learning. So even if your new design cannot be predicted yet it is only a matter of time till such designs can be implemented. Just like engineers your algorithms get better when something becomes repetitive but unlike engineers this expertise is not lost when your expert changes jobs.
For classification problems, it helps to have n samples for each class. Again, some data scientists have a ‘rule of 10’ and say n should be at least 10, but n can be 100,1000 or more depending on the complexity of the problem. If you do not have equally balanced data (nobody really has that) a more detailed guide to classification on imbalanced data can be found here.
If this is too abstract, think of it like this: if you are trying to figure out if a design is manufacturable then the answer is either Yes or No. In my experience, this is the most common classification problem in product engineering and it is usually imbalanced because companies tend to only keep their final successful design and tend to throw away their failed attempts. This is something you should change after having deploying machine learning in your department. If you have 99 successful designs and only 1 failed one, it will be quite hard to figure out why that is the one that failed. You would have a much easier time if you have 100 successful and 100 failed ones. Sidebar: you can still solve this problem though by building a model that predicts what a manufactured design will look like based on the starting point and then evaluate yourself if it will fail or not. More information of how you can solve classification problems by treating them as regression problems is here.
3. WHAT ALGORITHM YOU USE
The type of algorithm you use makes a huge difference, and there are a lot of algorithms. In my experience, you do not need to know them all. What you need to know is:
Trust me, you can go far with this amount of knowledge (at least when using a platform like Monolith that does all the hard work for you)
Here is a very often used graph showing that Neural Networks massively outperform other solutions when the amount of data grows.
And here is another frequently used graph in data science, a classic trade-off chart that shows you that if you want highly accurate but interpretable results you should use a random forest.
4. WHAT PROBLEM YOU WANT TO SOLVE
My favourite answer to the question: "how much data do I need" is another question: how well, how good do you need your model to be? The type of problem that you want to solve in your company determines how much data you will need. The graphic below shows how much data ( rough ballpark figures to simplify things), you might need.
Let's look at those scenarios in detail:
FINALLY, A CHEAT TRICK IF EVERYTHING ELSE FAILS
Here are three cheat tricks that most data scientists will use when they are unsure about the answer about how much data they need. Just try it with different models and have a look at the error! They train a model and first look at the convergence of the fitting error. If this error decreases pretty quickly you probably have a simple problem. If it takes a long time it is more complicated.
The second and even more reliable way of checking whether you can create a good model based on the data you have is to remove 20% of the data that you have from your training daata (this is called a train test split) and only train a model on the remaining 80% of the data. You then use your model to predict the result of the 20% you did not use for training and thus the model has never seen it. If the actual results and the model predictions lie on a straight line you have got a good model for unseen data, and that is what you want.
Last but not least, you can create a learning curve. This shows you how much you have learned from having no data to the amount of data that you currently have. In the plot below you can see that up to 60% of the available data in the example below really reduced the error and then it stagnated. A sign that more data will not make your model better.
SUMMARY OF MAIN RULES:
I help companies commercialize and adopt technologies/solutions | Commercialized 20+ technologies | Scouted for 60+ solutions
3yIts a helpful summarization & some good first level guidance Richard Ahlfeld....In your opinion are manufacturing & design intensive companies saving CAD data of failed/rejected designs to form part of a future training set? This was surely a gap a decade back. Does it still exist? Cause if it does, it's a massive roadblock right?