How much data do I need to use AI for engineering products?

How much data do I need to use AI for engineering products?

In this post, I wanted to summarise some practical considerations around the data that you require to build an impactful AI model on engineering R&D data. I have been trying to make this article useful for both beginners and people with prior experience. Would love your comments on whether you find it useful and suggestions for improvement

As machine learning is all about deriving relationships from data without explicitly defining them yourself, everybody's first question when starting is: How much and what data do I need? And after that immediate follow-up question is: what will I be able to predict with this? Learning new things, predicting things for new products, or automating processes successfully are crucial to justify the time spent building an AI solution, so everybody usually wants to know things up front.

In this article, I will try to give you as many useful insights as possible so you can answer this seemingly simple yet surprisingly difficult question! If this is too long for you, just reach out to info@monolithai.com and have one of the applied engineers at Monolith explain it to you in person. Like me they love nothing better than chatting about engineering challenges :)

Alright, here are the 4 main principles you need to consider:

1.   WHAT TYPE OF DATA YOU HAVE (Time series, flow fields, 3D design, simple tabular data, ... in general, more complex data types like flow fields or CAD designs can contain much more information per sample, but can also be harder to work with. )

2. QUALITY OF YOUR DATA (If some classes or output ranges are rare, more data will be required to understand them)

3. WHAT ALGORITHM YOU USE (Deep Learning needs lots of data, Gaussian Processes can work with very little)

4. WHAT PROBLEM YOU WANT TO SOLVE (Understand trends, make predictions, perform optimization, and how good your model needs to be useful)

We will now look at some concrete examples for each of these points.

1.   WHAT TYPE OF DATA YOU HAVE

a) TABULARLY STRUCTURED DATA

Structured and labeled data in a simple tabular form is the format that is easiest to use for machine learning. It makes life super easy from a data scientist's point of view. Sadly, it requires the most pre-processing from subject matter experts to create it. Moreover, it also is the least informative as you only get one digit per data point instead of an entire image or a design. Lots of engineering companies now have data policy officers to make sure people save things so they can be computationally processed. If you haven't considered it yet, you should. Your company should not create any data that cannot be used for learning in 2021.

tabular data

b) TIME SERIES

Working with raw time series works quite well for machine learning in general and it is often the format that comes off sensors that you put on engineering equipment. From identifying autopilots to performing predictive maintenance, the fact that a signal is sampled at a high sampling frequency (like 100 samples per second) usually creates high amounts of data very quickly that you can easily learn lots from. A challenge when combining different signals is usually to ensure they are all at the same sampling rate, but Monolith for example has a toolbox that resamples data automatically.

time series data

c) CAD DATA

Hardly anybody in engineering today knows that you can directly process 3D mesh information. Partially that is probably because except for Monolith nobody has built and commercialised tools that make it possible to do deep learning on CAD data. The team are real though leaders in this field and you can do incredible stuff with this new toolset. If this seems strange then just remind yourself that designs or a mesh is just a large number of (X,Y,Z) coordinates. 3D design deep learning is the most exciting new area of machine learning with the largest potential to revolutionise engineering today in my opinion. To prove that I would point to how deep learning transformed image processing, you can find another article on that here.

No alt text provided for this image

d) SIMULATION (CAE) DATA

No alt text provided for this image

You can also learn directly from simulation field data using deep learning algorithms. Again, this is hardly known among engineers but has actually become more prominent in 2021. DeepMind, Nvidia and Monolith are just some companies that have started research on this. While DeepMind tand NVidia are mostly interested in the benefits for computer graphics, video games, and movies, we at Monolith are excited about the posssibilites it has to model engineering system. Imagine this: If I run 10 wind turbine simulations and create structured output data and I only learn from the drag coefficient, I end up with 10 points for a fairly complex design, really not a lot of data considering the complexity of flows. If instead, I try to learn from the entire flow field – each containing 250,000 cells – I have more than 2.5 million points to learn from. Suddenly, 10 simulations are a lot of information given the complexity of the problem. Read more about how Siemens is using this approach for combustion chamber simulation.

2. QUALITY OF DATA

Like with everything in life, quantity is nothing without quality. Let’s say you are building a model to learn when a hard disc will fail. Collecting 1 Million data points without a single failure will teach you nothing at all when failures occur - whereas as little as 5 discs that actually failed for different reasons can teach you lots!

Here are a couple of tricks to better understand what high-quality data will mean for regression or classification problems. Don't know the difference? Read this

For regression/prediction problems, it is useful to get samples that cover your design space pretty evenly. See below two different designs of experiments (DoE) for simple tabular data. What I am plotting here is the knowledge we have about a part of the design space in color. Dark red means I know what is going on. Blue means I am uncertain what will happen here. So you can see that for five blades, no designs or data points were created. (makes sense, who has ever seen a five-bladed wind turbine?)

No alt text provided for this image

Here is another example showing a few randomly scattered points with big gaps in between them.

No alt text provided for this image

So the conclusion here: for simple data if you do not have information for a specific range, as you have never built a wind turbine with 5 blades or you have never tried to drive your car at 200 miles an hour, well then your chance of getting good answers using machine learning is low.

Personally, I like picturing any problem as an unknown mountain range that I am trying to measure. I have to run around manually and make measurements to build a map and after each measurement, I conclude what I think the terrain looks like. If I get two measurements of the same height I think this must be plain, but there could still be a valley in between. I don't know until I check. At NASA we used to say: there is no free lunch. Every problem requires a specific number of samples to understand it and you won't get away without doing the work unless a) you have prior experience of what this looks like or b) you have a really clever strategy of how to get your next measurement. Both a) and b) are fantastic mechanisms that make machine learning really powerful for engineering. a) is called transfer learning and b) active learning.

Below is a really old gif I created when I was a fellow at Imperial College London where I was trying to learn an unknown surface one sample at a time. You can see that the mathematical conclusions are completely wrong at first but then get better.

No alt text provided for this image

As engineering is mostly about different designs and not single data points, below is a more practical example that can help you tune your intuition of what designs you can make predictions for. Let's assume you have a database of rims with two or three spokes. It stands to reason that you will be able to make good predictions for rims that also have three or for spokes but that it will be hard applying this model to something with 6 spokes.

No alt text provided for this image

If you have a really big database of lots of different designs and cannot remember what might be similar or dissimilar from what you have tried before, you can use probabilistic methods. You can see in the dials below that the prediction algorithm is slightly uncertain about the performance range. If this uncertainty goes up it can usually be interpreted as: this is a very new design that I need to learn about first.

No alt text provided for this image

Finally, here comes the best and most exciting news about machine learning: Once you implement the system it starts learning. So even if your new design cannot be predicted yet it is only a matter of time till such designs can be implemented. Just like engineers your algorithms get better when something becomes repetitive but unlike engineers this expertise is not lost when your expert changes jobs.

For classification problems, it helps to have n samples for each class. Again, some data scientists have a ‘rule of 10’ and say n should be at least 10, but n can be 100,1000 or more depending on the complexity of the problem. If you do not have equally balanced data (nobody really has that) a more detailed guide to classification on imbalanced data can be found here.

If this is too abstract, think of it like this: if you are trying to figure out if a design is manufacturable then the answer is either Yes or No. In my experience, this is the most common classification problem in product engineering and it is usually imbalanced because companies tend to only keep their final successful design and tend to throw away their failed attempts. This is something you should change after having deploying machine learning in your department. If you have 99 successful designs and only 1 failed one, it will be quite hard to figure out why that is the one that failed. You would have a much easier time if you have 100 successful and 100 failed ones. Sidebar: you can still solve this problem though by building a model that predicts what a manufactured design will look like based on the starting point and then evaluate yourself if it will fail or not. More information of how you can solve classification problems by treating them as regression problems is here.

3. WHAT ALGORITHM YOU USE

The type of algorithm you use makes a huge difference, and there are a lot of algorithms. In my experience, you do not need to know them all. What you need to know is:

  • if you have very little data and want good uncertainty use a Gaussian Process.
  • if you have lots of data and do not know what matters use a Neural Network
  • if you have very mixed data and want to be able to interpret the results use a random forest

Trust me, you can go far with this amount of knowledge (at least when using a platform like Monolith that does all the hard work for you)

Here is a very often used graph showing that Neural Networks massively outperform other solutions when the amount of data grows.


No alt text provided for this image

And here is another frequently used graph in data science, a classic trade-off chart that shows you that if you want highly accurate but interpretable results you should use a random forest.

No alt text provided for this image

4. WHAT PROBLEM YOU WANT TO SOLVE

My favourite answer to the question: "how much data do I need" is another question: how well, how good do you need your model to be? The type of problem that you want to solve in your company determines how much data you will need. The graphic below shows how much data ( rough ballpark figures to simplify things), you might need.

No alt text provided for this image

Let's look at those scenarios in detail:

  1. 10 designs, simulations or test results: All you want to do, is to make sure that the next engineer you hire has access to the previous design files and information you gathered to get up to speed faster. This can be work with 3 data points or designs. To give a simple example: let's say an aerodynamicist just joined a racing company and needs to design a spoiler for a new car. The first step they will do is to have a look at the last 3 cars the company to learn from them. You can make this search and learn process a lot easier if you save the data in interactive 3D + functional data dashboards than by going through old folders. You can also solve it by creating 3 very neat folders and making sure the data in them is structured neatly so that people can compare future cars, so no need to use AI here.
  2. 50 designs, simulations or test results: You want to use insights from product testing or development to help your engineers make better decisions faster. You have noticed that there is considerable repetition, and from the last 50 projects you have done you can learn quite a few things using algorithmic methods. You can detect correlations, you look at failure scenarios, you can build simple models to make recommendations of what to test next. At this data level, algorithms can be a nice extension of the engineering expertise you already have.
  3. 150 designs, simulations or test results: You can make build an AI model to predict the result of tests or physical simulations. For example, you could predict the performance of a rim in a wind tunnel test, predict the maximum stress in a suspension system, etc. The prediction results will be not great if the problem is hard, and mostly we see people use AI at this data level for faster decision-making at the pre-design stage and they build models based on their own test data that is not biased by simplifying physical assumptions.
  4. 250 designs, simulations or test results: You can build recommender systems that can deliver really useful insights into what other solutions you could try. You can run targetted optimisation codes that tell you how to design things differently. This is the typical size of 'design of experiment' for optimisation studies based on CAE models, so we tend to get a lot of those at this data level.
  5. 500 designs, simulations or test results: You can build AI models that can predict the outcome of repetitive processes with good accuracy - sometimes as good as or even better than simplified physical simulations. This tends to be really beneficial for companies as they end up saving a lot of time and money on performing tests or running simulation if they can prove the use of an AI model for this scenario.
  6. 1000 designs, simulations or test results: You can build fully automated workflows. This is every engineering CIO or CTOs dream. Imagine this: a customer provides your team with their requirements, and you enter them into an online form where an AI algorithm will go into your PDM system and create a new product or component for these requirements fully automatically. I have seen this work for repetitive components like sealing solutions, pumps, bearings etc. so in general for suppliers who create many 1000 versions of the same component every year.

FINALLY, A CHEAT TRICK IF EVERYTHING ELSE FAILS

Here are three cheat tricks that most data scientists will use when they are unsure about the answer about how much data they need. Just try it with different models and have a look at the error! They train a model and first look at the convergence of the fitting error. If this error decreases pretty quickly you probably have a simple problem. If it takes a long time it is more complicated.

No alt text provided for this image

The second and even more reliable way of checking whether you can create a good model based on the data you have is to remove 20% of the data that you have from your training daata (this is called a train test split) and only train a model on the remaining 80% of the data. You then use your model to predict the result of the 20% you did not use for training and thus the model has never seen it. If the actual results and the model predictions lie on a straight line you have got a good model for unseen data, and that is what you want.

No alt text provided for this image

Last but not least, you can create a learning curve. This shows you how much you have learned from having no data to the amount of data that you currently have. In the plot below you can see that up to 60% of the available data in the example below really reduced the error and then it stagnated. A sign that more data will not make your model better.

No alt text provided for this image

SUMMARY OF MAIN RULES:

  • More complex, means you need more data
  • Use your company's experience, if your engineers think this is learnable from their experience it probably is and if not deep learning can still work but only if you have a huge amount of data
  • AI means learning from experience. if you only have dogs to learn from and want to predict cats, it won't work.
  • The most bulletproof method is to run a quick test. Monolith is built so you can do a POC in less than a day of work.










Samir Saini

I help companies commercialize and adopt technologies/solutions | Commercialized 20+ technologies | Scouted for 60+ solutions

3y

Its a helpful summarization & some good first level guidance Richard Ahlfeld....In your opinion are manufacturing & design intensive companies saving CAD data of failed/rejected designs to form part of a future training set? This was surely a gap a decade back. Does it still exist? Cause if it does, it's a massive roadblock right?

To view or add a comment, sign in

More articles by Richard Ahlfeld, Ph.D.

Insights from the community

Others also viewed

Explore topics