9-Step Guide to Building Machine Learning Models
In this article, I will walk you through the process of building machine learning models. I will first describe the entire process theoretically. In a subsequent project, we will implement part of this process, covering the main activities of a machine learning engineer or an AI engineer.
This article aims to align terminology and ensure you clearly understand the process so we can work on more advanced activities in the future.
The process of building machine learning models can be divided into these main stages:
Of course, each of these stages can be broken down into sub-steps, but the way it is presented here includes the main macro-stages.
Each of these stages involves a series of strategies, techniques, tools, and procedures, depending on the business problem, the final objective, the available computational capacity, and so on. What I will do now is describe each of these stages, comment on some particularities, potential problems, and tools, to later work practically.
1. Domain Understanding
The first step in the process of building a machine learning model is domain understanding.
The first step in any machine learning or data science project is knowing where you are going, what your destination is, and what you want to deliver. Therefore, you need to clearly define the business problem you intend to solve.
To define the business problem, it is crucial to have domain understanding.
The process of building machine learning models starts with a deep understanding of the domain or the field of study in question. This involves diving into the specifics of the area, understanding the nuances, challenges, and specific issues the model aims to address.
This step is FUNDAMENTAL as it provides the necessary context for all subsequent decisions in the project lifecycle, from data collection to model deployment.
For example, your company wants to forecast sales for the next period. This is a problem; you need to understand the domain, go through the process, and deliver the result. Now, your company has another need; it wants to predict whether a customer will sign up for a subscription plan. This is another problem, a new domain understanding, going through the process, and delivering a new result, and so on.
This means that the company can have several models operating in its day-to-day activities.
2. Data Collection and Preparation
There is nothing to do in machine learning if you do not have the raw material.
Everything we do in machine learning is to train the algorithm with the data. If you don’t have the data, what exactly are you going to train?
If your company is not concerned with data, is not collecting, storing, and managing it, your company is very behind for today’s standards, as we already live in a data-driven world.
When the data is available, then we can start considering the use of machine learning for processes. Obviously, the previous step involves data management, which includes, for example, data collection and preparation.
Once the domain is understood, the next step is to collect and prepare the data that will feed the model. This involves identifying data sources, gathering the necessary datasets, and performing cleaning and preprocessing tasks to ensure the data is consistent, relevant, and free from errors or unwanted noise.
This indeed involves a huge amount of work, likely an assignment for a data engineer because we have here various possibilities, different storage systems, different data sources, tools to collect, clean, process, and store the data. Without this, we cannot even move forward.
Collecting and preparing data must be part of the daily routine of a company, any company. Because data analysis and machine learning apply to companies of any segment, as long as you have raw material. Therefore, this is a fundamental activity in the process of building machine learning models.
Once you have collected and performed some initial data preparation, we can explore and analyze the data.
3. Data Exploration and Analysis
The stage of data exploration and analysis is generally not the responsibility of a machine learning engineer or an AI engineer. This can be done by a data analyst or even a data scientist.
A machine learning engineer will probably work on the infrastructure part to make the model available, version control, creating pipelines, which is already a lot of work.
The AI engineer will be working on more advanced AI tasks, while basic tasks like data exploration and analysis can be done by a data analyst by exclusion.
With the data collected and prepared, it is essential to explore and analyze the data to understand its structure, distribution, and patterns. This can involve data visualization, calculating descriptive statistics, and identifying potential anomalies or outliers.
This phase helps to lay out the subsequent steps, ensuring that the model is built on a solid foundation. Here, in this stage, we have various activities, techniques, and procedures where we can explore and detect problems such as missing values, for instance. We then apply a cleaning technique and have to choose the ideal technique according to the dataset, according to what we will do next, and so on. Therefore, I will use a technique that allows me to look at the data, identify potential problems, and apply corrections as necessary.
4. Feature Selection and Engineering
When you extract, prepare, analyze, and explore the data, the result is usually some kind of table.
The rows represent the observations of the event depicted in the data, and each column represents an attribute or variable.
Imagine product sales; each row of this table would be a sale, and each column would be an attribute inherent to the sale, such as: customer name, salesperson name, product sold, quantity of products sold, final price, discount, promotion, and so
In these rows, you need to do the verification work, check for repetition or duplication, and then divide these rows into samples when we work on building and evaluating the models.
We also need to pay attention to the attributes, and this is where feature selection and engineering come in. Not all attributes or characteristics of the data may be relevant or useful for modeling.
Feature selection involves choosing the most informative attributes, while feature engineering involves creating new attributes from existing ones to better represent patterns in the data.
Do I really need the this attribute to create the model?
Well, I don’t know… I need to do a selection job and verify the necessity. Interestingly, at this stage, we can use machine learning itself for feature engineering.
Ironically, I’m writing here about the entire machine learning process, which, within its own process, can use machine learning in one of its stages. Machine learning doesn’t have to be just the end; it can also be a means.
I can use a past algorithm to identify the most relevant variables and discard those with less relevance, keeping only what is important to later build a more accurate model.
Additionally, I may need to create new variables to better adjust the information represented in the available data. For example, imagine I have a column with the client’s age. Is the age of each client relevant? In most cases, it’s not! What is more relevant is the age group, so I can create a new column called age group, check the client’s age, see which group it falls into, and then assign it to each of the groups. I remove the age column and keep only the new age group column. This is a very common example of feature engineering.
Until the next stage we will see, typically the responsible party is the data scientist. From then on, usually, the machine learning engineer can work. This is not a fixed rule; it will depend on each company, the level of the professional, whether they know the entire process. Some companies prefer to segment, having a professional solely for model building, professionals only to handle deployment, especially if the company has many models used in its daily operations.
Regardless of all that, we have this feature selection and engineering stage, which is almost a science in itself.
5. Model Building and Evaluation
The next stage in the process of building machine learning models is the actual model building and evaluation.
This is where I will actually build the model, meaning that of the entire process, the building itself is just a small part.
In fact, in the previous stages, you can easily spend 50, 60, 70% of the project time. I consider the total project time, from problem conception to delivery at deployment.
50 to 70% can be in the stages before the actual model building and evaluation because there is a huge amount of work there.
We have alternatives, automation processes, and everything is evolving faster and faster, but there is still a lot of work in the previous stages. Until we reach model building and evaluation, which is an entire universe in itself.
In this stage, machine learning models are built using the prepared data. Once built, it is crucial to evaluate their performance on test datasets to understand their accuracy, robustness, and generalization.
Your company requested to build a model capable of predicting whether a customer will cancel a subscription. Then, you will search for historical data, train the model, prepare it, and deliver the result with a deployment. The model will receive data from new customers and will provide the prediction on whether they will cancel the subscription or not.
From domain understanding, data collection and preparation, data exploration and analysis, feature selection, and engineering. We have already gone through all this.
Now we build the model. Do you know beforehand which algorithm is ideal for this example I just mentioned? We cannot know which algorithm is ideal beforehand. This is where the science comes in again. We need to experiment.
Typically, we have a classification problem. We predict yes or no.
So, at least, we can narrow down the possibilities. But how many classification algorithms do we have today? Dozens. Many algorithms. Which one is ideal? I have no idea. But I will experiment.
With previous experience, you will probably start with logistic regression, maybe create a classification tree, perhaps an ensemble method, or even an artificial neural network for classification.
We need to experiment and create experience. And note, for each model with each algorithm, there is still the next stage, which is hyperparameter optimization. In other words, you need to know the possibilities each algorithm offers. Easily, you can create 10, 15, 20 versions of the same model until you reach the near-ideal model.
Recommended by LinkedIn
For example, you have a historical dataset of customers. You have the customer’s age, salary, gender, and other information. You search for historical data of customers who canceled and did not cancel the subscription. You train the model. Trained model, you provide input data for new customers. It will predict whether they will cancel the subscription or not, right?
So, do you know a mathematical formulation, an equation, for example, that establishes this relationship between input and output data, customer attributes, and whether they canceled or not? Is there a ready mathematical formulation for this?
We use machine learning because we do not know the mathematical formulation of this relationship between input and output data, or just input data if it’s unsupervised learning. We do not know this mathematical relationship. That’s why we use machine learning. So we can automate this process and find the near-ideal model.
I will never know the exact relationship between the data, but it doesn’t need to. As long as you have an approximation. Machine learning gives you an approximation. With machine learning, there will never be 100% accuracy because it doesn’t exist.
If there is 100% accuracy, it’s often an indication of a problem, not something positive in the model. You see that the model had overfitting, learned too much during the training phase.
What we deliver as a model is an approximation, which is already very good and works as long as you have a good volume of data and perform a good machine learning model construction process.
6. Hyperparameter Optimization
We built and evaluated models. But are they the best possible models? Well, I don’t know. I can only find out if I create different versions.
Each machine learning algorithm comes with hyperparameters. To understand this in a very didactic way, imagine, for example, a television. If two different people buy the same TV, each may have completely different images, right? Because each will make different adjustments. So, this is the idea of hyperparameters.
I can use logistic regression, which is an algorithm, and you can also use logistic regression. I can have one final result, and you can have another. Simply because of the changes we made to the hyperparameters.
Many people get confused with this. When you are training the model, what the model is learning are a set of numbers, coefficients, parameters.
What the model learns during training is what we call parameters. What helps the model in training are the hyperparameters.
These hyperparameters are parameters of functions, for example, in Python. When you look at computer programming, in programming, we create functions, which are blocks of code. These functions can receive parameters; this is the terminology in computer programming.
Well, these hyperparameters are exactly the parameters of the functions. Your algorithms are represented via computer programming, through functions.
Hyperparameter is what helps the model learn. In the case of logistic regression, for example, we have different calculation standards. The calculation standard is the hyperparameter.
That is, it must be defined beforehand, so the model can be trained, which will then learn its parameters.
Hyperparameter optimization involves experimenting with different values to find the combination that produces the best model performance.
Do you know beforehand what the best hyperparameters are? No, you don’t. So, what do you have to do? You have to experiment.
Hyperparameter optimization can be the job of a data scientist or already be the assignment of a machine learning engineer, including this within the CI/CD pipeline.
The fact is, if you want the best possible model, it’s not enough to just collect and prepare data, explore and analyze, select and engineer features, build the best algorithm. You also need to use hyperparameters, especially for more complex algorithms, such as deep learning.
This is a task that also involves several techniques, procedures. You can automate a good part of this work, but this is essential for having a well-tuned model, a model with high accuracy, a stable model. This is a fundamental part of the entire process.
7. Cross-Validation
Cross-validation could be considered a “sub-step” in the model building and evaluation phase, as well as in hyperparameter optimization.
However, I chose to present it here as a unique stage because it is often included in CI/CD pipelines when working with MLOps, which is a function of a machine learning engineer.
Cross-validation is a technique used to evaluate the generalization capability of a model. It divides the dataset into several partitions and trains and tests the model on different combinations of these partitions to ensure that the model performs well on unseen data.
Generalization Capability
But what exactly is this generalization capability? This is the main objective we want in machine learning. We do not want the model to learn the specific pattern of the data we used for training.
We want the model to look at the data pattern and establish a mathematical generalization. If it achieves this, we can use the model to make predictions from new data. If it does not achieve this, we will likely face issues such as underfitting and overfitting.
In other words, we do not want the model to learn the detail of each data point. This is bad.
What we want is for the model to look at the data in general and understand what the mathematical generalization of the data is.
How can the relationship of the data be explained generally? This is what we want. If we achieve this, then you just need to provide new data to the model, and it will be able to make predictions.
So, this is the main goal of performing all the previous work until reaching cross-validation in the model preparation.
Cross-validation helps to check the generalization capability of the model, which is our main goal. And if you create a CI/CD pipeline, this will likely be part of your pipeline.
8. Model Deployment
We have reached the model implementation stage, also known as deployment.
At this point, we expect a stable model with good generalization capability, a model that has been tested, experimented with, and evaluated with different metrics, a model that is close to ideal.
We then move this model forward in the production line and put it to work to solve the problem it was created for. This is one of the main responsibilities of a machine learning engineer, usually also a task for an AI engineer.
Once the model is finalized and optimized, it is implemented or put into production, which is basically the synonym we use for deployment. This means that the model is made available in an environment that can receive new data, process it, and provide predictions or insights in real time.
And what exactly would this environment be? It depends on how the company wants to offer its model. Will it create a web application? An application for smartphones? Create an API, make a call from another application? Generate predictions in CSV format, then load it into Power BI to create a dashboard? It depends on how the company wants to deliver it.
There is no rule. It depends on what the company wants to do. How does it want to use the model? This question needs to be asked.
If the company does not know, then we can offer alternatives. Can that model be placed on the company’s web page? Will it be used on the internet or intranet? Is it just an API that we need to feed another existing application? What are the possibilities? Well, this is where we will define how we will do the deployment.
When deploying, we have small details that need to be considered.
Any transformation applied to the training data must be applied to the test data and new data.
So, everything that was done earlier, in the exploration, analysis, during cleaning, feature engineering, everything that somehow modified the training data, modified the format, will have to be done again when delivering more data to the model.
Therefore, we should not overdo it in feature engineering. This will backfire on us later in the process. Oh, but I need many new variables… Will this improve the model? Remember that the deployment will be impacted later!
It will make the model slower in delivering the prediction. The pipeline will have to be executed to prepare the data. It will also be slower. All this needs to be considered. These are not isolated steps. It is important to make this clear. All the steps need to be cohesive. Because one step impacts the next step.
9. Iteration and Continuous Improvement
Once you have deployed the model, it will be implemented and start delivering predictions. For example, it will predict whether a customer will cancel a subscription or not.
This will help solve the business problem, and the company will take the necessary actions according to the predictions. Probably, your work is done there. But you will need to monitor the model.
It is very likely that new data will be available. So, you may need to retrain the model. Over time, the model may lose its predictive capability. Why? Because the data pattern may change!
Remember what we talked about earlier in model building and evaluation? What does the algorithm do? It detects a pattern in the data if it exists.
So, it detects the pattern, learns the mathematical generalization, and the model is created. Do you agree with me that customer behavior patterns can change over time?
We had the pandemic. Didn’t the pandemic cause a change in customer behavior? Yes, it did. So, the model created before the pandemic probably showed poor results during the pandemic. Why? Because the data pattern changed.
So, do I have to keep retraining the model all the time? I wouldn’t say all the time, but with some frequency, certainly.
You might need to retrain the model once a week, once a month. More than that, it starts to get complicated depending, of course, on the business area, the data volume, etc. But the fact is, the data pattern can change.
This is the concept of Data Drift and Model Drift, which is when the pattern changes. Did the pattern change? Well, then you need to go and adjust the model as necessary. And this is part of the iteration and continuous improvement stage.
Building a machine learning model is not a one-time process. As new data becomes available, or the domain needs change, it is essential to revisit, re-evaluate, and, if necessary, refine or rebuild the model to ensure it remains relevant and effective.
Now you can go home and rest because the process of building models is complete.
Thank you. 🐼❤️