5 Procedures to tackle a ML Problem: Just an overview
Hello folks!!This is my first article ever that I am writing. No experience, nothing. That’s how I started my journey in these deep ocean of Artificial Intelligence couples of years back. In starting it’s very hard to control our Adrenaline level. I used to go after any random algorithm to build a machine learning model. I was very excited about new and fancy algorithms to solve a problem but this is not how it goes. There is a proper way to build a model and tackle with the algorithms. Today, I will talk about approach to a machine learning problem. When you see a new ML problem in the wild, you might be tempted to jump ahead and apply your favorite algorithm at the problem-perhaps the one you understood best or had the most fun implementing. But knowing beforehand which algorithm will perform best on your specific problem is not often possible.
Instead, you need to take a step back and look at the big picture. Before you get in too deep, you will want to make sure to define the actual problem you are trying to solve. For example, do you already have a specific goal in mind, or are you just looking to do some exploratory analysis and find something interesting in the data? Often, you will start with a general goal, such as detecting spam email messages, making movie recommendations, or predicting the next PM of our country. However, as we have seen throughout the book, there are often several ways to solve a problem. For example, we can recognize handwritten digits using logistic regression, k-means clustering, and deep learning. Defining the problem will help you ask the right questions and make the right choices along the way. You must follow the following five-step procedure to approach machine learning problems in the wild:
1. Categorize the problem: This is a two-step process:
· Categorize by input: Simply speaking, if you have labeled data, it’s a supervised learning problem. If you have unlabeled data and want to find structure, it’s an unsupervised learning problem. If you want to optimize an objective function by interacting with an environment, it’s a reinforcement learning problem.
· Categorize by output: If the output of your model is a number, it’s a regression problem. If the output of your model is a class (or category), it’s a classification problem. If the output of your model is a set of input groups, it’s a clustering problem.
2. Find the available algorithms:
Now that you have categorized the problem, you can identify the algorithms that are applicable and practical to implement using the tools at our disposal. Microsoft has created a handy algorithm cheat sheet that shows which algorithms can be used for which category of problems. Although the cheat sheet is tailored towards the Microsoft Azure software, you can check this out here: http://aka.ms/MLCheatSheet
3. Implement all of the applicable algorithms (prototyping):
For any given problem, there are usually a handful of candidate algorithms that could do the job. So how do you know which one to pick? Often the answer to this problem is not straightforward, so you have to resort to trial and error. Prototyping is best done in two steps:
· You should aim for quick and dirty implementation of several algorithms with minimal feature engineering. At this stage, you should mainly be interested in seeing which algorithm behaves better at a coarse scale. This step is a bit like hiring: you’re looking for any reason to shorten your list of candidate algorithms. Once you have reduced the list to a few candidate algorithms, the real prototyping begins.
· Ideally, you would want to set up a machine learning pipeline that compares, the performance of each algorithm on the data set using a set of carefully selected evaluation criteria. At this stage, you should only be dealing with a handful of algorithms, so you can turn your attention to where the real magic lies: feature engineering.
Evaluating a model:
Model evaluation strategies come in many different forms and shapes. Three of the most commonly used techniques to compare models against each other:
· K-folds cross-validation
· Bootstrapping
· McNemar’s test
4. Feature Engineering:
Perhaps even more important than choosing the right algorithm is choosing the right features to represent the data. The process of finding the best way to represent our data is known as feature engineering, and it is one of the main tasks of data scientists and machine learning practitioners trying to solve real-world problems. Feature engineering can be split into stages:
· Feature selection: This is the process of identifying attributes (or feature) in the data. Let’s take an example of Image. Possible features of an image might be the location of edges, corners or ridges.
· Feature extraction: This is the actual process of transforming the raw data into the desired feature space used to feed a machine learning algorithm. An example would be the Harris operator, which allows us to extract corners in an image.
I will say “feature engineering is an art”. You must spend time here with your creativity.
5. Optimize hyperparameters:
Finally, you also want to optimize an algorithm’s hyperparameters. Examples might include the number of principals’ components of PCA, the parameter k in the k-nearest neighbor algorithm, or the number of layers and learning rate in a neural network. You can follow few of the techniques that mentioned below:
· Tuning hyperparameters with grid search: There are following steps you can look into:-
1. Implementing a simple grid search
2. Understanding the value of a validation set
3. Combining grid search with cross-validation
4. Combining grid search with nested cross-validation
· Scoring models using different evaluation metrics
1. Choosing the right classification metrics: you can read about Accuracy, Precision and Recall.
2. Choosing the right regression metric:- you can read about Mean squared error, Explained variance, The R squared
Apart from this, as we combine elaborate grid searchers with sophisticated evaluation metrics, our model selection code might become increasingly complex. Fortunately, scikit-learn offers a way to simplify model selection with a helpful construct known as a pipeline.
That’s it guys, I think this is getting long so I will stop here for this part. I will write 5 different articles in detail to explain all the five procedures that I mentioned in this article. Till then signing off!! Cheers.
Senior Data Scientist | GenAI
5yNice article bro I think everyone will get how build a good model with the help of this article