How to Deal With Imbalanced Classification and Imbalanced Regression Data?
Introduction
Imbalanced data refers to those types of datasets where the target class has an uneven distribution of observations, i.e one class label has a very high number of observations and the other has a very low number of observations. Let’s say we have a dataset of cancer patients and we are going to use this dataset to build a predictive model which takes an input and says whether a base patient is diagnosed. It’s a cancer record not cancer.
The problem of imbalanced datasets is very common. This problem arises when one set of classes dominates over another set of classes. It causes the machine learning model to be more biased towards the majority class. It causes poor classification of minority classes. Hence, this problem throws the question of “accuracy” out of the question. This is a very common problem in machine learning where we have datasets with a disproportionate ratio of observations in each class.
Imbalance classification is also called rare event modeling. When the target label for a classification modeling dataset is highly imbalanced, we call the minority event to be a rare event. In this case, the models tend to get learnings from the majority class, and predicting the minority class can be challenging. For example, if only 0.01% of the dataset is the minority event, the model tends not to do a good job identifying the pattern of a minority event.
So, let’s say you have a thousand records out of which 900 are cancer and 100 are non-cancer. This is an example of an imbalance dataset because your majority class is about 9 times bigger than the majority class. Data Balance can occur in different ways in your datasets, it could for instance be you have a lot of positive examples and a few negative examples, lots of positive and may be only a few negative points thrown in.
Right now if you would normally train a model on the following theory points , it will have a relatively small effect on the loss function and your model may tend to just ignore these points, if these points are important, you want that so you want to correct this imbalance somehow.
Sample Theory: Handling Imbalanced Dataset-
Problem:
How to Resolve?
An effective way to handle imbalanced data is to downsample and upweight the majority class. Let's start by defining those two new terms:
Oversampling & under-sampling are the techniques to change the ratio of the classes in an imbalanced modeling dataset.
Recommended by LinkedIn
Oversampling: Imbalanced learning is a basic problem in machine learning. When the number of samples from different categories in a classification task dataset differs significantly, the dataset is called imbalanced.
Minority categories in these fields have smaller sample sizes and poorer sample data quality; however, they typically carry more important information. We focus on a model’s ability to correctly classify the minority classes of samples, such as a complex network system, where it is more important to accurately diagnose the network fault types and maintain the normal operation of the system than to diagnose the network as normal.
By analyzing the variability of the algorithm cost in different misclassification cases, the classification algorithm is optimized to improve the performance of the learning algorithm. Data-based approaches are more popular in existing literature than approaches that improve a specific classification algorithm for a specific imbalanced dataset.
Undersampling: Undersampling is a technique to balance uneven datasets by keeping all of the data in the minority class and decreasing the size of the majority class. It is one of several techniques data scientists can use to extract more accurate information from originally imbalanced datasets. It simply means that there are not enough samples for you to accurately reconstruct the continuous-time signal that you started with.
Advantages:
Disadvantages:
Techniques to handle Imbalanced dataset:-
So, there are four points in minority class, p1, p2, p3 & p4.
Now, If a SMOTE analysis is done on this which is our minority class, If K nearest neighbors are specified as we want to create artificial samples as say three. SMOTE will basically find the nearest neighbors of every point, suppose for P1, we have P2 as the nearest neighbor; we have P3 as the nearest neighbor. If we keep a nearest neighbor count of 3 then even P4 is the nearest neighbor. Similarly for P3, P1 is the nearest neighbor, P2 is the nearest neighbor and P4 is also its nearest neighbor. So, for P4 we have P2, P1 & P3 based on the number of samples we want SMOTE to create. SMOTE would first find these lines, which is the line joining your minority class samples based on how many nearest neighbors you have considered and it will plot these instances somewhere on these lines. We can have multiple sample points on all the lines.
Conclusion: These approaches can be effective, although they are hit-or-miss and time-consuming respectively. Instead, the shortest path to a good result on a new classification task is to systematically evaluate a suite of machine learning algorithms in order to discover what works well, then double down. This approach can also be used for imbalanced classification problems, tailored for the range of data sampling, cost-sensitive, and one-class classification algorithms that one may choose from.