Today, I’m covering imbalanced classification problems in machine learning using SMOTE and ADASYN data augmentation.
Basics of Classification in Machine Learning
Classification can be defined as a class or category prediction process from observable values or data points.
Spam identification in emails may be an example of a classification problem. There can only be two performance types, “spam” and “no-spam”; thus, this is a classification of a binary kind.
Other examples are:
- Fraud Detection.
- Claim Prediction
- Default Prediction.
- Churn Prediction.
- Spam Detection.
- Anomaly Detection.
- Outlier Detection.
- Intrusion Detection
- Conversion Prediction
In fact, it is used for natural disaster prediction by the meteorological departments and galaxy collisions by astronomers.
Imbalanced data sets and its effects
The difficulty of dealing with imbalanced datasets is that the minority class would be overlooked by most machine learning approaches, although it is usually the most significant output in the minority class.
Say your dataset is being experimented on.
You create a model for classification and automatically get 90 percent accuracy. You are overjoyed.
But when you dig a little further and find that 90% of the material belongs to a single class.
When you found that your data had imbalanced groups, you felt very disappointed and that all the fantastic outcomes you believed you were receiving turned out to be a lie. 🙁
How to know when data is imbalanced
Imbalanced data refers to a concern with classification problems where the groups are not equally distributed.
For eg, with 100 instances (rows), you might have a 2-class (binary) classification problem. Class-1 is classified for a total of 80 instances and Class-2 is classified for the remaining 20 events.
This is an imbalanced dataset, with an 80:20 or more succinct 4:1 ratio of Class-1 to Class-2 examples.
Techniques to deal with imbalanced data
It is important to look into techniques like smote and adasyn, which generate new data and balance out the dataset classes.
Other techniques, which are not as great include: get more data, try resampling the data, try changing the evaluation metric, etc.
What is SMOTE?
SMOTE is short for Synthetic Minority Oversampling Technique.
If you have 100 rows of data, and you need to select 10 out of them, it’s quite easy. You just randomly sample 10 elements from the dataset. This is termed as undersampling. The opposite is known as oversampling.
So if you have a binary classification problem with 100 data rows in one class and 10 data rows in the other class, you could simply duplicate examples from the minority class in the training dataset prior to fitting a model.
This can balance the distribution of the class, but does not provide the model with any extra details.
Instead, we use data augmentation, which can be very powerful. The synthesis of new examples from the minority class is an advancement over the replication of examples from the minority class.
Oversampling with smote
We shall be using the sklearn library for this purpose. In this case, we’re creating a custom dataset with 5000 samples.
from imblearn.over_sampling import SMOTE from sklearn.datasets import make_classification import matplotlib.pyplot as plt
Now we use the make_classification function:
X, y = make_classification(n_samples=5000, n_features=2, n_redundant=0, weights=[.99], n_informative=2, n_clusters_per_class=1)
Mine turned out to be like:
Plotting the Data
We’ll use matplotlib:
plt.scatter(X[:, 0], X[:, 1], marker='o', c=y, s=25, edgecolor='k')
Obviously, if we fit a model to this dataset, then it will be heavily biased towards predicting the larger dataset.
So to balance it out, we will use smote:
Now we see that the dataset’s been balanced:
What is ADASYN?
Short for Adaptive Synthetic Sampling Approach, a generalization of the SMOTE algorithm.
By generating virtual instances for it, this algorithm also attempts to oversample the minority class.
But the distinction here is that it takes into account the distribution of density, which defines the number of synthetic instances produced for samples that are difficult to understand.
Because of this, it helps to adjust the decision constraints adaptively depending on the difficult samples.
Oversampling with ADASYN
Let’s try plotting the same dataset with ADASYN.
from imblearn.over_sampling import ADASYN ada = ADASYN() x_ada , Y_ada = ada.fit_resample(X_,y_) plt.scatter(x_ada[:, 0], x_ada[:, 1], marker='o', c=Y_ada, s=25, edgecolor='k')
What’s significant in both plots?
If you observe the plots carefully, you’ll find that ADASYN gives us much better detail, while SMOTE tends to cover the boundary by joining points if they’re close together.
Trying SMOTE on a real dataset
Do you want to see this in action on a real dataset? Let’s take this one: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset
We’ll definitely cover text data analytics in detail later, but this is just to show that even though we only learned it using generated simple plots, it has a much wider range of applications.
So this is our data: (we added the labels based on the ones given on kaggle)
You can obviously see that the data is very imbalanced at 1:10 for fake news. In such cases, all algorithms will predict 100% of the articles as fake, to maximize its accuracy. But that is definitely not the case.
Therefore we need SMOTE to balance out the dataset. First we change the text into numerical values by tfidf vectorizer (which we’ll learn later):
Observe closely among the generated datasets and they are very similar to the actual data, and the dataset is now balanced at 1:1 ratio, so there is no bias for the classification algorithms:
And that’s it for today. Keep coming back, and we have a lot more topics in store! Of course, if you missed anything, you’ll find all the code here: