SMOTE and ADASYN for handling imbalanced classification datasets

Filed Under: Python Advanced
Data Augmentation With SMOTE And ADASYN

Today, I’m covering imbalanced classification problems in machine learning using SMOTE and ADASYN data augmentation.

Basics of Classification in Machine Learning

Classification can be defined as a class or category prediction process from observable values or data points.

Spam identification in emails may be an example of a classification problem. There can only be two performance types, “spam” and “no-spam”; thus, this is a classification of a binary kind.

Other examples are:

  • Fraud Detection.
  • Claim Prediction
  • Default Prediction.
  • Churn Prediction.
  • Spam Detection.
  • Anomaly Detection.
  • Outlier Detection.
  • Intrusion Detection
  • Conversion Prediction

In fact, it is used for natural disaster prediction by the meteorological departments and galaxy collisions by astronomers.

Imbalanced data sets and its effects

The difficulty of dealing with imbalanced datasets is that the minority class would be overlooked by most machine learning approaches, although it is usually the most significant output in the minority class.

Say your dataset is being experimented on.

You create a model for classification and automatically get 90 percent accuracy. You are overjoyed.

But when you dig a little further and find that 90% of the material belongs to a single class.
When you found that your data had imbalanced groups, you felt very disappointed and that all the fantastic outcomes you believed you were receiving turned out to be a lie. 馃檨

How to know when data is imbalanced

Imbalanced data refers to a concern with classification problems where the groups are not equally distributed.

For eg, with 100 instances (rows), you might have a 2-class (binary) classification problem. Class-1 is classified for a total of 80 instances and Class-2 is classified for the remaining 20 events.

This is an imbalanced dataset, with an 80:20 or more succinct 4:1 ratio of Class-1 to Class-2 examples.

Techniques to deal with imbalanced data

It is important to look into techniques like smote and adasyn, which generate new data and balance out the dataset classes.

Other techniques, which are not as great include: get more data, try resampling the data, try changing the evaluation metric, etc.

What is SMOTE?

SMOTE is short for Synthetic Minority Oversampling Technique.

If you have 100 rows of data, and you need to select 10 out of them, it’s quite easy. You just randomly sample 10 elements from the dataset. This is termed as undersampling. The opposite is known as oversampling.

So if you have a binary classification problem with 100 data rows in one class and 10 data rows in the other class, you could simply duplicate examples from the minority class in the training dataset prior to fitting a model.

This can balance the distribution of the class, but does not provide the model with any extra details.

Instead, we use data augmentation, which can be very powerful. The synthesis of new examples from the minority class is an advancement over the replication of examples from the minority class.

Oversampling with smote

We shall be using the sklearn library for this purpose. In this case, we’re creating a custom dataset with 5000 samples.

from imblearn.over_sampling import SMOTE
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt

Now we use the make_classification function:

X, y = make_classification(n_samples=5000, n_features=2, n_redundant=0, weights=[.99], n_informative=2, n_clusters_per_class=1)

Mine turned out to be like:

Generated Binary Classification Dataset
Generated Binary Classification Dataset

Plotting the Data

We’ll use matplotlib:

plt.scatter(X[:, 0], X[:, 1], marker='o', c=y, s=25, edgecolor='k')
Plot Of Dataset For Smote
Plot Of Dataset For Smote

Obviously, if we fit a model to this dataset, then it will be heavily biased towards predicting the larger dataset.

So to balance it out, we will use smote:

New Counter Of Classification Dataset After Smote
New Counter Of Classification Dataset After Smote

Now we see that the dataset’s been balanced:

Plot Of Dataset After Smote
Plot Of Dataset After Smote

What is ADASYN?

Short for Adaptive Synthetic Sampling Approach, a generalization of the SMOTE algorithm.

By generating virtual instances for it, this algorithm also attempts to oversample the minority class.

But the distinction here is that it takes into account the distribution of density, which defines the number of synthetic instances produced for samples that are difficult to understand.

Because of this, it helps to adjust the decision constraints adaptively depending on the difficult samples.

Oversampling with ADASYN

Let’s try plotting the same dataset with ADASYN.

from imblearn.over_sampling import ADASYN

ada = ADASYN()
x_ada , Y_ada = ada.fit_resample(X_,y_)
plt.scatter(x_ada[:, 0], x_ada[:, 1], marker='o', c=Y_ada, s=25, edgecolor='k')
Image 1
Plot With Adasyn

What’s significant in both plots?

If you observe the plots carefully, you’ll find that ADASYN gives us much better detail, while SMOTE tends to cover the boundary by joining points if they’re close together.

Trying SMOTE on a real dataset

Do you want to see this in action on a real dataset? Let’s take this one: https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset

We’ll definitely cover text data analytics in detail later, but this is just to show that even though we only learned it using generated simple plots, it has a much wider range of applications.

So this is our data: (we added the labels based on the ones given on kaggle)

Fake News Dataset Kaggle
Fake News Dataset Kaggle

You can obviously see that the data is very imbalanced at 1:10 for fake news. In such cases, all algorithms will predict 100% of the articles as fake, to maximize its accuracy. But that is definitely not the case.

Counter Fake Vs True News
Counter Fake Vs True News

Therefore we need SMOTE to balance out the dataset. First we change the text into numerical values by tfidf vectorizer (which we’ll learn later):

Tfidf Vectorized Corpus
Tfidf Vectorized Corpus

Observe closely among the generated datasets and they are very similar to the actual data, and the dataset is now balanced at 1:1 ratio, so there is no bias for the classification algorithms:

Sentences Made By The Smote Algorithm
Sentences Made By The Smote Algorithm

And that’s it for today. Keep coming back, and we have a lot more topics in store! Of course, if you missed anything, you’ll find all the code here:

https://github.com/arkaprabha-majumdar/smote-for-data-numbers-and-text

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages