Data Discretization Using Sklearn In Machine Learning

Filed Under: Python Advanced
Data Discretization

Hello folks, hope this story finds you in good health!. As we know, some of the clustering and classification algorithms(i.e. rule-based algorithms) prefer working on ordinal data rather than the data which is measured on a numerical scale.

Yes, most of the time we heard that most of the ML algorithms need numerical input and it is true too. It will depend on the use case you are working on. So, here comes the Data discretization. In layman’s terms, it is a process of grouping continuous data into discrete buckets, by grouping. 


Data Discretization – In Detail

  • This process helps to limit the data to some states rather than having it in continuous form. It works best when we have too much data in a large scale. Then it will be difficult to classify or cluster without discretization.
  • Discretization is mesentery as some of the rule-based algorithms tend to work on categorical data than data on a numerical scale. Ex: Clustering and Classification.
  • You may be reading this word for this first time, but don’t worry. It is also called as Data binning and I am sure you heard of it hundred times 😛
  • There are 3 types of Data discretization methods –
  1. Quantile Transformation:

In this transformation, each bin has an equal number of values based on the percentiles.

2. Uniform Transformation:

In this transformation, each bin has equal or the same width with the possible values in the attribute.

3. Kmeans Transformation:

In this transformation, clusters are defined and values are assigned to them.

Well, now let’s import the sklearn library and our data to see how to perform these data binning methods. Let’s roll!!!


Data For Our Implementation

For the data transformation, we need data right!. So, we are going to work on loan data which is a pretty big dataset having huge volumes of data.

#data

import pandas as pd

df = pd.read_csv('loan_data.csv')
Loan Da

1. Quantile Transformation

The quantile transformation will bin the data records of each variable into k groups. Here, the number of records or values in each group will be the same or equal.

Let’s see how we can do this in python using scikit learn package. The class we will be using from sklearn is KBinsDiscritizer.

#quantile transformation 

#Import the class
from sklearn.preprocessing import KBinsDiscretizer

#Discrete the data
transf = KBinsDiscretizer(n_bins = 10, encode = 'ordinal', strategy = 'quantile')

#fit transform 
data = transf.fit_transform(df)

#Array to dataframe
from pandas import DataFrame

data1 = DataFrame(data)

#Peak into data
data1.head(5)

Here –

  • We have imported the KBinDiscritizer class from Sklearn.
  • Discretized the data with 10 bins and grouped by quantile method.
  • Then we fitted the data to the transformer.
  • After that, it will result in an array. We need to convert that array to a dataframe using Pandas DataFrame object as shown.
     0	  1	     2	 3	  4
0	8.0	 9.0  0.0	1.0	 1.0
1	8.0	 6.0  0.0	4.0	 0.0
2	8.0	 8.0  9.0	4.0	 0.0
3	8.0	 8.0  9.0	2.0	 0.0
4	8.0	 9.0  9.0	7.0	 2.0

But, wait! It’s cool to visualize this to get a better idea right?

#visualize the data

import matplotlib.pyplot as plt

data1.hist()
array([[<AxesSubplot:title={'center':'0'}>,
        <AxesSubplot:title={'center':'1'}>],
       [<AxesSubplot:title={'center':'2'}>,
        <AxesSubplot:title={'center':'3'}>],
       [<AxesSubplot:title={'center':'4'}>, <AxesSubplot:>]], dtype=object)
Data discretization

Inference –

  • Here, you can observe that all the 10 bins or groups have equal number of values. That’s how quantile transformation works.

2. Uniform Transformation

In the Uniform transformation, each bin will be of equal width included with possible values in the variables. Let’s see how it works.

#uniform transformation 

#Import the class
from sklearn.preprocessing import KBinsDiscretizer

#Discrete the data
transf = KBinsDiscretizer(n_bins = 10, encode = 'ordinal', strategy = 'uniform')

#fit transform 
data = transf.fit_transform(df)

#Array to dataframe
from pandas import DataFrame

data1 = DataFrame(data)

#Peak into data
data1.head(5)

Here –

  • We have updated the strategy as “uniform”. This will result in a equal width with possible values in each group.

Let’s visualize the data to interpret it better.

#visualize the data

import matplotlib.pyplot as plt

data1.hist()
array([[<AxesSubplot:title={'center':'0'}>,
        <AxesSubplot:title={'center':'1'}>],
       [<AxesSubplot:title={'center':'2'}>,
        <AxesSubplot:title={'center':'3'}>],
       [<AxesSubplot:title={'center':'4'}>, <AxesSubplot:>]], dtype=object)
Data discretization

Inference –

  • Here, you can see that rather than having equal values in each bin, the uniform transform have equal bin width with possible values.

3. KMeans Transformation

The KMeans will work quite differently than previous transformations. Here, Kmeans will try to fit the values into specified clusters. Let’s see how it works.

#Kmeans transformation 

#Import the class
from sklearn.preprocessing import KBinsDiscretizer

#Discrete the data
transf = KBinsDiscretizer(n_bins = 10, encode = 'ordinal', strategy = 'kmeans')

#fit transform 
data = transf.fit_transform(df)

#Array to dataframe
from pandas import DataFrame

data1 = DataFrame(data)

#Peak into data
data1.head(5)

Here –

  • Here, we have again updated the strategy parameter with “kmeans”. With this, the data values will fall into any of the clusters.

Let’s visualize the data.

#visualize the data

import matplotlib.pyplot as plt

data1.hist()
array([[<AxesSubplot:title={'center':'0'}>,
        <AxesSubplot:title={'center':'1'}>],
       [<AxesSubplot:title={'center':'2'}>,
        <AxesSubplot:title={'center':'3'}>],
       [<AxesSubplot:title={'center':'4'}>, <AxesSubplot:>]], dtype=object)
Data discretization

Inference –

  • You can observe that we got 3 clusters and values were all the values were fitted into those clusters.

Wrapping Up – Data Discretization

Data discretization is an essential step in data preprocessing. Because some of the rule-based algorithms will prefer dealing with qualitative data or the bins. I hope now you are clear with these 3 methods for data binning. Make sure to feed the data in the best form to your model to get the best results.

That’s all from now. Happy Python!!!

More read: sklearn.preprocessing

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content