Data Preprocessing in Python | A quick Introduction

Filed Under: Python
Data Preprocessing In Python

If you are data-savvy, you must have heard a quote – “Your model will be as good as your data” and “Garbage in = Garbage out”.

These are not merely quoted but they stand more than enough by their meaning in the data science world. If you ask any data analyst or scientist about their day-to-day responsibilities, they are all tied with data cleaning and processing most of the time.

Because it is what will fetch you a production-grade model. Well, now, you got the importance of data preprocessing. So, I thought of presenting a quick introduction about data preprocessing in python for you. 

Let’s explore some of the key steps in it with real-world data from the Lending Club.

Data Preprocessing in Python

There are many steps in data preprocessing in python –

  • Cleaning

The data cleaning process involves dealing with missing data and inconsistencies in the data. It also includes the duplicate check and noisy data treatment.

  • Integration

Data integration is all about combining data from different sources to form a consistent and stable dataset for your analysis.

  • Transformation

The data transformation step includes data normalization. It means, to make sure data is not redundant and falls on the same scale.

  • Reduction

Some of the databases are massive and became slow to load. So we can reduce the data by taking the subset with relevant attributes.

Import the Data

For the data preprocessing in python, we need to load the data. As I mentioned earlier, we are using the loan data from Lending Club.

#Load the data

import pandas as pd

df = pd.read_csv('loan_data.csv')

Loan Data
  • We have imported pandas library to read the data using read_csv function.

Basic Statistics

Before diving into preprocessing, we need to check the basic aspects/statistics of the data to know more. It will give a basic idea about your data and its attributes.

  • Describe

First, we will describe the data to see the basic stats.

#describe the data

Data Describe
  • here, you can see the basic stats like average / mean of the different data attributes.
  • Spend some time here to understand your data and try to explain the data attributes.
  • It will give you enough insights about the data distribution.

  • Null values

Now, check for null values. First, check if there are any null values and if there are any, try to find their count and place of exitance.

#null values

client_id      False
loan_type      False
loan_amount    False
repaid         False
loan_id        False
loan_start     False
loan_end       False
rate           False
dtype: bool

Well, fortunately there are no missing / null values in our data. So, needless to say, no need to count the null values.

  • Outliers

Here, we will check for the presence of outliers. The best way to check outliers is using a box plot. Let’s visualize the data using the box plot.


df['loan_amount'].plot(kind = 'box')
data preprocessing in python

Wow! we don’t have any outliers in the loan_amount attribute of the data. But make sure you check all the relevant attributes for the outliers.

Let’s check the rate attribute for the outlier presence.


df['rate'].plot(kind = 'box')
data preprocessing in python

Well, well, well! we got some neighbors now. So, hereby we are confirming the presence of outliers in the rate attribute of our data.

In the next section, let’s see how we can get rid of these outliers.

Data Transformation

Now, we are undertaking data into some transformation so that we can avoid the outliers in the data. For this purpose, we are going to transform the rate values into the square of it.

#data transformation 

import numpy as np

#find the sqrt of values 
df['Sqrt'] = np.sqrt(df['rate'])
data preprocessing in python

Good! we now derived a new column based on the values in rate attribute.

Note: If the data is skewed, it means there are some outliers present in the data.

An easy way to detect the outliers will be using histograms.

Now, let’s plot the data and see whether we negotiated the outliers or not.

#import seaborn library 
import seaborn as sns

#Distribution plot 
Dist Plot 1
Dist Plot 2

That’s perfect!

We have normalized data now. It is so satisfying to see the bell shape curve. As our data seems to be very less skewed and more normalized, we can report this as the best measure in our case.

Encoding the Categorical variables

When you are working with any dataset, first understand the datatypes of each data attribute. Sometimes, you may have categorical variables in your data. Let’s have a check.


client_id        int64
loan_type       object
loan_amount      int64
repaid           int64
loan_id          int64
loan_start      object
loan_end        object
rate           float64
Sqrt           float64
dtype: object

Well, we have a single categorical column i.e. loan_type.

Now, we have to encode the categorical values. For this purpose, you can simple use get_dummies function from pandas.


cat_var = pd.get_dummies(df['loan_type'])
cash	credit	home	other
0	0	  0 	1	      0
1	0	  1	    0	      0
2	0	  0	    1	      0
3	1	  0	    0	      0
4	0	  1	    0	      0

Well, we got our encoded values and you are doing great. You can also make use of sklearn.processing to proceed with label encoding and one-hot encoding.

Data preprocessing in Python – Conclusion

The data preprocessing in python is the most important as well as time-consuming step in the data science pipeline. But, I must say it is worth spending time on. If you get this right, you will be so close to getting an amazing model. Understanding the data, basic stats, data distribution, missing values, outliers, and encoding will be the key aspects of data preprocessing. We can have another story for encoding and model building later.

So, that’s all for now. Happy python!!!

More read: Data preprocessing

Generic selectors
Exact matches only
Search in title
Search in content