If you are data-savvy, you must have heard a quote – “Your model will be as good as your data” and “Garbage in = Garbage out”.
These are not merely quoted but they stand more than enough by their meaning in the data science world. If you ask any data analyst or scientist about their day-to-day responsibilities, they are all tied with data cleaning and processing most of the time.
Because it is what will fetch you a production-grade model. Well, now, you got the importance of data preprocessing. So, I thought of presenting a quick introduction about data preprocessing in python for you.
Let’s explore some of the key steps in it with real-world data from the Lending Club.
Data Preprocessing in Python
There are many steps in data preprocessing in python –
The data cleaning process involves dealing with missing data and inconsistencies in the data. It also includes the duplicate check and noisy data treatment.
Data integration is all about combining data from different sources to form a consistent and stable dataset for your analysis.
The data transformation step includes data normalization. It means, to make sure data is not redundant and falls on the same scale.
Some of the databases are massive and became slow to load. So we can reduce the data by taking the subset with relevant attributes.
Import the Data
For the data preprocessing in python, we need to load the data. As I mentioned earlier, we are using the loan data from Lending Club.
#Load the data import pandas as pd df = pd.read_csv('loan_data.csv') df
- We have imported pandas library to read the data using
Before diving into preprocessing, we need to check the basic aspects/statistics of the data to know more. It will give a basic idea about your data and its attributes.
First, we will describe the data to see the basic stats.
#describe the data df.describe()
- here, you can see the basic stats like average / mean of the different data attributes.
- Spend some time here to understand your data and try to explain the data attributes.
- It will give you enough insights about the data distribution.
- Null values
Now, check for null values. First, check if there are any null values and if there are any, try to find their count and place of exitance.
#null values df.isnull().any()
client_id False loan_type False loan_amount False repaid False loan_id False loan_start False loan_end False rate False dtype: bool
Well, fortunately there are no missing / null values in our data. So, needless to say, no need to count the null values.
Here, we will check for the presence of outliers. The best way to check outliers is using a box plot. Let’s visualize the data using the box plot.
#outliers df['loan_amount'].plot(kind = 'box')
Wow! we don’t have any outliers in the loan_amount attribute of the data. But make sure you check all the relevant attributes for the outliers.
Let’s check the rate attribute for the outlier presence.
#outliers df['rate'].plot(kind = 'box')
Well, well, well! we got some neighbors now. So, hereby we are confirming the presence of outliers in the rate attribute of our data.
In the next section, let’s see how we can get rid of these outliers.
Now, we are undertaking data into some transformation so that we can avoid the outliers in the data. For this purpose, we are going to transform the rate values into the square of it.
#data transformation import numpy as np #find the sqrt of values df['Sqrt'] = np.sqrt(df['rate'])
Good! we now derived a new column based on the values in rate attribute.
An easy way to detect the outliers will be using histograms.
Now, let’s plot the data and see whether we negotiated the outliers or not.
#import seaborn library import seaborn as sns #Distribution plot sns.distplot(df['rate']) sns.distplot(df['Sqrt'])
We have normalized data now. It is so satisfying to see the bell shape curve. As our data seems to be very less skewed and more normalized, we can report this as the best measure in our case.
Encoding the Categorical variables
When you are working with any dataset, first understand the datatypes of each data attribute. Sometimes, you may have categorical variables in your data. Let’s have a check.
client_id int64 loan_type object loan_amount int64 repaid int64 loan_id int64 loan_start object loan_end object rate float64 Sqrt float64 dtype: object
Well, we have a single categorical column i.e.
Now, we have to encode the categorical values. For this purpose, you can simple use
get_dummies function from pandas.
#dummies cat_var = pd.get_dummies(df['loan_type']) cat_var
cash credit home other 0 0 0 1 0 1 0 1 0 0 2 0 0 1 0 3 1 0 0 0 4 0 1 0 0
Well, we got our encoded values and you are doing great. You can also make use of sklearn.processing to proceed with label encoding and one-hot encoding.
Data preprocessing in Python – Conclusion
The data preprocessing in python is the most important as well as time-consuming step in the data science pipeline. But, I must say it is worth spending time on. If you get this right, you will be so close to getting an amazing model. Understanding the data, basic stats, data distribution, missing values, outliers, and encoding will be the key aspects of data preprocessing. We can have another story for encoding and model building later.
So, that’s all for now. Happy python!!!
More read: Data preprocessing