Many times, we think the analysis is all about getting the data, load into preferable platforms, and using some tools, we plot some graphs and metrics, which are used in decision making. But, it is not the case ever.
Data Preparation Using Python Pandas
The data we get will be messy and there will be a lot of inconsistency over that. The need for data preparation will flash here. In this article, we will discuss some of the key data preparation methods using python pandas.
1. Load the Data
#load the data import pandas as pd data = pd.read_csv('Titanic.csv')
Our data is now ready to be prepared. You can peak into the data using the function
2. Explore the Basics
Before we move into preparation, we have to explore the data. It can be its shape, attributes, summary, datatype, and more.
#Shape of the data data.shape
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')
#summary - statistical data.describe()
PassengerId int64 Survived int64 Pclass int64 Name object Sex object Age float64 SibSp int64 Parch int64 Ticket object Fare float64 Cabin object Embarked object dtype: object
As you can observe, we are using a very simple function to explore the data. These are just one-liners. You can also define a custom function with python code.
In the next steps, we will be focusing on preparing the data for the final analysis.
3. Drop Irrelevant Attributes
Not all the attributes in the data will be useful for analysis. So we need to drop them before you dive into analysis.
In our data, the attributes such as PassengerId, Ticket, and Name can be dropped straight away. Let’s do it.
#define the columns columns = ['PassengerId','Name','Ticket'] #drop them data.drop(columns, inplace = True, axis = 1)
You can cross-check that the data has dropped using the function
Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin', 'Embarked'], dtype='object')
Perfect!. We have dropped the irrelevant now.
4. Duplicate check
Entry of the same data record more than once can be considered as a duplicate entry. We often encounter this issue with the datasets. So, we have to check for duplicates in the very initial phase of analysis, so that you do no need to worry later.
#duplicate data.drop_duplicates(inplace = True)
You can confirm the same using the shape of the data.
It seems there are no duplicate values in our data and it is good to see this always.
5. Missing Values
The missing values or the NaN’s can cause many issues while working with different modules and algorithms. So, you need to either replace them or fill them with a relevant value such as the mean of the values.
First, we have to check for the presence of missing values.
#missing values data.isnull().sum()
Survived 0 Pclass 0 Sex 0 Age 106 SibSp 0 Parch 0 Fare 0 Cabin 581 Embarked 2 dtype: int64
Oh! Unfortunately, we have many missing values in our data. Let’s see how we can tackle them.
Missing values can be defined in multiple ways in the data. Somewhere you can see a blank space or it can even be denoted as NA or with a dash.
So, to get uniformity over the missing values, we have to replace all of them with ‘NA’ values. So that, later we can fill that NA value with the most relevant value.
0 22.0 1 38.0 2 26.0 3 35.0 4 35.0 ... 886 27.0 887 19.0 888 NaN 889 26.0 890 32.0 Name: Age, Length: 891, dtype: float64
You can replace the values over the entire data as well. For that you can use this line of the code –
To fill the missing values, we will be using the mean of the values. Frist, you have to find the mean of the values in a particular attribute, and then, we call to fill that values in place of NA’s.
#find mean round(data['Age'].mean(),2)
We have got the mean of the values. Now, let’s fill this value in place of missing values in the ‘Age’ attribute.
#fill NA's data['Age'].fillna(29.7, inplace=True)
That’s perfect! You have no missing values in the ‘Age’ attribute now. You can do it for the rest of the data using the same method.
6. Pivoting The Data
By pivoting the data we can uncover a lot of hidden insights over the raw data. Let’s see how we can make use of the pivot function in pandas to get this done.
#pivoting data = pd.pivot_table(data,values= 'Survived', index= 'Sex', columns= 'Pclass')
Basically, here we wanted to uncover survivability based on sex and class. Let’s see what pivot offers to us.
You are awesome!
As we can see from the above-pivoted table, it is clear that females who are in the Pclass 1 and 2 are the luckiest ones as the survivability in those classes is nearly 100%.
All the above methods and functions will help you a lot in preparing your data in your analysis journey. I hope you will find them easy and handy at times.
Conclusion – Data Preparation
Data preparation is an essential phase of any data analysis project. If you were able to get your data right, then it will unleash many hidden mysteries in it. There will be many lost stories in that data, make sure you heard them right. In this journey, I hope the above methods and functions will assist you to the make the process easy and effective.
That’s all for now. Happy Python!!!
More read: Python pandas for data wrangling.