Data Preparation Using Pandas In Python

Filed Under: Pandas
Data Preparation Using Pandas In Python

Many times, we think the analysis is all about getting the data, load into preferable platforms, and using some tools, we plot some graphs and metrics, which are used in decision making. But, it is not the case ever.

Data Preparation Using Python Pandas

The data we get will be messy and there will be a lot of inconsistency over that. The need for data preparation will flash here. In this article, we will discuss some of the key data preparation methods using python pandas. 


1. Load the Data

For this entire discussion, we will be using the Titanic dataset due to its universality. You can find plenty of datasets in Kaggle, don’t forget to check it out. 

More read: 4 Easy Ways For Data Filtering In Python Pandas

#load the data
import pandas as pd

data = pd.read_csv('Titanic.csv')
Titanic

Our data is now ready to be prepared. You can peak into the data using the function data.head().


2. Explore the Basics

Before we move into preparation, we have to explore the data. It can be its shape, attributes, summary, datatype, and more.

Shape

#Shape of the data

data.shape
(891, 12)

Attributes

#attributes

data.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Summary

#summary - statistical 

data.describe()
Titanic Describe

Datatypes

#datatypes

data.dtypes
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

As you can observe, we are using a very simple function to explore the data. These are just one-liners. You can also define a custom function with python code.

In the next steps, we will be focusing on preparing the data for the final analysis.


3. Drop Irrelevant Attributes

Not all the attributes in the data will be useful for analysis. So we need to drop them before you dive into analysis.

In our data, the attributes such as PassengerId, Ticket, and Name can be dropped straight away. Let’s do it.

#define the columns
columns = ['PassengerId','Name','Ticket']

#drop them
data.drop(columns, inplace = True, axis = 1)

You can cross-check that the data has dropped using the function data.columns().

#recheck

data.columns
Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin',
       'Embarked'],
      dtype='object')

Perfect!. We have dropped the irrelevant now.


4. Duplicate check

Entry of the same data record more than once can be considered as a duplicate entry. We often encounter this issue with the datasets. So, we have to check for duplicates in the very initial phase of analysis, so that you do no need to worry later.

#duplicate

data.drop_duplicates(inplace = True)

You can confirm the same using the shape of the data.

#recheck

data.shape
(784, 9)

It seems there are no duplicate values in our data and it is good to see this always.


5. Missing Values

The missing values or the NaN’s can cause many issues while working with different modules and algorithms. So, you need to either replace them or fill them with a relevant value such as the mean of the values.

First, we have to check for the presence of missing values.

#missing values

data.isnull().sum()
Survived      0
Pclass        0
Sex           0
Age         106
SibSp         0
Parch         0
Fare          0
Cabin       581
Embarked      2
dtype: int64

Oh! Unfortunately, we have many missing values in our data. Let’s see how we can tackle them.

Replace

Missing values can be defined in multiple ways in the data. Somewhere you can see a blank space or it can even be denoted as NA or with a dash.

So, to get uniformity over the missing values, we have to replace all of them with ‘NA’ values. So that, later we can fill that NA value with the most relevant value.

#repalce

data['Age'].replace('NaN')
0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

You can replace the values over the entire data as well. For that you can use this line of the code – data.repalce('NaN').

Fill NA’s

To fill the missing values, we will be using the mean of the values. Frist, you have to find the mean of the values in a particular attribute, and then, we call to fill that values in place of NA’s.

#find mean

round(data['Age'].mean(),2)
29.7

We have got the mean of the values. Now, let’s fill this value in place of missing values in the ‘Age’ attribute.

#fill NA's

data['Age'].fillna(29.7, inplace=True)

That’s perfect! You have no missing values in the ‘Age’ attribute now. You can do it for the rest of the data using the same method.


6. Pivoting The Data

By pivoting the data we can uncover a lot of hidden insights over the raw data. Let’s see how we can make use of the pivot function in pandas to get this done.

#pivoting

data = pd.pivot_table(data,values= 'Survived', index= 'Sex', columns= 'Pclass')

Basically, here we wanted to uncover survivability based on sex and class. Let’s see what pivot offers to us.

data preparation

You are awesome!

As we can see from the above-pivoted table, it is clear that females who are in the Pclass 1 and 2 are the luckiest ones as the survivability in those classes is nearly 100%.

All the above methods and functions will help you a lot in preparing your data in your analysis journey. I hope you will find them easy and handy at times.


Conclusion – Data Preparation

Data preparation is an essential phase of any data analysis project. If you were able to get your data right, then it will unleash many hidden mysteries in it. There will be many lost stories in that data, make sure you heard them right. In this journey, I hope the above methods and functions will assist you to the make the process easy and effective.

That’s all for now. Happy Python!!!

More read: Python pandas for data wrangling.

close
Generic selectors
Exact matches only
Search in title
Search in content