Pandas, Pandas and Pandas. When it comes to data manipulation and analysis, nothing can serve the purpose better than Pandas. In previous stories, we have learned many data operations using pandas. Today is another day where we are going to explore the data summarization topic using pandas in python. So, without wasting much time on the intro, let’s roll!
The word data summarization is nothing but extracting and presenting the raw data as a summary of it. Just presenting the raw data cannot make any sense to your audience. So, breaking the data into subsets and then gathering or summarizing the insights can craft a neat story any day.
Pandas offers many functions such as count, value counts, crosstab, group by, and more to present the raw data in an informative way.
Well, in this story, we are going to explore all the data summarization techniques using pandas in python.
Pandas count is a very simple function that is used to get the count of the data points. Its applications are limited compared to crosstab and Groupby. But, it is quite useful at all times.
Before we move forward, let’s install all the required libraries for data summarization in python.
#Pandas import pandas as pd #Numpy import numpy as np #Matplotlib import matplotlib.pyplot as plt #seaborn import seaborn as sns
Now, let’s load our Titanic data. The reason I am using this data is, it is pretty easy to understand the data summarization using these attributes. So, if you are a beginner or a pro, it will best suit the purpose.
#titanic data import pandas as pd data = pd.read_csv('titanic.csv')
We can dig deep to understand the basic information about the data.
#data columns data.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')
PassengerId int64 Survived int64 Pclass int64 Name object Sex object Age float64 SibSp int64 Parch int64 Ticket object Fare float64 Cabin object Embarked object dtype: object
Well, we have both numerical and categorical data types in our data and it will spice up things for sure.
Now, it’s time to count the values present in both rows and columns.
#count of values in columns data.count(0)
PassengerId 891 Survived 891 Pclass 891 Name 891 Sex 891 Age 714 SibSp 891 Parch 891 Ticket 891 Fare 891 Cabin 204 Embarked 889 dtype: int64
You can see that most of the columns have 891 values. But columns such as cabin and Age have less value. It indicates the presence of null values or missing data. Let’s look at the rows for the same.
#count of values in rows data.count(1)
0 11 1 12 2 11 3 12 4 11 .. 886 11 887 12 888 10 889 12 890 11 Length: 891, dtype: int64
You can observe that not all the rows have the same number of values. An ideal row of this data should have 12 values.
You can observe or inspect the data by index level as well. Let’s use
set_index function for the same.
#set index data = data.set_index(['Sex','Pclass']) data.head(2)
That’s our index level data watch!
Now, we have 2 attributes as our data index. So, let’s set the count level as ‘Sex’ to get the particular data.
#count level data.count(level = 'Sex')
Similarly for ‘Pclass’
#count level data.count(level = 'Pclass')
That’s ‘some’ information you need to work with data modeling.
The value counts function has more functionality compared to the count function with 1-2 lines of code. Definitely, it will earn more respect in your eyes as it can perform the operations of the group by functioning more seamlessly.
#value counts data.value_counts(['Pclass'])
Pclass 3 491 1 216 2 184 dtype: int64
That’s cool. We now have information about all three classes and the values that belong to each of them.
One of the best features of the
value_counts function is, you can even normalize the data.
#normalization data.value_counts(['Pclass'], normalize = True, sort = True, ascending = True)
Pclass 2 0.206510 1 0.242424 3 0.551066 dtype: float64
Here, we have not only normalized the values but also sorted the values in ascending order which makes some sense
For the data attribute which has no levels in it such as “fare”, we can create the bins. Let’s see how it works.
(-0.513, 102.466] 838 (102.466, 204.932] 33 (204.932, 307.398] 17 (409.863, 512.329] 3 (307.398, 409.863] 0 Name: Fare, dtype: int64
Well, we have created 5 bin ranges for the “fare”. Most of the ticket prices are in the 0 – 100 range and belong to Pclass 1.
A crosstab is a simple function that shows the relationship between two variables. It is very handy to quickly analyze two variables.
Now, let’s see the relationship between Sex and the Survivability of the passengers in the data.
Survived 0 1 Sex female 81 233 male 468 109
You can see the clear relationship between Sex with Survivability. We can plot this data for better visibility.
That’s cool! I hope things were better now.
In the crosstab, we can do so much. We can add multiple data layers in the cross tab and even we can visualize the same.
#multiple layers crosstab pd.crosstab([data['Pclass'], data['Sex']], [data['Embarked'], data['Survived']], rownames = ['Pclass', 'gender'], colnames = ['Embarked', 'Survived'], dropna=False)
There is a lot of information in just one table. That’s crosstab for you! Finally, let’s plot the correlation plot for this table data, and let’s see how it works.
#correlation import seaborn as sns sns.heatmap(pd.crosstab([data['Pclass'],data['Sex']],[data['Embarked'],data['Survived']]),annot = True)
We have got an amazing correlation plot showing key information about the data.
Data Summarization – Conclusion
Data manipulation and analysis are most important as you will get to know about key insights and hidden patterns in your data. In this regard, data summarization is one of the best techniques you can make use of to get into your data for the best analysis.
That’s all for now and I hope this story helps you in your analysis. Happy Python!!!
More read: Data manipulation and statistical analysis