Crosstab Using Pandas For Data Summarization – In Detail

Filed Under: Pandas
Crosstab Using Pandas For Data Summarization

In this tutorial, we will be discussing the crosstab function in pandas which makes data summarization very easy and beautiful. Let’s dive deep!

Data summarization is defined as presenting data easily and comprehensively. It is one of the key data mining concepts which helps us to find the key insights over the data in a structured and easy way.

Crosstab Using Pandas For Data Summarization

Summarization of data is more like presenting the summary of the available data. This summary will be informative and easy to understand as well. There is some function which helps you in this purpose. For example, you can use the groupby and pivot_table function to summarize the data.

But for now, we will be focusing on crosstab pandas for data summarization.

Also readPandas count and value_count for data summarization.


Load the Data

For this tutorial, we will be using coffee sales data. You can easily download this dataset from the Kaggle website. Using pandas we can load the data and read it.

#load the data

data = pd.read_csv('coffeesales.csv')
data.head(5)
Crosstab 1

Let’s understand what is this data is all about and check basic information about this data.

#Columns of the data

data.columns
Index(['order_date', 'market', 'region', 'product_category', 'product', 'cost',
       'inventory', 'net_profit', 'sales'],
      dtype='object')
#Dimensions 

data.shape
(4248, 9)

We have 4248 rows and 9 columns in this data.

#datatypes

data.dtypes
order_date          object
market              object
region              object
product_category    object
product             object
cost                 int64
inventory            int64
net_profit           int64
sales                int64
dtype: object

We have an equal share of both categorical and numerical variables.

#null values

data.isnull().any()
order_date          False
market              False
region              False
product_category    False
product             False
cost                False
inventory           False
net_profit          False
sales               False
dtype: bool

That’s cool. We don’t have any missing or null values in our data.


1. Basic Crosstab in Python

Let’s start this by creating a simple crosstab in python. This will give us an idea to move further with advanced crosstabs.

#simple cross tab

simple_crosstab = pd.crosstab(data['region'], data['product_category'])
simple_crosstab
Crosstab 2

Observe that how easily a simple table can summarize the heart of the data. This is what crosstab can offer us within seconds and just in 2 lines of code. Thanks to Pandas!

You can even rename the row and column names of this cross tab. For this, you have to pass rownames and colnames parameters to the crosstab function. Run the below code and observe the changes.

#rename row and col names of crosstab

simple_crosstab = pd.crosstab(data['region'], data['product_category'], rownames = ['Regions in USA'], colnames = ['Product_categories'])
simple_crosstab

2. Crosstab with Multiple Rows

To dig out more insights and a summary of the data, we can add multiple rows to our crosstab. This will help us to explore and analyze the data in a meaningful way and moreover in multiple aspects as well.

#crosstab with multiple rows

multirow_crosstab = pd.crosstab(data['region'], [data['product_category'], df['market']])
multirow_crosstab
multiple rows

I have added the market values as well in this crosstab along with region and product categories. Now, things are much wider and throw a lot of information about the coffee sales.

You can even pass 3 rows. But make sure, you are not making it messy. Since it produces a summary of the data, it should be simple and easy to digest.


3. Subtotals

Yes, having totals will be a great way to add much more information for the audience. If you were thought of adding the subtotals to these crosstab tables, I have a simple solution for this.

You need to pass the margins parameter to the crosstab function and turn it to True. This is enough to produce the desired results.

# margins 

multirow_crosstab = pd.crosstab(data['region'], [data['product_category'], df['market']], margins = True)
multirow_crosstab
subtotals

You can observe the subtotals on both axes and I am sure now it looks more complete than before.

You can even change the total value name by passing the parameter, margins_name = ' ' .


4. Percentage of Occurrence

In the analysis, we frequently want to know the occurrence of the values over the data. Using crosstab, we can find out the percentage of occurrence of values.

#percentage of occurrence

percentage_of_occurrence = pd.crosstab(data['region'], 
                                       data['product_category'], 
                                       normalize = True
                                      ).style.format('{:.1%}'
                                                    ).background_gradient(
).set_caption('Heatmap of sales over different regions')


percentage_of_occurrence
percentage of values

Wowee 馃槢

Don’t worry about a new look at the crosstab. Here, I have added the pandas dataframe stylings and the gradient heat map to make it glamorous. But don’t forget to observe the percentage of occurrences.


5. Aggregation

By default, the crosstab function will use count as an aggregation function over the values. You can change any other aggregation function by passing the parameter aggfunc = ‘ ‘.

Here, I will be using mean as my aggregation function over that data. Like above, I will add some dataframe styling to make it beautiful.

#aggregation

multirow_crosstab = pd.crosstab(data['region'], 
                                data['product_category'], 
                                values=data['net_profit'], 
                                aggfunc = 'mean'
                               ).round(2).style.background_gradient()
multirow_crosstab
heat map

Here, observe that I have passed the values parameter so that pandas will apply the mean function to the net profit of the data.


Crosstab in Python Pandas – Conclusion

Pandas crosstab is one of the widely used functions for data summarization. If you are fighting with huge data, by using crosstabs you can define the summary of those big chunks of data in an easy and informative manner for your audience or for your analysis.

I have addressed some of the dataframe styles offered by pandas to make those crosstabs look great. I hope you learned something about crosstabs throughout this tutorial.

That’s all for now. Happy Python!

More read: Pandas dataframe styling

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content