Pandas DataFrame groupby() Function

Filed Under: Python
Pandas groupby() Function

1. Pandas groupby() function

Pandas DataFrame groupby() function is used to group rows that have the same values. It’s mostly used with aggregate functions (count, sum, min, max, mean) to get the statistics based on one or more column values.

Pandas gropuby() function is very similar to the SQL group by statement. Afterall, DataFrame and SQL Table are almost similar too. It’s an intermediary function to create groups before reaching the final result.

2. Split Apply Combine

It’s also called the split-apply-combine process. The groupby() function splits the data based on some criteria. The aggregate function is applied to each of the groups and then combined together to create the result DataFrame. The below diagram illustrates this behavior with a simple example.

Pandas Groupby Split Apply Combine Example

Split Apply Combine Example

3. Pandas DataFrame groupby() Syntax

The groupby() function syntax is:


groupby(
        self,
        by=None,
        axis=0,
        level=None,
        as_index=True,
        sort=True,
        group_keys=True,
        squeeze=False,
        observed=False,
        **kwargs
    )
  • The by argument determines the way to groupby elements. Generally, column names are used to group by the DataFrame elements.
  • The axis parameter determines whether to grouby rows or columns.
  • The level is used with MultiIndex (hierarchical) to group by a particular level or levels.
  • as_index specifies to return aggregated object with group labels as the index.
  • The sort parameter is used to sort group keys. We can pass it as False for better performance with larger DataFrame objects.
  • group_keys: when calling apply, add group keys to index to identify pieces.
  • squeeze: Reduce the dimensionality of the return type if possible, otherwise return a consistent type.
  • observed: If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
  • **kwargs: only accepts keyword argument ‘mutated’ and is passed to groupby.

The groupby() function returns DataFrameGroupBy or SeriesGroupBy depending on the calling object.

4. Pandas groupby() Example

Let’s say we have a CSV file with the below content.


ID,Name,Role,Salary
1,Pankaj,Editor,10000
2,Lisa,Editor,8000
3,David,Author,6000
4,Ram,Author,4000
5,Anupam,Author,5000

We will use Pandas read_csv() function to read the CSV file and create the DataFrame object.


import pandas as pd

df = pd.read_csv('records.csv')

print(df)

Output:


   ID    Name    Role  Salary
0   1  Pankaj  Editor   10000
1   2    Lisa  Editor    8000
2   3   David  Author    6000
3   4     Ram  Author    4000
4   5  Anupam  Author    5000

4.1) Average Salary Group By Role

We want to know the average salary of the employees based on their role. So we will use groupby() function to create groups based on the ‘Role’ column. Then call the aggregate function mean() to calculate the average and produce the result. Since we don’t need ID and Name columns, we will remove them from the output.


df_groupby_role = df.groupby(['Role'])

# select only required columns
df_groupby_role = df_groupby_role[["Role", "Salary"]]

# get the average
df_groupby_role_mean = df_groupby_role.mean()

print(df_groupby_role_mean)

Output:


        Salary
Role          
Author    5000
Editor    9000

The indexes in the output don’t look good. We can fix it by calling the reset_index() function.


df_groupby_role_mean = df_groupby_role_mean.reset_index()
print(df_groupby_role_mean)

Output:


     Role  Salary
0  Author    5000
1  Editor    9000

4.2) Total Salary Paid By Role

In this example, we will calculate the salary paid for each role.


df_salary_by_role = df.groupby(['Role'])[["Role", "Salary"]].sum().reset_index()
print(df_salary_by_role)

Output:


     Role  Salary
0  Author   15000
1  Editor   18000

This example looks simple because everything is done in a single line. In the earlier example, I had divided the steps for clarity.

4.3) Total Number of Employees by Role

We can use size() aggregate function to get this data.


df_size_by_role = df.groupby(['Role']).size().reset_index()
df_size_by_role.columns.values[1] = 'Count'  # renaming the size column
print(df_size_by_role)

Output:


     Role  Count
0  Author      3
1  Editor      2

5. References

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages