Table of Contents
1. Pandas groupby() function
Pandas DataFrame groupby() function is used to group rows that have the same values. It’s mostly used with aggregate functions (count, sum, min, max, mean) to get the statistics based on one or more column values.
Pandas gropuby() function is very similar to the SQL group by statement. Afterall, DataFrame and SQL Table are almost similar too. It’s an intermediary function to create groups before reaching the final result.
2. Split Apply Combine
It’s also called the split-apply-combine process. The groupby() function splits the data based on some criteria. The aggregate function is applied to each of the groups and then combined together to create the result DataFrame. The below diagram illustrates this behavior with a simple example.
3. Pandas DataFrame groupby() Syntax
The groupby() function syntax is:
groupby( self, by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False, **kwargs )
- The by argument determines the way to groupby elements. Generally, column names are used to group by the DataFrame elements.
- The axis parameter determines whether to grouby rows or columns.
- The level is used with MultiIndex (hierarchical) to group by a particular level or levels.
- as_index specifies to return aggregated object with group labels as the index.
- The sort parameter is used to sort group keys. We can pass it as False for better performance with larger DataFrame objects.
- group_keys: when calling apply, add group keys to index to identify pieces.
- squeeze: Reduce the dimensionality of the return type if possible, otherwise return a consistent type.
- observed: If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
- **kwargs: only accepts keyword argument ‘mutated’ and is passed to groupby.
The groupby() function returns DataFrameGroupBy or SeriesGroupBy depending on the calling object.
4. Pandas groupby() Example
Let’s say we have a CSV file with the below content.
ID,Name,Role,Salary 1,Pankaj,Editor,10000 2,Lisa,Editor,8000 3,David,Author,6000 4,Ram,Author,4000 5,Anupam,Author,5000
We will use Pandas read_csv() function to read the CSV file and create the DataFrame object.
import pandas as pd df = pd.read_csv('records.csv') print(df)
ID Name Role Salary 0 1 Pankaj Editor 10000 1 2 Lisa Editor 8000 2 3 David Author 6000 3 4 Ram Author 4000 4 5 Anupam Author 5000
4.1) Average Salary Group By Role
We want to know the average salary of the employees based on their role. So we will use groupby() function to create groups based on the ‘Role’ column. Then call the aggregate function mean() to calculate the average and produce the result. Since we don’t need ID and Name columns, we will remove them from the output.
df_groupby_role = df.groupby(['Role']) # select only required columns df_groupby_role = df_groupby_role[["Role", "Salary"]] # get the average df_groupby_role_mean = df_groupby_role.mean() print(df_groupby_role_mean)
Salary Role Author 5000 Editor 9000
The indexes in the output don’t look good. We can fix it by calling the reset_index() function.
df_groupby_role_mean = df_groupby_role_mean.reset_index() print(df_groupby_role_mean)
Role Salary 0 Author 5000 1 Editor 9000
4.2) Total Salary Paid By Role
In this example, we will calculate the salary paid for each role.
df_salary_by_role = df.groupby(['Role'])[["Role", "Salary"]].sum().reset_index() print(df_salary_by_role)
Role Salary 0 Author 15000 1 Editor 18000
This example looks simple because everything is done in a single line. In the earlier example, I had divided the steps for clarity.
4.3) Total Number of Employees by Role
We can use size() aggregate function to get this data.
df_size_by_role = df.groupby(['Role']).size().reset_index() df_size_by_role.columns.values = 'Count' # renaming the size column print(df_size_by_role)
Role Count 0 Author 3 1 Editor 2