1. Pandas cut() Function
Pandas cut() function is used to segregate array elements into separate bins. The cut() function works only on one-dimensional array-like objects.
2. Usage of Pandas cut() Function
The cut() function is useful when we have a large number of scalar data and we want to perform some statistical analysis on it.
For example, let’s say we have an array of numbers between 1 and 20. We want to divide them into two bins of (1, 10] and (10, 20] and add labels such as “Lows” and “Highs”. We can easily perform this using the pandas cut() function.
Furthermore, we can perform functions on the elements of a specific bin and label elements.
3. Pandas cut() function syntax
The cut() function sytax is:
cut(
x,
bins,
right=True,
labels=None,
retbins=False,
precision=3,
include_lowest=False,
duplicates="raise",
)
- x is the input array to be binned. It must be one-dimensional.
- bins defines the bin edges for the segmentation.
- right indicates whether to include the rightmost edge or not, default value is True.
- labels is used to specify the labels for the returned bins.
- retbins specifies whether to return the bins or not.
- precision specifies the precision at which to store and display the bins labels.
- include_lowest specifies whether the first interval should be left-inclusive or not.
- duplicates speicifies what to do if the bins edges are not unique, whether to raise ValueError or drop non-uniques.
4. Pandas cut() function examples
Let’s look into some examples of pandas cut() function. I will use NumPy to generate random numbers to populate the DataFrame
object.
4.1) Segment Numbers into Bins
import pandas as pd
import numpy as np
df_nums = pd.DataFrame({'num': np.random.randint(1, 100, 10)})
print(df_nums)
df_nums['num_bins'] = pd.cut(x=df_nums['num'], bins=[1, 25, 50, 75, 100])
print(df_nums)
print(df_nums['num_bins'].unique())
Output:
num
0 80
1 40
2 25
3 9
4 66
5 13
6 63
7 33
8 20
9 60
num num_bins
0 80 (75, 100]
1 40 (25, 50]
2 25 (1, 25]
3 9 (1, 25]
4 66 (50, 75]
5 13 (1, 25]
6 63 (50, 75]
7 33 (25, 50]
8 20 (1, 25]
9 60 (50, 75]
[(75, 100], (25, 50], (1, 25], (50, 75]]
Categories (4, interval[int64]): [(1, 25] < (25, 50] < (50, 75] < (75, 100]]
Notice that 25 is part of the bin (1, 25]. It’s because the rightmost edge is included by default. If you don’t want that then pass the right=False
parameter to the cut() function.
4.2) Adding Labels to Bins
import pandas as pd
import numpy as np
df_nums = pd.DataFrame({'num': np.random.randint(1, 20, 10)})
print(df_nums)
df_nums['nums_labels'] = pd.cut(x=df_nums['num'], bins=[1, 10, 20], labels=['Lows', 'Highs'], right=False)
print(df_nums)
print(df_nums['nums_labels'].unique())
Since we want 10 to be part of Highs, we are specifying right=False in the cut() function call.
Output:
num
0 5
1 16
2 6
3 13
4 2
5 10
6 18
7 10
8 2
9 18
num nums_labels
0 5 Lows
1 16 Highs
2 6 Lows
3 13 Highs
4 2 Lows
5 10 Highs
6 18 Highs
7 10 Highs
8 2 Lows
9 18 Highs
[Lows, Highs]
Categories (2, object): [Lows < Highs]