Top Statistical Functions To Use With Pandas In Python

Filed Under: Pandas
Top Statistical Functions To Use With Pandas In Python

Python pandas are the most useful library for data manipulation and analysis. Pandas is a software package. But in spite of this, it offers tons of functions which will assist us in various operations. It allows us to use various statistical functions, which drive the statistical measures of the data. In this story, let’s see some of the top statistical functions offered by pandas. 


Loading the Data For Statistical Functions

To see how all these statistical functions work, we need data. For this, we are going with coffee sales data which is quite huge and has multiple features. 

#data

import pandas as pd
data = pd.read_csv('coffeesales.csv')
data.head(5)
Coffe

Well, our data is now ready to get explored statistically. Before moving forward, let’s explore some basic features of our data.

Shape

#shape

data.shape
(4248, 9)

We have 4K + rows and 9 features in our data.

Features

#features

data.columns
Index(['order_date', 'market', 'region', 'product_category', 'product', 'cost',
       'inventory', 'net_profit', 'sales'],
      dtype='object')

I think this should be enough. Now, let’s explore our data using some of the top statistical functions offered by pandas.


1. Describe

The describe function in pandas is the most useful one. It reveals the statistical measures such as min and max number, counts, standard deviation, mean, and the percentiles as well.

More read: Data Describe Library In Python For Data Exploration

#describe

data.describe()
Statistical functions

Using this one-liner code, we can quickly get enough information to understand our data. In the above output, we can easily find some of the key information such as max sales, min-cost, and more.

The describe function is the best fit for summary statistics. It works very well with pandas dataframe and returns the results in a flash.

Since it is a numerical function, it won’t consider the categorical columns present in our data.


2. Min, Max and idMin, idMax

I am sure you are well aware of the min and max functions in python. But the idmin and idmax are also the coolest functions I have ever seen.

  • Min and Max – These function will return the minimum and maximum number in the particular column.
  • idmin and idmax – These functions will return the index of those min and max values. Isn’t it cool 馃槢
#Min

min(data['sales'])

17

#Max

max(data['sales'])

912

#idxmin

data['sales'].idxmin()

154

#idxmax

data['sales'].idxmax()

1154

Here, you can see that the min and max values are 17 and 912 respectively. And, value 17 is in index 154 and the value 912 is located in index 1154. That’s something awesome 馃槢


3. nsmallest and nlargest

The nsmallest is the function that returns the n smallest numbers. You have to pass the number of values to be returned. Suppose, if you pass 3 as the number, it will return the top 3 smallest numbers in the data.

Similarly, nlargest works just opposite to nsmallest. It will return the n largest numbers present in the data. We will see them in action below.

#smallest

data.nsmallest(3,'sales')
Small

Pretty awesome. We got the top 3 smallest numbers from the sales column in our data.

#largest

data.nlargest(3,'sales')
Large

Well, as expected we got the top 3 largest numbers. You can pass whatever number you want.


4. Corr

The correlation is one of the most useful functions to understand the correlation among features in our data. It will describe the degree to which two variables move with respect to another.

In simple words, the correlation will determine if the two variables are causal or not. If causal, it will measure the degree of it.

#correlation

data.corr()
Statistical functions

That’s it. We got the correlation results. Here we can see that sales & cost, sales & net_profit are highly positively correlated.

The correlation scale will be from -1 to +1. here, +1 is highly positively correlated and -1 is highly negatively correlated.


5. Sample, Unique and Value_count

Sample

You can use the sample function to get the random samples from the data. This function will return random values from the data. Let’s see how it works.

#sample

data.sample(5)
Statistical functions

Well, the sample function produced the random samples from the data. It will help in data inspection.

Unique

We don’t get many functions in the statistics category which work with categorical data. But, we got a unique function that returns the unique values in the specific variable.

#unique

data['market'].unique()
array(['Wholesale', 'Retail'], dtype=object)

Yeah, we have 2 markets over which products were sold. Wholesale and Retail. This function is something serious 馃槢

Value_count

We know how to see the unique values in the data. But the value_count functions will return the count of those values in the data.

Let’s check ’em on!

#value count

data['market'].value_counts()
Retail       2544
Wholesale    1704
Name: market, dtype: int64

That’s cool. We can see the whole counts of those values. These functions are specifically very useful to work on categorical data.

I would like to plot this because I have stories without visualizations 馃槢 So, another 2 crazy functions to grow your statistical functions list.

#plot

data['market'].value_counts().plot(kind = 'bar')
Statistical functions

Now, it looks good than ever.


Wrapping UP – Statistical Functions in Python

Statistical functions which pandas offer will help us in understanding the statistical nature of the data. These numbers will suggest to us what to do next. I hope all these functions which I showed here will come to your use in your assignments.

That’s all for now. Happy Python!!!

More read: Statistics and Python

close
Generic selectors
Exact matches only
Search in title
Search in content