Python pandas are the most useful library for data manipulation and analysis. Pandas is a software package. But in spite of this, it offers tons of functions which will assist us in various operations. It allows us to use various statistical functions, which drive the statistical measures of the data. In this story, let’s see some of the top statistical functions offered by pandas.
Loading the Data For Statistical Functions
To see how all these statistical functions work, we need data. For this, we are going with coffee sales data which is quite huge and has multiple features.
#data import pandas as pd data = pd.read_csv('coffeesales.csv') data.head(5)
Well, our data is now ready to get explored statistically. Before moving forward, let’s explore some basic features of our data.
We have 4K + rows and 9 features in our data.
Index(['order_date', 'market', 'region', 'product_category', 'product', 'cost', 'inventory', 'net_profit', 'sales'], dtype='object')
I think this should be enough. Now, let’s explore our data using some of the top statistical functions offered by pandas.
The describe function in pandas is the most useful one. It reveals the statistical measures such as min and max number, counts, standard deviation, mean, and the percentiles as well.
Using this one-liner code, we can quickly get enough information to understand our data. In the above output, we can easily find some of the key information such as max sales, min-cost, and more.
The describe function is the best fit for summary statistics. It works very well with pandas dataframe and returns the results in a flash.
Since it is a numerical function, it won’t consider the categorical columns present in our data.
2. Min, Max and idMin, idMax
I am sure you are well aware of the min and max functions in python. But the idmin and idmax are also the coolest functions I have ever seen.
Min and Max– These function will return the minimum and maximum number in the particular column.
idmin and idmax– These functions will return the index of those min and max values. Isn’t it cool 😛
Here, you can see that the min and max values are 17 and 912 respectively. And, value 17 is in index 154 and the value 912 is located in index 1154. That’s something awesome 😛
3. nsmallest and nlargest
The nsmallest is the function that returns the n smallest numbers. You have to pass the number of values to be returned. Suppose, if you pass 3 as the number, it will return the top 3 smallest numbers in the data.
nlargest works just opposite to nsmallest. It will return the n largest numbers present in the data. We will see them in action below.
Pretty awesome. We got the top 3 smallest numbers from the sales column in our data.
Well, as expected we got the top 3 largest numbers. You can pass whatever number you want.
The correlation is one of the most useful functions to understand the correlation among features in our data. It will describe the degree to which two variables move with respect to another.
In simple words, the correlation will determine if the two variables are causal or not. If causal, it will measure the degree of it.
That’s it. We got the correlation results. Here we can see that sales & cost, sales & net_profit are highly positively correlated.
The correlation scale will be from -1 to +1. here, +1 is highly positively correlated and -1 is highly negatively correlated.
5. Sample, Unique and Value_count
You can use the sample function to get the random samples from the data. This function will return random values from the data. Let’s see how it works.
Well, the sample function produced the random samples from the data. It will help in data inspection.
We don’t get many functions in the statistics category which work with categorical data. But, we got a unique function that returns the unique values in the specific variable.
array(['Wholesale', 'Retail'], dtype=object)
Yeah, we have 2 markets over which products were sold. Wholesale and Retail. This function is something serious 😛
We know how to see the unique values in the data. But the value_count functions will return the count of those values in the data.
Let’s check ’em on!
#value count data['market'].value_counts()
Retail 2544 Wholesale 1704 Name: market, dtype: int64
That’s cool. We can see the whole counts of those values. These functions are specifically very useful to work on categorical data.
I would like to plot this because I have stories without visualizations 😛 So, another 2 crazy functions to grow your statistical functions list.
#plot data['market'].value_counts().plot(kind = 'bar')
Now, it looks good than ever.
Wrapping UP – Statistical Functions in Python
Statistical functions which pandas offer will help us in understanding the statistical nature of the data. These numbers will suggest to us what to do next. I hope all these functions which I showed here will come to your use in your assignments.
That’s all for now. Happy Python!!!
More read: Statistics and Python