Data Describe Library In Python For Data Exploration

Filed Under: Python Modules
Data Describe Library In Python For Data Exploration

Data exploration or exploratory data analysis is an integral part of any analysis project. It not only explores the data, but it describes your data. It enables you to understand your data and the features in it.

The data exploration in the earlier phase will help you in the model-building stages. Usually, people spend most of their time on EDA. Being said that, we have discussed many libraries which help you in EDA.

Today’s it’s time for the data describe library available in python. 

So, without wasting much time on the introduction, let’s see how we can install this library and work with it.

More read:

  1. QuickDA in Python: Explore Your Data In Seconds.
  2. Klib in Python – Speed Up Your Data Visualization.

1. Installing the data_describe library in Python

To install the data_describe library in python, you can execute the below code. You have to write a pip command for this.

#installation 

!pip install data_describe
Data Describe

You can refer to the last line in the image for the successful installation message. After this, you have to import the library into python to work with it.

#import

import data_describe as d_d

Perfect! You have successfully installed and imported the required library. Now, let’s see what it offers to us.


2. Load the Data

We need to explore the data. And so, we’ll work on the coffee sales data. I chose this just because, it is quite big to explore and it’s a real-world dataset. 

You can download the dataset here.

#load the data

import pandas as pd
data = pd.read_csv('coffeesales.csv')
data.head(5)
Coffeesales

Whoo! Our data is ready to explore.


3. Summary (Statistical) of the Data

It is much important to understand the statistical summary of the data. It will uncover the min, max, median values along with unique and null values as well.

#summary

d_d.data_summary(data)
Data Describe 1

The above line of code returns a small block of info followed by a brief summary of the data. Note that, the summary function will only work on numerical attributes and hence you can see the blank values for the categorical attributes.


4. Heatmap

Yes, you can plot a heatmap for the whole data using the heatmap function offered by the data describe the library. Let’s see how it works. 

#heatmap

d_d.data_heatmap(data)
Data Describe 2

Here is our beautiful heatmap. The best thing about this library is, it offers many functions which will help us in exploring the data that to with one line of code :P.


5. Correlation Matrix

The correlation matrix is used to display the correlation between the attributes in the data. It consists of the rows and columns which represent the attributes present in the data.

#correlation

d_d.correlation_matrix(data)
Correlation Matrix Dd

As usual, all this happens with one line of code 🙂


6. Scatter Plots

Scatter plots using the cartesian coordinates to display the data values on the plot. These are used to explore the relationship between two numerical variables. Let’s see how we can plot a scatter graph using the data describe library-based scatter_plot function.

#scatter plots

d_d.scatter_plots(data, plot_mode='matrix')
Scatter Matrix

You can also call this plot a scatter matrix. Here I have passes the plot_mode argument as Matrix. You can try using different parameters or arguments to the scatter function. 


7. Clustering

The data points which show similar features can be clustered as a similar group. We can get to see multiple clusters in the data.

Cluster plots will help us to visualize these clusters in the data.

#cluster plots

d_d.cluster(data)
Cluster

That’s cool! We can see 3 different clusters in this data according to their behavior. You can also see the clusters in the scatter plots as well. But, cluster plots will better serve the purpose.


8. Feature Importance Plot

We already know that all the features in our data will not contribute to our purpose. So, it is very important to find the most important or the relevant features for our analysis or modeling purpose.

Here comes the feature importance plots, which displays the most important features in our dataset.

#feature importance

d_d.importance(data, 'sales')
Feature

Basically what it does is, it will estimate the importance of the features based on the ‘sales’ attribute in the data. For this, the data_describe library offers the important function as shown above. 


Wrapping Up – Data Describe

Data describe is one of the fastest and easiest libraries that one can use to explore the data. I personally enjoyed using it to explore the data. It offers many useful functions and saves a lot of time for sure. I hope you find this library useful and don’t forget to give it a try in your upcoming analysis work.

That’s all for now. Happy Python!!!

More read: Official documentation of the library

close
Generic selectors
Exact matches only
Search in title
Search in content