Let me give you a tool so strong, it will change the manner you start analyzing your datasets – pandas profiling. No more need to find ways to describe your dataset using mean and max() and min() functions.
What is Pandas profiling?
The pandas_profiling library is composed of the following information:
- Overview of DataFrame,
- Attributes that are specified by DataFrame,
- Attribute associations (Pearson Correlation and Spearman Correlation), and
- A DataFrame study.
Basic Syntax of pandas_profiling library
import pandas as pd import pandas_profiling df = pd.read_csv(#file location) pandas_profiling.ProfileReport(df, **kwargs)
Working With Pandas Profiling
To begin working with the pandas_profiling module, let’s get a dataset:
The data used was derived from GIS and satellite information, as well as from information gathered from the natural inventories that were prepared for the environmental impact assessment (EIA) reports for two planned road projects (Road A and Road B) in Poland.
These reports were mostly used to gather information on the size of the amphibian population in each of the 189 occurrence sites.
Using the Pandas Profiling module
Let’s use pandas to read the csv file we just downloaded:
data = pd.read_csv("dataset.csv",delimiter = ";")
We need to import the package ProfileReport:
from pandas_profiling import ProfileReport ProfileReport(data)
The function generates profile reports from a pandas DataFrame. The pandas df.describe() function is great but a little basic for serious exploratory data analysis.
The pandas_profiling module extends the pandas DataFrame with df.profile_report() for quick data analysis.
For each column the following statistics – if relevant for the column type – are presented in an interactive HTML report:
- Type inference: detect the types of columns in a data frame.
- Essentials: type, unique values, missing values
- Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
- Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- Most frequent values
- Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
- Missing values matrix, count, heatmap, and dendrogram of missing values
- Text analysis learns about categories (Uppercase, Space), scripts (Latin, Cyrillic), and blocks (ASCII) of text data.
- File and Image analysis extract file sizes, creation dates, and dimensions and scan for truncated images or those containing EXIF information.
1. Describe a Dataset
This is the same as the command of data.describe :
It also gives us the types of variables and detailed information about them, including descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution (excluding NaN values).
Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided.
2. Correlation matrix
We also have the correlation matrix:
It is similar to using the np.corrcoef(X,Y) or data.corr() functions. Pandas’ dataframe.corr() is used to find the pairwise correlation of all columns in the dataframe. Any na values are automatically excluded. For any non-numeric data type columns in the dataframe it is ignored.
3. View of the dataset
And finally we have a part of the dataset itself:
As you can see, it saves us a lot of time and effort. If you liked this article, follow me as an author. Also, bookmark the page because we post a lot of great content.