Today, it’s back to basics. We’ll be going over all Exploratory Data Analysis (EDA) metrics and which one you should choose – both the when and why.
In this article, I am presuming you haven’t forgotten Data Analysis with Python. So, let’s get started.
What is Exploratory Data Analysis (EDA)?
The short answer – Exploratory Data Analysis or EDA for short is an important process of conducting prior examination of data to recognize patterns, identify unintended variations, and test your assumptions with statistical analysis.
Longer answer – In any data science project, exploratory data analysis (EDA) is a significant move.
By computing descriptive statistics of our data collection, we often try to get a snapshot of our data.
Exploratory data analysis, which appears to be somewhat rigid with laws and formulas, is a supplement to inferential statistics.
At an advanced stage, EDA requires looking at the data set from various angles and explaining it, and then summarizing it.
Exploratory Data Analysis (EDA) is a term for certain forms of initial analysis and data set observations, typically early in an analytical process.
To learn more about what it represents and how to implement it some experts describe it as “taking a peek” at the data.
Often, exploratory data analysis is a precursor to other kinds of statistical and data work.
These steps are some that are absolutely necessary for understanding your dataset.
Exploratory Data Analysis Metrics
We will explore a data set and perform exploratory data analysis on it.
The major topics to be covered are below:
- Central tendency
- Handle Missing value
- Removing duplicates
- Outlier Treatment
- Normalizing and Scaling( Numerical Variables)
- Encoding Categorical variables( Dummy Variables)
- Bivariate Analysis
1. Measures of Central tendency
A central tendency measure is a summary statistic that reflects a dataset’s center point or typical value.
Such measurements show where most values fall in distribution and are often referred to as a distribution’s central position. It can be thought of as the propensity of knowledge to cluster around a middle value.
Central trend measures represent a single value that attempts to define a data set by defining the central position within that data set.
As such, central inclination measures are also called central position measures.
It’s a metric, in other words, that tells us where the center of a data set is. The mean, which is sometimes referred to as the mean, is the most well-known of the central tendency measures.
Other indicators of central tendencies, such as the median and the mode, exist, however.
All valid measurements of central tendency are the mean, median, and mode.
- mean = sum(values) / number of values
- median = middlemost point of all values
- mode = value with highest frequency of occurrence
2. Handling of missing data
In the data collection, there are different ways of handling the missing values.
And which approach to use is actually based on the type of knowledge you are dealing with:
- Drop the missing values: We drop the missing values from certain variables in this situation. You may drop certain values in the event that very few values are missing.
- Impute with mean value: You can substitute the missing values with mean values for the numerical column. It is advisable to verify that the variable does not have extreme values until replacing it with the mean value. From outliers.
- Impute with median value: You can also substitute the missing values with median values for the numerical column. It is advisable to use the median method if you have extreme values, such as outliers.
- Impute with mode value: You can substitute missing values with mode values i.e. frequent ones, for the categorical column.
3. Removing duplicate values
For a pandas dataframe, we can use the
function to remove our duplicates rows.
4. Outlier Treatment
Outliers, being the most extreme findings, depending on whether they are extremely high or low, can include the maximum sample or minimum sample, or both.
Statistical studies may be skewed and their expectations are broken by outliers.
All observers, sadly, will confront outliers and be forced to make choices about what to do with them.
You may think that it’s best to delete them from your records, given the problems they can cause.
However, it is only legitimate for particular places to exclude outliers.
Outliers can be very insightful about the subject-area and the method of data collection.
It is important to understand how outliers occur and whether they will occur again as a regular part of the field of analysis.
The variability in your data is increased by outliers, which reduces statistical strength. As a consequence, removing outliers will allow the findings to become statistically relevant.
- If a calculation error or data entry error is the outlier in question, correct the error if necessary. Delete that observation if you can’t correct it because you know it’s wrong.
- You can legally exclude the outlier if the outlier in question is not a member of the population you are researching (i.e., uncommon properties or conditions).
- You can not exclude it if the outlier in question is a natural part of the population you are researching.
We generally identify outliers with the help of a boxplot, so here box plot shows some of the data points outside the range of the data:
The small round circles are the outliers.
The importance of these outliers needs to be addressed and there are many ways to manage them:
- Drop the value of outlier
- Using the median, replace the outlier value
5. Normalization and Scaling
Sometimes the data set variables have different sizes, i.e. one variable is in millions and only 100 in others.
For eg, income has values in thousands in our data set and age in just two digits.
Since there are different scales of data in these variables, it is difficult to compare these variables.
The approach used to standardize the set of features of data is feature scaling (also known as data normalization).
Since the range of data values can vary widely, when using machine learning algorithms, becomes a required step in data preprocessing.
In this approach, we transform variables into a single scale with various measurement scales.
Using the formula (x-mean)/standard deviation, StandardScaler normalizes the results. We can only do this for the numeric variables.
Lastly, we’ll check out some bivariate analysis.
6. Bivariate analysis
I will give you my personal notes on which ones to use. So there is a way of analyzing these variables as shown below:
If it’s Numerical vs. Numerical :
- Line plot
- Heatmap for correlation
- Joint plot
If one is Categorical and other is Numerical:
- Bar chart
- Violin plot
- Categorical box plot
- Swarm plot
If both of them are Categorical Variables:
- Bar chart
- Grouped bar chart
- Point plot
If you liked reading this article and want to read more, continue to follow the site! We have a lot of interesting articles upcoming in the near future. To stay updated on all the articles, don’t forget to join us along on Twitter and sign up for the newsletter for some interesting reads!