Hello readers! Today, we will be having a look at an Important step in the domain of Data Science and Machine Learning — Exploratory Data Analysis in R, in detail.
So, let us begin!! ๐
Table of Contents
Exploratory Data Analysis – Steps to know
In the domain of data science and machine learning, we tend to make predictions on the real-time data from the models and processing of the data to a clean form.
That is, in order for the machine learning models to work well, it is very important to perform pre-processing of data as well as understanding the data.
Exploratory data analysis is the first and fore most step of in a data science project. In this process we analyze every univariate and bivariate variable of the dataset.
We basically follow the below criteria in Exploratory Data Analysis of the dataset:
- Initially, get the statistical description of the data values.
- Study and analyze the continuous variables of the dataset.
- Analyze the categorical variables.
- Relate the numeric and categorical variables to find an association between the variables through visualization.
- A check on the data types of the variables.
- Treat the outliers, if any.
- Impute the missing values.
- Visualize the distribution of the data variables (numeric as well categorical).
Descriptive information of the data in terms of summary and statistical values is known as Informative analysis. While, the process which involves working on the data variables such as treating outliers, etc is known as Operative analysis.
1. Statistical description and Analysis of data types
In this article, we have made use of the Bike Rental Count Prediction dataset. You can find the dataset here!
Initially, we have loaded the dataset into the R environment using the read.csv() function.
Loading the dataset-
#Removed all the existing objects rm(list = ls()) #Setting the working directory setwd("D:/Ediwsor_Project - Bike_Rental_Count/") getwd() #Load the dataset bike_data = read.csv("day.csv",header=TRUE)
Further, we understand the data values of every column in terms of data types and values using str()
function. The class()
function helps us know the data type of the dataset i.e. a data frame, etc. In addition, we make use of summary()
function to understand the statistics of the data set in terms of mean, median, etc.
From the above data analysis, we have understood that the data columns – ‘season’, ‘yr’, ‘mnth’, ‘holiday’, ‘weekday’, ‘workingday’, and ‘weathersit’ belong to categorical type values but have been interpreted as integer values.
Thus, we change the data types of the above mentioned columns to suitable types (factor type).
# 1. Understanding the data values of every column of the dataset str(bike_data) # 2. Viewing the type of the dataset class(bike_data) # 3.Understanding the data distribution of the dataset summary(bike_data) # 4. Dimensions of the dataset dim(bike_data) bike_data$season=as.factor(bike_data$season) bike_data$yr=as.factor(bike_data$yr) bike_data$mnth=as.factor(bike_data$mnth) bike_data$holiday=as.factor(bike_data$holiday) bike_data$weekday=as.factor(bike_data$weekday) bike_data$workingday=as.factor(bike_data$workingday) bike_data$weathersit=as.factor(bike_data$weathersit) bike_data$dteday = as.Date(bike_data$dteday,format="%Y-%m-%d") str(bike_data) #Now we will check the effect of certain variables on the dependent variable(cnt) and decide whether to restore or drop those data columns #Extracting the day values from the date and storing into a new column - 'day' bike_data$day=format(bike_data$dteday,"%d")
Further, we have extracted the ‘day’ value from the ‘dteday’ variable.
Output:
> str(bike_data) 'data.frame': 731 obs. of 16 variables: $ instant : int 1 2 3 4 5 6 7 8 9 10 ... $ dteday : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 2 3 4 5 6 7 8 9 10 ... $ season : int 1 1 1 1 1 1 1 1 1 1 ... $ yr : int 0 0 0 0 0 0 0 0 0 0 ... $ mnth : int 1 1 1 1 1 1 1 1 1 1 ... $ holiday : int 0 0 0 0 0 0 0 0 0 0 ... $ weekday : int 6 0 1 2 3 4 5 6 0 1 ... $ workingday: int 0 0 1 1 1 1 1 0 0 1 ... $ weathersit: int 2 2 1 1 1 1 2 2 1 1 ... $ temp : num 0.344 0.363 0.196 0.2 0.227 ... $ atemp : num 0.364 0.354 0.189 0.212 0.229 ... $ hum : num 0.806 0.696 0.437 0.59 0.437 ... $ windspeed : num 0.16 0.249 0.248 0.16 0.187 ... $ casual : int 331 131 120 108 82 88 148 68 54 41 ... $ registered: int 654 670 1229 1454 1518 1518 1362 891 768 1280 ... $ cnt : int 985 801 1349 1562 1600 1606 1510 959 822 1321 ... > summary(bike_data) instant dteday season yr mnth holiday weekday Min. : 1.0 2011-01-01: 1 Min. :1.000 Min. :0.0000 Min. : 1.00 Min. :0.00000 Min. :0.000 1st Qu.:183.5 2011-01-02: 1 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.: 4.00 1st Qu.:0.00000 1st Qu.:1.000 Median :366.0 2011-01-03: 1 Median :3.000 Median :1.0000 Median : 7.00 Median :0.00000 Median :3.000 Mean :366.0 2011-01-04: 1 Mean :2.497 Mean :0.5007 Mean : 6.52 Mean :0.02873 Mean :2.997 3rd Qu.:548.5 2011-01-05: 1 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:10.00 3rd Qu.:0.00000 3rd Qu.:5.000 Max. :731.0 2011-01-06: 1 Max. :4.000 Max. :1.0000 Max. :12.00 Max. :1.00000 Max. :6.000 (Other) :725 workingday weathersit temp atemp hum windspeed casual Min. :0.000 Min. :1.000 Min. :0.05913 Min. :0.07907 Min. :0.0000 Min. :0.02239 Min. : 2.0 1st Qu.:0.000 1st Qu.:1.000 1st Qu.:0.33708 1st Qu.:0.33784 1st Qu.:0.5200 1st Qu.:0.13495 1st Qu.: 315.5 Median :1.000 Median :1.000 Median :0.49833 Median :0.48673 Median :0.6267 Median :0.18097 Median : 713.0 Mean :0.684 Mean :1.395 Mean :0.49538 Mean :0.47435 Mean :0.6279 Mean :0.19049 Mean : 848.2 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:0.65542 3rd Qu.:0.60860 3rd Qu.:0.7302 3rd Qu.:0.23321 3rd Qu.:1096.0 Max. :1.000 Max. :3.000 Max. :0.86167 Max. :0.84090 Max. :0.9725 Max. :0.50746 Max. :3410.0 registered cnt Min. : 20 Min. : 22 1st Qu.:2497 1st Qu.:3152 Median :3662 Median :4548 Mean :3656 Mean :4504 3rd Qu.:4776 3rd Qu.:5956 Max. :6946 Max. :8714 > dim(bike_data) [1] 731 16 > str(bike_data) 'data.frame': 731 obs. of 16 variables: $ instant : int 1 2 3 4 5 6 7 8 9 10 ... $ dteday : Date, format: "2011-01-01" "2011-01-02" "2011-01-03" "2011-01-04" ... $ season : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ... $ yr : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... $ mnth : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ... $ holiday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... $ weekday : Factor w/ 7 levels "0","1","2","3",..: 7 1 2 3 4 5 6 7 1 2 ... $ workingday: Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 1 2 ... $ weathersit: Factor w/ 3 levels "1","2","3": 2 2 1 1 1 1 2 2 1 1 ... $ temp : num 0.344 0.363 0.196 0.2 0.227 ... $ atemp : num 0.364 0.354 0.189 0.212 0.229 ... $ hum : num 0.806 0.696 0.437 0.59 0.437 ... $ windspeed : num 0.16 0.249 0.248 0.16 0.187 ... $ casual : int 331 131 120 108 82 88 148 68 54 41 ... $ registered: int 654 670 1229 1454 1518 1518 1362 891 768 1280 ... $ cnt : int 985 801 1349 1562 1600 1606 1510 959 822 1321 ... >
Having done this, we check the effect/contribution of the variable ‘day’ on the target variable ‘cnt’ by plotting them using ggplot()
function.
#Extracting the day values from the date and storing into a new column - 'day' bike_data$day=format(bike_data$dteday,"%d") unique(bike_data$day) library(ggplot2) ggplot(bike_data, aes(instant, cnt)) + geom_point() + scale_x_continuous("Instant")+ scale_y_continuous("Count") #Dropping the above mentioned data columns from the dataset bike_data=subset(bike_data,select = -c(instant,day,dteday,casual,registered)) str(bike_data) dim(bike_data)
Output:

As the variable ‘instant’ is not related to ‘cnt’, we drop the variable. Moreover, the summation of ‘casual’ and ‘registered’ make up the target variable ‘cnt’. Thus, we even drop them using drop_na()
function in R.
> str(bike_data) 'data.frame': 731 obs. of 12 variables: $ season : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ... $ yr : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... $ mnth : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ... $ holiday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... $ weekday : Factor w/ 7 levels "0","1","2","3",..: 7 1 2 3 4 5 6 7 1 2 ... $ workingday: Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 1 2 ... $ weathersit: Factor w/ 3 levels "1","2","3": 2 2 1 1 1 1 2 2 1 1 ... $ temp : num 0.344 0.363 0.196 0.2 0.227 ... $ atemp : num 0.364 0.354 0.189 0.212 0.229 ... $ hum : num 0.806 0.696 0.437 0.59 0.437 ... $ windspeed : num 0.16 0.249 0.248 0.16 0.187 ... $ cnt : int 985 801 1349 1562 1600 1606 1510 959 822 1321 ... > dim(bike_data) [1] 731 12
2. Treating Missing data
Apart from informative analysis, now we focus on the Operative analysis. Here, we detect for the presence of missing data into the variables and treat them, if any!
You can learn more about missing value analysis in R here.
Example:
We have used sum(is.na())
function to find the sum of NA values in every column.
### Missing Value Analysis ### sum(is.na(bike_data)) summary(is.na(bike_data))
Output:
As seen, the data is free from missing values.
> sum(is.na(bike_data)) [1] 0 > summary(is.na(bike_data)) season yr mnth holiday weekday workingday weathersit Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical FALSE:731 FALSE:731 FALSE:731 FALSE:731 FALSE:731 FALSE:731 FALSE:731 temp atemp hum windspeed cnt Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical FALSE:731 FALSE:731 FALSE:731 FALSE:731 FALSE:731
3. Treatment of Outliers!
Now, we detect the presence of outliers in the data using BoxPlot()
method and also replace them with NULL values.
Further, we impute the NULL values with either their statistical values or we drop them. We’ll work with the tidyr library in this example.
Example:
### Outlier Analysis -- DETECTION ### # 1. Outliers in the data values exists only in continuous/numeric form of data variables. Thus, we need to store all the numeric and categorical independent variables into a separate array structure. col = c('temp','cnt','hum','windspeed') categorical_col = c("season","yr","mnth","holiday","weekday","workingday","weathersit") # 2. Using BoxPlot to detect the presence of outliers in the numeric/continuous data columns. boxplot(bike_data[,c('temp','atemp','hum','windspeed')]) # From the above visualization, it is clear that the data variables 'hum' and 'windspeed' contains outliers in the data values. #OUTLIER ANALYSIS -- Removal of Outliers # 1. From the boxplot, we have identified the presence of outliers. That is, the data values that are present above the upper quartile and below the lower quartile can be considered as the outlier data values. # 2. Now, we will replace the outlier data values with NULL. for (x in c('hum','windspeed')) { value = bike_data[,x][bike_data[,x] %in% boxplot.stats(bike_data[,x])$out] bike_data[,x][bike_data[,x] %in% value] = NA } #Checking whether the outliers in the above defined columns are replaced by NULL or not sum(is.na(bike_data$hum)) sum(is.na(bike_data$windspeed)) as.data.frame(colSums(is.na(bike_data))) #Removing the null values library(tidyr) bike_data = drop_na(bike_data) as.data.frame(colSums(is.na(bike_data)))
Output:
> #Checking whether the outliers in the above defined columns are replaced by NULL or not > sum(is.na(bike_data$hum)) [1] 2 > sum(is.na(bike_data$windspeed)) [1] 13 > as.data.frame(colSums(is.na(bike_data))) colSums(is.na(bike_data)) season 0 yr 0 mnth 0 holiday 0 weekday 0 workingday 0 weathersit 0 temp 0 atemp 0 hum 2 windspeed 13 cnt 0 > #Removing the null values > library(tidyr) > bike_data = drop_na(bike_data) > as.data.frame(colSums(is.na(bike_data))) colSums(is.na(bike_data)) season 0 yr 0 mnth 0 holiday 0 weekday 0 workingday 0 weathersit 0 temp 0 atemp 0 hum 0 windspeed 0 cnt 0
From the above output, it is clear that we have detected 13+2 outliers in two variables and have replaced them with NULL. Further, we dropped the NULL values as it was a negligible count.
Conclusion
By this, we have come to the end of this topic. Feel free to comment below in case you come across any question.
Till then, Happy Learning!! ๐