Exploratory Data Analysis in R programming

Filed Under: R Programming
EXPLORATORY DATA ANALYSIS IN R

Hello readers! Today, we will be having a look at an Important step in the domain of Data Science and Machine Learning — Exploratory Data Analysis in R, in detail.

So, let us begin!! 馃檪


Exploratory Data Analysis – Steps to know

In the domain of data science and machine learning, we tend to make predictions on the real-time data from the models and processing of the data to a clean form.

That is, in order for the machine learning models to work well, it is very important to perform pre-processing of data as well as understanding the data.

Exploratory data analysis is the first and fore most step of in a data science project. In this process we analyze every univariate and bivariate variable of the dataset.

We basically follow the below criteria in Exploratory Data Analysis of the dataset:

  1. Initially, get the statistical description of the data values.
  2. Study and analyze the continuous variables of the dataset.
  3. Analyze the categorical variables.
  4. Relate the numeric and categorical variables to find an association between the variables through visualization.
  5. A check on the data types of the variables.
  6. Treat the outliers, if any.
  7. Impute the missing values.
  8. Visualize the distribution of the data variables (numeric as well categorical).

Descriptive information of the data in terms of summary and statistical values is known as Informative analysis. While, the process which involves working on the data variables such as treating outliers, etc is known as Operative analysis.


1. Statistical description and Analysis of data types

In this article, we have made use of the Bike Rental Count Prediction dataset. You can find the dataset here!

Initially, we have loaded the dataset into the R environment using the read.csv() function.

Loading the dataset-

#Removed all the existing objects
rm(list = ls())
#Setting the working directory
setwd("D:/Ediwsor_Project - Bike_Rental_Count/")
getwd()

#Load the dataset
bike_data = read.csv("day.csv",header=TRUE)

Further, we understand the data values of every column in terms of data types and values using str() function. The class() function helps us know the data type of the dataset i.e. a data frame, etc. In addition, we make use of summary() function to understand the statistics of the data set in terms of mean, median, etc.

From the above data analysis, we have understood that the data columns – ‘season’, ‘yr’, ‘mnth’, ‘holiday’, ‘weekday’, ‘workingday’, and ‘weathersit’ belong to categorical type values but have been interpreted as integer values.

Thus, we change the data types of the above mentioned columns to suitable types (factor type).

# 1. Understanding the data values of every column of the dataset
str(bike_data)

# 2. Viewing the type of the dataset
class(bike_data)

# 3.Understanding the data distribution of the dataset
summary(bike_data)

# 4. Dimensions of the dataset
dim(bike_data)

bike_data$season=as.factor(bike_data$season)
bike_data$yr=as.factor(bike_data$yr)
bike_data$mnth=as.factor(bike_data$mnth)
bike_data$holiday=as.factor(bike_data$holiday)
bike_data$weekday=as.factor(bike_data$weekday)
bike_data$workingday=as.factor(bike_data$workingday)
bike_data$weathersit=as.factor(bike_data$weathersit)
bike_data$dteday = as.Date(bike_data$dteday,format="%Y-%m-%d")

str(bike_data)

#Now we will check the effect of certain variables on the dependent variable(cnt) and decide whether to restore or drop those data columns

#Extracting the day values from the date and storing into a new column - 'day'
bike_data$day=format(bike_data$dteday,"%d")

Further, we have extracted the ‘day’ value from the ‘dteday’ variable.

Output:

> str(bike_data)
'data.frame':	731 obs. of  16 variables:
 $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ dteday    : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ weekday   : int  6 0 1 2 3 4 5 6 0 1 ...
 $ workingday: int  0 0 1 1 1 1 1 0 0 1 ...
 $ weathersit: int  2 2 1 1 1 1 2 2 1 1 ...
 $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
 $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
 $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
 $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
 $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
 $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
 $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...



> summary(bike_data)
    instant             dteday        season            yr              mnth          holiday           weekday     
 Min.   :  1.0   2011-01-01:  1   Min.   :1.000   Min.   :0.0000   Min.   : 1.00   Min.   :0.00000   Min.   :0.000  
 1st Qu.:183.5   2011-01-02:  1   1st Qu.:2.000   1st Qu.:0.0000   1st Qu.: 4.00   1st Qu.:0.00000   1st Qu.:1.000  
 Median :366.0   2011-01-03:  1   Median :3.000   Median :1.0000   Median : 7.00   Median :0.00000   Median :3.000  
 Mean   :366.0   2011-01-04:  1   Mean   :2.497   Mean   :0.5007   Mean   : 6.52   Mean   :0.02873   Mean   :2.997  
 3rd Qu.:548.5   2011-01-05:  1   3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:10.00   3rd Qu.:0.00000   3rd Qu.:5.000  
 Max.   :731.0   2011-01-06:  1   Max.   :4.000   Max.   :1.0000   Max.   :12.00   Max.   :1.00000   Max.   :6.000  
                 (Other)   :725                                                                                     
   workingday      weathersit         temp             atemp              hum           windspeed           casual      
 Min.   :0.000   Min.   :1.000   Min.   :0.05913   Min.   :0.07907   Min.   :0.0000   Min.   :0.02239   Min.   :   2.0  
 1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.33708   1st Qu.:0.33784   1st Qu.:0.5200   1st Qu.:0.13495   1st Qu.: 315.5  
 Median :1.000   Median :1.000   Median :0.49833   Median :0.48673   Median :0.6267   Median :0.18097   Median : 713.0  
 Mean   :0.684   Mean   :1.395   Mean   :0.49538   Mean   :0.47435   Mean   :0.6279   Mean   :0.19049   Mean   : 848.2  
 3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.:0.65542   3rd Qu.:0.60860   3rd Qu.:0.7302   3rd Qu.:0.23321   3rd Qu.:1096.0  
 Max.   :1.000   Max.   :3.000   Max.   :0.86167   Max.   :0.84090   Max.   :0.9725   Max.   :0.50746   Max.   :3410.0  
                                                                                                                        
   registered        cnt      
 Min.   :  20   Min.   :  22  
 1st Qu.:2497   1st Qu.:3152  
 Median :3662   Median :4548  
 Mean   :3656   Mean   :4504  
 3rd Qu.:4776   3rd Qu.:5956  
 Max.   :6946   Max.   :8714  
 


> dim(bike_data)
[1] 731  16



> str(bike_data)
'data.frame':	731 obs. of  16 variables:
 $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ dteday    : Date, format: "2011-01-01" "2011-01-02" "2011-01-03" "2011-01-04" ...
 $ season    : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ yr        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ mnth      : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ holiday   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ weekday   : Factor w/ 7 levels "0","1","2","3",..: 7 1 2 3 4 5 6 7 1 2 ...
 $ workingday: Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 1 2 ...
 $ weathersit: Factor w/ 3 levels "1","2","3": 2 2 1 1 1 1 2 2 1 1 ...
 $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
 $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
 $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
 $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
 $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
 $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
 $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...
> 

Having done this, we check the effect/contribution of the variable ‘day’ on the target variable ‘cnt’ by plotting them using ggplot() function.

#Extracting the day values from the date and storing into a new column - 'day'
bike_data$day=format(bike_data$dteday,"%d")
unique(bike_data$day)

library(ggplot2)          
ggplot(bike_data, aes(instant, cnt)) + geom_point() + scale_x_continuous("Instant")+ scale_y_continuous("Count")

#Dropping the above mentioned data columns from the dataset
bike_data=subset(bike_data,select = -c(instant,day,dteday,casual,registered))
str(bike_data)
dim(bike_data)

Output:

Ggplot
ggplot

As the variable ‘instant’ is not related to ‘cnt’, we drop the variable. Moreover, the summation of ‘casual’ and ‘registered’ make up the target variable ‘cnt’. Thus, we even drop them using drop_na() function in R.

> str(bike_data)
'data.frame':	731 obs. of  12 variables:
 $ season    : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ yr        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ mnth      : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ holiday   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ weekday   : Factor w/ 7 levels "0","1","2","3",..: 7 1 2 3 4 5 6 7 1 2 ...
 $ workingday: Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 1 2 ...
 $ weathersit: Factor w/ 3 levels "1","2","3": 2 2 1 1 1 1 2 2 1 1 ...
 $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
 $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
 $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
 $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
 $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...
> dim(bike_data)
[1] 731  12

2. Treating Missing data

Apart from informative analysis, now we focus on the Operative analysis. Here, we detect for the presence of missing data into the variables and treat them, if any!

You can learn more about missing value analysis in R here.

Example:

We have used sum(is.na()) function to find the sum of NA values in every column.

### Missing Value Analysis ###
sum(is.na(bike_data))
summary(is.na(bike_data))

Output:

As seen, the data is free from missing values.

> sum(is.na(bike_data))
[1] 0
> summary(is.na(bike_data))
   season            yr             mnth          holiday         weekday        workingday      weathersit     
 Mode :logical   Mode :logical   Mode :logical   Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:731       FALSE:731       FALSE:731       FALSE:731       FALSE:731       FALSE:731       FALSE:731      
    temp           atemp            hum          windspeed          cnt         
 Mode :logical   Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:731       FALSE:731       FALSE:731       FALSE:731       FALSE:731   

3. Treatment of Outliers!

Now, we detect the presence of outliers in the data using BoxPlot() method and also replace them with NULL values.

Further, we impute the NULL values with either their statistical values or we drop them. We’ll work with the tidyr library in this example.

Example:

### Outlier Analysis -- DETECTION ###

# 1. Outliers in the data values exists only in continuous/numeric form of data variables. Thus, we need to store all the numeric and categorical independent variables into a separate array structure.
col = c('temp','cnt','hum','windspeed')
categorical_col = c("season","yr","mnth","holiday","weekday","workingday","weathersit")

# 2. Using BoxPlot to detect the presence of outliers in the numeric/continuous data columns.
boxplot(bike_data[,c('temp','atemp','hum','windspeed')])

# From the above visualization, it is clear that the data variables 'hum' and 'windspeed' contains outliers in the data values.
#OUTLIER ANALYSIS -- Removal of Outliers
# 1. From the boxplot, we have identified the presence of outliers. That is, the data values that are present above the upper quartile and below the lower quartile can be considered as the outlier data values.
# 2. Now, we will replace the outlier data values with NULL.

for (x in c('hum','windspeed'))
{
  value = bike_data[,x][bike_data[,x] %in% boxplot.stats(bike_data[,x])$out]
  bike_data[,x][bike_data[,x] %in% value] = NA
} 

#Checking whether the outliers in the above defined columns are replaced by NULL or not
sum(is.na(bike_data$hum))
sum(is.na(bike_data$windspeed))
as.data.frame(colSums(is.na(bike_data)))

#Removing the null values
library(tidyr)
bike_data = drop_na(bike_data)
as.data.frame(colSums(is.na(bike_data)))

Output:

> #Checking whether the outliers in the above defined columns are replaced by NULL or not
> sum(is.na(bike_data$hum))
[1] 2
> sum(is.na(bike_data$windspeed))
[1] 13
> as.data.frame(colSums(is.na(bike_data)))
           colSums(is.na(bike_data))
season                             0
yr                                 0
mnth                               0
holiday                            0
weekday                            0
workingday                         0
weathersit                         0
temp                               0
atemp                              0
hum                                2
windspeed                         13
cnt                                0
> #Removing the null values
> library(tidyr)
> bike_data = drop_na(bike_data)
> as.data.frame(colSums(is.na(bike_data)))
           colSums(is.na(bike_data))
season                             0
yr                                 0
mnth                               0
holiday                            0
weekday                            0
workingday                         0
weathersit                         0
temp                               0
atemp                              0
hum                                0
windspeed                          0
cnt                                0

From the above output, it is clear that we have detected 13+2 outliers in two variables and have replaced them with NULL. Further, we dropped the NULL values as it was a negligible count.


Conclusion

By this, we have come to the end of this topic. Feel free to comment below in case you come across any question.

Till then, Happy Learning!! 馃檪

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content