Missing value analysis using R programming

Filed Under: R Programming
Missing Value Analysis In R

Hello, readers! In this article, we will be focusing on an important aspect of Data analysis – Missing value analysis using R Programming, in detail.

So, let us begin!! 馃檪


Impact of Missing values on a model!

Data Analysis is an important step in the process of solving a data science prediction problem. With the help of data analysis, we can analyze the data to its core subjective and further draw observations from the data.

Missing value analysis is one such part of Data Analysis.

To begin with, Missing values are very hazardous for the data problem predictions. Having missing or NULL values in your data variables can cause severe error in the statistical distribution of the data. That in turn disturbs the overall modelling of the data values.

In order to practically implement the concept of Missing value analysis and removal of the same, we will be making use of Bike Rental Count Prediction Problem.

You can find the snippet of the dataset below!

Missing Values Dataset
Missing Values Dataset

Let us start by loading the dataset into the R environment. We have made use of read.csv() function for the same.

By using sum(is.na(dataset)) function, we can count the total number of missing values or null values present in the particular data variable or the entire dataset.

To add, summary() function enables us to get the number of NULL and non-NULL values for each data variable of the dataset.

#Removed all the existing objects
rm(list = ls())

#Setting the working directory
setwd("D:/Ediwsor_Project - Bike_Rental_Count/")
getwd()

#Load the dataset
bike_data = read.csv("Bike.csv",header=TRUE)

#### Missing Value Analysis ####
sum(is.na(bike_data))
summary(is.na(bike_data))

Output:

[1] 44
    temp            hum          windspeed          cnt         
 Mode :logical   Mode :logical   Mode :logical   Mode :logical  
 FALSE:706       FALSE:706       FALSE:706       FALSE:706      
 TRUE :11        TRUE :11        TRUE :11        TRUE :11    

As seen above, every data column of the dataset contains 11 missing values.

Now that we have encountered the presence of missing/NULL values, it is important for us to work on the data and change it to some meaningful value.

Mentioned below are some of the techniques through which we can replace/impute missing data values easily.


1. Impute missing values with statistical values

In an easy way, we can replace the above detected missing values by mean, median or mode.

Below is the example, wherein we have replaced every column having missing values with the mean of all the present data values of that particular column.

We have made use of mean() function from R documentation to impute the missing values.

With as.data.frame(colSums(is.na(data))) function, we can get the total count of missing values per column of the dataset.

bike_data$temp[is.na(bike_data$temp)] <- mean(bike_data$temp, na.rm = T) 

bike_data$hum[is.na(bike_data$hum)] <- mean(bike_data$hum, na.rm = T) 
bike_data$cnt[is.na(bike_data$cnt)] <- mean(bike_data$cnt, na.rm = T) 

bike_data$windspeed[is.na(bike_data$windspeed)] <- 

mean(bike_data$windspeed, na.rm = T) 

as.data.frame(colSums(is.na(bike_data)))

Output:

                       colSums(is.na(bike_data))
temp                              0
hum                               0
windspeed                         0
cnt                               0

2. Delete the missing values

If the count of the missing values is extremely large, it is very difficult to deal with it. If it is above 30%, the best way is to either delete the column or drop the missing values.

Below, we have used drop_na() function from ‘tidyr‘ library to drop all the missing values from the entire dataset in a single go!

library(tidyr)
bike_data = drop_na(bike_data)
as.data.frame(colSums(is.na(bike_data)))

Output:

                       colSums(is.na(bike_data))
temp                              0
hum                               0
windspeed                         0
cnt                               0

3. kNN Imputation

Another crucial way of imputing the missing values is through kNN imputation.

In kNN imputation, the algorithm chooses ‘k’ values and uses the concept of distance values.

That is, it measures and identifies ‘k’ values based on the Euclidian distance formula and then replaces the missing values with the nearest one.

In the below example, we have used knnImputation() method from ‘DMwR‘ library to perform kNN imputation on the missing values.

library(DMwR)
knn_res <- knnImputation(bike_data)  # perform knn imputation.
anyNA(knn_res)

Using anyNA() method, we can get to know the presence of missing values, if any. It returns True if the data contains missing values else it returns False.

Output:

> anyNA(knn_res)
[1] FALSE

Conclusion

By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question. For more such posts related to R, stay tuned and till then, Happy Learning!! 馃檪


References

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages