Hello, readers! In this article, we will be focusing on an important aspect of Data analysis – Missing value analysis using R Programming, in detail.
So, let us begin!! 🙂
Impact of Missing values on a model!
Data Analysis is an important step in the process of solving a data science prediction problem. With the help of data analysis, we can analyze the data to its core subjective and further draw observations from the data.
Missing value analysis is one such part of Data Analysis.
To begin with, Missing values are very hazardous for the data problem predictions. Having missing or NULL values in your data variables can cause severe error in the statistical distribution of the data. That in turn disturbs the overall modelling of the data values.
In order to practically implement the concept of Missing value analysis and removal of the same, we will be making use of Bike Rental Count Prediction Problem.
You can find the snippet of the dataset below!

Let us start by loading the dataset into the R environment. We have made use of read.csv() function for the same.
By using sum(is.na(dataset)) function
, we can count the total number of missing values or null values present in the particular data variable or the entire dataset.
To add, summary() function
enables us to get the number of NULL and non-NULL values for each data variable of the dataset.
#Removed all the existing objects
rm(list = ls())
#Setting the working directory
setwd("D:/Ediwsor_Project - Bike_Rental_Count/")
getwd()
#Load the dataset
bike_data = read.csv("Bike.csv",header=TRUE)
#### Missing Value Analysis ####
sum(is.na(bike_data))
summary(is.na(bike_data))
Output:
[1] 44
temp hum windspeed cnt
Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:706 FALSE:706 FALSE:706 FALSE:706
TRUE :11 TRUE :11 TRUE :11 TRUE :11
As seen above, every data column of the dataset contains 11 missing values.
Now that we have encountered the presence of missing/NULL values, it is important for us to work on the data and change it to some meaningful value.
Mentioned below are some of the techniques through which we can replace/impute missing data values easily.
1. Impute missing values with statistical values
In an easy way, we can replace the above detected missing values by mean, median or mode.
Below is the example, wherein we have replaced every column having missing values with the mean of all the present data values of that particular column.
We have made use of mean() function from R documentation to impute the missing values.
With as.data.frame(colSums(is.na(data)))
function, we can get the total count of missing values per column of the dataset.
bike_data$temp[is.na(bike_data$temp)] <- mean(bike_data$temp, na.rm = T)
bike_data$hum[is.na(bike_data$hum)] <- mean(bike_data$hum, na.rm = T)
bike_data$cnt[is.na(bike_data$cnt)] <- mean(bike_data$cnt, na.rm = T)
bike_data$windspeed[is.na(bike_data$windspeed)] <-
mean(bike_data$windspeed, na.rm = T)
as.data.frame(colSums(is.na(bike_data)))
Output:
colSums(is.na(bike_data))
temp 0
hum 0
windspeed 0
cnt 0
2. Delete the missing values
If the count of the missing values is extremely large, it is very difficult to deal with it. If it is above 30%, the best way is to either delete the column or drop the missing values.
Below, we have used drop_na() function
from ‘tidyr‘ library to drop all the missing values from the entire dataset in a single go!
library(tidyr)
bike_data = drop_na(bike_data)
as.data.frame(colSums(is.na(bike_data)))
Output:
colSums(is.na(bike_data))
temp 0
hum 0
windspeed 0
cnt 0
3. kNN Imputation
Another crucial way of imputing the missing values is through kNN imputation.
In kNN imputation, the algorithm chooses ‘k’ values and uses the concept of distance values.
That is, it measures and identifies ‘k’ values based on the Euclidian distance formula and then replaces the missing values with the nearest one.
In the below example, we have used knnImputation()
method from ‘DMwR‘ library to perform kNN imputation on the missing values.
library(DMwR)
knn_res <- knnImputation(bike_data) # perform knn imputation.
anyNA(knn_res)
Using anyNA()
method, we can get to know the presence of missing values, if any. It returns True if the data contains missing values else it returns False.
Output:
> anyNA(knn_res)
[1] FALSE
Conclusion
By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question. For more such posts related to R, stay tuned and till then, Happy Learning!! 🙂