Hello, readers! In this article, we will be focusing on an important aspect of Data analysis – **Missing value analysis using R Programming**, in detail.

So, let us begin!! ๐

Table of Contents

## Impact of Missing values on a model!

**Data Analysis** is an important step in the process of solving a data science prediction problem. With the help of data analysis, we can analyze the data to its core subjective and further draw observations from the data.

Missing value analysis is one such part of Data Analysis.

To begin with, **Missing values** are very hazardous for the data problem predictions. Having missing or **NULL **values in your data variables can cause severe error in the statistical distribution of the data. That in turn disturbs the overall modelling of the data values.

In order to practically implement the concept of Missing value analysis and removal of the same, we will be making use of Bike Rental Count Prediction Problem.

You can find the snippet of the dataset below!

Let us start by loading the dataset into the R environment. We have made use of read.csv() function for the same.

By using `sum(is.na(dataset)) function`

, we can count the total number of missing values or null values present in the particular data variable or the entire dataset.

To add, `summary() function`

enables us to get the number of NULL and non-NULL values for each data variable of the dataset.

#Removed all the existing objects rm(list = ls()) #Setting the working directory setwd("D:/Ediwsor_Project - Bike_Rental_Count/") getwd() #Load the dataset bike_data = read.csv("Bike.csv",header=TRUE) #### Missing Value Analysis #### sum(is.na(bike_data)) summary(is.na(bike_data))

**Output:**

[1] 44 temp hum windspeed cnt Mode :logical Mode :logical Mode :logical Mode :logical FALSE:706 FALSE:706 FALSE:706 FALSE:706 TRUE :11 TRUE :11 TRUE :11 TRUE :11

As seen above, every data column of the dataset contains 11 missing values.

Now that we have encountered the presence of missing/NULL values, it is important for us to work on the data and change it to some meaningful value.

Mentioned below are some of the techniques through which we can replace/impute missing data values easily.

## 1. Impute missing values with statistical values

In an easy way, we can replace the above detected missing values by mean, median or mode.

Below is the example, wherein we have replaced every column having missing values with the mean of all the present data values of that particular column.

We have made use of mean() function from R documentation to impute the missing values.

With `as.data.frame(colSums(is.na(data)))`

function, we can get the total count of missing values per column of the dataset.

bike_data$temp[is.na(bike_data$temp)] <- mean(bike_data$temp, na.rm = T) bike_data$hum[is.na(bike_data$hum)] <- mean(bike_data$hum, na.rm = T) bike_data$cnt[is.na(bike_data$cnt)] <- mean(bike_data$cnt, na.rm = T) bike_data$windspeed[is.na(bike_data$windspeed)] <- mean(bike_data$windspeed, na.rm = T) as.data.frame(colSums(is.na(bike_data)))

**Output:**

colSums(is.na(bike_data)) temp 0 hum 0 windspeed 0 cnt 0

## 2. Delete the missing values

If the count of the missing values is extremely large, it is very difficult to deal with it. If it is above 30%, the best way is to either delete the column or drop the missing values.

Below, we have used `drop_na() function`

from ‘**tidyr**‘ library to drop all the missing values from the entire dataset in a single go!

library(tidyr) bike_data = drop_na(bike_data) as.data.frame(colSums(is.na(bike_data)))

**Output:**

colSums(is.na(bike_data)) temp 0 hum 0 windspeed 0 cnt 0

## 3. kNN Imputation

Another crucial way of imputing the missing values is through kNN imputation.

**In kNN imputation, the algorithm chooses ‘k’ values and uses the concept of distance values. **

**That is, it measures and identifies ‘k’ values based on the Euclidian distance formula and then replaces the missing values with the nearest one. **

In the below example, we have used `knnImputation()`

method from ‘**DMwR**‘ library to perform kNN imputation on the missing values.

library(DMwR) knn_res <- knnImputation(bike_data) # perform knn imputation. anyNA(knn_res)

Using `anyNA() `

method, we can get to know the presence of missing values, if any. It returns **True **if the data contains missing values else it returns **False**.

**Output:**

> anyNA(knn_res) [1] FALSE

## Conclusion

By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question. For more such posts related to R, stay tuned and till then, Happy Learning!! ๐