Hello, readers! In this article, we would be focusing on an important library for data cleansing — Tidyr library in R programming, in detail.
So, let us get started!!
Table of Contents
Usage of tidyr library in R
Data cleansing plays a very important role in the process of applying Machine Learning models to a dataset for predictions. In R programming, this purpose is served by tidyr library.
tidyr library helps us to assemble the data in a simple and clean form. It can be considered as a form of creating and storing data in a simplified format. This in turn reduces the overhead of analyzing and simplifying the data prior to modelling.
In our examples used for explanation, we will be making use of the below dataset.
Have a look!
1. The fill() function
fill() function of the tidyr package enables us to replace or impute the missing values of a specific column. The NULL values of the passed column gets replaced by the previous entry of the column.
In the below example, we have replaced the NULL values of the column ‘holiday’. Thus, the NA values get replaced by the previous entry present i.e. ‘0’.
bike_data %>% fill(holiday)
2. The replace_na() function
Unlike fill() function, the
replace_na() function replaces the NULL values of the multiple columns to some specific user defined values.
In the below example, we have replaced the NULL values of the below columns:
- yr -> 0
- holiday -> ‘unknown’
- workingday -> 1
- mnth -> ’12’
bike_data %>% replace_na(list(yr=0,holiday="unknown",workingday=1,mnth="12"))
3. The drop_na() function
drop_na() function, we can altogether drop/delete the values which contains NULL values. That is, with drop_na() function, all the rows get deleted which encounters a NULL value.
bike_data = drop_na(bike_data) print(bike_data)
4. The gather() function
gather() function accepts multiple columns as parameter and widens the entire dataset. It converts the values from the columns into key-value pairs.
bike_data %>% gather(day_type, day, weekday:workingday)
In the above example, we have passed ‘weekday, workingday’ as parameters and have assumed ‘day_type’ and ‘day’ as keys for which the passed column values would act as a value pair.
5. The nest() function
nest() function behaves like a summarization function. It summarizes the entire dataset with all the data variables and creates a list of data frames containing all the nested values.
bike_data %>% nest(data = c(weathersit))
Here, we have nested and grouped the entire dataset by the column ‘weathersit’.
By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.
For more posts related to R, stay tuned and till then, Happy Learning!! 🙂