Missing data or values occurs when the data record is absent in the variable. This will cause serious issues in the data modeling process if not treated properly. Above all, most of the algorithms are not comfortable with missing data.
There are many ways to handle missing data in R. You can drop those records. But, keep in mind that you are dropping information when you do so and may lose a potential edge in modeling. On the other hand, you can impute the missing data with the mean and median of the data. In this article, we will be looking at filling Missing Values in R using the Tidyr package.
Tidyr is a R package which offers many functions to assist you in tidy the data. Greater the data quality, Better the model!
1. Missing Data in R
- Missing values can be denoted by many forms – NA, NAN and more.
- It is a missing record in the variable. It can be a single value or an entire row.
- Missing values can occur both in numerical and categorical data.
- R offers many methods to deal with missing data
- Tidyr package helps in filling missing data using the Top down or bottom up approach.
2. Tidyr Package in R
- The Tidyr package in R is used to clean the raw data in R.
- If offers functions for cleaning, organizing, filling missing values and more.
- We will be using tidyr with R pipes.
To install the Tidyr package in R, run the below code in R.
#Install tidyr package install.packages('tidyr') #Load the library library(tidyr)
package ‘tidyr’ successfully unpacked and MD5 sums checked
You will get the confirmation message after successful loading of the tidyr as shown above.
3. Create a Dataframe
Yes, we have to create a simple sample data frame that has missing values. This will help us in using the fill function of tidyr to fill the missing data.
#Create a dataframe a <- c('A','B','C','D','E','F','G','H','I','J') b <- c('Roger','Carlo','Durn','Jessy','Mounica','Rack','Rony','Saly','Kelly','Joseph') c <- c(86,NA,NA,NA,88,NA,NA,86,NA,NA) df <- data.frame(a,b,c) df
a b c 1 A Roger 86 2 B Carlo NA 3 C Durn NA 4 D Jessy NA 5 E Mounica 88 6 F Rack NA 7 G Rony NA 8 H Saly 86 9 I Kelly NA 10 J Joseph NA
Well, we got our data frame but with a lot of missing values. So, in these cases where your data has more and more missing values, you can make use of the fill function in R to fill the corresponding values/neighbor values in place of missing data.
4. Two Different Approaches
Yes, you can fill in the data as I said earlier. This process includes two approaches –
- Up – While filling the missing values, you have to specify the direction of filling of values. If you choose Up, then the filling process will be bottom-up.
- Down – In this method, you have to set the direction of filling to down.
Didn’t get it?
Don’t worry. We will be going through some examples to illustrate the same and you will get to know how things work.
5. Filling Missing Values – ‘Up’
In this process, we have a data frame with 3 columns and 10 data records in it. Before using the fill function to handle the missing data, you have to make sure of some things –
Sometimes when the data is collected, people may enter 1 value as a representation of some values, because they were the same.
Ex: When collecting the age, if there were 10 people whose age is 25, you can mention 25 against the last person indicating that all 10 people’s age is 25.
Please note that it is not the most common situation you face. But, the intention of this is to make sure, when you are in this kind of space, you can use the fill function to deal with this.
#Dataframe a b c 1 A Roger 86 2 B Carlo NA 3 C Durn NA 4 D Jessy NA 5 E Mounica 88 6 F Rack NA 7 G Rony NA 8 H Saly 86 9 I Kelly NA 10 J Joseph NA #Creste new dataframe by filling missing values (Up) df1 <- df %>% fill(c, .direction = 'up') df1
a b c 1 A Roger 86 2 B Carlo 88 3 C Durn 88 4 D Jessy 88 5 E Mounica 88 6 F Rack 86 7 G Rony 86 8 H Saly 86 9 I Kelly NA 10 J Joseph NA
You can observe that, the fill function filled the missing values using UP direction (Bottom – Up).
- You can see that there are 2 NA values in the last rows. This is because the fill function first encounters the NA value and fills it to the next NA value as the direction is UP.
6. Filling Missing Values – ‘Down’
Well, here we will be using the ‘Down’ method to fill the missing values in the data. Always make sure of some assumptions which I have mentioned in the earlier section to understand what you are doing and what will be the outcome.
#Data a b c 1 A Roger 86 2 B Carlo NA 3 C Durn NA 4 D Jessy NA 5 E Mounica 88 6 F Rack NA 7 G Rony NA 8 H Saly 86 9 I Kelly NA 10 J Joseph NA #Creates new dataframe by filling missing values (Down) - (Top-Down approach) df1 <- df %>% fill(c, .direction = 'down') df1
a b c 1 A Roger 86 2 B Carlo 86 3 C Durn 86 4 D Jessy 86 5 E Mounica 88 6 F Rack 88 7 G Rony 88 8 H Saly 86 9 I Kelly 86 10 J Joseph 86
- Here, there are no missing values. This is because the fill function first encounters valid data values which are 86. It will fill the 86 into the next NA regions until it finds a valid data record.
7. Wrapping Up
Filling Missing values in R is the most important process when you are analyzing any data which has null values. Things may seem a bit hard for you, but make sure you through the article once or twice to understand it concisely. It’s not a hard cake to digest!.
I hope this method will come to your assistance in your future assignments. That’s all for now. Happy R!!! 🙂
More read: Fill function in R