Hello, readers! In this article, we will be focusing on Label Encoding in R programming, in detail.
So, let us begin!!
First, what is Label Encoding?
Before diving deep into the concept of Label Encoding, let us understand about its emergence as a technique in the domain of Data Science and Machine Learning.
To recall, Machine Learning algorithms broadly deal with structured and unstructured data i.e. the data which contains labels as well as unlabeled data values.
In Supervised Learning algorithms, they work on labeled data values to provide predictions on the data.
In a nutshell, a
label can be considered as a number or a string that represents the group of entities especially categorical group of entities. Having labelled data enables the algorithm have a better understanding of the complex structured data values in the dataset.
Data Pre-processing is an essential step prior to modeling. Thus, at this step, it is necessary for us to understand the data formats and make necessary manipulations.
This is when Label Encoder comes into picture.
With Label Encoder, we can format the labelled data into a numeric format. That is, it converts the labelled data of the categorical groups into a numeric format.
Let us consider a data variable from a data set with the below labels–
Poll = [‘Yes’, ‘No’]
Now, we can here use Label Encoder that would in turn convert the above labels into numeric format such as [0,1].
This would in return reduce the complexity of the data set for the machine learning model to work on.
Let us now focus on the practical implementation of the same in the upcoming section.
Practical Implementation of a Label Encoder in R
To begin with, R provides us with ‘superml‘ library that contains the below set of functions to apply Label Encoder to our data.
- LabelEncoder$new(): This function creates and initializes an instance of the Label Encoder class.
- LabelEncoder$fit(): With this function, we can create memory space for the encoding values but it does not return any value as an output.
- LabelEncoder$fit_transform(): With this function, we can encode the data as well as reserve memory for the encoding values ahead.
In the below example, we have created a data frame with the columns ‘roll’ and ‘City’ which is numeric and categorical in nature.
Further, we initialize the Label Encoder class with the new() function. Ahead of which, we convert the labels i.e. encode them into numeric format using
rm(list = ls()) library(superml) dta <- data.frame(roll = c(1,2,3,4,5), City=c('Pune','Satara','Pune','Satara','Mumbai')) print("Data before label encoding..\n") print(dta) label <- LabelEncoder$new() #print(label$fit(dta$City)) dta$City <- label$fit_transform(dta$City) print(dta$City) print("Data after label encoding..\n") print(dta)
 "Data before label encoding..\n" roll City 1 1 Pune 2 2 Satara 3 3 Pune 4 4 Satara 5 5 Mumbai > print(dta$City)  1 2 1 2 0  "Data after label encoding..\n" roll City 1 1 1 2 2 2 3 3 1 4 4 2 5 5 0
Label Encoder with a dataset
In this example, we have made use of the below dataset!
Further, as clearly seen above, the column ‘Type’ is of categorical form on which we would perform label encoding as shown below–
Here, we have loaded the data into the R environment using read.csv() function. Then we load the superml library to load the necessary function and then encode the ‘Type’ column using fit_transform() function.
rm(list = ls()) #Setting the working directory setwd("D:/data") getwd() #Load the dataset dta = read.csv("data.csv",header=TRUE) library(superml) label <- LabelEncoder$new() print(label$fit(dta$Type)) dta$Type <- label$fit_transform(dta$Type) print(dta$Type)
> print(dta$Type)  0 4 0 3 7 8 1 5 6 2
By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.
Try implementing the concept of Label Encoding with other categorical data values and do let us know about your experience in the comment section.
Till then, Happy Learning!! 🙂