Hello, readers! In this article, we will be focusing on How to Create dummy variables in R programming, in detail.
So, let us begin!
Why do we need dummy variables in R?
Let us first understand the concept of dummy variables. Consider a dataset that represents some categorical data values.
Handling such a huge number of categories and groups is a cumbersome task for the machine learning model. Thus arises the need to treat categorical or level entries.
This is when the concept of dummy entries comes into picture.
dummy variable is a numeric interpretation of the category or level of the factor variable. That is, it represents every group or level of the categorical variable as a single numeric entity.
For example, consider a data set that contains a variable ‘Poll’ with values ‘Yes’ and ‘No’. Now, in order to represent the two groups as numeric entries, we can create dummies of the same.
So, the transformed dataset would now have two more additional columns as ‘Poll.1’ which would represent ‘yes’ type values (would assign 1 to all the data rows that are associated with level yes) and ‘Poll.2’ for ‘No’ type values.
1. R fast.dummies library to create dummy variables
R provides us with fast.dummies library that contains of dummy_cols() function for the creation of dummy variables at ease.
dummy_cols() function, one can select the variables for whom the dummies need to be created.
dummy_cols(data, select_columns = 'columns'
In this example, we have made use of the Bank Load Defaulter dataset. You can find the dataset here.
Further, we have made use of dummy_cols() function to create dummy variables for the column ‘ed’.
rm(list = ls()) #install.packages('fastDummies') library('fastDummies') dta = read.csv("bank-loan.csv",header=TRUE) dim(dta) dum <- dummy_cols(dta, select_columns = 'ed') dim(dum)
As witnessed below, the initial number of columns of the data set equals to 9. Post creation of dummy variables, the number of columns equals to 14.
All the 5 levels of the ed variable has been segregated as a separate column. Only those rows which belongs to a certain category are set as 1, rest all values are set to zero(0).
> dim(dta)  850 9 > dim(dum)  850 14
What if we need to create dummies for multiple variables in a single shot or at once?
Well, we can then create a list of all the variables for which we need dummies using c() function and pass them as arguments through select_columns.
rm(list = ls()) #install.packages('fastDummies') library('fastDummies') dta = read.csv("bank-loan.csv",header=TRUE) dim(dta) dum <- dummy_cols(dta, select_columns = c('ed','default')) dim(dum)
Here, we have created dummies for both ‘ed’ and ‘default’ data columns.
> dim(dta)  850 9 > dum <- dummy_cols(dta, select_columns = c('ed','default')) > dim(dum)  850 17
2. R dummies library to create dummy variables
R dummies library can also be used to create dummy data variables for the categorical data columns at ease.
For the same, we can make use of
dummy() function that enables us to create dummy entries for selected columns.
In the below example, we have created dummy variables of the column ‘ed’ using dummy() function.
rm(list = ls()) library('dummies') dta = read.csv("bank-loan.csv",header=TRUE) dim(dta) dum <- dummy(dta$ed) dim(dum)
As seen below, all the levels have been segregated as a different column.
Also, only those data rows that match to the particular level is set to 1 in the column else it is represented as zero.
For example, if the data represents the level ‘ed1’, then it is set to 1 else it is set to 0.
By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.
Do let us know about your experience with dummy variables in the comment box!
For more such posts related to R programming, Stay tuned with us.
Till then, Happy Learning!! 🙂