In this tutorial, we’ll move on to understanding factors in R programming. One operation we perform frequently in data science is the estimation of a variable based upon the model we built. We are sometimes required to estimate the price of a share or a house, and sometimes we need to estimate what color car is likely to be sold the fastest.

Variables in data science fall under two categories – **continuous** and **categorical**. Continuous variables are those that can take numerical values including floating points. Prices of houses or shares, quantifiable variables like age, weight or height of a person are all continuous variables.

On the other hand, categorical variables take a set of fixed values that can be represented using a set of labels. Examples for this category as marital status, gender, the color of the vehicle, the highest educational degree of a person and so on.

Categorical variables are represented using the **factors** in R.

Table of Contents

## Creating Factors in R

Factors can be created using a `factor()`

function.

factor(x=vector, levels, labels, is.ordered=TRUE/FALSE)

The first argument to factor function is the **vector** x of values that you wish to factorize. Note that you cannot create a factor using a matrix. X should always be a single-dimensional vector of character strings or integer values.

Secondly, you need to supply the list of **levels** you need in the factor. **Levels** is a vector of unique values used in the factor. This is an optional argument.

The third argument is **labels**. Sometimes when you encode the variables as a vector of integers, you need to specify what integer represents what label. You could use 0 and 1 to represent male and female, but you need to specify that using these labels. So basically this is the key for looking up the factors.

Finally, you have a Boolean valued argument **is.ordered**. Sometimes you may wish to retain the order amongst the factors used. For example, you may encode the month of joining using integers 1 to 12, to represent months from January to Decemeber. In these cases, you need to specify **ordered** to TRUE.

Let us look at examples of factors now.

#Encode the genders of people into a vector first #These might be extracted from a dataset usually. > genvector <- c("Male","Female","Female","Male","Male","Female") #Create a factor from this vector > genfact <- factor(genvector) > genfact [1] Male Female Female Male Male Female Levels: Female Male

Notice how the levels are automatically obtained from the vector’s unique values here. Let us try another example where we define male and female as 0 and 1 using labels.

#Define a vector with 0 for Male and 1 for Female. > genvector2 <- c(0,1,1,0,0,1) #Assign labels Male and Female to 0 and 1 when creating a Factor. > genfact2 <-factor(genvector2,levels=c("0","1"),labels=c("Male","Female")) > genfact2 [1] Male Female Female Male Male Female Levels: Male Female

Observe that the labels you have defined are displayed instead of 0 and 1 defined in the factor.

## Ordering in Factors in R Programming

Let us work another example using the ordering of factor levels. Let us first define a vector representing the month of joining for 8 employees.

> moj <- c("Jan","Jun","May","Jan","Apr","Dec","Nov","Sep")

Now, there is no way for the compiler to know that May comes before Jun in the order of months. So the following code throws FALSE.

> moj[2]>moj[3] [1] FALSE

To impose **ordering**, we need to define a vector with all the months in order first.

> ordermonths <-c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")

Now create a factor for our data using our moj vector, set the levels to ordermonths and set the argument ordered to TRUE.

> factormoj <- factor(x=moj, levels=ordermonths, ordered=TRUE)

Now factormoj displays as follows.

> factormoj [1] Jan Jun May Jan Apr Dec Nov Sep 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < ... < Dec

The compiler now knows the ordering among the months. Let us check if it knows that May comes before June.

> factormoj[2]>factormoj[3] [1] TRUE

## Modifying Factors

Each element of factor can be assigned a value individually using indexing, just like we index vectors. Let us modify a value from the genfactor we created earlier in the tutorial.

We’ll continue with the same variable from before, **genfact** to make things easier for you.

> genfact [1] Male Female Female Male Male Female Levels: Female Male > genfact[1] [1] Male Levels: Female Male > genfact[1]<-"Female" > genfact [1] Female Female Female Male Male Female Levels: Female Male

## Adding New Levels to Factors

To add a new level to a factor, which hasn’t been defined earlier, you just need to modify the levels vector in the following manner. Let’s try this on our existing **genfact** variable.

> levels(genfact) <- c(levels(genfact),"Other") > genfact [1] Female Female Female Male Male Female Levels: Female Male Other

You can now modify the factors to the newly defined level “Other” as well.

> genfact[3] <- "Other" > genfact [1] Female Female Other Male Male Female Levels: Female Male Other