Factors in R

Filed Under: R Programming
Factors In R

In this tutorial, we’ll move on to understanding factors in R programming. One operation we perform frequently in data science is the estimation of a variable based upon the model we built. We are sometimes required to estimate the price of a share or a house, and sometimes we need to estimate what color car is likely to be sold the fastest.

Variables in data science fall under two categories – continuous and categorical. Continuous variables are those that can take numerical values including floating points. Prices of houses or shares, quantifiable variables like age, weight or height of a person are all continuous variables.

On the other hand, categorical variables take a set of fixed values that can be represented using a set of labels. Examples for this category as marital status, gender, the color of the vehicle, the highest educational degree of a person and so on.

Categorical variables are represented using the factors in R.

Creating Factors in R

Factors can be created using a factor() function.

factor(x=vector, levels, labels, is.ordered=TRUE/FALSE)

The first argument to factor function is the vector x of values that you wish to factorize. Note that you cannot create a factor using a matrix. X should always be a single-dimensional vector of character strings or integer values.

Secondly, you need to supply the list of levels you need in the factor. Levels is a vector of unique values used in the factor. This is an optional argument.

The third argument is labels. Sometimes when you encode the variables as a vector of integers, you need to specify what integer represents what label. You could use 0 and 1 to represent male and female, but you need to specify that using these labels. So basically this is the key for looking up the factors.

Finally, you have a Boolean valued argument is.ordered. Sometimes you may wish to retain the order amongst the factors used. For example, you may encode the month of joining using integers 1 to 12, to represent months from January to Decemeber. In these cases, you need to specify ordered to TRUE.

Let us look at examples of factors now.

#Encode the genders of people into a vector first
#These might be extracted from a dataset usually.
> genvector <- c("Male","Female","Female","Male","Male","Female")

#Create a factor from this vector
> genfact <- factor(genvector)
> genfact
[1] Male   Female Female Male   Male   Female
Levels: Female Male

Notice how the levels are automatically obtained from the vector’s unique values here. Let us try another example where we define male and female as 0 and 1 using labels.

#Define a vector with 0 for Male and 1 for Female.
> genvector2 <- c(0,1,1,0,0,1)
#Assign labels Male and Female to 0 and 1 when creating a Factor.
> genfact2 <-factor(genvector2,levels=c("0","1"),labels=c("Male","Female"))
> genfact2
[1] Male   Female Female Male   Male   Female
Levels: Male Female

Observe that the labels you have defined are displayed instead of 0 and 1 defined in the factor.

Ordering in Factors in R Programming

Let us work another example using the ordering of factor levels. Let us first define a vector representing the month of joining for 8 employees.

> moj <- c("Jan","Jun","May","Jan","Apr","Dec","Nov","Sep")

Now, there is no way for the compiler to know that May comes before Jun in the order of months. So the following code throws FALSE.

> moj[2]>moj[3]
[1] FALSE

To impose ordering, we need to define a vector with all the months in order first.

> ordermonths <-c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")

Now create a factor for our data using our moj vector, set the levels to ordermonths and set the argument ordered to TRUE.

> factormoj <- factor(x=moj, levels=ordermonths, ordered=TRUE)

Now factormoj displays as follows.

> factormoj
[1] Jan Jun May Jan Apr Dec Nov Sep
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < ... < Dec

The compiler now knows the ordering among the months. Let us check if it knows that May comes before June.

> factormoj[2]>factormoj[3]
[1] TRUE

Modifying Factors

Each element of factor can be assigned a value individually using indexing, just like we index vectors. Let us modify a value from the genfactor we created earlier in the tutorial.

We’ll continue with the same variable from before, genfact to make things easier for you.

> genfact
[1] Male   Female Female Male   Male   Female
Levels: Female Male
> genfact[1]
[1] Male
Levels: Female Male
> genfact[1]<-"Female"
> genfact
[1] Female Female Female Male   Male   Female
Levels: Female Male

Adding New Levels to Factors

To add a new level to a factor, which hasn’t been defined earlier, you just need to modify the levels vector in the following manner. Let’s try this on our existing genfact variable.

> levels(genfact) <- c(levels(genfact),"Other")
> genfact
[1] Female Female Female Male   Male   Female
Levels: Female Male Other

You can now modify the factors to the newly defined level “Other” as well.

> genfact[3] <- "Other"
> genfact
[1] Female Female Other  Male   Male   Female
Levels: Female Male Other

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages