In this tutorial, we’ll move on to understanding factors in R programming. One operation we perform frequently in data science is the estimation of a variable based upon the model we built. We are sometimes required to estimate the price of a share or a house, and sometimes we need to estimate what color car is likely to be sold the fastest.

Variables in data science fall under two categories – **continuous** and **categorical**. Continuous variables are those that can take numerical values including floating points. Prices of houses or shares, quantifiable variables like age, weight or height of a person are all continuous variables.

On the other hand, categorical variables take a set of fixed values that can be represented using a set of labels. Examples for this category as marital status, gender, the color of the vehicle, the highest educational degree of a person and so on.

Categorical variables are represented using the **factors** in R.

Table of Contents

## Creating Factors in R

Factors can be created using a `factor()`

function.

```
factor(x=vector, levels, labels, is.ordered=TRUE/FALSE)
```

The first argument to factor function is the **vector** x of values that you wish to factorize. Note that you cannot create a factor using a matrix. X should always be a single-dimensional vector of character strings or integer values.

Secondly, you need to supply the list of **levels** you need in the factor. **Levels** is a vector of unique values used in the factor. This is an optional argument.

The third argument is **labels**. Sometimes when you encode the variables as a vector of integers, you need to specify what integer represents what label. You could use 0 and 1 to represent male and female, but you need to specify that using these labels. So basically this is the key for looking up the factors.

Finally, you have a Boolean valued argument **is.ordered**. Sometimes you may wish to retain the order amongst the factors used. For example, you may encode the month of joining using integers 1 to 12, to represent months from January to Decemeber. In these cases, you need to specify **ordered** to TRUE.

Let us look at examples of factors now.

```
#Encode the genders of people into a vector first
#These might be extracted from a dataset usually.
> genvector <- c("Male","Female","Female","Male","Male","Female")
#Create a factor from this vector
> genfact <- factor(genvector)
> genfact
[1] Male Female Female Male Male Female
Levels: Female Male
```

Notice how the levels are automatically obtained from the vector’s unique values here. Let us try another example where we define male and female as 0 and 1 using labels.

```
#Define a vector with 0 for Male and 1 for Female.
> genvector2 <- c(0,1,1,0,0,1)
#Assign labels Male and Female to 0 and 1 when creating a Factor.
> genfact2 <-factor(genvector2,levels=c("0","1"),labels=c("Male","Female"))
> genfact2
[1] Male Female Female Male Male Female
Levels: Male Female
```

Observe that the labels you have defined are displayed instead of 0 and 1 defined in the factor.

## Ordering in Factors in R Programming

Let us work another example using the ordering of factor levels. Let us first define a vector representing the month of joining for 8 employees.

```
> moj <- c("Jan","Jun","May","Jan","Apr","Dec","Nov","Sep")
```

Now, there is no way for the compiler to know that May comes before Jun in the order of months. So the following code throws FALSE.

```
> moj[2]>moj[3]
[1] FALSE
```

To impose **ordering**, we need to define a vector with all the months in order first.

```
> ordermonths <-c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")
```

Now create a factor for our data using our moj vector, set the levels to ordermonths and set the argument ordered to TRUE.

```
> factormoj <- factor(x=moj, levels=ordermonths, ordered=TRUE)
```

Now factormoj displays as follows.

```
> factormoj
[1] Jan Jun May Jan Apr Dec Nov Sep
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < Oct < ... < Dec
```

The compiler now knows the ordering among the months. Let us check if it knows that May comes before June.

```
> factormoj[2]>factormoj[3]
[1] TRUE
```

## Modifying Factors

Each element of factor can be assigned a value individually using indexing, just like we index vectors. Let us modify a value from the genfactor we created earlier in the tutorial.

We’ll continue with the same variable from before, **genfact** to make things easier for you.

```
> genfact
[1] Male Female Female Male Male Female
Levels: Female Male
> genfact[1]
[1] Male
Levels: Female Male
> genfact[1]<-"Female"
> genfact
[1] Female Female Female Male Male Female
Levels: Female Male
```

## Adding New Levels to Factors

To add a new level to a factor, which hasn’t been defined earlier, you just need to modify the levels vector in the following manner. Let’s try this on our existing **genfact** variable.

```
> levels(genfact) <- c(levels(genfact),"Other")
> genfact
[1] Female Female Female Male Male Female
Levels: Female Male Other
```

You can now modify the factors to the newly defined level “Other” as well.

```
> genfact[3] <- "Other"
> genfact
[1] Female Female Other Male Male Female
Levels: Female Male Other
```