# Exploring Categorical Variables in R

Filed Under: R Programming

Whether you are a student, analytics engineer, or data researcher, your daily routine will be incomplete without plenty of data. The data you use, sometimes haveÂ categorical variables.

These category variables can be nominal (True/False, Female/Male) or ordinal (high, low, medium). In the data analysis, exploring them is very important to find key insights.

Now, in this article, we will be exploring categorical variables using the R language.

We will consider exploring the categorical variables using R without visualizing them. Let’s see how it works.

First things first. Before analysis, we need data that have categorical variables in it.

I am using the built-in ‘iris’ dataset for this purpose. There are a couple of reasons for it.

• Iris is a small dataset with fewer complications.
• It has categorical variables in it.
• Simple structure with 4 attributes.
```#Reading data into R
df <- datasets::iris

#displays data
df
```

As we know that the iris dataset is available by default in R, you can load it with minimal effort.

We got the data and it’s ready for exploring!

Our approach will be a bit different. We won’t be using the summary() function to get basic information about the numerical data.

Instead, I will recommend using the table() function in R, which will eventually return more insights than summary and for sure it is scalable.

## Exploring categorical variables in R – The simple way

Well, we got everything ready now to proceed further and explore the data. Let’s make use of the table() function to get some key insights and exciting numbers from the input data i.e. iris.

```#Returns the number of occurrences of data points
table(iris\$Sepal.Length)
```

Output 1:

```4.3  4.4  4.5  4.6  4.7  4.8  4.9  5   5.1  5.2
1    3    1    4    2    5    6   10   9    4
```
```#Returns the number of occurrences of data points
table(iris\$Petal.Width)
```

Output 2:

```0.1  0.2  0.3  0.4  0.5  0.6  1  1.1  1.2  1.3
5    29   7    7    1    1   7   3    5    13
```
```#Returns the number of occurrences of data points
table(iris\$Species)
```

Output 3:

```      setosa   versicolor   virginica
50         50         50
```

As you can observe, the table() the function will look for nominal categories and returns the count of values falling under the category.

In the first output, you can see that the sepal length of 5 has 10 occurrences and in the second output, the Petal width of 0.2 has 29 occurrences. We have a total of 150 observations in our dataset and the percentages of these two will be 6.6% and 19.3 % respectively.

In the last output, we have encountered the categorical variables. There are 3 categories and each counts 50 and the percentage will be 33.333%. We will explore this section in the below sections.

Kindly note that I have taken the results of the first 10 values of the data to illustrate the things to you. There can be more occurrences later in the dataset. You can explore all of them using this method.

## Calculate table proportions in R

As we know R is supreme in data analysis, it can also perform the calculations of the table proportions directly with the help of a function prop.table().

Let’s see how it works.

Now, we will calculate the proportions of the categories present in our data and then round the decimal values to get the standard proportions using R.

```#Assigning the data
props_data <- table(iris\$Species)

#Calculates the proportions of categories
props_data <- prop.table(props_data) * 100

#displays the proportions
props_data
```
```   setosa   versicolor  virginica
33.33333   33.33333   33.33333
```

Fantastic! We got the proportions of the values which fall under each category. But these kinds of decimal values will make our data messy and there will be chances that you may end up less accurately.

` - Anonymous ML engineer `

To avoid these issues, you have to make sure your data quality is not compromised. You can resolve this data using theÂ round() function in RÂ to reduce the decimal points to custom size. Here we go:

```#rounds off the decimal values to standard values
round(props_data,1)
```
```     setosa   versicolor  virginica
33.3       33.3       33.3
```

Wow!!! Now it’s looking better and your accuracy won’t get affected. Your model performance will be improved with quality data as you achieved above.

## Wrapping Up

There are functions in R such as summary() to explore the variables. But it’s good that we have the functions like a table() and prop. The table() for exploring categorical variables using R. I hope you will be happy to know these methods for exploring in the analysis.

The table() function in R has got greater applications. It will return many key insights over the data and it’s invaluable in your analysis. That’s great right?

I hope you got some methods to deal with categorical variables. Then it’s good to close. See you in the next article.Â Take care, Happy R!!!