Exploring Categorical Variables in R

Filed Under: R Programming
Exploring Variables in R

Whether you are a student, analytics engineer, or data researcher, your daily routine will be incomplete without plenty of data. The data you use, sometimes have categorical variables.

These category variables can be nominal (True/False, Female/Male) or ordinal (high, low, medium). In the data analysis, exploring them is very important to find key insights.

Now, in this article, we will be exploring categorical variables using the R language.

We will consider exploring the categorical variables using R without visualizing them. Let’s see how it works.

Reading The Data

First things first. Before analysis, we need data that have categorical variables in it.

I am using the built-in ‘iris’ dataset for this purpose. There are a couple of reasons for it.

  • Iris is a small dataset with fewer complications.
  • It has categorical variables in it.
  • Simple structure with 4 attributes.
#Reading data into R
df <- datasets::iris

#displays data

As we know that the iris dataset is available by default in R, you can load it with minimal effort.

Screenshot 987 1

We got the data and it’s ready for exploring!

Our approach will be a bit different. We won’t be using the summary() function to get basic information about the numerical data.

Instead, I will recommend using the table() function in R, which will eventually return more insights than summary and for sure it is scalable.

Exploring categorical variables in R – The simple way

Well, we got everything ready now to proceed further and explore the data. Let’s make use of the table() function to get some key insights and exciting numbers from the input data i.e. iris.

#Returns the number of occurrences of data points  

Output 1:

4.3  4.4  4.5  4.6  4.7  4.8  4.9  5   5.1  5.2
 1    3    1    4    2    5    6   10   9    4 
#Returns the number of occurrences of data points

Output 2:

0.1  0.2  0.3  0.4  0.5  0.6  1  1.1  1.2  1.3
 5    29   7    7    1    1   7   3    5    13
#Returns the number of occurrences of data points

Output 3:

      setosa   versicolor   virginica 
        50         50         50 

As you can observe, the table() the function will look for nominal categories and returns the count of values falling under the category.

In the first output, you can see that the sepal length of 5 has 10 occurrences and in the second output, the Petal width of 0.2 has 29 occurrences. We have a total of 150 observations in our dataset and the percentages of these two will be 6.6% and 19.3 % respectively.

In the last output, we have encountered the categorical variables. There are 3 categories and each counts 50 and the percentage will be 33.333%. We will explore this section in the below sections.

Kindly note that I have taken the results of the first 10 values of the data to illustrate the things to you. There can be more occurrences later in the dataset. You can explore all of them using this method.

Calculate table proportions in R

As we know R is supreme in data analysis, it can also perform the calculations of the table proportions directly with the help of a function prop.table().

Let’s see how it works.

Now, we will calculate the proportions of the categories present in our data and then round the decimal values to get the standard proportions using R.

#Assigning the data
props_data <- table(iris$Species)

#Calculates the proportions of categories 
props_data <- prop.table(props_data) * 100

#displays the proportions 
   setosa   versicolor  virginica 
  33.33333   33.33333   33.33333 

Fantastic! We got the proportions of the values which fall under each category. But these kinds of decimal values will make our data messy and there will be chances that you may end up less accurately.

“Your model will be as Good as Your Data”

- Anonymous ML engineer

To avoid these issues, you have to make sure your data quality is not compromised. You can resolve this data using the round() function in R to reduce the decimal points to custom size. Here we go:

#rounds off the decimal values to standard values
     setosa   versicolor  virginica 
      33.3       33.3       33.3 

Wow!!! Now it’s looking better and your accuracy won’t get affected. Your model performance will be improved with quality data as you achieved above.

Wrapping Up

There are functions in R such as summary() to explore the variables. But it’s good that we have the functions like a table() and prop. The table() for exploring categorical variables using R. I hope you will be happy to know these methods for exploring in the analysis.

The table() function in R has got greater applications. It will return many key insights over the data and it’s invaluable in your analysis. That’s great right?

I hope you got some methods to deal with categorical variables. Then it’s good to close. See you in the next article. Take care, Happy R!!!

More read: R documentation

Generic selectors
Exact matches only
Search in title
Search in content