R aggregate() function – Important things to know!

Filed Under: R Programming
R Aggregate() Function

Hello, readers! In this article, we would be focusing on an important built-in function in R programming – R aggregate() function, in detail.

So, let us begin!! 馃檪


Functioning of aggregate() function in R

Analysis of data is a crucial step prior to modelling of data in the domain of data science and machine learning.

R programming provides us with a built-in function to analyze the data in a single go. The aggregate() function enables us to have a statistical summary of the data values fed to it.

With the aggregate() function, we can analyze the data in terms of the statistical measures such as mean, summation, etc.

Have a look at the below syntax!

aggregate(data,by = list(data$column-name),FUN = mean)

Having understood about the aggregate() function, let us try to implement it under various constraints in the next section!


R aggregate() function with mean

To being with, we would be using the concept of aggregation with respect to mean of the data values provided. By this, we mean to say, that the aggregate function would actually sum up the mean of the data values mentioned in the argument list.

We would be using the below dataset in this example.

Dataset
Dataset

In the below example, we have provided the aggregate function with ‘FUN = mean‘ i.e. use mean as the function for the aggregation of data values.

Moreover, we have grouped by the entire output with respect to the column ‘Gender’ of the dataset using ‘by = list(data$column-name)‘ as the argument.

rm(list = ls())
#Setting the working directory
setwd("D:/")
getwd()

#Load the dataset
dta = read.csv("Data.csv", header=TRUE)
aggregate(dta, by = list(dta$Gender), FUN = mean)

Output:

R Aggregate With Mean
R Aggregate With Mean

Having analyzed the output, one thing that comes to my notice is that the categorical variables have been replaced by NA.

It is because the aggregate() function works only on numeric data columns. Thus, all the other factor variables (categorical or character values) would be replaced by NA i.e. NULL data.

To avoid these NA values in the output and to omit the character variables from processing, we just have to exclude them within the function.

Don’t worry, will learn about it in the upcoming section!


R aggregate() function with sum

Now, let us try using the aggregate() function with respect to the sum of the data values in accordance with the data columns provided.

By using ‘FUN = sum’, we let the function calculate the aggregate of the sum of the numeric data columns provided and then group by a particular column used.

Have a look at the below dataset!

Dataset 1
Dataset 1

In the below example, we have loaded the dataset into the environment using the read.csv() function. Further, we have calculated the aggregate of the sum of the numeric columns with respect to the ‘Designation’ column as the group.

As discussed in the last section, all the factor variables will be represented as NA in the output. To avoid the same, we use the below line of code as a parameter to exclude the factor columns from the dataset:

x = data[ , colnames(data) != "factor-col-name"

By this, the character columns would be exempted from processing by the aggregate() function.

rm(list = ls())

#Setting the working directory
setwd("D:/")
getwd()

#Load the dataset
dta = read.csv("Data.csv",header=TRUE)
aggregate(x = dta[ , colnames(dta) != "Designation"],             
          by = list(dta$Designation),
          FUN = sum)

Output:

   Group.1      A B C
1  Client       5 7 8
2  Cook         6 7 8
3  Manager      3 4 5
4  Teacher      0 0 0

Handling missing values with the aggregate() function

So, what if the numeric data columns contain missing values? It would cost us to separately treat the missing values. That is why the aggregate() function provides us with the ‘na.rm’ argument to treat the missing values.

By setting ‘na.rm=TRUE‘, we allow the aggregate() function to treat and remove the NA values.

Dataset:

Dataset Missing Values
Dataset Missing Values
rm(list = ls())

#Setting the working directory
setwd("D:/")
getwd()

#Load the dataset
dta = read.csv("Data.csv",header=TRUE)
aggregate(x = dta[ , colnames(dta) != "Designation"],             
          by = list(dta$Designation),
          FUN = sum,
          na.rm=TRUE)

Output:

  Group.1    A B C
1  Client    1 7 8
2    Cook    6 0 8
3 Manager    3 4 5
4 Teacher    0 0 0

Conclusion

By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question. For more such posts related to R, stay tuned with us! Till then, Happy Learning!! 馃檪


References

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages