R group_by() function – Practical Guide!

Filed Under: R Programming
R Group By Function

Hello, readers! In this article, we will be focusing on R group_by() function in detail.

So, let us begin!!


Usage of R group_by() function

While dealing with datasets, we usually find the dataset in the form of table as a combination of rows and columns. Now, in the domain of data science and analytics, we often come across situation wherein we need to analyze and understand the data in terms of their combinations as well.

For example, consider a dataset which contains marks of students with various factors such as subject, special groups of subjects, extra curricular activities, etc. In such scenario, it is beneficial for us to have a provision wherein we can group the marks against the factors mentioned above.

This is when the R group_by() function comes into picture!

The group_by() function groups the existing tabular value against some specific variables or factors of the table. By this, we get the values that are enclosed and dependent only on the mentioned factors chosen.

R dplyr library provides us with the group_by() function to work with the data.

Syntax:

data_object %>% 
  group_by(column_names)

Now, let us have a look at the implementation of the same!


Example 1: Grouping across a single column using group_by() function

In this example, we have created a list of 20 numbers and have created a categorical variables ‘Poll’ using rep() function with values as ‘Yes’ and ‘No’ and ‘S’ with values ‘r’ and ‘n’.

Further, we have created a table of these columns using the tibble() function. After which, we have grouped the values against the ‘Poll’ variable as shown below!

Example:

#Removed all the existing objects
rm(list = ls())

lst <- c(1:20)
Poll <- rep(c("Yes", "No"), 10) # rep stands for replicate
S = rep(c("r","n"),10)

#install.packages('tibble')
library('tibble')
dta = tibble(lst,Poll,S)

#print(dta)
library('dplyr')
dta %>% 
  group_by(Poll) 

Output:

# A tibble: 20 x 3
# Groups:   Poll [2]
     lst Poll  S    
   <int> <chr> <chr>
 1     1 Yes   r    
 2     2 No    n    
 3     3 Yes   r    
 4     4 No    n    
 5     5 Yes   r    
 6     6 No    n    
 7     7 Yes   r    
 8     8 No    n    
 9     9 Yes   r    
10    10 No    n    
11    11 Yes   r    
12    12 No    n    
13    13 Yes   r    
14    14 No    n    
15    15 Yes   r    
16    16 No    n    
17    17 Yes   r    
18    18 No    n    
19    19 Yes   r    
20    20 No    n    

Example 2: R group_by() with summarize() alongside n() function

In the below example, we have clubbed the group_by() function with the summarize() function. Within the summarize() function, we have passed n() which works as the total count of values. And, finally, we have grouped them across the ‘Poll’ variable.

Example:

#Removed all the existing objects
rm(list = ls())

lst <- c(1:20)
Poll <- rep(c("Yes", "No"), 10) # rep stands for replicate
S = rep(c("r","n"),10)
#install.packages('tibble')
library('tibble')
dta = tibble(lst,Poll,S)
#print(dta)
library('dplyr')
dta %>% 
  group_by(Poll) %>% 
  summarize(n = n())

Output:

# A tibble: 2 x 2
  Poll      n
* <chr> <int>
1 No       10
2 Yes      10

Example 3: Grouping across multiple columns using group_by() function

In this example, we have grouped the table against the columns ‘Poll’ and ‘S’, respectively. Further, we have summarized the values across the total count using summarize() function.

Example:

dta %>% 
  group_by(Poll,S) %>% 
  summarize(n = n())

Output:

# A tibble: 2 x 3
# Groups:   Poll [2]
  Poll  S         n
  <chr> <chr> <int>
1 No    n        10
2 Yes   r        10

Example 4: R group_by() with mutate() function

Here, we have grouped the values across columns ‘Poll’ and ‘S’. Further, we have used mutate() function to get it according to the mean of the ‘lst’ column using mean() function

Example:

#Removed all the existing objects
rm(list = ls())

lst <- c(1:20)
Poll <- rep(c("Yes", "No"), 10) # rep stands for replicate
S = rep(c("r","n"),10)
#install.packages('tibble')
library('tibble')
dta = tibble(lst,Poll,S)
#print(dta)
library('dplyr')
dta %>% 
  group_by(Poll,S) %>% 
  mutate(res = mean(lst))

Output:

# A tibble: 20 x 4
# Groups:   Poll, S [2]
     lst Poll  S       res
   <int> <chr> <chr> <dbl>
 1     1 Yes   r        10
 2     2 No    n        11
 3     3 Yes   r        10
 4     4 No    n        11
 5     5 Yes   r        10
 6     6 No    n        11
 7     7 Yes   r        10
 8     8 No    n        11
 9     9 Yes   r        10
10    10 No    n        11
11    11 Yes   r        10
12    12 No    n        11
13    13 Yes   r        10
14    14 No    n        11
15    15 Yes   r        10
16    16 No    n        11
17    17 Yes   r        10
18    18 No    n        11
19    19 Yes   r        10
20    20 No    n        11

Conclusion

By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.

For more such posts related to R programming, stay tuned with us!

Till then, Happy Learning!! 馃檪


References

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages