One-Way ANOVA test in R

Filed Under: R Programming
ONE WAY ANOVA TEST IN R

Hello, readers! In this article, we will be focusing on One-Way ANOVA test in R programming, in detail.

So, let us begin!!


First, what is One-Way ANOVA test?

Before diving into the concept of ANOVA, let us first understand its emergence as a test into the domain of data science.

At first, it is very important to check for the credibility of the data variables that are to be fed into the model.

That is, it is a very crucial step for us to perform feature selection on the raw data prior to modelling.

This is when One-Way ANOVA test comes into picture.

ANOVA test is basically used when the dataset contains a categorical independent variables i.e. the independent variables of the dataset. Example: Categorical values like ‘yes’ or ‘no’, ‘true’ or ‘false’, etc.

ANOVA test is actually a statistical test which tests and estimates the way a dependent variable is affected by one or more categorical independent variables of the dataset.

Moreover, it judges the difference in the mean values of the groups for level of each independent categorical variable.

In One-Way ANOVA test, we look for the difference in the mean of variables across groups against a single grouping variable.

Recommended read – Skewness test in R


Assumptions of ANOVA test

  • The data values of every variable is normally distributed.
  • This normal distribution of data has a common variance.
  • ANOVA test independently obtains the observations from the defined grouping factor (categorical) values.

Hypothesis of ANOVA test

  • Null Hypothesis: The mean of all the different groups is the same.
  • Alternate Hypothesis: At least the mean of one sample group differs from others.

Steps to implement One-Way ANOVA in R

  1. At first, we load the dataset into the R environment using read.csv() function. We have made use of the Bike Rental Count Prediction problem, selecting only the categorical values out of them. You can find the entire dataset here!
#Removed all the existing objects
rm(list = ls())
#Setting the working directory
setwd("D:/Ediwsor_Project - Bike_Rental_Count/")
getwd()

#Load the dataset
bike_data = read.csv("day.csv",header=TRUE)

2. From the entire dataset, we segregate and select the categorical values into a list.

categorical_col = c("season","yr","mnth","holiday","weekday","workingday","weathersit")

3. Now is the time to apply One-Way ANOVA test! We make use of aov() function and group it against a single quantitative variable ‘cnt’.

for(x in categorical_col)

  {
  print(x)
  anova_test = aov(cnt ~ bike_data[,x],bike_data)
  print(summary(anova_test))
}

Our main task here was to understand the significant difference in the count of bikes at rent (cnt) under different environmental conditions and seasonal rotations.

Output:

[1] "season"
                Df    Sum Sq   Mean Sq F value Pr(>F)    
bike_data[, x]   3 9.218e+08 307282201   124.8 <2e-16 ***
Residuals      713 1.755e+09   2461404                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[1] "yr"
                Df    Sum Sq   Mean Sq F value Pr(>F)    
bike_data[, x]   1 8.813e+08 881327066     351 <2e-16 ***
Residuals      715 1.796e+09   2511190                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[1] "mnth"
                Df    Sum Sq  Mean Sq F value Pr(>F)    
bike_data[, x]  11 1.042e+09 94755196   40.87 <2e-16 ***
Residuals      705 1.635e+09  2318469                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[1] "holiday"
                Df    Sum Sq  Mean Sq F value Pr(>F)  
bike_data[, x]   1 1.377e+07 13770983   3.697 0.0549 .
Residuals      715 2.663e+09  3724555                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[1] "weekday"
                Df    Sum Sq Mean Sq F value Pr(>F)
bike_data[, x]   6 1.757e+07 2928537   0.782  0.584
Residuals      710 2.659e+09 3745432               
[1] "workingday"
                Df    Sum Sq Mean Sq F value Pr(>F)
bike_data[, x]   1 8.494e+06 8494340   2.276  0.0132 *
Residuals      715 2.668e+09 3731935               
[1] "weathersit"
                Df    Sum Sq   Mean Sq F value Pr(>F)    
bike_data[, x]   2 2.680e+08 133999088   39.72 <2e-16 ***
Residuals      714 2.409e+09   3373711                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

As the p value is less than the alpha (significance) value i.e. 0.05, we can say that there is a considerable difference in the mean of the groups highlighted with ‘*’, respectively.


Conclusion

By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.

For more such posts related to R programming, stay tuned and till then, Happy Learning!! 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content