Hello, readers! In this article, we will be focusing on One-Way ANOVA test in R programming, in detail.
So, let us begin!!
First, what is One-Way ANOVA test?
Before diving into the concept of ANOVA, let us first understand its emergence as a test into the domain of data science.
At first, it is very important to check for the credibility of the data variables that are to be fed into the model.
That is, it is a very crucial step for us to perform feature selection on the raw data prior to modelling.
This is when One-Way ANOVA test comes into picture.
ANOVA test
is basically used when the dataset contains a categorical independent variables i.e. the independent variables of the dataset. Example: Categorical values like ‘yes’ or ‘no’, ‘true’ or ‘false’, etc.
ANOVA test is actually a statistical test which tests and estimates the way a dependent variable is affected by one or more categorical independent variables of the dataset.
Moreover, it judges the difference in the mean values of the groups for level of each independent categorical variable.
In One-Way ANOVA test, we look for the difference in the mean of variables across groups against a single grouping variable.
Recommended read – Skewness test in R
Assumptions of ANOVA test
- The data values of every variable is normally distributed.
- This normal distribution of data has a common variance.
- ANOVA test independently obtains the observations from the defined grouping factor (categorical) values.
Hypothesis of ANOVA test
- Null Hypothesis: The mean of all the different groups is the same.
- Alternate Hypothesis: At least the mean of one sample group differs from others.
Steps to implement One-Way ANOVA in R
- At first, we load the dataset into the R environment using read.csv() function. We have made use of the Bike Rental Count Prediction problem, selecting only the categorical values out of them. You can find the entire dataset here!
#Removed all the existing objects
rm(list = ls())
#Setting the working directory
setwd("D:/Ediwsor_Project - Bike_Rental_Count/")
getwd()
#Load the dataset
bike_data = read.csv("day.csv",header=TRUE)
2. From the entire dataset, we segregate and select the categorical values into a list.
categorical_col = c("season","yr","mnth","holiday","weekday","workingday","weathersit")
3. Now is the time to apply One-Way ANOVA test! We make use of aov()
function and group it against a single quantitative variable ‘cnt’.
for(x in categorical_col)
{
print(x)
anova_test = aov(cnt ~ bike_data[,x],bike_data)
print(summary(anova_test))
}
Our main task here was to understand the significant difference in the count of bikes at rent (cnt) under different environmental conditions and seasonal rotations.
Output:
[1] "season"
Df Sum Sq Mean Sq F value Pr(>F)
bike_data[, x] 3 9.218e+08 307282201 124.8 <2e-16 ***
Residuals 713 1.755e+09 2461404
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[1] "yr"
Df Sum Sq Mean Sq F value Pr(>F)
bike_data[, x] 1 8.813e+08 881327066 351 <2e-16 ***
Residuals 715 1.796e+09 2511190
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[1] "mnth"
Df Sum Sq Mean Sq F value Pr(>F)
bike_data[, x] 11 1.042e+09 94755196 40.87 <2e-16 ***
Residuals 705 1.635e+09 2318469
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[1] "holiday"
Df Sum Sq Mean Sq F value Pr(>F)
bike_data[, x] 1 1.377e+07 13770983 3.697 0.0549 .
Residuals 715 2.663e+09 3724555
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[1] "weekday"
Df Sum Sq Mean Sq F value Pr(>F)
bike_data[, x] 6 1.757e+07 2928537 0.782 0.584
Residuals 710 2.659e+09 3745432
[1] "workingday"
Df Sum Sq Mean Sq F value Pr(>F)
bike_data[, x] 1 8.494e+06 8494340 2.276 0.0132 *
Residuals 715 2.668e+09 3731935
[1] "weathersit"
Df Sum Sq Mean Sq F value Pr(>F)
bike_data[, x] 2 2.680e+08 133999088 39.72 <2e-16 ***
Residuals 714 2.409e+09 3373711
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As the p value is less than the alpha (significance) value i.e. 0.05, we can say that there is a considerable difference in the mean of the groups highlighted with ‘*’, respectively.
Conclusion
By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.
For more such posts related to R programming, stay tuned and till then, Happy Learning!! 🙂