Hello, readers! Today, we will work on an important Statistical test in the domain of Data Science and Machine learning – the Chi-square test in R programming.
So, let us begin!
Table of Contents
What is Chi-square Test?
We all are well aware that feature selection and understanding of the association of data variables is a crucial step before applying machine learning models on the datasets.
The type of statistical test to apply on a dataset solely depends on the nature of the dataset i.e. continuous or categorical.
Below are some of the mostly used statistical tests for regression algorithms:
- Correlation Regression analysis
- T test, etc
When it comes to categorical data, below are the most popular statistical tests to perform:
- ANOVA test
- Chi-square test
Today, we will be having a look at Chi-square as a statistical test for feature selection.
Chi-square test is a non-parametric statistical test that is used to understand and estimate the correlation between two categorical variables of the dataset.
By understanding the correlation of variables, it becomes easier for us to derive association in terms of the end predictions and further use-cases.
It is also framed as a statistical test that is used to determine the presence of association between the categorical variables of the dataset i.e. whether the categorical variables are independent or dependent on each other.
Assumptions for Chi-square test in R
- It needs two categorical variables supplied to the function as arguments.
- Every passed categorical variable must have two or more categories(groups).
- The variables must not be paired to each other.
Hypothesis of Chi-square test
- Alternate-hypothesis: The two variables are associated with each other.
- Null-hypothesis: The variables are independent of each other i.e. they have no association between them.
R chisq.test() function to perform Chi-square test
R provides us with
chisq.test() to perform Chi-square testing and detect the presence of association between the passed categorical variables.
#Removed all the existing objects rm(list = ls()) y_actual = c(10,20,30,40,50) y_predict = c(9.8,19.8,30,40,52.5) chi = chisq.test(y_actual, y_predict) print(chi)
Output: Interpretation of the result obtained from Chi-square test
- Degree of freedom (df): These are the values from the passed variables that are free to vary.
- Test statistic (X-squared): It is the random variable of Chi-square test that depicts the mean of the observed v/s expected frequency counts of the variables.
- P-value: It describes the probability of the sample.
> print(chi) Pearson's Chi-squared test data: y_actual and y_predict X-squared = 20, df = 16, p-value = 0.2202
To interpret the Chi-square test, we observe if the p-value is less than the significance value (usually, 0.05).
If it is, then we reject the NULL HYPOTHESIS and claim that an association exists between the two variables. That is, one variable can be explained by the other.
In our example, p-value is greater than the assumed significance value(0.05). Thus, we accept NULL HYPOTHESIS and assume that the variables are independent of each other.
Implementing Chi-square Test in R on Bike Rental Dataset
In this example, we have made use of the Bike Rental Prediction dataset. You can find the dataset here!
First, we load the dataset into the environment using read.csv() function.
#Removed all the existing objects rm(list = ls()) #Setting the working directory setwd("D:/Ediwsor_Project - Bike_Rental_Count/") getwd() #Load the dataset bike_data = read.csv("day.csv",header=TRUE)
Then, we have selected few of the categorical variables and have performed the Chi-square test.
print(chisq.test(bike_data$season,bike_data$yr)) print(chisq.test(bike_data$mnth,bike_data$holiday)) print(chisq.test(bike_data$workingday,bike_data$weathersit))
> print(chisq.test(bike_data$season,bike_data$yr)) Pearson's Chi-squared test data: bike_data$season and bike_data$yr X-squared = 0.027386, df = 3, p-value = 0.9988 > print(chisq.test(bike_data$mnth,bike_data$holiday)) Pearson's Chi-squared test data: bike_data$mnth and bike_data$holiday X-squared = 9.5502, df = 11, p-value = 0.5712 > print(chisq.test(bike_data$workingday,bike_data$weathersit)) Pearson's Chi-squared test data: bike_data$workingday and bike_data$weathersit X-squared = 2.4498, df = 2, p-value = 0.2938
As a result from the above tests, it is clear that the NULL HYPOTHESIS stands true and the variables are independent of each other.
By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.
For more such posts related to R programming, stay tuned.
Till then, Happy Learning!! 🙂