Hello, readers! Today, we will work on an important Statistical test in the domain of Data Science and Machine learning – the **Chi-square test in R programming**.

So, let us begin!

Table of Contents

## What is Chi-square Test?

We all are well aware that feature selection and understanding of the association of data variables is a crucial step before applying machine learning models on the datasets.

The type of statistical test to apply on a dataset solely depends on the nature of the dataset i.e. continuous or categorical.

Below are some of the mostly used statistical tests for regression algorithms:

- Correlation Regression analysis
- T test, etc

When it comes to categorical data, below are the most popular statistical tests to perform:

**ANOVA test****Chi-square test**

Today, we will be having a look at Chi-square as a statistical test for feature selection.

`Chi-square test`

is a non-parametric statistical test that is used to understand and estimate the correlation between two categorical variables of the dataset.

By understanding the correlation of variables, it becomes easier for us to derive association in terms of the end predictions and further use-cases.

It is also framed as a statistical test that is used to determine the presence of association between the categorical variables of the dataset i.e. whether the categorical variables are independent or dependent on each other.

### Assumptions for Chi-square test in R

- It needs two categorical variables supplied to the function as arguments.
- Every passed categorical variable must have two or more categories(groups).
- The variables must not be paired to each other.

### Hypothesis of Chi-square test

- Alternate-hypothesis: The two variables are associated with each other.
- Null-hypothesis: The variables are independent of each other i.e. they have no association between them.

## R chisq.test() function to perform Chi-square test

R provides us with `chisq.test()`

to perform Chi-square testing and detect the presence of association between the passed categorical variables.

**Syntax:**

```
chisq.test(variable1,variable2)
```

**Example:**

```
#Removed all the existing objects
rm(list = ls())
y_actual = c(10,20,30,40,50)
y_predict = c(9.8,19.8,30,40,52.5)
chi = chisq.test(y_actual, y_predict)
print(chi)
```

**Output:** Interpretation of the result obtained from Chi-square test

**Degree of freedom (df)**: These are the values from the passed variables that are free to vary.**Test statistic****(X-squared)**: It is the random variable of Chi-square test that depicts the mean of the observed v/s expected frequency counts of the variables.**P-value**: It describes the probability of the sample.

```
> print(chi)
Pearson's Chi-squared test
data: y_actual and y_predict
X-squared = 20, df = 16, p-value = 0.2202
```

To interpret the Chi-square test, we observe if the p-value is less than the significance value (usually, 0.05).

If it is, then we reject the NULL HYPOTHESIS and claim that an association exists between the two variables. That is, one variable can be explained by the other.

In our example, p-value is greater than the assumed significance value(0.05). Thus, we accept NULL HYPOTHESIS and assume that the variables are independent of each other.

## Implementing Chi-square Test in R on Bike Rental Dataset

In this example, we have made use of the Bike Rental Prediction dataset. You can find the dataset here!

**Example:**

First, we load the dataset into the environment using read.csv() function.

```
#Removed all the existing objects
rm(list = ls())
#Setting the working directory
setwd("D:/Ediwsor_Project - Bike_Rental_Count/")
getwd()
#Load the dataset
bike_data = read.csv("day.csv",header=TRUE)
```

Then, we have selected few of the categorical variables and have performed the Chi-square test.

```
print(chisq.test(bike_data$season,bike_data$yr))
print(chisq.test(bike_data$mnth,bike_data$holiday))
print(chisq.test(bike_data$workingday,bike_data$weathersit))
```

**Output:**

```
> print(chisq.test(bike_data$season,bike_data$yr))
Pearson's Chi-squared test
data: bike_data$season and bike_data$yr
X-squared = 0.027386, df = 3, p-value = 0.9988
> print(chisq.test(bike_data$mnth,bike_data$holiday))
Pearson's Chi-squared test
data: bike_data$mnth and bike_data$holiday
X-squared = 9.5502, df = 11, p-value = 0.5712
> print(chisq.test(bike_data$workingday,bike_data$weathersit))
Pearson's Chi-squared test
data: bike_data$workingday and bike_data$weathersit
X-squared = 2.4498, df = 2, p-value = 0.2938
```

As a result from the above tests, it is clear that the NULL HYPOTHESIS stands true and the variables are independent of each other.

## Conclusion

By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.

For more such posts related to R programming, stay tuned.

Till then, Happy Learning!! 🙂