KNN algorithm in R programming

Filed Under: R Programming
K Nearest Neighbor In R

Hello readers! In our series of Machine Learning with R programming, today, we will be having a look at KNN algorithm in R programming, in detail.


First, what is KNN in Machine Learning?

Before beginning with KNN in Machine Learning, let us first understand Supervised Machine Learning with respect to Data Science.

In the domain of Data Science, modeling plays a very important role in the prediction of data variables for real-life problems.

There are various algorithms in Machine Learning to model and evaluate the functioning of the model.

These are the two types of Machine Learning Algorithms:

  • Supervised Machine Learning: It contains algorithms that learn from the labeled data and makes predictions.
  • Unsupervised Machine Learning: It contains algorithms that process the unlabeled data and perform predictions.

KNN is an acronym for K nearest neighbor. It is a supervised Machine learning algorithm that is used for classification as well as regression data variables. That is, it works on continuous as well as categorical variables.

KNN is a non-parametric algorithm that enables us to make predictions out of real time labelled data variables.

Working of KNN

KNN uses the concept of feature similarity to predict the value/group of the new data entries. It assigns a value or group to the new data variables based on how the data point is close to the nearest k points that we choose from the training data set.

  • At first, we load the dataset into the environment.
  • After which, using the KNN algorithm, we choose the value of ‘K’ as the value for the nearest neighbors from the training dataset.
  • Now, for every single data variable present in the testing dataset, it calculates the distance between the data point and every row of the training dataset using the Euclidian or Manhattan distance formula.
  • Based on the distance, it will choose the top K nearest values and then will assign the test data value on a class/value basis on the class’s frequency of the K values of the training dataset.

Having understood the working of Knn algorithm, let us now have a look at the implementation of the same!


KNN Implementation – Step by Step Guide!

Now, let us focus on the Practical Implementation of the KNN algorithm. In this example, we have made use of the Bike Rental Count Prediction Problem wherein we have supposed to predict the count of customers that would opt for a rented vehicle on different environmental conditions.

You can find the dataset here!

1. Load the dataset

Initially, we have to load the dataset into the environment. For the same, we set the current directory as the working directory using the setwd("path") function. Then using the read.csv() function, we load the data.

#Setting the working directory
setwd("D:/Bike_Rental_Count/")
#Load the dataset
bike_data = read.csv("day.csv",header=TRUE)

2. Exploratory Data Analysis

Having loaded the data, now we execute certain functions to study the structure and type of the data variables.

  1. str() function: With this function, we can have a look at the data type and set of values present in the data variables.
# 1. Understanding the data values of every column of the dataset
str(bike_data)

Output:

'data.frame':	731 obs. of  16 variables:
 $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ dteday    : Factor w/ 731 levels "2011-01-01","2011-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ weekday   : int  6 0 1 2 3 4 5 6 0 1 ...
 $ workingday: int  0 0 1 1 1 1 1 0 0 1 ...
 $ weathersit: int  2 2 1 1 1 1 2 2 1 1 ...
 $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
 $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
 $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
 $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
 $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
 $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
 $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...

2. summary() function: Using this function, we get the statistical estimation of the variables of the dataset in terms of mean, median, interquartile range, etc.

# 2.Understanding the data distribution of the dataset
summary(bike_data)

Output:

Exploratory Data Analysis KNN
Exploratory Data Analysis-KNN

3. Having seen the data types of the variables, we will now convert the data types of the variables to the appropriate ones.

# From the above data analysis, we have understood that the data columns -- 'season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday' and 'weathersit' belong to categorical type values.
# Thus, we need to change the data types of the columns to suitable types

bike_data$season=as.factor(bike_data$season)
bike_data$yr=as.factor(bike_data$yr)
bike_data$mnth=as.factor(bike_data$mnth)
bike_data$holiday=as.factor(bike_data$holiday)
bike_data$weekday=as.factor(bike_data$weekday)
bike_data$workingday=as.factor(bike_data$workingday)
bike_data$weathersit=as.factor(bike_data$weathersit)
bike_data$dteday = as.Date(bike_data$dteday,format="%Y-%m-%d")

str(bike_data)

Here we have used as.factor() function to convert the data variables of some specific variables to categorical type variables.

Output:

'data.frame':	731 obs. of  16 variables:
 $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ dteday    : Date, format: "2011-01-01" "2011-01-02" "2011-01-03" "2011-01-04" ...
 $ season    : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ yr        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ mnth      : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ holiday   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ weekday   : Factor w/ 7 levels "0","1","2","3",..: 7 1 2 3 4 5 6 7 1 2 ...
 $ workingday: Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 1 2 ...
 $ weathersit: Factor w/ 3 levels "1","2","3": 2 2 1 1 1 1 2 2 1 1 ...
 $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
 $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
 $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
 $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
 $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
 $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
 $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...

3. Feature Selection

Now, that the data analysis is done, it is the time to select relevant variables from the dataset. Here, we have made use of corrplot() function to plot the correlation between the numeric variables of the dataset.

##FEATURE SELECTION from the Continuous independent variables##

library(corrgram)
corrgram(bike_data[,numeric_col],order=FALSE,upper.panel = panel.pie,
         text.panel = panel.txt,
         main= "Correlation Analysis Plot of the Continuous Independent Variables")

#From the above correlation analysis plot, it is clear that the numeric variables 'temp' and 'atemp' are highly co-related to each other i.e. they serve or depict the same information.

#Thus, it makes us applicable to drop any one of those data variables.

#So, we drop the 'atemp' variable from the dataset.
bike_data = subset(bike_data,select = -c(atemp))

From the below plot, it is visible that the variables ‘temp’ and ‘atemp’ have high value of correlation and thus it makes us liable to delete any one of the columns. Thus, we drop the variable ‘atemp’ using subset() function.

Output:

Correlation Plot
Correlation Plot

4. Splitting of dataset

Now, the most important step prior to Modelling is Splitting of the dataset. We split the dataset into training and testing data variables using createDataPartition() method.

##SAMPLING OF DATA - Splitting of Data columns into Training and Test dataset##

categorical_col_updated = c('season','yr','mnth','weathersit','holiday')

library(dummies)
bike = bike_data
bike = dummy.data.frame(bike,categorical_col_updated)
dim(bike)

#Separating the depenedent and independent data variables into two dataframes.
library(caret)
set.seed(101)
split_val = createDataPartition(bike$cnt, p = 0.80, list = FALSE) 
train_data = bike[split_val,]
test_data = bike[-split_val,]

5. Error Metric

To evaluate the performance of a model, we need to evaluate the accuracy of the model using different techniques such as Mean Absolute Percentage Error and R-square error metric.

#Defining error metrics to check the error rate and accuracy of the Regression ML algorithms

#1. MEAN ABSOLUTE PERCENTAGE ERROR (MAPE)
MAPE = function(y_actual,y_predict){
  mean(abs((y_actual-y_predict)/y_actual))*100
}

#2. R SQUARE error metric -- Coefficient of Determination
RSQUARE = function(y_actual,y_predict){
  cor(y_actual,y_predict)^2
}

6. Modeling

We have made use of FNN::knn.reg() function to create a model out of the dataset and then represented the accuracy of the model.

##MODEL 4: KNN 
KNN_model = FNN::knn.reg(train = train_data, test = test_data, y = train_data[,27], k = 3)

KNN_predict=ceiling(KNN_model$pred[1:27]) #Predicted values

KNN_MAPE = MAPE(test_data[,27],KNN_predict)

Accuracy_KNN = 100 - KNN_MAPE
print("MAPE: ")
print(KNN_MAPE)
print('Accuracy of KNN: ')
print(Accuracy_KNN)

Output:

"MAPE: "
47.91161
"Accuracy of KNN: "
52.08839

Conclusion

By this, we have come to the end of this topic. For the Bike Prediction problem used above, we have received an accuracy of 52.08%.

Try implementing the same steps for other datasets and problems and do let us know the accuracy in the comment section.

Till then, Happy Learning!! 馃檪

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content