Logistic Regression in R programming

Filed Under: R Programming
LOGISTIC REGRESSION (1)

Hello, readers! In our series of Machine Learning with R, we will have a look at one of the most prominently used algorithm in ML. That is, Logistic Regression using R, in detail.


So, what is Logistic Regression?

Before getting started with Logistic Regression, let us first start unwrapping the necessity of Machine Learning models.

Machine Learning models enable us to detect patterns from the labeled or unlabeled data and utilize the same to make predictions. By this, it helps us solve real-life problems and make estimations.

Supervised Machine Learning works on categorical as well as regression data values to evaluate the data values and make predictions.

One of the most prominently used algorithm to work on classification data is Logistic Regression!

Logistic Regression is a classification Supervised Machine Learning algorithm that classifies binary or multi-class data values. It works on categorical data only.

With Logistic Regression, we can build models that segregate and classify binary data values.

Logistic Regression works on the logit function to segregate the binary labelled values. Further, this logit function uses the odds value and probability to predict the labels correctly.

Let us consider an example to have a clear understanding about Logistic Regression.

Logistic Regression can help us classify emails as ‘Spam’ or ‘Not Spam’ or identify financial NPAs such as bank defaulters. Let’s take this as our example and allow me to show you a quick demo of how we can work this out.


Logistic Regression in R – A Practical Approach

Having understood about Logistic Regression, let us now begin with the implementation of the same.

In this example, we would be trying to predict whether a customer is a Bank Loan defaulter or not. You can find the dataset below!

Defaulter Prediction Dataset
Defaulter Prediction Dataset

1. Load the dataset

Let us now load the dataset into the R environment. We would be making use of read.csv() function to load the data values.

#Removed all the existing objects
rm(list = ls())

#Setting the working directory
setwd("Santander Prediction/")
getwd()

#Load the dataset
train_data = read.csv("train.csv",header=TRUE)

2. Sampling of data

Before performing modelling on the data values, let us first split the dataset into training and testing values. We have made use of createDataPartition() method to segregate the data variables.

###SAMPLING OF DATA###
library(caret)
clean_data = cbind(train_independent,train_dependent)
split_index = createDataPartition(clean_data$train_dependent , p=.80 ,list=FALSE)
X = clean_data[split_index,]
Y  = clean_data[-split_index,]

3. Error Metrics

In order to evaluate any machine learning model, it is very important for us to define the necessary error metrics for the same. For Logistic Regression, we would be making use of Confusion Matrix and derive other metrics such as Precision, Recall, f-1 score, etc.

#error metrics -- Confusion Matrix

error_metric=function(CM)
{
  
  TN =CM[1,1]
  TP =CM[2,2]
  FP =CM[1,2]
  FN =CM[2,1]
  precision =(TP)/(TP+FP)
  recall_score =(FP)/(FP+TN)
  f1_score=2*((precision*recall_score)/(precision+recall_score))
  accuracy_model  =(TP+TN)/(TP+TN+FP+FN)
  False_positive_rate =(FP)/(FP+TN)
  False_negative_rate =(FN)/(FN+TP)
  print(paste("Precision value of the model: ",round(precision,2)))
  print(paste("Accuracy of the model: ",round(accuracy_model,2)))
  print(paste("Recall value of the model: ",round(recall_score,2)))
  print(paste("False Positive rate of the model: ", round(False_positive_rate,2)))
  print(paste("False Negative rate of the model: ", round(False_negative_rate,2)))
  print(paste("f1 score of the model: ",round(f1_score,2)))
}

4. Finally, let us apply the model!

Below, we have applied the logistic regression model on our dataset and further evaluated the model.

logit_model =glm(formula = train_dependent~. ,data =X ,family='binomial')
summary(logit_model)
logit_predict = predict(logit_model , Y[-201] ,type = 'response' )
logit_predict <- ifelse(logit_predict > 0.5,1,0) # Probability check
CM= table(Y[,201] , logit_predict)
error_metric(CM)
library(pROC)
roc_score=roc(Y[,201], logit_predict)
plot(roc_score ,main ="ROC curve for Logistic Regression ")

Explanation:

  • The glm() function enables us to apply logistic regression on the dataset.
  • Further, predict() function to test the model on the testing data.
  • Having applied the model, we have built the confusion matrix in a matrix format using the table() function.
  • Using the plot() function, we have plot the ROC curve for the model built from roc() function in R documentation.

Output:

[1] "Precision value of the model:  0.71"
[1] "Accuracy of the model:  0.91"
[1] "Recall value of the model:  0.01"
[1] "False Positive rate of the model:  0.01"
[1] "False Negative rate of the model:  0.72"
[1] "f1 score of the model:  0.03"
ROC-Logistic Regression
ROC-Logistic Regression

So, as seen above, we have witnessed 91% accuracy for Logistic Regression model on our dataset.


Conclusion

By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.

For more such posts related to R, stay tuned! And till then, Happy Learning!! 馃檪

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content