Linear Discriminant Analysis in R

Filed Under: R Programming
Linear Discriminant Analysis In R

Hello, readers! In this article, we will be focusing on Linear Discriminant Analysis in R programming, in detail.

So, let us begin!!


Working of Linear Discriminant Analysis

Prior to working with Linear Discriminant Analysis, let us first understand its emergence and origin in the domain of Data Science.

To solve any real-life problems using data science and machine learning, we need to work on the huge dataset3s to process, clean, transform and apply algorithms.

Among all of these steps, in order to reduce the complexity of the model, we need to reduce the dimensionality of the data for the model to work efficiently. Thus, it is important for us to understand the need of every column of the dataset and as to what impact it has on the target value/variable.

This is when Linear Discriminant Analysis comes into picture.

It is a dimension reduction technique that is basically used to analyze every column of the dataset and also observe the values on statistical grounds such as mean, etc. It makes use of a linear combination of predictors to predict the class of every observation that is fed to the model.

At first, this process determines the mean of the entire group of values and then evaluates the mean of individual variable. That is then it estimates the probability of the variable to be a part of the different group of values from the dataset.

Explanation of Linear Discriminant Analysis in two steps:

  1. Detects maximum separation between the classes of the data values.
  2. Uses the above separation knowledge to predict the class of each variable that is fed to the built model.

Assumptions of Linear Discriminant Analysis

  • The data needs to be normally distributed i.e. all the variables need to follow a normal distribution and by scaling the data all the categorical variables would be encoded into numeric values.
  • Feature Scaling is a must! Need to scale the data prior to the application of LDA to ensure the data is scale-free.
  • The data needs to be free from outliers. Thus it is highly recommended to treat the outliers prior to processing.

Now, let us focus on the practical implementation of the same.


Syntax of lda() function in R

R provides us with ‘MASS‘ library that offers lda() function to apply linear discriminant analysis on the data values.

lda(formula, data)

Here, ‘formula’ can be a group or a variable with respect to which LDA would work. The ‘data’ is the set of data values that needs to be provided to the lda() function to work on.


Linear Discriminant Analysis in R – Practical Approach

In this example, we have made use of Bank Loan dataset which aims at predicting whether a customer is a loan defaulter or not.

You can find the dataset here!

Initially, we load the dataset into the R environment using read.csv() function.

Further, we split the dataset into train and test data values using createDataPartition() method.

At last, we apply lda() function with respect to the distinct target variable ‘default’ on the training data set as shown below–

Example:

rm(list = ls())
#Setting the working directory
setwd("D:/Loan_Defaulter")
getwd()
#Load the dataset
dta = read.csv("bank-loan.csv",header=TRUE)

###################################Data SAMPLING########################################
categorical_col= c('ed')
library(dummies)
data = dta
data = dummy.data.frame(data,categorical_col)
dim(data)
library(caret)
set.seed(101)
split = createDataPartition(data$default, p = 0.80, list = FALSE)
train_data = data[split,]
test_data = data[-split,]


model_lda <- lda(default~., data = train_data) 
predictions_lda <- predict(model_lda,test_data)
print(model_lda)

Output:

The below outcome can be explained in three stages as below:

  1. It returns the probabilities of each and every group as well as the variable with respect to the dataset.
  2. It also returns the mean of every variable as well as the group values.
  3. At the end, we can also witness the coefficient of linear discriminants for every variable.
Call:
lda(default ~ ., data = train_data)

Prior probabilities of groups:
        0         1 
0.7379679 0.2620321 

Group means:
       age       ed1       ed2        ed3        ed4         ed5   employ  address   income  debtinc creddebt  othdebt
0 35.77295 0.5579710 0.2826087 0.09903382 0.05314010 0.007246377 9.734300 8.958937 48.70773  8.73285 1.247905 2.905252
1 33.12245 0.4013605 0.3469388 0.17006803 0.07482993 0.006802721 5.387755 6.387755 41.99320 14.32109 2.444348 3.814618

Coefficients of linear discriminants:
                  LD1
age       0.016884783
ed1      -0.181500101
ed2       0.108155258
ed3       0.320127893
ed4      -0.236784568
ed5       0.286471171
employ   -0.120510587
address  -0.045972955
income    0.003498294
debtinc   0.089782731
creddebt  0.296260235
othdebt  -0.069878104

Conclusion

By this, we have come to the end of this topic. Feel free to comment below in case you come across any question.

For more such posts related to R programming, stay tuned with us.

Till then, Happy Learning!! 馃檪

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content