Hello, readers! In this article, we will be focusing on Linear Discriminant Analysis in R programming, in detail.
So, let us begin!!
Table of Contents
Working of Linear Discriminant Analysis
Prior to working with Linear Discriminant Analysis, let us first understand its emergence and origin in the domain of Data Science.
To solve any real-life problems using data science and machine learning, we need to work on the huge dataset3s to process, clean, transform and apply algorithms.
Among all of these steps, in order to reduce the complexity of the model, we need to reduce the dimensionality of the data for the model to work efficiently. Thus, it is important for us to understand the need of every column of the dataset and as to what impact it has on the target value/variable.
This is when
Linear Discriminant Analysis comes into picture.
It is a dimension reduction technique that is basically used to analyze every column of the dataset and also observe the values on statistical grounds such as mean, etc. It makes use of a linear combination of predictors to predict the class of every observation that is fed to the model.
At first, this process determines the mean of the entire group of values and then evaluates the mean of individual variable. That is then it estimates the probability of the variable to be a part of the different group of values from the dataset.
Explanation of Linear Discriminant Analysis in two steps:
- Detects maximum separation between the classes of the data values.
- Uses the above separation knowledge to predict the class of each variable that is fed to the built model.
Assumptions of Linear Discriminant Analysis
- The data needs to be normally distributed i.e. all the variables need to follow a normal distribution and by scaling the data all the categorical variables would be encoded into numeric values.
- Feature Scaling is a must! Need to scale the data prior to the application of LDA to ensure the data is scale-free.
- The data needs to be free from outliers. Thus it is highly recommended to treat the outliers prior to processing.
Now, let us focus on the practical implementation of the same.
Syntax of lda() function in R
R provides us with ‘MASS‘ library that offers
lda() function to apply linear discriminant analysis on the data values.
Here, ‘formula’ can be a group or a variable with respect to which LDA would work. The ‘data’ is the set of data values that needs to be provided to the lda() function to work on.
Linear Discriminant Analysis in R – Practical Approach
In this example, we have made use of Bank Loan dataset which aims at predicting whether a customer is a loan defaulter or not.
You can find the dataset here!
Initially, we load the dataset into the R environment using read.csv() function.
Further, we split the dataset into train and test data values using
At last, we apply lda() function with respect to the distinct target variable ‘default’ on the training data set as shown below–
rm(list = ls()) #Setting the working directory setwd("D:/Loan_Defaulter") getwd() #Load the dataset dta = read.csv("bank-loan.csv",header=TRUE) ###################################Data SAMPLING######################################## categorical_col= c('ed') library(dummies) data = dta data = dummy.data.frame(data,categorical_col) dim(data) library(caret) set.seed(101) split = createDataPartition(data$default, p = 0.80, list = FALSE) train_data = data[split,] test_data = data[-split,] model_lda <- lda(default~., data = train_data) predictions_lda <- predict(model_lda,test_data) print(model_lda)
The below outcome can be explained in three stages as below:
- It returns the probabilities of each and every group as well as the variable with respect to the dataset.
- It also returns the mean of every variable as well as the group values.
- At the end, we can also witness the coefficient of linear discriminants for every variable.
Call: lda(default ~ ., data = train_data) Prior probabilities of groups: 0 1 0.7379679 0.2620321 Group means: age ed1 ed2 ed3 ed4 ed5 employ address income debtinc creddebt othdebt 0 35.77295 0.5579710 0.2826087 0.09903382 0.05314010 0.007246377 9.734300 8.958937 48.70773 8.73285 1.247905 2.905252 1 33.12245 0.4013605 0.3469388 0.17006803 0.07482993 0.006802721 5.387755 6.387755 41.99320 14.32109 2.444348 3.814618 Coefficients of linear discriminants: LD1 age 0.016884783 ed1 -0.181500101 ed2 0.108155258 ed3 0.320127893 ed4 -0.236784568 ed5 0.286471171 employ -0.120510587 address -0.045972955 income 0.003498294 debtinc 0.089782731 creddebt 0.296260235 othdebt -0.069878104
By this, we have come to the end of this topic. Feel free to comment below in case you come across any question.
For more such posts related to R programming, stay tuned with us.
Till then, Happy Learning!! 🙂