Boosting in R | Another Ensemble-Based Method

Filed Under: R Programming
Boosting In R Yet Another Ensamble Method

Are you a big fan of ensemble models?. Well, here is boosting in R – Yet another ensemble-based method. In this article, we will explore how boosting works in R and how we can make our model predictions better. Let’s roll!!!.


Boosting in R

Boosting in R is an ensemble-based method used to boost the performance of weak learners. Similar to bagging, the boosting algorithms will make us ensembles of models which are trained on resamples of data. The voting is made to figure the final prediction. 

Before moving forward, you should understand two distincts of boosting in R.

  • The resampled data in the boosting are constructed in a way to generate complementary learners.
  • Boosting won’t support equal distribution of votes like bagging. Boosting will provide votes based on individual performances. So, the better the model performs, the greater the influence on the final prediction.

Adaboost – Adaptive Boosting in R

The concept of AdaBoost was first proposed by Freund and Schapire back in the year 1997. The AdaBoost or the adaptive boosting will generate the weak learners and train them with much complex or difficult to classify examples or data points.

  • Adabag package:

You have to use the adabag package to implement the AdaBoost.M1 classifier. Once this classifier is trained, you can use this for predictions over unseen data. You can measure the error rate using a separate dataset or you can use cross-validation as well. 

Well, I hope you got a good understanding of boosting, adaboosting and adabag. Now, let’s see all of them in action.


Credit dataset

We are using credit data for this purpose. Let’s explore the data using functions such as str() and summary() to get some insights into the data.

#Read the dataset
df <- read.csv('credit.csv')
#Explore the datatypes
str(df)
'data.frame':	1000 obs. of  20 variables:
 $ months_loan_duration: chr  "< 0 DM" "1 - 200 DM" "unknown" "< 0 DM" ...
 $ credit_history      : int  6 48 12 42 24 36 24 36 12 30 ...
 $ purpose             : chr  "critical" "repaid" "critical" "repaid" ...
 $ amount              : chr  "radio/tv" "radio/tv" "education" "furniture" ...
 $ savings_balance     : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ employment_length   : chr  "unknown" "< 100 DM" "< 100 DM" "< 100 DM" ...
 $ installment_rate    : chr  "> 7 yrs" "1 - 4 yrs" "4 - 7 yrs" "4 - 7 yrs" ...
 $ personal_status     : int  4 2 2 2 3 2 3 2 2 4 ...
 $ other_debtors       : chr  "single male" "female" "single male" "single male" ...
 $ residence_history   : chr  "none" "none" "none" "guarantor" ...
 $ property            : int  4 2 3 4 4 4 4 2 4 2 ...
 $ age                 : chr  "real estate" "real estate" "real estate" "building society savings" ...
 $ installment_plan    : int  67 22 49 45 53 35 53 35 61 28 ...
 $ housing             : chr  "none" "none" "none" "none" ...
 $ existing_credits    : chr  "own" "own" "own" "for free" ...
 $ default             : int  2 1 1 1 2 1 1 1 1 2 ...
 $ dependents          : int  1 2 1 1 2 1 1 1 1 2 ...
 $ telephone           : int  1 1 2 2 2 2 1 1 1 1 ...
 $ foreign_worker      : chr  "yes" "none" "none" "none" ...
 $ job                 : chr  "yes" "yes" "yes" "yes" ...

Take some time to analyze and understand what these numbers are telling to us. Describing data is the key aspect of any analysis work and you should spend some time here.


Adaboost classifier using boosting in R

So, we have the data. I hope you spent some time understanding what it is about. Now, we can move forward and create the train and test data to perform boosting.

You can start by installing required libraries.

#Load required libraries
library(caret)
library(adabag)
#Creates the train and test split [90:10]
credit_data <- createDataPartition(df$default, p=0.90, list = F)
train_data <- df[credit_data, ]
test_data <- df[-credit_data, ]


We have created the train and test data with a 90:10 ratio. 90% train data and 10% test data. You can see the glimpse of train and test data below.

#Explore train data
str(train_data)
'data.frame':	901 obs. of  20 variables:
 $ months_loan_duration: chr  "< 0 DM" "1 - 200 DM" "unknown" "< 0 DM" ...
 $ credit_history      : int  6 48 12 42 24 36 24 36 12 30 ...
 $ purpose             : chr  "critical" "repaid" "critical" "repaid" ...
 $ amount              : chr  "radio/tv" "radio/tv" "education" "furniture" ...
 $ savings_balance     : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ employment_length   : chr  "unknown" "< 100 DM" "< 100 DM" "< 100 DM" ...
 $ installment_rate    : chr  "> 7 yrs" "1 - 4 yrs" "4 - 7 yrs" "4 - 7 yrs" ...
 $ personal_status     : int  4 2 2 2 3 2 3 2 2 4 ...
 $ other_debtors       : chr  "single male" "female" "single male" "single male" ...
 $ residence_history   : chr  "none" "none" "none" "guarantor" ...
 $ property            : int  4 2 3 4 4 4 4 2 4 2 ...
 $ age                 : chr  "real estate" "real estate" "real estate" "building society savings" ...
 $ installment_plan    : int  67 22 49 45 53 35 53 35 61 28 ...
 $ housing             : chr  "none" "none" "none" "none" ...
 $ existing_credits    : chr  "own" "own" "own" "for free" ...
 $ default             : int  2 1 1 1 2 1 1 1 1 2 ...
 $ dependents          : int  1 2 1 1 2 1 1 1 1 2 ...
 $ telephone           : int  1 1 2 2 2 2 1 1 1 1 ...
 $ foreign_worker      : chr  "yes" "none" "none" "none" ...
 $ job                 : chr  "yes" "yes" "yes" "yes" ...
#Explore test data
str(test_data)
'data.frame':	99 obs. of  20 variables:
 $ months_loan_duration: chr  "unknown" "unknown" "unknown" "1 - 200 DM" ...
 $ credit_history      : int  9 10 6 24 27 12 36 12 18 24 ...
 $ purpose             : chr  "critical" "critical" "fully repaid" "delayed" ...
 $ amount              : chr  "car (new)" "furniture" "radio/tv" "furniture" ...
 $ savings_balance     : int  2134 2069 426 2333 5965 6468 1953 1007 1568 3617 ...
 $ employment_length   : chr  "< 100 DM" "unknown" "< 100 DM" "unknown" ...
 $ installment_rate    : chr  "1 - 4 yrs" "1 - 4 yrs" "> 7 yrs" "0 - 1 yrs" ...
 $ personal_status     : int  4 2 4 4 1 2 4 4 3 4 ...
 $ other_debtors       : chr  "single male" "married male" "married male" "single male" ...
 $ residence_history   : chr  "none" "none" "none" "none" ...
 $ property            : int  4 1 4 2 2 1 4 1 4 4 ...
 $ age                 : chr  "other" "other" "other" "building society savings" ...
 $ installment_plan    : int  48 26 39 29 30 52 61 22 24 20 ...
 $ housing             : chr  "none" "none" "none" "bank" ...
 $ existing_credits    : chr  "own" "own" "own" "own" ...
 $ default             : int  3 2 1 1 2 1 1 1 1 2 ...
 $ dependents          : int  1 1 1 1 1 2 2 1 1 1 ...
 $ telephone           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ foreign_worker      : chr  "yes" "none" "none" "none" ...
 $ job                 : chr  "yes" "no" "yes" "yes" ...

You have to convert the target variable (default) to factors to avoid the error.

#Convert the target variable as factors 
train_data$default <- as.factor(train_data$default)
test_data$default <- as.factor(test_data$default)
#Trains the model
my_model <- boosting(default~., data = train_data, boos = T, mfinal = 10)

#Model in action
predict_model <- predict(my_model, test_data)

#Confusion matrix of the predictions 
predict_model$confusion

#Computes error
predict_model$error
Observed Class
Predicted Class  1  2  3
              1 54  7  1
              2 10 25  2



Error  - 0.2020202

Fantastic. You have built an AdaBoost classifier to predict the loan defaulters in the input dataset. That’s how boosting works in R. Feel free to explore more parameters of the predict_model. 


Adaboost classifier with boosting.cv in R

The boosting.cv is another method where you train the model on train data with many subsets. Let’s see how it works in R.

#Convert target variable as factors
df$default <- as.factor(df$default)

#Create boosting.cv classifier 
model_cv <- boosting.cv(default~., data = df, boos = T, mfinal = 10, v=5)

#Measuer the predictions 
model_cv$confusion

#Measure the error
model_cv$error
 Observed Class
Predicted Class   1   2   3   4
              1 543 102   9   3
              2  88 229  18   3
              3   2   2   1   0



Error - 0.227

That’s it. You have built 2 classifier models using boosting and boosting.cv methods. The AdaBoost classifier is performing quite well. This is because of boosting techniques. You can try these methods using other datasets as well.


Ending note

Boosting in R is the ensemble-based method, which will boost the performance of the models. I hope after reading this, you can use boosting methods to improve the model performance. That’s all for now. Happy R!!!

More read: R documentation

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content