Bagging in R for Machine Leaning

Filed Under: R Programming
Bagging In R For Machine Learning

Now, in this article, we will talk about Bootstrap Aggregation, also known as Bagging in R for the machine learning tasks. In one of the previous articles, we had discussed bootstrapping using R, where it takes to bootstrap samples over the samples with replacement of original data.

Bagging is one of the first ensemble algorithms that machine learning people will learn and it will help you in enlacing the accuracy of classification and regression models along with stability. If I have to quote another major advantage – Bagging helps to reduce variance. It will help in avoiding the overfitting of the model. Let’s see how it works.

Bagging in R

Before a definition for bagging in R, we need to understand two things.

Bagging = Bootstrapping + Aggregation.

Bootstrapping is creating multiple or many samples with replacement to the original data. And Aggregation is to combine the outcome of all the samples to measure the most accurate statistics. 

You can use the Bagging technique for reducing the variance and bias in the sample data. So, this finally results in tackling the overfitting of the models.

As I already mentioned Bagging is the technique mostly used with Classification and Regression models. Now, let’s understand Aggregation in the context of classification and regression.

  • Classification: Here aggregation works on the Voting system. Models get voted based on their accuracy and the best will be taken.
  • Regression: Here average values decide the best statistics.

So, we are using “iris” data with “ipred” package in R to illustrate the bagging in R. Let’s roll!!!

Install the required packages in R

We can start with knowing and installing the packages we need for bagging. We will be using 5 libraries in this.

  • Caret: It is used for general model fitting.
  • Dplyr: It will help us in data wrangling / pre-processing.
  • rpart: This will help us in implementing decision trees.
  • ipred: ipred library will help us in fitting bagged models.
  • e1071: Helps in visualizing the variable importance.
#Install all the required R packages

install.packages('e1071')
library(e1071)

install.packages('caret')
library(caret)

install.packages('rpart')
library(rpart)

install.packages('ipred')
library(ipred)

install.packages('dplyr')
library(dplyr)

Know Your Data

Done with installing packages?

If yes, then we can move forward and get the data. We are going to use the “Iris” dataset for this purpose. Let’s import and read the data in R.

#Import and read the data
df <- datasets::iris
df
           Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
  1             5.1         3.5          1.4         0.2        setosa
  2             4.9         3.0          1.4         0.2        setosa
  3             4.7         3.2          1.3         0.2        setosa
  4             4.6         3.1          1.5         0.2        setosa
  5             5.0         3.6          1.4         0.2        setosa
  6             5.4         3.9          1.7         0.4        setosa
  7             4.6         3.4          1.4         0.3        setosa
  8             5.0         3.4          1.5         0.2        setosa
  9             4.4         2.9          1.4         0.2        setosa
  10            4.9         3.1          1.5         0.1        setosa

Let’s peek into the data and see what it offers to us.

#Display content in the data
str(df)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 …
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 …
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 …
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 …
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 …
#Gives you 5 point summary over the data
summary(df)
 Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  

I hope the above numbers are self explanatory and dont need much discussion. Have a look.

Bagging in R

Let’s use the bagging function in R. The “ipred” library will help us calling this function. Here, we are going to fit our data into the model. So, let’s see how it works.

#Fits the data into the model 

My_bagged_model <- bagging(
     formula = Species ~ .,
     data = df,
     nbagg = 100,   
     coob = TRUE,
     control = rpart.control(minsplit = 2, cp = 0) 
 )

My_bagged_model
Bagging classification trees with 100 bootstrap replications 

Call: bagging.data.frame(formula = Species ~ ., data = df, nbagg = 100, 
    coob = TRUE, control = rpart.control(minsplit = 2, cp = 0))

Out-of-bag estimate of misclassification error:  0.0467 

The rpart function parameters –

  • Minsplit: Indicated model to have 2 observations per node to split upon.
  • cp: It is a complexity parameter.

Running the Prediction

So, we have used iris data for this. We went with 100 bags for our model and we did all the hard work right?.

Now, we can check how well our model is predicting over new data. Excited?

#Here we have created a new data record which is unseen to our model
Unseen_data <- data.frame(Sepal.Length=7.22, Sepal.Width=3.98, Petal.Length=5.44, Petal.Width=2.5)

#Predict the Species 
predict(My_bagged_model,Unseen_data)
virginica
Levels: virginica

Heyaa!

Fantastic. You have created a bagged model which predicts amazingly. As you remember the miscalculation error is negligible for this model, you can trust the results.

Wrapping Up – Bagging in R

Bagging in R is one of the best methods for reducing the variance and bias in the data. The high variance can cause model overfitting and it can cause irregularities in the model. Multiple packages will help you in developing the bagging models. Feel free to try other packages also such as Bagging MART. I hope, by now, you got a good understanding of the bagging method and its uses. That’s all for now. Happy R!!!

More read: Bagging Classification and Regression Trees

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content