Now, in this article, we will talk about Bootstrap Aggregation, also known as Bagging in R for the machine learning tasks. In one of the previous articles, we had discussed bootstrapping using R, where it takes to bootstrap samples over the samples with replacement of original data.
Bagging is one of the first ensemble algorithms that machine learning people will learn and it will help you in enlacing the accuracy of classification and regression models along with stability. If I have to quote another major advantage – Bagging helps to reduce variance. It will help in avoiding the overfitting of the model. Let’s see how it works.
Bagging in R
Before a definition for bagging in R, we need to understand two things.
Bagging = Bootstrapping + Aggregation.
Bootstrapping is creating multiple or many samples with replacement to the original data. And Aggregation is to combine the outcome of all the samples to measure the most accurate statistics.
You can use the Bagging technique for reducing the variance and bias in the sample data. So, this finally results in tackling the overfitting of the models.
As I already mentioned Bagging is the technique mostly used with Classification and Regression models. Now, let’s understand Aggregation in the context of classification and regression.
- Classification: Here aggregation works on the Voting system. Models get voted based on their accuracy and the best will be taken.
- Regression: Here average values decide the best statistics.
So, we are using “iris” data with “ipred” package in R to illustrate the bagging in R. Let’s roll!!!
Install the required packages in R
We can start with knowing and installing the packages we need for bagging. We will be using 5 libraries in this.
- Caret: It is used for general model fitting.
- Dplyr: It will help us in data wrangling / pre-processing.
- rpart: This will help us in implementing decision trees.
- ipred: ipred library will help us in fitting bagged models.
- e1071: Helps in visualizing the variable importance.
#Install all the required R packages
install.packages('e1071')
library(e1071)
install.packages('caret')
library(caret)
install.packages('rpart')
library(rpart)
install.packages('ipred')
library(ipred)
install.packages('dplyr')
library(dplyr)
Know Your Data
Done with installing packages?
If yes, then we can move forward and get the data. We are going to use the “Iris” dataset for this purpose. Let’s import and read the data in R.
#Import and read the data
df <- datasets::iris
df
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
Let’s peek into the data and see what it offers to us.
#Display content in the data
str(df)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 …
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 …
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 …
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 …
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 …
#Gives you 5 point summary over the data
summary(df)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
I hope the above numbers are self explanatory and dont need much discussion. Have a look.
Bagging in R
Let’s use the bagging function in R. The “ipred” library will help us calling this function. Here, we are going to fit our data into the model. So, let’s see how it works.
#Fits the data into the model
My_bagged_model <- bagging(
formula = Species ~ .,
data = df,
nbagg = 100,
coob = TRUE,
control = rpart.control(minsplit = 2, cp = 0)
)
My_bagged_model
Bagging classification trees with 100 bootstrap replications
Call: bagging.data.frame(formula = Species ~ ., data = df, nbagg = 100,
coob = TRUE, control = rpart.control(minsplit = 2, cp = 0))
Out-of-bag estimate of misclassification error: 0.0467
The rpart function parameters –
- Minsplit: Indicated model to have 2 observations per node to split upon.
- cp: It is a complexity parameter.
Running the Prediction
So, we have used iris data for this. We went with 100 bags for our model and we did all the hard work right?.
Now, we can check how well our model is predicting over new data. Excited?
#Here we have created a new data record which is unseen to our model
Unseen_data <- data.frame(Sepal.Length=7.22, Sepal.Width=3.98, Petal.Length=5.44, Petal.Width=2.5)
#Predict the Species
predict(My_bagged_model,Unseen_data)
virginica
Levels: virginica
Heyaa!
Fantastic. You have created a bagged model which predicts amazingly. As you remember the miscalculation error is negligible for this model, you can trust the results.
Wrapping Up – Bagging in R
Bagging in R is one of the best methods for reducing the variance and bias in the data. The high variance can cause model overfitting and it can cause irregularities in the model. Multiple packages will help you in developing the bagging models. Feel free to try other packages also such as Bagging MART. I hope, by now, you got a good understanding of the bagging method and its uses. That’s all for now. Happy R!!!
More read: Bagging Classification and Regression Trees