Bootstrapping in R programming language

Filed Under: R Programming
Bootstrapping In R

Bootstrapping is an inferential statistic resampling method that helps to draw a large number of samples out of a single dataset with replacement. In this article, we will be performing Bootstrapping in the R programming language.

Before we begin, here are the key points about bootstrapping in R –

  • Bootstrapping is a resampling method.
  • It allows you to predict the population distribution even from a single sample.
  • If you consider machine learning, you can use bootstrapping to predict the performance of models over unseen data.
  • It allows you to measure the confidence intervals and bootstrap statistics

Bootstrapping in R: Why and When?

Statistical inference plays a major role in a data science problem cycle. A data scientist will spend a major portion of time drawing a valid inference out of input data.

There are quite some challenges data scientists will face while solving problems and the major ones are –

  • If the sample size is small (i.e. 100 or 200), it cannot represent the true population.
  • Repeated sampling may vary the interest estimate.
  • The distribution will be complex or even unknown.
  • You may not draw valid inferences out of complex or unknown distributions.

To draw a valid inference, in a normal day a Data scientist should collect data over the entire population. But this can be very expensive and time-consuming also.

Suppose, if we want to collect responses over a pandemic, it is nearly impossible to collect data from crores of people. Instead, collecting data from selective demographics globally seems feasible.

But the challenge is, whenever you collect the data from different samples, the standard deviation will be large and it may result in bias.

Now, we have the question of “When”?

Well, ideally there are three scenarios where Bootstrapping looks good over other techniques –

  • First, When we have complex or unknown data distributions.
  • Second, When the sample size is considerably small.
  • Third, When you need to understand the variance.

Understanding the Process of Bootstrapping in R

Now, I hope you have a better understanding of the need and working of bootstrapping. So, let’s apply the learned knowledge using R programming which is our end goal as well.

The process includes –

  • Install and setup the Bootstrap package.
  • Define the function.
  • Apply the boot function.
  • Observe the Bootstrap statistic and Confidence interval.

1. Install the ‘boot’ package.

#Install the boot package with dependencies 
install.packages('boot',dependencies = T)

After the successful installation and unpacking, import the library.

#Import the boot library 
library(boot)

2. Define the Boot function

Now, we have to define the boot function. Now that we will be using 2 arguments such as data and i. The data represents the input dataset and I represent the row index.

#Define the function 
function_x <- function(data,i){
df<-data[i, ]
c(cor(df[ ,2],df[ ,3]))
}

We have defined a function with two parameters as discussed above. This function checks the correlation between the specified variables. In this case, we have specified variables 2 and 3.

3. Applying the Boot function

We are set now. We have defined the function which looks over the correlation between two variables. Now, all we need to do is to apply this function to the input data.

booooot <- boot(state.x77,function_x, R=100)
booooot
ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = state.x77, statistic = function_x, R = 100)


Bootstrap Statistics :
      original      bias    std. error
t1* -0.4370752 0.006423952   0.1381687

WOW! we have the results. Observe that the original correlation is -0.43 and the standard error here is 0.13.

Let’s look over the summary of the results or the boot function using the summary function.

summary(booooot)
Length    Class   Mode   
  
t0          1    -none- numeric  
t         100    -none- numeric  
R           1    -none- numeric  
data      400    -none- numeric  
seed      626    -none- numeric  
statistic   1    -none- function 
sim         1    -none- character
call        4    -none- call     
stype       1    -none- character
strata     50    -none- numeric  
weights    50    -none- numeric  

You can see the brief summary of the results of the boot function. The terms are –

  • t0 – The observed values of the statistic of data.
  • t – A matrix with the sum of rows of bootstrap replicate.
  • R-Value of R which is passed to boot function.
  • data – Input data to boot.
  • Seed – Set. seed value to the boot.
  • sim – Type of simulation used.
  • Style – The statistic type passed to boot.

4. Calculate the Confidence Interval

Using the boot.ci() function, we can find the confidence interval (CI) of the bootstrapping results. Let’s see how it works.

confidence_interval <- boot.ci(booooot, index = 1)
confidence_interval
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 100 bootstrap replicates

CALL : 
boot.ci(boot.out = b, index = 1)

Intervals : 
Level      Normal              Basic         
95%   (-0.7143, -0.1727 )   (-0.8003, -0.1859 )  

Level     Percentile            BCa          
95%   (-0.6882, -0.0738 )   (-0.6556, -0.0057 )  
Calculations and Intervals on Original Scale
Some basic intervals may be unstable
Some percentile intervals may be unstable
Some BCa intervals may be unstable

These are the results based on the 100 bootstrapped aspects. You can observe the various intervals as well. These are the 4 types of confidence intervals – Normal, Basic, Percentile, and BCa.

We can also calculate some of the basic statistics such as mean, median, range and even class.

mean(booooot$t)
range(booooot$t)
sd(booooot$t)
class(booooot)
-0.4306512
-0.707758756  0.009827232
0.1381687
"boot"

You can make the final observations as –

  • The correlation coefficient = [ -0.707758756 , 0.009827232].
  • Mean = -0.4306512
  • 95% CI = -0.7

Ending note

As we stated earlier, bootstrapping in R is the statistical resampling method that helps to draw samples with replacement. You have to define a function, apply the boot function to the input data, set the resampling rate and that’s it. You can observe the confidence interval of your results as well and as the performance factors. That’s all for now. Happy R!!!

More read: R documentation

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages