K-fold Cross Validation in R programming

Filed Under: R Programming
K FOLD CROSS VALIDATION IN R

The K-fold cross-validation in R is a repeated holdout based technique also known as an f-fold CV. This technique has become the industry standard to evaluate the model performance.

The k-fold cross-validation instead of gathering random samples which will eventually result in using the same data records one more time will divide the data into k, known as folds

Anyway you can set k value to any number, but the most preferred convention to be used is 10-fold cross validation.

But, why only 10 folds?

The studies show that you will have a little benefit from using a greater number. For each folds i.e. 10 folds, the machine learning model is built on the remaining 90% of data.

You will be using this 10% to evaluate the model performance. Even though model is trained and tested for 10 times, you can use the average result value.

Implementing K-Fold Cross-Validation

I hope by now you get a basic understanding of cross-validation in R. Note that we will be using the “Caret” package for this process.

If you don’t know about it, CARET stands for Classification, and Regression Training, which helps in model training.

1. Importing the data

Now, we can implement this method on credit data. We’ll use the read.csv() method

#Reading the data
df <- read.csv('creditdata.csv')
df
Credit Data
Credit Data

This is a credit data set that includes multiple attributes. Our target variable will be the Default variable.

2. Creating Folds

Now we are going to create folds using the data present in the Default variable.

#Loading caret library
library(caret)
#Creating folds
fold <- createFolds(df$default, k=10)
#Display folds
View(fold)
[1]  17  51  52  61  75  95 106 110 114 115 128 131 142 144 165 169 181

$Fold02
 [1]  11  13  28  47  54  64  85 109 113 119 121 122 123 125 149 152 173 179

$Fold03
 [1]   2   3  15  16  23  26  38  48  62  69  71 111 127 137 151 155 176 177

$Fold04
 [1]  12  31  60  67  70  73  83  89  90  96  99 116 139 146 157 158 159 168 183

$Fold05
 [1]   7  66  72  78  86  87  91 102 104 108 117 130 136 138 147 153 160 162 175

$Fold06
 [1]   4   6   9  14  19  20  24  25  32  34  35  50  56  76 105 118 163 171

$Fold07
 [1]  36  43  46  49  63  74  77  92  94  97 100 101 129 150 156 164 172 178

$Fold08
 [1]   1   8  37  39  40  41  53  57  58  82  84  93  98 103 126 132 167 170

$Fold09
 [1]  10  21  22  42  45  55  80  81  88 107 112 124 133 140 148 154 161 180 182

$Fold10
 [1]   5  18  27  29  30  33  44  59  65  68  79 120 134 135 141 143 145 166 174


Fine, we have created 10 folds. You can also check this using command str(fold).

You can observe that all the data points present in the default column is been divided into 10 folds. Now, let’s move ahead to train and test the data process.

3, Creating Train and Test data

We have successfully created the folds which are 10. Now we can split the train and test data for model training.

Training Data

#Training data
test_data <- df[fold$Fold01, ]
str(test_data)
'data.frame':	17 obs. of  22 variables:
 $ checking_balance    : logi  NA NA NA NA NA NA ...
 $ months_loan_duration: chr  "unknown" "1 - 200 DM" "1 - 200 DM" "1 - 200 DM" ...
 $ credit_history      : int  24 24 27 9 36 12 24 14 36 12 ...
 $ purpose             : chr  "critical" "delayed" "delayed" "repaid" ...
 $ amount              : chr  "radio/tv" "furniture" "car (used)" "business" ...
 $ savings_balance     : int  2424 2333 5965 1391 1977 1318 11938 1410 7855 1680 ...
 $ employment_length   : chr  "unknown" "unknown" "< 100 DM" "< 100 DM" ...
 $ installment_rate    : chr  "> 7 yrs" "0 - 1 yrs" "> 7 yrs" "1 - 4 yrs" ...
 $ personal_status     : int  4 4 1 2 4 4 2 1 4 3 ...
 $ other_debtors       : chr  "single male" "single male" "single male" "married male" ...
 $ residence_history   : chr  "none" "none" "none" "none" ...
 $ property            : int  4 2 2 1 4 4 3 2 2 1 ...
 $ age                 : chr  "building society savings" "building society savings" "other" "real estate" ...
 $ installment_plan    : int  53 29 30 27 40 54 39 35 25 35 ...
 $ housing             : chr  "none" "bank" "none" "bank" ...
 $ existing_credits    : chr  "own" "own" "own" "own" ...
 $ default             : int  2 1 2 1 1 1 2 1 2 1 ...
 $ dependents          : int  1 1 1 1 2 1 2 1 2 1 ...
 $ telephone           : int  1 1 1 1 1 1 2 1 1 1 ...
 $ foreign_worker      : chr  "none" "none" "yes" "yes" ...
 $ job                 : chr  "yes" "yes" "yes" "yes" ...
 $ X                   : chr  "skilled employee" "unskilled resident" "mangement self-employed" "skilled employee" ...

Testing data

#Creating test data 
train_data <- df[-fold$Fold01, ]
str(train_data)
'data.frame':	166 obs. of  22 variables:
 $ checking_balance    : logi  NA NA NA NA NA NA ...
 $ months_loan_duration: chr  "< 0 DM" "1 - 200 DM" "unknown" "< 0 DM" ...
 $ credit_history      : int  6 48 12 42 24 36 24 36 12 30 ...
 $ purpose             : chr  "critical" "repaid" "critical" "repaid" ...
 $ amount              : chr  "radio/tv" "radio/tv" "education" "furniture" ...
 $ savings_balance     : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ employment_length   : chr  "unknown" "< 100 DM" "< 100 DM" "< 100 DM" ...
 $ installment_rate    : chr  "> 7 yrs" "1 - 4 yrs" "4 - 7 yrs" "4 - 7 yrs" ...
 $ personal_status     : int  4 2 2 2 3 2 3 2 2 4 ...
 $ other_debtors       : chr  "single male" "female" "single male" "single male" ...
 $ residence_history   : chr  "none" "none" "none" "guarantor" ...
 $ property            : int  4 2 3 4 4 4 4 2 4 2 ...
 $ age                 : chr  "real estate" "real estate" "real estate" "building society savings" ...
 $ installment_plan    : int  67 22 49 45 53 35 53 35 61 28 ...
 $ housing             : chr  "none" "none" "none" "none" ...
 $ existing_credits    : chr  "own" "own" "own" "for free" ...
 $ default             : int  2 1 1 1 2 1 1 1 1 2 ...
 $ dependents          : int  1 2 1 1 2 1 1 1 1 2 ...
 $ telephone           : int  1 1 2 2 2 2 1 1 1 1 ...
 $ foreign_worker      : chr  "yes" "none" "none" "none" ...
 $ job                 : chr  "yes" "yes" "yes" "yes" ...
 $ X                   : chr  "skilled employee" "skilled employee" "unskilled resident" "skilled employee" ...

We got 166 observations for model training and 17 observations for model testing purposes. In cross-validation, the model will be measured for performance each time i.e. 10, and then the average result will be taken into consideration as well.

4. Measure the model performance

We can measure the model performance by estimating the kappa statistic. But before we move further, we have many things to consider. This process requires some libraries. Let’s install them.

#Install and Import the required libraries 
library(caret)
library(C50)
library(irr)

We are using the c50 decision tree model for our 10 fold credit data cross-validation. You can also try using different models apart from decision trees. Take a dig at it. 

#Sets seed and folds
set.seed(123)
folds <- createFolds(df$default, k=10)
results <- lapply(folds, function(x) {
  credit_train <- df[-x, ]
  credit_test <- df[x, ]
  credit_model <- C5.0.formula(default ~., data = credit_train)
  credit_pred <- predict(credit_model, credit_test)
  credit_actual <- credit_test$default
  kappa <- kappa2(data.frame(credit_actual, credit_pred))$value
  return(kappa)
})
results

This above code will give you the kappa statistic for the 10 fold cross validation results.

$ fold01 : num 0.343
$ fold02 : num 0.255
$ fold03 : num 0.109
$ fold04 : num 0.107 ........

These are the results that we are getting for each fold. It looks good and let’s check it’s average value using the mean function and by un listing the folds.

#Getting the average value
mean(unlist(results)
0.274

Our model looks fairly low. This means the model is performing marginally better than any random occurrence. Don’t lose heart, there are many automated methods to improve our model performance. We will be covering those in coming articles.

Wrapping Up

The cross-validation in R is just like a resampling technique that you can use to measure the model performance. In 10 fold cross-validation, we will be creating 10 folds of data and measure the performance indivisibly and later unite them with their average.

That’s all for now. Happy R!!!

More read: Caret package in r

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages