In the field of machine learning, it is common practice to divide a dataset into two different sets. These sets are training set and testing set. It is preferable to keep the training and testing data separate.
Why should we split our dataset?
If we don’t split the dataset into training and testing sets, then we end up testing and training our model on the same data. When we test on the same data we trained our model on, we tend to get good accuracy.
However, this doesn’t mean that the model will perform as good on unseen data. This is termed as overfitting in the world of machine learning.
Overfitting is the case when your model represents the training dataset a little too accurately. This means that your model fits too closely.
Overfitting is an undesirable phenomenon when training a model. So is underfitting.
Underfitting is when the model is not even able to represent the data points in the training dataset.
How to split a dataset using sklearn?
Let’s see how can we use sklearn to split a dataset into training and testing sets. We will go over the process step by step.
1. Import the dataset
Let’s start by importing a dataset into our Python notebook. In this tutorial, we are going to use the titanic dataset as the sample dataset. You can import the titanic dataset from the seaborn library in Python.
import seaborn as sns titanic = sns.load_dataset('titanic') titanic.head()
2. Form input and output vectors from the dataset
Before we move on to splitting the dataset into training and testing sets, we need to prepare input and output vectors out of the dataset.
Let’s treat the ‘survived‘ column as output. This means that this model is going to be trained to predict whether a person survived will survive or not.
y = titanic.survived print(y)
We also need to remove ‘survived‘ column from the dataset to get the input vector.
3. Deciding the split ratio
The split ratio represents what portion of the data will go to the training set and what portion of it will go to the testing set. The training set is almost always larger than the testing set.
Most common split ratio used by data scientists is 80:20.
A split ratio of 80:20 means that 80% of the data will go to the training set and 20% of the dataset will go to the testing set.
4. Performing the split
To split the data we will are going to use train_test_split from sklearn library.
train_test_split randomly distributes your data into training and testing set according to the ratio provided.
We are going to use 80:20 as the split ratio.
We first need to import train_test_split from sklearn.
from sklearn.model_selection import train_test_split
To perform the split use :
We have mentioned test size as 0.2, this means that the training size would be 0.8 giving us our desired ratio.
5. Verify by printing the shapes of training and testing vectors
To verify the split, let’s print out the shapes of different vectors.
print("shape of original dataset :", titanic.shape) print("shape of input - training set", x_train.shape) print("shape of output - training set", y_train.shape) print("shape of input - testing set", x_test.shape) print("shape of output - testing set", y_test.shape)
shape of original dataset : (891, 15) shape of input - training set (712, 14) shape of output - training set (712,) shape of input - testing set (179, 14) shape of output - testing set (179,)
The complete code for this tutorial is given below :
import seaborn as sns from sklearn.model_selection import train_test_split #import dataset titanic = sns.load_dataset('titanic') #output vector y = titanic.survived #input vector x=titanic.drop('survived',axis=1) #split x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2) #verify print("shape of original dataset :", titanic.shape) print("shape of input - training set", x_train.shape) print("shape of output - training set", y_train.shape) print("shape of input - testing set", x_test.shape) print("shape of output - testing set", y_test.shape)
This tutorial was about splitting data into training and testing sets using sklearn in python. We also discussed concepts like overfitting, underfitting to understand the need for splitting the data.