How to Split Data into Training and Testing Sets?

Filed Under: Python Advanced
Split Data

In the field of machine learning, it is common practice to divide a dataset into two different sets. These sets are training set and testing set. It is preferable to keep the training and testing data separate.

Why should we split our dataset?

If we don’t split the dataset into training and testing sets, then we end up testing and training our model on the same data. When we test on the same data we trained our model on, we tend to get good accuracy.

However, this doesn’t mean that the model will perform as good on unseen data. This is termed as overfitting in the world of machine learning.

Overfitting is the case when your model represents the training dataset a little too accurately. This means that your model fits too closely. 

Overfitting is an undesirable phenomenon when training a model. So is underfitting.

Underfitting is when the model is not even able to represent the data points in the training dataset.

How to split a dataset using sklearn?

Let’s see how can we use sklearn to split a dataset into training and testing sets. We will go over the process step by step.

1. Import the dataset

Let’s start by importing a dataset into our Python notebook. In this tutorial, we are going to use the titanic dataset as the sample dataset. You can import the titanic dataset from the seaborn library in Python.

import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic.head()
Titanic Dataset
Titanic Dataset

2. Form input and output vectors from the dataset

Before we move on to splitting the dataset into training and testing sets, we need to prepare input and output vectors out of the dataset.

Let’s treat the ‘survived‘ column as output. This means that this model is going to be trained to predict whether a person survived will survive or not.

y = titanic.survived
print(y)

Output :

Output
Output

We also need to remove ‘survived‘ column from the dataset to get the input vector.

x=titanic.drop('survived',axis=1)
x.head()

Output :

Input
Input

3. Deciding the split ratio

The split ratio represents what portion of the data will go to the training set and what portion of it will go to the testing set. The training set is almost always larger than the testing set.

Most common split ratio used by data scientists is 80:20.

A split ratio of 80:20 means that 80% of the data will go to the training set and 20% of the dataset will go to the testing set.

4. Performing the split

To split the data we will are going to use train_test_split from sklearn library.

train_test_split randomly distributes your data into training and testing set according to the ratio provided.

We are going to use 80:20 as the split ratio.

We first need to import train_test_split from sklearn.

from sklearn.model_selection import train_test_split

To perform the split use :

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)

We have mentioned test size as 0.2, this means that the training size would be 0.8 giving us our desired ratio.

5. Verify by printing the shapes of training and testing vectors

To verify the split, let’s print out the shapes of different vectors.

print("shape of original dataset :", titanic.shape)
print("shape of input - training set", x_train.shape)
print("shape of output - training set", y_train.shape)
print("shape of input - testing set", x_test.shape)
print("shape of output - testing set", y_test.shape)

Output :

shape of original dataset : (891, 15)
shape of input - training set (712, 14)
shape of output - training set (712,)
shape of input - testing set (179, 14)
shape of output - testing set (179,)

Complete code

The complete code for this tutorial is given below :

import seaborn as sns
from sklearn.model_selection import train_test_split

#import dataset
titanic = sns.load_dataset('titanic')

#output vector
y = titanic.survived

#input vector
x=titanic.drop('survived',axis=1)

#split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2)

#verify
print("shape of original dataset :", titanic.shape)
print("shape of input - training set", x_train.shape)
print("shape of output - training set", y_train.shape)
print("shape of input - testing set", x_test.shape)
print("shape of output - testing set", y_test.shape)

Conclusion

This tutorial was about splitting data into training and testing sets using sklearn in python. We also discussed concepts like overfitting, underfitting to understand the need for splitting the data.

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages