PyTorch DataLoader

Filed Under: PyTorch
PyTorch Data Loader

We’ll be covering the PyTorch DataLoader in this tutorial. Large datasets are indispensable in the world of machine learning and deep learning these days. However, working with large datasets requires loading them into memory all at once.

This leads to memory outage and slowing down of programs. PyTorch offers a solution for parallelizing the data loading process with the support of automatic batching as well. This is the DataLoader class present within the torch.utils.data package.

PyTorch DataLoader Syntax

DataLoader class has the following constructor:

DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
           batch_sampler=None, num_workers=0, collate_fn=None,
           pin_memory=False, drop_last=False, timeout=0,
           worker_init_fn=None)

Let us go over the arguments one by one.

  1. Dataset – It is mandatory for a DataLoader class to be constructed with a dataset first. PyTorch Dataloaders support two kinds of datasets:
    • Map-style datasets – These datasets map keys to data samples. Each item is retrieved by a __get_item__() method implementation.
    • Iterable-style datasets – These datasets implement the __iter__() protocol. Such datasets retrieve data in a stream sequence rather than doing random reads as in the case of map datasets.
  2. Batch size – Refers to the number of samples in each batch.
  3. Shuffle – Whether you want the data to be reshuffled or not.
  4. Sampler – refers to an optional torch.utils.data.Sampler class instance. A sampler defines the strategy to retrieve the sample – sequential or random or any other manner. Shuffle should be set to false when a sampler is used.
  5. Batch_Sampler – Same as the data sampler defined above, but works at a batch level.
  6. num_workers – Number of sub-processes needed for loading the data.
  7. collate_fn – Collates samples into batches. Customized collation is possible in Torch.
  8. pin_memory – Pinned (page-locked) memory locations are used by GPUs for faster data access. When set to True, this option enables the data loader to copy tensors into the CUDA pinned memory.
  9. drop_last – If the total data size is not a multiple of the batch_size, the last batch has less number of elements than the batch_size. This incomplete batch can be dropped by setting this option to True.
  10. timeout – Sets the time to wait while collecting a batch from the workers (sub-processes).
  11. worker_init_fn – Defines a routine to be called by each worker process. Allows customized routines.

Let us now look at a few examples of how to use DataLoaders.

PyTorch DataLoaders on Built-in Datasets

MNIST is a dataset comprising of images of hand-written digits. This is one of the most frequently used datasets in deep learning. You can load the MNIST dataset first as follows.

import torch
import matplotlib.pyplot as plt
from torchvision import datasets, transforms

Before we proceed, it will help to learn a little about the torchvision transforms we have just imported. Transforms are commonly used with image datasets in order to perform operations such as normalization, resizing, cropping etc.

Transforms are in general stacked together using a compose function and applied to the images in the dataset after converting them to a tensor.

The only operation we need to perform upon MNIST images is the normalization. We pass the values 0.5 and 0.5 to the normalization transform to convert the pixels into values between 0 and 1, into distribution with a mean 0.5 and standard deviation of 0.5.

# Define a transform to normalize the data
transform = transforms.Compose([transforms.ToTensor(),
                              transforms.Normalize((0.5,), (0.5,)),
                              ])

Now we load the built-in dataset at ‘~/.pytorch/MNIST_data/’ into our working space as a torch dataset and then build a data loader using this dataset.

# Download and load the training data
trainset = datasets.MNIST('~/.pytorch/MNIST_data/', download=True, train=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

To access the images from the dataset, all we need to do is to call an iter() function upon the data loader we defined here with the name trainloader. We can now access the images in the dataset using the .next() function.

dataiter = iter(trainloader)
images, labels = dataiter.next()
print(images.shape)
print(labels.shape)
plt.imshow(images[1].numpy().squeeze(), cmap='Greys_r')

The following details regarding the batch size are printed along with the label of the image being printed.

torch.Size([64, 1, 28, 28])
torch.Size([64])
tensor(2)
MNIST Data Set
MNIST Data Set

DataLoaders on Custom Datasets

PyTorch allows you to create custom datasets and implement data loaders upon then. This makes programming in PyTorch very flexible.

To define a custom dataset, you need to override two major functions of the torch.util.data.Dataset class – __len__ and __getitem__ – which are used to retrieve the size of the dataset and get a sample item from a particular index respectively.

Let us create a sample dataset for illustrating this. We create a dataset that holds 1000 randomly generated numbers.

from torch.utils.data import Dataset
import random

class SampleDataset(Dataset):
  def __init__(self,r1,r2):
    randomlist=[]
    for i in range(1,1000):
      n = random.randint(r1,r2)
      randomlist.append(n)
    self.samples=randomlist

  def __len__(self):
      return len(self.samples)

  def __getitem__(self,idx):
      return(self.samples[idx])

dataset=SampleDataset(4,445)
dataset[100:120]

Output:

[439, 131, 338, 15, 212, 34, 44, 288, 387, 273, 324, 214, 115, 205, 213, 66, 226, 123, 65, 14]

Now we can define a data loader upon this custom dataset.

from torch.utils.data import DataLoader
loader = DataLoader(dataset,batch_size=12, shuffle=True, num_workers=2 )
for i, batch in enumerate(loader):
        print(i, batch)

The output of the above code will be data divided into batches of 12. Some of the batches retrieved are shown below.

0 tensor([417, 410,   9, 261, 357, 288, 368,  97, 411,   8, 181,  80])
1 tensor([ 27,  59, 159, 392, 402, 294,  69,  67, 201, 427, 243, 402])
2 tensor([142, 267,  21, 399, 192, 377, 425, 270,  83, 370, 237, 199])
3 tensor([266, 305,  41, 315, 231, 260, 254, 383, 266, 285, 165, 118])
4 tensor([265, 320,  92, 162, 192, 153,  49, 344,  97, 240, 312, 192])
5 tensor([417,  35, 109,  75, 288, 258, 218, 275, 158, 251,  71, 276])
6 tensor([203,  86, 291, 429,  93, 334, 288, 392, 167, 242, 430, 194])
7 tensor([ 79,  52, 421, 147, 119,  76, 131,  28,  13, 277, 270, 164])
8 tensor([ 56, 410, 253, 159, 318,  68, 342, 260,  23, 289, 326, 134])
9 tensor([ 55,   9, 132, 353,  43, 225, 188, 217, 387,  32, 214, 242])
10 tensor([131,   6, 106, 191,  89,  91,  81, 271, 247, 347, 259, 366])

Conclusion

As you can see, the PyTorch Dataloader can be used with both custom and built-in datasets. PyTorch DataLoaders give much faster data access than the regular I/O performed upon the disk. We hope this tutorial has helped you understand the PyTorch Dataloader in a much better manner.

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages