One-Hot Encoding in Python – Implementation using Sklearn

Filed Under: Python Advanced
One Hot Encoding

One-Hot encoding is a technique of representing categorical data in the form of binary vectors. It is a common step in the processing of sequential data before performing classification.

One-Hot encoding also provides a way to implement word embedding. Word Embedding refers to the process of turning words into numbers for a machine to be able to understand it.

It is common to make word embeddings out of a corpus before inputting it to an LSTM model. Making word embeddings out of a corpus makes it easier for a computer to find relationships and patterns between words.

In this tutorial, we are going to understand what exactly is One-Hot Encoding and then use Sklearn to implement it.

Let’s start by taking an example.

Working of One-Hot Encoding in Python

Consider the following sequence of words.

['Python', 'Java', 'Python', 'Python', 'C++', 'C++', 'Java', 'Python', 'C++', 'Java' ]

This is a sequential data with three categories.

The categories in the data above are as follows :

  1. Python
  2. Java
  3. C++

Let us try to understand the working behind One-Hot Encoding.

One-Hot Encoring has a two step process.

  1. Conversion of Categories to Integers
  2. Conversion of Integers to Binary vectors

1. Conversion of Categories to Integers

Let us convert the three categories in our example to integers.

C++0
Java1
Python2

Now we can use these integers to represent our original data as follows :

[2 1 2 2 0 0 1 2 0 1]

You can read this data with the conversion table above.

Let’s move to the second step now.

2. Conversion of Integers to Binary vectors

This is not your usual Integer to Binary conversion. Rather in this conversion we only set the value index corresponding to the integer as one and all the other entries are set to zero in the vector.

Let’s see what we mean by this :

C++0[1, 0, 0]
Java1[0, 1, 0]
Python2[0, 0, 1]

We can represent the data in our example as :

[[0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]]

Our original sequence data is now in the form of a 2-D Matrix. This makes it easier for a machine to understand it.

Python Code for Implementing One-Hot Encoding using Sklearn

Let’s move to the implementation part of One-Hot Encoding. We are going to use Sklearn for implementing the same.

We are going to follow the same two-step approach while implementing as well.

The steps are as follows:

  1. Use LabelEncoder to convert categories into integers.
  2. Use OneHotEncoder to convert the integers into One-Hot vectors (binary vectors).

Before we move further, let’s write the code for declaring the array with data in our example.

import numpy as np 
data = ['Python', 'Java', 'Python', 'Python', 'C++', 'C++', 'Java', 'Python', 'C++', 'Java' ]
vals = np.array(data)

1. Using LabelEncoder to convert Categories into Integers

We will first use LabelEncoder on the data. Let’s import it from Sklearn and then use it on the data.

The code for the same is as follows :

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(vals)
print(integer_encoded)

Output :

The output comes out as:

[2 1 2 2 0 0 1 2 0 1]

2. Using OneHotEncoder to convert Integer Encoding into One-Hot Encoding

Now let’s convert the integer encoding to One-Hot encoding.

OneHotEncoder only works on data that is in column format. To use the integer encoding from LabelEncoder we will have to reshape the output before providing it as an input to OneHotEncoder.

That can be done with the following lines of code :

integer_encoded_reshape = integer_encoded.reshape(len(integer_encoded), 1)
print(integer_encoded_reshape)

Output :

[[2]
 [1]
 [2]
 [2]
 [0]
 [0]
 [1]
 [2]
 [0]
 [1]]

Now we can use this data to make One-Hot vectors.

from sklearn.preprocessing import OneHotEncoder
onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded_reshape)
print(onehot_encoded)

Output :

[[0. 0. 1.]
 [0. 1. 0.]
 [0. 0. 1.]
 [0. 0. 1.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]]

Complete Code

Here’s the complete code for this tutorial :

import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

# data
data = ['Python', 'Java', 'Python', 'Python', 'C++', 'C++', 'Java', 'Python', 'C++', 'Java' ]
vals = np.array(data)

# Integer Encoding
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(vals)
print(integer_encoded)


#reshaping for OneHotEncoder
integer_encoded_reshape = integer_encoded.reshape(len(integer_encoded), 1)

# One-Hot Encoding
onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded_reshape)
print(onehot_encoded)

Conclusion

This tutorial was about One-Hot Encoding in Python. We understood how it works and used Skelarn to implement Label Encoding and One Hot Encoding.

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages