In this tutorial, our aim is to implement the Bag of Words model in Python under ten lines of code. Before we get into the implementation, let’s learn about the Bag of Words model.
Table of Contents
What is the Bag of Words model?
Bag of words (BOW) is a technique to extract features from the text for Natural Language Processing.
The bag of word model focuses on the word count to represent a sentence. So, the order of the words in a sentence is not considered under the BOW model.
The steps involved in creating the BOW model for a piece of text are as follows:
- Tokenize the text and store the tokens in a list.
- Create a vocabulary out of the tokens.
- Count the number of occurrences of tokens in each sentence and store the count.
- Create a bag of words model by converting the text into vectors with count of each word from the vocabulary.
We are going to use the Keras preprocessing module to implement BOW. In particular, we will be using the Tokenizer class which is a text tokenization utility class.
Let’s see an example.
Example of the Bag of Words Model
Consider the following sentences.
'There was a man', 'The man had a dog', 'The dog and the man walked',
There are three sentences in our text. Let’s create a vocabulary out of this.
For getting a vocabulary out of the text we are going to use tokenization.
Tokenization is the process of breaking down a piece of text into smaller units called tokens.
After tokenizing we get the vocabulary as follows :
man | the | a | dog | there | was | had | and | walked |
Now to create the bag of words model we need to count the number occurrences of these tokens in the three sentences.
We the get the count as follows :
man | the | a | dog | there | was | had | and | walked | |
S1 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 |
S2 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 |
S3 | 1 | 2 | 0 | 1 | 0 | 0 | 0 | 1 | 1 |
We can create the vectors for the three sentences as shown :
[[0. 1. 0. 1. 0. 1. 1. 0. 0. 0.] [0. 1. 1. 1. 1. 0. 0. 1. 0. 0.] [0. 1. 2. 0. 1. 0. 0. 0. 1. 1.]]
Now let’s learn how to implement bag of words model using Keras.
How to implement Bag of Words using Python Keras?
To implement the bag of words using Keras we will first have to import the module.
from keras.preprocessing.text import Tokenizer
Now we have to declare the text for the model to work on.
text = [ 'There was a man', 'The man had a dog', 'The dog and the man walked', ]
Now let’s see how to get the vocabulary and the bag of words representations for the text.
1. Fit a Tokenizer on the text
To create tokens out of the text we will use Tokenizer class from Keras Text preprocessing module.
model = Tokenizer() model.fit_on_texts(text)
Now you can use this model to get the vector representations of the sentences as well as to get the vocabulary.
2. Get Bag of Words representation
To get the vector representation for the sentences you need to run the following line of code.
rep = model.texts_to_matrix(text, mode='count')
Here we use the text to matrix method from the tokenizer class. It converts a list of texts to a Numpy matrix. By mentioning mode as count we make sure that the final matrix has the counts for each token.
To display the vectors use:
print(rep)
Output:
[[0. 1. 0. 1. 0. 1. 1. 0. 0. 0.] [0. 1. 1. 1. 1. 0. 0. 1. 0. 0.] [0. 1. 2. 0. 1. 0. 0. 0. 1. 1.]]
Now let’s see how to print the vocabulary.
3. Display the vocabulary
To display the vocabulary (tokens) use :
print(f'Key : {list(model.word_index.keys())}')
Output :
Key : ['man', 'the', 'a', 'dog', 'there', 'was', 'had', 'and', 'walked']
Complete Python Code for Bag of Words Model
Here’s the complete code for this tutorial.
from keras.preprocessing.text import Tokenizer text = [ 'There was a man', 'The man had a dog', 'The dog and the man walked', ] # using tokenizer model = Tokenizer() model.fit_on_texts(text) #print keys print(f'Key : {list(model.word_index.keys())}') #create bag of words representation rep = model.texts_to_matrix(text, mode='count') print(rep)
Output:
Key : ['man', 'the', 'a', 'dog', 'there', 'was', 'had', 'and', 'walked'] [[0. 1. 0. 1. 0. 1. 1. 0. 0. 0.] [0. 1. 1. 1. 1. 0. 0. 1. 0. 0.] [0. 1. 2. 0. 1. 0. 0. 0. 1. 1.]]
Conclusion
This tutorial was about implementing the Bag of Words model in Python. We used the Keras text preprocessing module for implementing it. Hope you had fun learning with us.