Bag of Words Model in Python [In 10 Lines of Code!]

Filed Under: Python Advanced
BOW

In this tutorial, our aim is to implement the Bag of Words model in Python under ten lines of code. Before we get into the implementation, let’s learn about the Bag of Words model.

What is the Bag of Words model?

Bag of words (BOW) is a technique to extract features from the text for Natural Language Processing.

The bag of word model focuses on the word count to represent a sentence. So, the order of the words in a sentence is not considered under the BOW model.

The steps involved in creating the BOW model for a piece of text are as follows:

  1. Tokenize the text and store the tokens in a list.
  2. Create a vocabulary out of the tokens.
  3. Count the number of occurrences of tokens in each sentence and store the count.
  4. Create a bag of words model by converting the text into vectors with count of each word from the vocabulary.

We are going to use the Keras preprocessing module to implement BOW. In particular, we will be using the Tokenizer class which is a text tokenization utility class.

Let’s see an example.

Example of the Bag of Words Model

Consider the following sentences.

  'There was a man',
  'The man had a dog',
  'The dog and the man walked',

There are three sentences in our text. Let’s create a vocabulary out of this.

For getting a vocabulary out of the text we are going to use tokenization.

Tokenization is the process of breaking down a piece of text into smaller units called tokens.

After tokenizing we get the vocabulary as follows :

mantheadogtherewashadandwalked
Vocabulary

Now to create the bag of words model we need to count the number occurrences of these tokens in the three sentences.

We the get the count as follows :

mantheadogtherewashadandwalked
S1101011000
S2111100100
S3120100011
Vocabulary

We can create the vectors for the three sentences as shown :

[[0. 1. 0. 1. 0. 1. 1. 0. 0. 0.]
[0. 1. 1. 1. 1. 0. 0. 1. 0. 0.]
[0. 1. 2. 0. 1. 0. 0. 0. 1. 1.]]

Now let’s learn how to implement bag of words model using Keras.

How to implement Bag of Words using Python Keras?

To implement the bag of words using Keras we will first have to import the module.

from keras.preprocessing.text import Tokenizer

Now we have to declare the text for the model to work on.

text = [
  'There was a man',
  'The man had a dog',
  'The dog and the man walked',
]

Now let’s see how to get the vocabulary and the bag of words representations for the text.

1. Fit a Tokenizer on the text

To create tokens out of the text we will use Tokenizer class from Keras Text preprocessing module.

model = Tokenizer()
model.fit_on_texts(text)

Now you can use this model to get the vector representations of the sentences as well as to get the vocabulary.

2. Get Bag of Words representation

To get the vector representation for the sentences you need to run the following line of code.

rep = model.texts_to_matrix(text, mode='count')

Here we use the text to matrix method from the tokenizer class. It converts a list of texts to a Numpy matrix. By mentioning mode as count we make sure that the final matrix has the counts for each token.

To display the vectors use:

print(rep)

Output:

[[0. 1. 0. 1. 0. 1. 1. 0. 0. 0.]
 [0. 1. 1. 1. 1. 0. 0. 1. 0. 0.]
 [0. 1. 2. 0. 1. 0. 0. 0. 1. 1.]]

Now let’s see how to print the vocabulary.

3. Display the vocabulary

To display the vocabulary (tokens) use :

print(f'Key : {list(model.word_index.keys())}')

Output :

Key : ['man', 'the', 'a', 'dog', 'there', 'was', 'had', 'and', 'walked']

Complete Python Code for Bag of Words Model

Here’s the complete code for this tutorial.

from keras.preprocessing.text import Tokenizer

text = [
  'There was a man',
  'The man had a dog',
  'The dog and the man walked',
]
# using tokenizer 
model = Tokenizer()
model.fit_on_texts(text)

#print keys 
print(f'Key : {list(model.word_index.keys())}')

#create bag of words representation 
rep = model.texts_to_matrix(text, mode='count')
print(rep)

Output:

Key : ['man', 'the', 'a', 'dog', 'there', 'was', 'had', 'and', 'walked']
[[0. 1. 0. 1. 0. 1. 1. 0. 0. 0.]
 [0. 1. 1. 1. 1. 0. 0. 1. 0. 0.]
 [0. 1. 2. 0. 1. 0. 0. 0. 1. 1.]]

Conclusion

This tutorial was about implementing the Bag of Words model in Python. We used the Keras text preprocessing module for implementing it. Hope you had fun learning with us.

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages