Python Gensim Word2Vec

Filed Under: Python
Python Gensim

Gensim is an open-source vector space and topic modelling toolkit. It is implemented in Python and uses NumPy & SciPy. It also uses Cython for performance.

1. Python Gensim Module

Gensim is designed for data streaming, handle large text collections and efficient incremental algorithms or in simple language – Gensim is designed to extract semantic topics from documents automatically in the most efficient and effortless manner.

This actually differentiates it from others as most of them only target in-memory and batch processing. At the core of Gensim unsupervised algorithms such as Latent Semantic Analysis, Latent Dirichlet Allocation examines word statistical co-occurrence patterns within a corpus of training documents to discover the semantic structure of documents.

2. Why use Gensim?

Gensim has various features, which give it an edge over other scientific packages, like:

  • Memory independent – You don’t need the whole training corpus to reside in RAM at a given time which means it can process large, web-scale corpora with ease.
  • It provides I/O wrappers and converters around several popular data formats.
  • Gensim has efficient implementations for various vector space algorithms, which includes Tf-Idf, distributed incremental Latent Dirichlet Allocation (LDA) or Random Projection, distributed incremental Latent Semantic Analysis, also adding new ones is really easy.
  • It also provides similarity queries for documents in their semantic representation.

3. Getting Started with Gensim

Before getting started with Gensim you need to check if your machine is ready to work with it. Gensim assumes following to be working seamlessly on your machine:

  • Python 2.6 or later
  • Numpy 1.3 or later
  • Scipy 0.7 or later

3.1) Install Gensim Library

Once you have the above mentioned requirements satisfied your device is ready for gensim. You can get it using pip. Just go to your terminal and run the following command:


sudo pip install --upgrade gensim

3.2) Using Gensim

You can use gensim in any of your python scripts just by importing it like any other package. Just use the following import:


import gensim

3.3) Develop Gensim Word2Vec Embedding

We have talked a lot about text, word and vector while introducing Gensim, let’s start with developing a word 2 vector embedding:


from gensim.models import Word2Vec
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
			['this', 'is', 'the', 'second', 'sentence'],
			['yet', 'another', 'sentence'],
			['one', 'more', 'sentence'],
			['and', 'the', 'final', 'sentence']]
# train model
model = Word2Vec(sentences, min_count=1)
# summarize the loaded model
print(model)
# summarize vocabulary
words = list(model.wv.vocab)
print(words)
# access vector for one word
print(model['sentence'])
# save model
model.save('model.bin')
# load model
new_model = Word2Vec.load('model.bin')
print(new_model)

Let’s run the code, we are expecting vector for each word:
python gensim word2vec load

3.4) Visualize Word Embedding

We can see several vectors for every word in our training data and it is definitely hard to understand. Visualizing can help us in this scenario:


from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot
# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
			['this', 'is', 'the', 'second', 'sentence'],
			['yet', 'another', 'sentence'],
			['one', 'more', 'sentence'],
			['and', 'the', 'final', 'sentence']]
# train model
model = Word2Vec(sentences, min_count=1)
# fit a 2d PCA model to the vectors
X = model[model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# create a scatter plot of the projection
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model.wv.vocab)
for i, word in enumerate(words):
	pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()

Let’s run the program and see if we get something which is simpler and we can understand easily:
python gensim plotting

3.5) Load Google’s Word2Vec Embedding

Using an existing pre-trained data may not be the best approach for an NLP application but it can be really a time consuming and difficult task to train your own data at this point as it requires a lot of computer RAM and time of course. So we are using Google’s data for this example. For this example, you’ll be needing a file which you can find here.

Download the file, unzip it and we’ll use the binary file inside.

Here is a sample program:


from gensim.models import KeyedVectors
# load the google word2vec model
filename = 'GoogleNews-vectors-negative300.bin'
model = KeyedVectors.load_word2vec_format(filename, binary=True)
# calculate: (king - man) + woman = ?
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(result)

The above example loads google’s word to vec data and then calculates king-man + woman=?. We should expect the following:


[('queen', 0.7118192315101624)]

Let’s see the output for this program:
python google word2vec processing

3.6) Load Stanford’s GloVe Embedding

There is another algorithm available for converting word to vectors, popularly known as Global Vectors for Word Representation or GloVe. We’ll use them for our next example.

Since we are using existing data, we’ll be needing a file this one is relatively smaller and can be downloaded from here.

First we’ll need to convert the file to word to vec format and this can be done as:


from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = 'glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

Once this is done we are ready to head forward with our example as:


# load the Stanford GloVe model
filename = 'glove.6B.100d.txt.word2vec'
model = KeyedVectors.load_word2vec_format(filename, binary=False)
# calculate: (king - man) + woman = ?
result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
print(result)

Again we are expecting queen as the output, let’s run the program and check the results. Let’s see the output for this program:
python gensim tutorial with stanford word embedding

4. Conclusion

In this tutorial, we have seen how to produce and load word embedding layers in Python using Gensim. To be specific we have learned:

  • To train our own word embedding model on text data.
  • To visualize a trained word embedding model.
  • To load pre-trained GloVe and word2vec word embedding models from Stanford and Google respectively

We have seen Gensim makes it effortless to convert words to vectors and is very efficient. Also querying on the established pattern is easy and efficient.

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages