Table of Contents
Gensim is designed for data streaming, handle large text collections and efficient incremental algorithms or in simple language – Gensim is designed to extract semantic topics from documents automatically in the most efficient and effortless manner.
This actually differentiates it from others as most of them only target in-memory and batch processing. At the core of Gensim unsupervised algorithms such as Latent Semantic Analysis, Latent Dirichlet Allocation examines word statistical co-occurrence patterns within a corpus of training documents to discover the semantic structure of documents.
Why use Gensim?
Gensim has various features, which give it an edge over other scientific packages, like:
- Memory independent – You don’t need the whole training corpus to reside in RAM at a given time which means it can process large, web-scale corpora with ease.
- It provides I/O wrappers and converters around several popular data formats.
- Efficient Gensim has efficient implementations for various vector space algorithms, which includes Tf-Idf, distributed incremental Latent Dirichlet Allocation (LDA) or Random Projection, distributed incremental Latent Semantic Analysis, also adding new ones is really easy.
- It also provides similarity queries for documents in their semantic representation.
Before getting started with Gensim you need to check if your machine is ready to work with it. Gensim assumes following to be working seamlessly on your machine:
- Python 2.6 or later
- Numpy 1.3 or later
- Scipy 0.7 or later
Once you have the above mentioned requirements satisfied your device is ready for gensim. You can get it using pip. Just go to your terminal and run the following command:
Copysudo pip install --upgrade gensim
You can use gensim in any of your python script just by importing it like any other package. Just use the following import:
Develop Gensim Word2Vec Embedding
We have talked a lot about text, word and vector while introducing Gensim, let’s start with developing a word 2 vector embedding:
Copyfrom gensim.models import Word2Vec # define training data sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'], ['this', 'is', 'the', 'second', 'sentence'], ['yet', 'another', 'sentence'], ['one', 'more', 'sentence'], ['and', 'the', 'final', 'sentence']] # train model model = Word2Vec(sentences, min_count=1) # summarize the loaded model print(model) # summarize vocabulary words = list(model.wv.vocab) print(words) # access vector for one word print(model['sentence']) # save model model.save('model.bin') # load model new_model = Word2Vec.load('model.bin') print(new_model)
Let’s run the code, we are expecting vector for each word:
Visualize Word Embedding
We can see several vectors for every word in our training data and it is definitely hard to understand. Visualizing can help us in this scenario:
Copyfrom gensim.models import Word2Vec from sklearn.decomposition import PCA from matplotlib import pyplot # define training data sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'], ['this', 'is', 'the', 'second', 'sentence'], ['yet', 'another', 'sentence'], ['one', 'more', 'sentence'], ['and', 'the', 'final', 'sentence']] # train model model = Word2Vec(sentences, min_count=1) # fit a 2d PCA model to the vectors X = model[model.wv.vocab] pca = PCA(n_components=2) result = pca.fit_transform(X) # create a scatter plot of the projection pyplot.scatter(result[:, 0], result[:, 1]) words = list(model.wv.vocab) for i, word in enumerate(words): pyplot.annotate(word, xy=(result[i, 0], result[i, 1])) pyplot.show()
Let’s run the program and see if we get something which is simpler and we can understand easily:
Load Google’s Word2Vec Embedding
Using an existing pre-trained data may not be the best approach for an NLP application but it can be really a time consuming and difficult task to train your own data at this point as it requires a lot of computer RAM and time of course. So we are using Google’s data for this example. For this example, you’ll be needing a file which you can find here.
Download the file, unzip it and we’ll use the binary file inside.
Here is a sample program:
Copyfrom gensim.models import KeyedVectors # load the google word2vec model filename = 'GoogleNews-vectors-negative300.bin' model = KeyedVectors.load_word2vec_format(filename, binary=True) # calculate: (king - man) + woman = ? result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1) print(result)
The above example loads google’s word to vec data and then calculates
king-man + woman=?. We should expect the following:
Let’s see the output for this program:
Load Stanford’s GloVe Embedding
There is another algorithm available for converting word to vectors, popularly known as Global Vectors for Word Representation or GloVe. We’ll use them for our next example.
Since we are using existing data, we’ll be needing a file this one is relatively smaller and can be downloaded from here.
First we’ll need to convert the file to word to vec format and this can be done as:
Copyfrom gensim.scripts.glove2word2vec import glove2word2vec glove_input_file = 'glove.6B.100d.txt' word2vec_output_file = 'glove.6B.100d.txt.word2vec' glove2word2vec(glove_input_file, word2vec_output_file)
Once this is done we are ready to head forward with our example as:
Copy# load the Stanford GloVe model filename = 'glove.6B.100d.txt.word2vec' model = KeyedVectors.load_word2vec_format(filename, binary=False) # calculate: (king - man) + woman = ? result = model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1) print(result)
Again we are expecting queen as the output, let’s run the program and check the results. Let’s see the output for this program:
In this tutorial, we have seen how to produce and load word embedding layers in Python using Gensim. To be specific we have learned:
- To train our own word embedding model on text data.
- To visualize a trained word embedding model.
- To load pre-trained GloVe and word2vec word embedding models from Stanford and Google respectively
We have seen Gensim makes it effortless to convert words to vectors and is very efficient. Also querying on the established pattern is easy and efficient.