Natural Language Processing: How to Process NLP Data?

Filed Under: Python Advanced
Natural Language Processing Tasks

Let’s look at some of the most popular Natural Language Processing tasks, and how to perform them using Python. Natural Language Processing (NLP) uses algorithms for human language interpretation and manipulation.

This is one of the most commonly used fields of machine learning.

If AI continues to grow, we will need specialists in developing models that examine speech and vocabulary, discover contextual trends, and create text and audio insights.

1. Preparing the Datasets for Natural Language Processing Project

Let’s get ourselves some data. So, we’ll just copy the first 30 lines from www.gutenberg.org/files/35/35-0.txt, which is a free novel from Project Gutenberg.

If you are interested in other free datasets, have a look the top 11 machine learning datasets

text = '''The Time Traveller (for so it will be convenient to speak of him) was
expounding a recondite matter to us. His pale grey eyes shone and
twinkled, and his usually pale face was flushed and animated. The fire
burnt brightly, and the soft radiance of the incandescent lights in the
lilies of silver caught the bubbles that flashed and passed in our
glasses. Our chairs, being his patents, embraced and caressed us rather
than submitted to be sat upon, and there was that luxurious
after-dinner atmosphere, when thought runs gracefully free of the
trammels of precision. And he put it to us in this way—marking the
points with a lean forefinger—as we sat and lazily admired his
earnestness over this new paradox (as we thought it) and his fecundity.

“You must follow me carefully. I shall have to controvert one or two
ideas that are almost universally accepted. The geometry, for instance,
they taught you at school is founded on a misconception.”

“Is not that rather a large thing to expect us to begin upon?” said
Filby, an argumentative person with red hair.

“I do not mean to ask you to accept anything without reasonable ground
for it. You will soon admit as much as I need from you. You know of
course that a mathematical line, a line of thickness _nil_, has no real
existence. They taught you that? Neither has a mathematical plane.
These things are mere abstractions.”

“That is all right,” said the Psychologist.

“Nor, having only length, breadth, and thickness, can a cube have a
real existence.”

“There I object,” said Filby. “Of course a solid body may exist. All
real things—”

“So most people think. But wait a moment. Can an _instantaneous_ cube
exist?”

“Don’t follow you,” said Filby.

“Can a cube that does not last for any time at all, have a real
existence?”

Filby became pensive. “Clearly,” the Time Traveller proceeded, “any
real body must have extension in _four_ directions: it must have
Length, Breadth, Thickness, and—Duration. But through a natural
infirmity of the flesh, which I will explain to you in a moment, we
incline to overlook this fact. There are really four dimensions, three
which we call the three planes of Space, and a fourth, Time. There is,
however, a tendency to draw an unreal distinction between the former
three dimensions and the latter, because it happens that our
consciousness moves intermittently in one direction along the latter
from the beginning to the end of our lives.”'''

2. Stemming the Data

Stemming is a process that is used by extracting affixes from them to remove the base structure of the terms.

Stemming is used by search engines to catalog terms. This is why a search engine will only store the stems, instead of preserving all types of a word. Stemming, therefore, decreases the scale of the index and improves the accuracy of retrieval.

In NLTK (which stands for Natural Language Tool Kit), we have two main stemming functions:

  • Porter Stemmer
  • Lancaster Stemmer

Porter Stemmer

Without a question, Port Stemmer is the most widely used stemmer that is also one of the mildest stemmers.

It is also the oldest, by a broad margin, stemming algorithm.

I will directly be coding, assuming a basic knowledge of Python lists, loops, etc. So if we do this:

import re
text = re.sub("\n"," ",text)

import nltk
from nltk.stem import PorterStemmer

word_stemmer = PorterStemmer()
for word in text.split(" "):
  if len(word)>10:
    print((word,word_stemmer.stem(word)))

then we get the output as:

('incandescent', 'incandesc') ('after-dinner', 'after-dinn') ('atmosphere,', 'atmosphere,') ('way—marking', 'way—mark') ('forefinger—as', 'forefinger—a') ('earnestness', 'earnest') ('universally', 'univers') ('misconception.”', 'misconception.”') ('argumentative', 'argument') ('mathematical', 'mathemat') ('mathematical', 'mathemat') ('abstractions.”', 'abstractions.”') ('Psychologist.', 'psychologist.') ('existence.”', 'existence.”') ('_instantaneous_', '_instantaneous_') ('existence?”', 'existence?”') ('directions:', 'directions:') ('and—Duration.', 'and—duration.') ('dimensions,', 'dimensions,') ('distinction', 'distinct') ('consciousness', 'conscious') ('intermittently', 'intermitt')

So, as you can see most of the words were correctly shortened. Those that weren’t, for example, “mathemat” will however produce this same word for all similar words. So it’s not a problem.

Lancaster Stemmer

The Lancaster stemming algorithm is very rough.

The quickest algorithm here and it will massively decrease your corpus vocabulary, but not the method you would use if you want more differentiation.

from nltk.stem import LancasterStemmer
Lanc_stemmer = LancasterStemmer()

for word in text.split(" "):
  if len(word)>10:
    print((word,Lanc_stemmer.stem(word)))

gives:

('incandescent', 'incandesc') ('after-dinner', 'after-dinn') ('atmosphere,', 'atmosphere,') ('way—marking', 'way—marking') ('forefinger—as', 'forefinger—as') ('earnestness', 'earnest') ('universally', 'univers') ('misconception.”', 'misconception.”') ('argumentative', 'argu') ('mathematical', 'mathem') ('mathematical', 'mathem') ('abstractions.”', 'abstractions.”') ('Psychologist.', 'psychologist.') ('existence.”', 'existence.”') ('_instantaneous_', '_instantaneous_') ('existence?”', 'existence?”') ('directions:', 'directions:') ('and—Duration.', 'and—duration.') ('dimensions,', 'dimensions,') ('distinction', 'distinct') ('consciousness', 'conscy') ('intermittently', 'intermit')

3. Lemmatization of Text Data

The process of lemmatization is like stemming.

After lemmatization, the output we can get is called ‘lemma’, which is a root word rather than a root stem of the stemming output.

Unlike stemming, we will get a valid word after lemmatization, which implies the same thing.

nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

for word in text.split():
  if len(word)>5 and word!=lemmatizer.lemmatize(word):
    print((word,lemmatizer.lemmatize(word)))
  elif len(word)>10:
    print((word,lemmatizer.lemmatize(word)))

gives us:

('incandescent', 'incandescent') ('lights', 'light') ('lilies', 'lily') ('bubbles', 'bubble') ('after-dinner', 'after-dinner') ('atmosphere,', 'atmosphere,') ('trammels', 'trammel') ('way—marking', 'way—marking') ('points', 'point') ('forefinger—as', 'forefinger—as') ('earnestness', 'earnestness') ('universally', 'universally') ('misconception.”', 'misconception.”') ('argumentative', 'argumentative') ('mathematical', 'mathematical') ('mathematical', 'mathematical') ('things', 'thing') ('abstractions.”', 'abstractions.”') ('Psychologist.', 'Psychologist.') ('existence.”', 'existence.”') ('_instantaneous_', '_instantaneous_') ('existence?”', 'existence?”') ('directions:', 'directions:') ('and—Duration.', 'and—Duration.') ('dimensions,', 'dimensions,') ('planes', 'plane') ('distinction', 'distinction') ('dimensions', 'dimension') ('consciousness', 'consciousness') ('intermittently', 'intermittently')

Difference: The PorterStemmer class chops off the word ‘es’. The WordNetLemmatizer class considers that as a true word.

In plain terms, the stemming technique looks only at the word’s shape, while the lemmatization technique looks at the word’s meaning.

4. Part Of Speech (POS) tags

Part-of-Speech (PoS) tagging can be defined as the system by which one of the parts of speech is allocated to the word. Typically, it is called POS labeling.

We may say in clear terms that POS tagging is a job of marking each word with the proper part of speech in an expression.

We do know that nouns, verbs, adverbs, adjectives, pronouns, conjunctions, and their sub-categories are part of the vocabulary.

nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag

for sentence in text.split(".")[0]:
  token = sentence.split(" ")[1:]
  token = [i for i in token if i] 
  tokens_tag = pos_tag(token)
  print(tokens_tag)

gives us:

[('Time', 'NNP'), ('Traveller', 'NNP'), ('(for', 'NNP'), ('so', 'IN'), ('it', 'PRP'), ('will', 'MD'), ('be', 'VB'), ('convenient', 'JJ'), ('to', 'TO'), ('speak', 'VB'), ('of', 'IN'), ('him)', 'NN'), ('was', 'VBD'), ('expounding', 'VBG'), ('a', 'DT'), ('recondite', 'JJ'), ('matter', 'NN'), ('to', 'TO'), ('us', 'PRP')]

Now, we’ll go into some of the natural language processing tasks.

5. Remove \n tags

Let’s remove all the newline tags here so we can move ahead with clean text.

import re
text = re.sub("\n"," ",text)

6. Find synonyms

First, let’s see how to get antonyms for words in your text. I’m of course assuming basic knowledge of Python here. In the example below, I found the synonyms for “large enough” words (length>5), since we don’t often need synonyms for much smaller words:

import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet

for word in text.split(" "):
  if len(word)>5:
    list_syn = []
    for syn in wordnet.synsets(word): 
      for lemm in syn.lemmas():
        if lemm.name() not in list_syn:
          list_syn.append(lemm.name())
    if list_syn:
      print(word + ":-")
      print(" "+str(list_syn))

I accommodated for empty synonym lists and repeating words, and we get quite a nice output:

Traveller:-
 ['traveler', 'traveller']
convenient:-
 ['convenient', 'commodious']
expounding:-
 ['exposition', 'expounding', 'elaborate', 'lucubrate', 'expatiate', 'exposit', 'enlarge', 'flesh_out', 'expand', 'expound', 'dilate', 'set_forth']
recondite:-
 ['abstruse', 'deep', 'recondite']
matter:-
 ['matter', 'affair', 'thing', 'topic', 'subject', 'issue', 'count', 'weigh']
usually:-
 ['normally', 'usually', 'unremarkably', 'commonly', 'ordinarily']
flushed:-
 ['blush', 'crimson', 'flush', 'redden', 'level', 'even_out', 'even', 'scour', 'purge', 'sluice', 'flushed', 'rose-cheeked', 'rosy', 'rosy-cheeked', 'red', 'reddened', 'red-faced']
radiance:-
 ['radiance', 'glow', 'glowing', 'radiancy', 'shine', 'effulgence', 'refulgence', 'refulgency']
incandescent:-
 ['incandescent', 'candent']
lights:-
 ['light', 'visible_light', 'visible_radiation', 'light_source', 'luminosity', 'brightness', 'brightness_level', 'luminance', 'luminousness', 'illumination', 'lightness', 'lighting', 'sparkle', 'twinkle', 'spark', 'Inner_Light', 'Light', 'Light_Within', 'Christ_Within', 'lighter', 'igniter', 'ignitor', 'illume', 'illumine', 'light_up', 'illuminate', 'fire_up', 'alight', 'perch', 'ignite', 'fall', 'unhorse', 'dismount', 'get_off', 'get_down']

7. Find Antonyms

Similarly, for antonyms:

for word in text.split(" "):
  if len(word)>5:
    list_ant = []
    for syn in wordnet.synsets(word): 
      for lemm in syn.lemmas():
        if lemm.antonyms(): 
            list_ant.append(lemm.antonyms()[0].name())
    if list_ant:
      print(word + ":-")
      print(" "+str(list_ant))

we get:

convenient:- ['inconvenient', 'incommodious'] 
expounding:- ['contract'] 
usually:- ['remarkably'] 
lights:- ['dark', 'extinguish'] 
caught:- ['unhitch'] 
passed:- ['fail', 'fail', 'be_born'] 
thought:- ['forget'] 
gracefully:- ['gracelessly', 'ungraciously', 'ungracefully'] 
points:- ['unpointedness'] 
admired:- ['look_down_on'] 
earnestness:- ['frivolity'] 
thought:- ['forget'] 
follow:- ['precede', 'predate', 'precede'] 
founded:- ['abolish'] 
argumentative:- ['unargumentative'] 
accept:- ['reject', 'refuse', 'refuse'] 
reasonable:- ['unreasonable'] 
ground:- ['figure'] 
course:- ['unnaturally'] 
mathematical:- ['verbal'] 
thickness:- ['thinness', 'thinness'] 
mathematical:- ['verbal'] 
having:- ['lack', 'abstain', 'refuse'] 
course:- ['unnaturally'] 
follow:- ['precede', 'predate', 'precede'] 
extension:- ['flexion'] 
natural:- ['unnatural', 'artificial', 'supernatural', 'flat'] 
incline:- ['indispose'] 
overlook:- ['attend_to'] 
unreal:- ['real', 'real', 'natural', 'substantial'] 
former:- ['latter', 'latter'] 
happens:- ['dematerialize', 'dematerialise'] 
consciousness:- ['unconsciousness', 'incognizance'] 
latter:- ['former', 'former'] 
beginning:- ['ending', 'end','finish']

8. Getting phrases containing nouns

We can get the phrases inside a text, thereby reducing the information loss when tokenizing and topic modeling. This can be done using the spacy library:

import spacy
spacy_obj = spacy.load('en_core_web_sm')

And then we can simply run this over our input text:

spacy_text = spacy_obj(text)
for phrase in spacy_text.noun_chunks:
  print(phrase)

This will give us phrases that contain nouns, which are one of the most important aspects of a text, especially a novel:

The Time Traveller
a recondite matter
His pale grey eyes
his usually pale face
the soft radiance
the incandescent lights
a lean forefinger
this new paradox
one or two
ideas
an argumentative person
reasonable ground
a mathematical line
no real
existence
a mathematical plane
mere abstractions
the Psychologist
a
real existence
an _instantaneous_ cube
a real
existence
the Time Traveller
_four_ directions
a natural
infirmity
the three planes
an unreal distinction
the former
three dimensions
our
consciousness

If we combine these phrases, it kind of forms the story summary.

Ending Note

If you liked reading this article and want to read more, follow me as an author. Until then, keep coding!

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content