POS Tagging with NLTK and Chunking in NLP [EXAMPLES]

POS Tagging

POS Tagging (Parts of Speech Tagging) is a process to mark up the words in text format for a particular part of a speech based on its definition and context. It is responsible for text reading in a language and assigning some specific token (Parts of Speech) to each word. It is also called grammatical tagging.

Let’s learn with a NLTK Part of Speech example:

Input: Everything to permit us.

Output: [(‘Everything’, NN),(‘to’, TO), (‘permit’, VB), (‘us’, PRP)]

Steps Involved in the POS tagging example

Tokenize text (word_tokenize)
apply pos_tag to above step that is nltk.pos_tag(tokenize_text)

NLTK POS Tags Examples are as below:

Abbreviation	Meaning
CC	coordinating conjunction
CD	cardinal digit
DT	determiner
EX	existential there
FW	foreign word
IN	preposition/subordinating conjunction
JJ	This NLTK POS Tag is an adjective (large)
JJR	adjective, comparative (larger)
JJS	adjective, superlative (largest)
LS	list market
MD	modal (could, will)
NN	noun, singular (cat, tree)
NNS	noun plural (desks)
NNP	proper noun, singular (sarah)
NNPS	proper noun, plural (indians or americans)
PDT	predeterminer (all, both, half)
POS	possessive ending (parent\ ‘s)
PRP	personal pronoun (hers, herself, him, himself)
PRP$	possessive pronoun (her, his, mine, my, our )
RB	adverb (occasionally, swiftly)
RBR	adverb, comparative (greater)
RBS	adverb, superlative (biggest)
RP	particle (about)
TO	infinite marker (to)
UH	interjection (goodbye)
VB	verb (ask)
VBG	verb gerund (judging)
VBD	verb past tense (pleaded)
VBN	verb past participle (reunified)
VBP	verb, present tense not 3rd person singular(wrap)
VBZ	verb, present tense with 3rd person singular (bases)
WDT	wh-determiner (that, what)
WP	wh- pronoun (who)
WRB	wh- adverb (how)

The above NLTK POS tag list contains all the NLTK POS Tags. NLTK POS tagger is used to assign grammatical information of each word of the sentence. Installing, Importing and downloading all the packages of POS NLTK is complete.

What is Chunking in NLP?

Chunking in NLP is a process to take small pieces of information and group them into large units. The primary use of Chunking is making groups of “noun phrases.” It is used to add structure to the sentence by following POS tagging combined with regular expressions. The resulted group of words are called “chunks.” It is also called shallow parsing.

In shallow parsing, there is maximum one level between roots and leaves while deep parsing comprises of more than one level. Shallow parsing is also called light parsing or chunking.

Rules for Chunking

There are no pre-defined rules, but you can combine them according to need and requirement.

For example, you need to tag Noun, verb (past tense), adjective, and coordinating junction from the sentence. You can use the rule as below

chunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}

Following table shows what the various symbol means:

Name of symbol	Description
.	Any character except new line
*	Match 0 or more repetitions
?	Match 0 or 1 repetitions

Now Let us write the code to understand rule better

from nltk import pos_tag
from nltk import RegexpParser
text ="learn php from guru99 and make study easy".split()
print("After Split:",text)
tokens_tag = pos_tag(text)
print("After Token:",tokens_tag)
patterns= """mychunk:{<NN.?>*<VBD.?>*<JJ.?>*<CC>?}"""
chunker = RegexpParser(patterns)
print("After Regex:",chunker)
output = chunker.parse(tokens_tag)
print("After Chunking",output)

Output:

After Split: ['learn', 'php', 'from', 'guru99', 'and', 'make', 'study', 'easy']
After Token: [('learn', 'JJ'), ('php', 'NN'), ('from', 'IN'), ('guru99', 'NN'), ('and', 'CC'), ('make', 'VB'), ('study', 'NN'), ('easy', 'JJ')]
After Regex: chunk.RegexpParser with 1 stages:
RegexpChunkParser with 1 rules:
       <ChunkRule: '<NN.?>*<VBD.?>*<JJ.?>*<CC>?'>
After Chunking (S
  (mychunk learn/JJ)
  (mychunk php/NN)
  from/IN
  (mychunk guru99/NN and/CC)
  make/VB
  (mychunk study/NN easy/JJ))

The conclusion from the above Part of Speech tagging Python example: “make” is a verb which is not included in the rule, so it is not tagged as mychunk

Use Case of Chunking

Chunking is used for entity detection. An entity is that part of the sentence by which machine get the value for any intention.

Example: 
Temperature of New York. 
Here Temperature is the intention and New York is an entity.

In other words, chunking is used as selecting the subsets of tokens. Please follow the below code to understand how chunking is used to select the tokens. In this example, you will see the graph which will correspond to a chunk of a noun phrase. We will write the code and draw the graph for better understanding.

Code to Demonstrate Use Case

 import nltk
text = "learn php from guru99"
tokens = nltk.word_tokenize(text)
print(tokens)
tag = nltk.pos_tag(tokens)
print(tag)
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp  =nltk.RegexpParser(grammar)
result = cp.parse(tag)
print(result)
result.draw()    # It will draw the pattern graphically which can be seen in Noun Phrase chunking

Output:

['learn', 'php', 'from', 'guru99']  -- These are the tokens
[('learn', 'JJ'), ('php', 'NN'), ('from', 'IN'), ('guru99', 'NN')]   -- These are the pos_tag
(S (NP learn/JJ php/NN) from/IN (NP guru99/NN))        -- Noun Phrase Chunking

Graph

From the graph, we can conclude that “learn” and “guru99” are two different tokens but are categorized as Noun Phrase whereas token “from” does not belong to Noun Phrase.

Chunking is used to categorize different tokens into the same chunk. The result will depend on grammar which has been selected. Further Chunking NLTK is used to tag patterns and to explore text corpora.

COUNTING POS TAGS

We have discussed various pos_tag in the previous section. In this particular tutorial, you will study how to count these tags. Counting tags are crucial for text classification as well as preparing the features for the Natural language-based operations. I will be discussing with you the approach which guru99 followed while preparing code along with a discussion of output. Hope this will help you.

How to count Tags:

Here first we will write working code and then we will write different steps to explain the code.

from collections import Counter
import nltk
text = "Guru99 is one of the best sites to learn WEB, SAP, Ethical Hacking and much more online."
lower_case = text.lower()
tokens = nltk.word_tokenize(lower_case)
tags = nltk.pos_tag(tokens)
counts = Counter( tag for word,  tag in tags)
print(counts)

Output:

Counter({‘NN’: 5, ‘,’: 2, ‘TO’: 1, ‘CC’: 1, ‘VBZ’: 1, ‘NNS’: 1, ‘CD’: 1, ‘.’: 1, ‘DT’: 1, ‘JJS’: 1, ‘JJ’: 1, ‘JJR’: 1, ‘IN’: 1, ‘VB’: 1, ‘RB’: 1})

Elaboration of the code

To count the tags, you can use the package Counter from the collection’s module. A counter is a dictionary subclass which works on the principle of key-value operation. It is an unordered collection where elements are stored as a dictionary key while the count is their value.
Import nltk which contains modules to tokenize the text.
Write the text whose pos_tag you want to count.
Some words are in upper case and some in lower case, so it is appropriate to transform all the words in the lower case before applying tokenization.
Pass the words through word_tokenize from nltk.

Calculate the pos_tag of each token

Output = [('guru99', 'NN'), ('is', 'VBZ'), ('one', 'CD'), ('of', 'IN'), ('the', 'DT'), ('best', 'JJS'), ('site', 'NN'), ('to', 'TO'), ('learn', 'VB'), ('web', 'NN'), (',', ','), ('sap', 'NN'), (',', ','), ('ethical', 'JJ'), ('hacking', 'NN'), ('and', 'CC'), ('much', 'RB'), ('more', 'JJR'), ('online', 'JJ')]

Now comes the role of dictionary counter. We have imported in the code line 1. Words are the key and tags are the value and counter will count each tag total count present in the text.

Frequency Distribution

Frequency Distribution is referred to as the number of times an outcome of an experiment occurs. It is used to find the frequency of each word occurring in a document. It uses FreqDistclass and defined by the nltk.probabilty module.

A frequency distribution is usually created by counting the samples of repeatedly running the experiment. The no of counts is incremented by one, each time. E.g.

freq_dist = FreqDist()

for the token in the document:

freq_dist.inc(token.type())

For any word, we can check how many times it occurred in a particular document. E.g.

Count Method: freq_dist.count(‘and’)This expression returns the value of the number of times ‘and’ occurred. It is called the count method.
Frequency Method: freq_dist.freq(‘and’)This the expression returns frequency of a given sample.

We will write a small program and will explain its working in detail. We will write some text and will calculate the frequency distribution of each word in the text.

import nltk
a = "Guru99 is the site where you can find the best tutorials for Software Testing     Tutorial, SAP Course for Beginners. Java Tutorial for Beginners and much more. Please     visit the site guru99.com and much more."
words = nltk.tokenize.word_tokenize(a)
fd = nltk.FreqDist(words)
fd.plot()

Explanation of code:

Import nltk module.
Write the text whose word distribution you need to find.
Tokenize each word in the text which is served as input to FreqDist module of the nltk.
Apply each word to nlk.FreqDist in the form of a list
Plot the words in the graph using plot()

Please visualize the graph for a better understanding of the text written

Frequency Distribution of Each Word in the Graph — Frequency distribution of each word in the graph

NOTE: You need to have matplotlib installed to see the above graph

Observe the graph above. It corresponds to counting the occurrence of each word in the text. It helps in the study of text and further in implementing text-based sentimental analysis. In a nutshell, it can be concluded that nltk has a module for counting the occurrence of each word in the text which helps in preparing the stats of natural language features. It plays a significant role in finding the keywords in the text. You can also extract the text from the pdf using libraries like extract, PyPDF2 and feed the text to nlk.FreqDist.

The key term is “tokenize.” After tokenizing, it checks for each word in a given paragraph or text document to determine that number of times it occurred. You do not need the NLTK toolkit for this. You can also do it with your own python programming skills. NLTK toolkit only provides a ready-to-use code for the various operations.

Counting each word may not be much useful. Instead one should focus on collocation and bigrams which deals with a lot of words in a pair. These pairs identify useful keywords to better natural language features which can be fed to the machine. Please look below for their details.

Collocations: Bigrams and Trigrams

What is Collocations?

Collocations are the pairs of words occurring together many times in a document. It is calculated by the number of those pair occurring together to the overall word count of the document.

Consider electromagnetic spectrum with words like ultraviolet rays, infrared rays.

The words ultraviolet and rays are not used individually and hence can be treated as Collocation. Another example is the CT Scan. We don’t say CT and Scan separately, and hence they are also treated as collocation.

We can say that finding collocations requires calculating the frequencies of words and their appearance in the context of other words. These specific collections of words require filtering to retain useful content terms. Each gram of words may then be scored according to some association measure, to determine the relative likelihood of each Ingram being a collocation.

Collocation can be categorized into two types-

Bigrams combination of two words
Trigramscombinationof three words

Bigrams and Trigrams provide more meaningful and useful features for the feature extraction stage. These are especially useful in text-based sentimental analysis.

Bigrams Example Code

import nltk

text = "Guru99 is a totally new kind of learning experience."
Tokens = nltk.word_tokenize(text)
output = list(nltk.bigrams(Tokens))
print(output)

Output:

[('Guru99', 'is'), ('is', 'totally'), ('totally', 'new'), ('new', 'kind'), ('kind', 'of'), ('of', 'learning'), ('learning', 'experience'), ('experience', '.')]

Trigrams Example Code

Sometimes it becomes important to see a pair of three words in the sentence for statistical analysis and frequency count. This again plays a crucial role in forming NLP (natural language processing features) as well as text-based sentimental prediction.

The same code is run for calculating the trigrams.

import nltk
text = “Guru99 is a totally new kind of learning experience.”
Tokens = nltk.word_tokenize(text)
output = list(nltk.trigrams(Tokens))
print(output)

Output:

[('Guru99', 'is', 'totally'), ('is', 'totally', 'new'), ('totally', 'new', 'kind'), ('new', 'kind', 'of'), ('kind', 'of', 'learning'), ('of', 'learning', 'experience'), ('learning', 'experience', '.')]

Tagging Sentences

Tagging Sentence in a broader sense refers to the addition of labels of the verb, noun, etc., by the context of the sentence. Identification of POS tags is a complicated process. Thus generic tagging of POS is manually not possible as some words may have different (ambiguous) meanings according to the structure of the sentence. Conversion of text in the form of list is an important step before tagging as each word in the list is looped and counted for a particular tag. Please see the below code to understand it better

import nltk
text = "Hello Guru99, You have to build a very good site, and I love visiting your site."
sentence = nltk.sent_tokenize(text)
for sent in sentence:
	 print(nltk.pos_tag(nltk.word_tokenize(sent)))

Output:

[(‘Hello’, ‘NNP’), (‘Guru99’, ‘NNP’), (‘,’, ‘,’), (‘You’, ‘PRP’), (‘have’, ‘VBP’), (‘build’, ‘VBN’), (‘a’, ‘DT’), (‘very’, ‘RB’), (‘good’, ‘JJ’), (‘site’, ‘NN’), (‘and’, ‘CC’), (‘I’, ‘PRP’), (‘love’, ‘VBP’), (‘visiting’, ‘VBG’), (‘your’, ‘PRP$’), (‘site’, ‘NN’), (‘.’, ‘.’)]

Code Explanation:

Code to import nltk (Natural language toolkit which contains submodules such as sentence tokenize and word tokenize.)
Text whose tags are to be printed.
Sentence Tokenization
For loop is implemented where words are tokenized from sentence and tag of each word is printed as output.

In Corpus there are two types of POS taggers:

Rule-Based
Stochastic POS Taggers

1.Rule-Based POS Tagger: For the words having ambiguous meaning, rule-based approach on the basis of contextual information is applied. It is done so by checking or analyzing the meaning of the preceding or the following word. Information is analyzed from the surrounding of the word or within itself. Therefore words are tagged by the grammatical rules of a particular language such as capitalization and punctuation. e.g., Brill’s tagger.

2.Stochastic POS Tagger: Different approaches such as frequency or probability are applied under this method. If a word is mostly tagged with a particular tag in training set then in the test sentence it is given that particular tag. The word tag is dependent not only on its own tag but also on the previous tag. This method is not always accurate. Another way is to calculate the probability of occurrence of a specific tag in a sentence. Thus the final tag is calculated by checking the highest probability of a word with a particular tag.

POS tagging with Hidden Markov Model

Tagging Problems can also be modeled using HMM. It treats input tokens to be observable sequence while tags are considered as hidden states and goal is to determine the hidden state sequence. For example x = x₁,x₂,…………,x_n where x is a sequence of tokens while y = y₁,y₂,y₃,y₄………y_nis the hidden sequence.

How Hidden Markov Model (HMM) Works?

HMM uses join distribution which is P(x, y) where x is the input sequence/ token sequence and y is tag sequence.

Tag Sequence for x will be argmax_y1….ynp(x1,x2,….xn,y1,y2,y3,…..). We have categorized tags from the text, but stats of such tags are vital. So the next part is counting these tags for statistical study.

Summary

POS Tagging in NLTK is a process to mark up the words in text format for a particular part of a speech based on its definition and context.
Some NLTK POS tagging examples are: CC, CD, EX, JJ, MD, NNP, PDT, PRP$, TO, etc.
POS tagger is used to assign grammatical information of each word of the sentence. Installing, Importing and downloading all the packages of Part of Speech tagging with NLTK is complete.
Chunking in NLP is a process to take small pieces of information and group them into large units.
There are no pre-defined rules, but you can combine them according to need and requirement.
Chunking is used for entity detection. An entity is that part of the sentence by which machine get the value for any intention.
Chunking is used to categorize different tokens into the same chunk.

POS Tagging

Steps Involved in the POS tagging example

What is Chunking in NLP?

Rules for Chunking

Use Case of Chunking

Code to Demonstrate Use Case

RELATED ARTICLES

COUNTING POS TAGS

Frequency Distribution

Collocations: Bigrams and Trigrams

What is Collocations?

Bigrams Example Code

Trigrams Example Code

Tagging Sentences

POS tagging with Hidden Markov Model

How Hidden Markov Model (HMM) Works?

Summary

Sign up for the newsletter