Tagging Problems and Hidden Markov Model

Tagging Sentences

Tagging Sentence in a broader sense refers to the addition of labels of the verb, noun,etc.by the context of the sentence. Identification of POS tags is a complicated process. Thus generic tagging of POS is manually not possible as some words may have different (ambiguous) meanings according to the structure of the sentence. Conversion of text in the form of list is an important step before tagging as each word in the list is looped and counted for a particular tag. Please see the below code to understand it better

import nltk
text = "Hello Guru99, You have to build a very good site, and I love visiting your   site."
sentence = nltk.sent_tokenize(text)
for sent in sentence:


[('Hello', 'NNP'), ('Guru99', 'NNP'), (',', ','), ('You', 'PRP'), ('have', 'VBP'), ('build', 'VBN'), ('a', 'DT'), ('very', 'RB'), ('good', 'JJ'), ('site', 'NN'), ('and', 'CC'), ('I', 'PRP'), ('love', 'VBP'), ('visiting', 'VBG'), ('your', 'PRP$'), ('site', 'NN'), ('.', '.')]

Code Explanation

  1. Code to import nltk (Natural language toolkit which contains submodules such as sentence tokenize and word tokenize.)
  2. Text whose tags are to be printed.
  3. Sentence Tokenization
  4. For loop is implemented where words are tokenized from sentence and tag of each word is printed as output.

In Corpus there are two types of POS taggers:

  • Rule-Based
  • Stochastic POS Taggers

1.Rule-Based POS Tagger: For the words having ambiguous meaning, rule-based approach on the basis of contextual information is applied. It is done so by checking or analyzing the meaning of the preceding or the following word. Information is analyzed from the surrounding of the word or within itself. Therefore words are tagged by the grammatical rules of a particular language such as capitalization and punctuation. e.g., Brill’s tagger.

2.Stochastic POS Tagger: Different approaches such as frequency or probability are applied under this method. If a word is mostly tagged with a particular tag in training set then in the test sentence it is given that particular tag. The word tag is dependent not only on its own tag but also on the previous tag. This method is not always accurate. Another way is to calculate the probability of occurrence of a specific tag in a sentence. Thus the final tag is calculated by checking the highest probability of a word with a particular tag.

Hidden Markov Model:

Tagging Problems can also be modeled using HMM. It treats input tokens to be observable sequence while tags are considered as hidden states and goal is to determine the hidden state sequence. For example x = x1,x2,…………,xn where x is a sequence of tokens while y = y1,y2,y3,y4………ynis the hidden sequence.

How HMM Model Works?

HMM uses join distribution which is P(x, y) where x is the input sequence/ token sequence and y is tag sequence.

Tag Sequence for x will be argmaxy1….ynp(x1,x2,….xn,y1,y2,y3,…..). We have categorized tags from the text, but stats of such tags are vital. So the next part is counting these tags for statistical study.