Part of Speech Tagging - A Sequence Structure Learning Method

Chanming
Apr 12, 2020

Part Of Speech (POS) tags, approach to tackle the challenge of ambiguity in text. Tag sets, taggers, Hidden Markov Model are summarized.

What is POS

Word Classes, morphological classes, syntactic categories Tells difference between Nouns, Verbs, Adj etc.

Examples

  1. Authorship Attribution “The lawyer convinced the jury.” → Sam “Ruby travelled around Australia.” → Sam “The hospital was cleaned by the janitor.” → Max “Lunch was served at 12pm.” → Max “The bookstore was opened by the manager.” → ? distinguish them using sentence structure, examine passive voice

  2. Information Extraction Given this: ‣ “Brasilia, the Brazilian capital, was founded in 1960.” Obtain this: ‣ capital(Brazil, Brasilia) ‣ founded(Brasilia, 1960) So we need to know which are nouns, adjectives, verbs, and numbers.

POS Open/Close Classes

Open vs Closed, how readily(easy) do POS categories take on new words? Open:

  • Nouns
    • Proper(Australia) vs common (wombat)
    • Mass (rice) vs Count (bowls)
  • Verbs
  • Adj
  • Adv

Close:

  • Prepositions (in, on, with, for…)
  • Particles (brushed himself off)
  • Determiners
    • articles (a, an, the)
    • demonstratives (this, that, these, those)
    • quantifiers (each, every some two)
  • Pronouns
    • Person (I, me, she)
    • Possessive (my our)
    • interrogative or Whs( who, what, where, …)
  • Conjunctions (and, or, but, if, although …)
  • Modal verbs (can, could, may, might, must)
  • some more.. (negatives, politeness markers, etc)

Ambiguity

Many word types belong to multiple classes

Time flies like an arrow noun verb preposition determiner noun Fruit flies like a banana noun noun verb determiner noun

Tag sets

major Penn Treebank Tags

abbr meaning abbr meaning
NN noun VB verb
JJ adjective RB adverb
DT determiner CD cardinal number
IN preposition PRP personal pronoun
MD modal CC coordinating conjunction
RP particle WH wh-pronoun
TO to    

Penn Treebank Derived Tags

  • NN:
    • NNS (plural, wombats)
    • NNP (proper, Australia)
    • NNPS (proper plural, Australians)
  • VB:
    • VB (infinitive, eat)
    • VBP (1st /2nd person present, eat)
    • VBZ (3rd person singular, eats)
    • VBD (past tense, ate)
    • VBG (gerund, eating)
    • VBN (past participle, eaten)
  • JJ:
    • JJR (comparative, nicer)
    • JJS (superlative, nicest)
  • RB:
    • RBR (comparative, faster)
    • RBS (superlative, fastest)
  • PRP:
    • PRP$ (possessive, my)
  • WH:
    • WH$ (possessive, whose)
    • WDT(wh-determiner, who)
    • WRB (wh-adverb, where)

Tagger

Rule-based taggers

  • Typically starts with a list of possible tags for each word
    • From a lexical resource, or a corpus
  • Often includes other lexical information, e.g. verb sub-categorization (its arguments)
  • Apply rules to narrow down to a single tag
    • E.g. If DT comes before word, then eliminate VB
    • Relies on some unambiguous contexts
  • Large systems have thousands of constraints

Statistical taggers

Uni-gram Tagger

Assign most common tag to each word type Requires a corpus of tagged words “Model” is just a look-up table But actually quite good, ~90% accuracy Correctly resolves about 75% of ambiguity Often considered the baseline for more complex approaches

Classifier-based taggers

Use a standard discriminative classifier (e.g. logistic regression, neural network), with features:

  • Target word
  • Lexical context around the word
  • Already classified tags in sentence

Among the best sequential models

  • But can suffer from error propagation: wrong predictions from previous steps affect the next ones

Hidden Markov Model (HMM) taggers

Properties:

  • A basic sequential (or structured) model Like sequential classifiers, use both previous tag (words that already tagged) and lexical evidence (possible tag for target word)
  • Unlike classifiers, treat previous tag(s) evidence and lexical evidence as independent from each other
  • Less sparsity
  • Fast algorithms for sequential prediction, i.e. finding the best tagging of entire word sequence

Hidden Markov Model (HMM)

HMM Concepts and Training

\[\hat t = \mathop{\arg\!\max}\limits_t P(w \mid t)P(t)\]

where

\[P(w \mid t) = P(w_1 \mid t_1)P(w_2\mid t_2)\cdots P(w_i \mid t_i)\]

and

\[P(t)=P(t_1 \mid t_0)P(t_2 \mid t_1)\cdots P(t_i \mid t_{i-1})\]

Markov: assumes that the sequence follows a Markov chain, the probability of a tag depends only on the previous tag. Hidden: the tags are not seen beforehand, our goal is to find the best tag sequence. Parameters:

  • emission probabilities: $P(w_i \mid t_i)$
  • transition probabilities : $P(t_i \mid t_{i-1})$

Training the model, figuring out the parameters using MLE.

\[P(\textbf{like} \mid \text{VB}) = \frac{count(\text{VB},\textbf{like})}{count(\text{VB})}\] \[P(\text{NN} \mid \text{DT}) = \frac{count(\text{DT} ,\text{NN})}{count(\text{DT})}\]

Some Problems:

  1. About the first tag when $i=1$ and $i-1=0$, we define a symbol <s>. \(P(\text{NN} \mid \lt \text s \gt) = \frac{count(\lt \text s \gt, \text{NN})}{count(\lt \text{s} \gt)}\)

  2. About the last tag, we use a symbol </s>. $P(\lt / \text s \gt \mid \text{NN})$
  3. About unseen word tag, we use smoothing techniques, like NB/n-gram LMs.
    • in NB, we use a $e$ denote a very little probability
    • in N-gram LMs, we use add - 1 add - k etc.

HMM - Prediction (decoding)

after we get the emission and transition matrix, we get a complete table of parameters. What we need to do is to find a tag sequence (notice, here we are finding a global optimal, not local optimal using greedy method) that can maximize the Joint Probability.

The greedy method, which does the argmax from left to right, tag by tag, will still be prone to error accumulation (propagation).

We should take all possible tag combinations, and then take the maximum one. But this lead to exponential number of sequences. We need the Viterbi Algorithm to deal with that challenge.

Viterbi Algorithm

Dynamic Programming Problem to find a global optimal.So we are gonna find out the best sequence by

  1. calculate all possibility for the possible tags in the first column(Word), say:

    • $P(\textbf{Janet} \mid \text{NNP}) * P(\text{NNP} \mid \lt \text S \gt)$,
    • $P(\textbf{Janet} \mid \text{MD}) * P(\text{MD} \mid \lt \text S \gt)$,
    • $P(\textbf{Janet} \mid \text{VB}) * P(\text{VB} \mid \lt \text S \gt)$,
    • $P(\textbf{Janet} \mid \text{DT}) * P(\text{DT} \mid \lt \text S \gt)$
  2. calculate the all the probability of possible tags for the second word in the second column, in each cell, calculate a maximum of all possible previous tags: \(\mathop{\arg\!\max}\limits_{tag({\textbf{Janet}})}[P(\textbf{will} \mid \text{NNP}) * P(\text{NNP} \mid tag({\textbf{Janet}})) * s(tag(\textbf{Janet}) \mid \textbf{Janet})]\)

    • where $tag(\textbf{Janet})$ is all possible Tag that the previous word can be, $s(\text{NNP} \mid \textbf{Janet})$ is the value that we calculate from the first step. Now we should find a global optimal best $s()$ for the word $will$ and tag $\text{NNP}$.

    • $\mathop{\arg!\max}\limits_{tag(\textbf{Janet})}(P(\textbf{will} \mid \text{MD}) * P(\text{MD} \mid tag(\textbf{Janet})) * s(tag(\textbf{Janet}) \mid \textbf{Janet})))$ where $tag(\textbf{Janet})$ is all possible Tag that the previous word can be, $s(\text{MD} \mid \textbf{Janet})$ is the value that we calculate from the first step. Now we should find a global optimal best $s()$ for the word $will$ and tag $\text{MD}$.
    • $\mathop{\arg!\max}\limits_{tag(\textbf{Janet})}(P(\textbf{will} \mid \text{DT}) * P(\text{DT} \mid \textbf{Janet}) * s(tag(\textbf{Janet}) \mid \textbf{Janet})))$ where $tag(\textbf{Janet})$ is all possible Tag that the previous word can be, $s(\text{DT} \mid \textbf{Janet})$ is the value that we calculate from the first step. Now we should find a global optimal best $s()$ for the word $will$ and tag $\text{DT}$.

    The key point, that why global optimal and local optimal are different, is the place we multiply $P(\text{MD} \mid tag(\textbf{Janet}))$, which can be different like $P(\text{MD} \mid \text{NNP})$, $P(\text{MD} \mid \text{MD})$ … $P(\text{MD} \mid \text{DT})$. this is where the difference come from.

Reference Lecture 6 HMM.

Other challenges:

  • trigrams with HMM have good result, but need to deal with sparsity by incorporating with interpolation and back-off.