Language Model - N-gram
Words, Sentences, Paragraphs, Documents, Corpus are all sparse. they are made up of characters and have uncountable variations. Let’s simplify the question: How can we learn from the meaning of a piece of text based on a given corpus? Language Models are devised to tackle this challenge. In this notes, N-gram Model, its smoothing, application and evaluation are summarized.
Deriving n-gram language model
Given a word sequence $w_1, w_2,\ldots,w_m$, we represent the joint probabilities by a conditional probability format e.g:
\[P(w_1, w_2,\ldots,w_m)= P(w_1)P(w_2|w_1)P(w_3|w_1,w_2)\ldots P(w_m|w_1,w_2\ldots w_{m-1})\]Make a simplifying assumption (Markov Assumption):
\[P(w_i|w_1\ldots w_{i-1})\approx P(w_i|w_{i-n+1}\ldots w_{i-1})\]This is the N-gram model, where $w_i$ only have joint probability conditional on the N words before it. Hence, for some small $n$:
\[P(w_1,w_2,\ldots w_m) = P(w_1)P(w_2)\ldots P(w_i) , n=1\] \[P(w_1,w_2,\ldots w_m) = P(w_1)P(w_2|w_1)P(w_3|w_2)\ldots P(w_i|w_{i-1}) , n=2\]Calculate the Probability:
\[P(w_i) = \frac{C(w_i)}{M}\] \[P(w_i|w_{i-1}) = \frac{C(w_iw_{i-1})}{C(w_{i-1})}\] \[P(w_i|w_{i-1}w_{i-2}) = \frac{C(w_iw_{i-1}w_{i-2})}{C(w_{i-1}w_{i-2})}\]Notice: when counting the n-gram occurane, shift only 1 space, e.g. sequence of no, no, no, no. counting (no,no) should return 3, rather than 2.
- Problems and Challenges:
- Sentences are long, trigram would not work when effective words are far away. e.g. The lecture that took place last week was on preprocessing.
- Probabilities are often too small -> use log probability to avoid numerical underflow (0.00004->0).
- No probability for unseen words -> use special symbol to represent them (e.g.
<UNK>
) - Words in new contexts, we need to smooth the LM.
Smoothing to deal with Sparsity
Basic idea: give events never seen some probability
Laplacian (Add-one) Smoothing
pretend we’ve seen each n-gram once more than we did.
\[P(w_i) = \frac{C(w_i)+1}{M+|V|}\] \[P(w_i|w_{i-1}) = \frac{C(w_iw_{i-1}) + 1}{C(w_{i-1})+|V|}\]where $V$ is the vocabulary size.
Add-k (Lidstone) Smoothing
Adding one is often too much, so we add a fraction k instead.
\[P(w_i|w_{i-1}) = \frac{C(w_iw_{i-1}) + k}{C(w_{i-1})+k|V|}\]where $V$ is the vocabulary size, $0\lt k \lt 1$ .
We have to choose k, while k has a large impact on result since $\lvert V\rvert$ are usually big.
Absolute Smoothing
borrow a probability mass from observed n-gram counts and then distributed equally to those unseen n-gram.
given parameter $d$, the unseen n-gram will have a same probability of:
\[P(unseen) = \frac{d*|unique-seen-ngram|}{|all-seen-ngram|*C(unseen-ngram)}\] \[P(seen) = \frac{C(w_iw_{i-1})-d}{C((w_{i-1}))}\]Refer to page 23 @ L3
Katz Backoff
Improve the absolute smoothing, where the probability mass is equally distributed to unseen ngrams. Katz Backoff distributes those probability mass according to a lower-oder ngram model, e.g. calculate on trigram, then distribute according to bigram or unigram.
\[P(seen) = \frac{C(w_iw_{i-1})-d}{C((w_{i-1}))}\]For unseen ngram:
\[P(w_i|w_{i-1}) = \alpha(w_{i-1})*\frac{P(w_i)}{\sum_{w_j:C(w_{i-1},w_j)=0}P(w_j)}\]in the above section, it’s the unigram probability for $w_i$. In the below section, $\alpha(w_{i-1})$ indiates the total discounted amount, $\alpha(w_{i-1}) = \frac{d * |unique-seen-bigram-in-corpus|}{|total-seen-bigram-in-corpus|} = 0.1*5/20$. and the sum of $P(w_j)$ is the probability of all unigrams which not appear after a $w_{i-1}$.
Back–off is a different smoothing strategy, where we incorporate lower–order n-gram models (in particular, for unseen contexts). For example, if we have never seen some tri-gram from our sentence, we can instead consider the bigram probability (at some penalty, to maintain the probability of all of the events, given some context, summing to 1). If we haven’t seen the bi-gram, we consider the uni-gram probability. If we’ve never seen the uni-gram (this token doesn’t appear in the corpus at all),then we need a so–called “0-gram” probability, which is a default for unseen tokens
Problem: I can’t see without my reading ___
$C(reading, glasses) = C(reading, Francisco) = 0$
$C(Fransciso) \gt C(glasses)$
Katz backoff will give higher probability to Francisco
Kneser-Ney Smoothing
Redistribute probability mass based on how many number of different contexts word w has appeared in. This is called continuation probability.
\[P(seen) = \frac{C(w_iw_{i-1})-d}{C((w_{i-1}))}\]For unseen ngram:
\[P(w_i|w_{i-1}) = \alpha(w_{i-1})*P_{cont}(w_i)\] \[P_{cont}(w_i)=\frac{|\{w_{i-1}:C(w_{i-1},w_i))>0\}|}{\sum_{w_{i}}|\{w_{i-1}:C(w_{i-1},w_i)>0\}|}\]High continuation counts for glasses (men’s glasses, black glasses, buy glasses, prescription glasses, etc) Low continuation counts for Franciso (San Francisco)
Interpolation
- A better way to combine different order n-gram models
- Weighted sum of probabilities across progressively shorter contexts
- Interpolated trigram model:
where $\lambda$ is learnt from held out data, and should be sum into 1. i.e. $\sum_{n=1}^{n_{max}}\lambda_n=1$
Interpolation is a similar idea with backoff, but instead of only “falling back” to lower– order n-grammodels forunseen events, we caninstead considerevery probability as a linear combination of all of the relevant n-gram models,where the weights are once more chosen to ensure that the probabilities of all events, given some context, sum to 1.
Interpolated Kneser-Ney Smoothing
Interpolation instead of back-off
\[P(w_i\mid w_{i-1}) = \frac{C(w_{i-1},w_i)-d}{C(w_{i-1})}+\beta(w_{i-1})P_{cont}(w_i)\]where $\beta(w_{i-1})$ is a normalising constant so that $P(w_i | w_{i-1})$ sum to 1 |
- Commonly used Kneser-Ney language models use 5-grams as max order
- Has different discount values for each n-gram order.
Generating Language
Sentence = < s >
- argmax greedy search:
- find a $w$ that maximize $P(w \mid \langle s \rangle)$
- Beam search decoding:
- keep track of top-N highest probability words each turn
- produces sentences with near-optimal sentence probability
- (take sentence probability into account)
- Randomly samples from the distribution according to probability we’ve obtained
Evaluating Language Models
- Extrinsic
- spelling correction, machine translation
- Intrinsic
- Perplexity on held-out test set
In information theory, perplexity is a measurement of how well a probability distribution or probability model predicts a sample.
Inverse probability of entire test set $w_1,w_2\ldots w_m$ including </s>
.
The lower the $PP$ is, the better the models are.