Applications
NLP tasks and applications.
Machine translation
Challenges
- Not just simple word for word translation
- Structural changes, e.g., syntax and semantics
- Multiple word translations, idioms
- Inflections for gender, case etc
- Missing information (e.g., determiners)
Statistical Machine Translation
Dataset :
- Parallel Corpora, one text in multiple languages => train translationmodel
- Monolingual corpora => train language model
Use Probability to derive language model and translation model. i.e.
\[\argmax_e P(f\mid e)P(e)\]where $P(f\mid e)$ is the translation model and $P(e)$ is the language model
How to train translation model
Alignment is required since word are not corresponding to each other. The idea is to use expectation maximisation to learn the alignment, since the alignments are rarely provided.
Neural Machine Translation
-
Architecture: use a single neural model to directly translate from source to target. An encoder-decoder model is formed by first RNN to encode the source sentence, and second RNN to decode the target sentence. Between the two RNNs, there’s a vector that encodes the whole source sentences.
-
Training: requires parallel corpus, and train with next word prediction, just like a lm. The loss is defined as sum of negative log probability of each RNN outputs in the decoder part. $L = L_1 + L_2 \cdots + L_n$
- testing: we dont have target sentence (what we want to predict)
- greedy decoding: use $\argmax$ to find each output. Doesnt guarantee global optimal probability
- Exhaustive search decoding: To find optimal $P(y\mid x)$, consider every possible word at each step, to compute probability of all possible sequences. Too computational expensive.
- Beam Search Decoding: Instead of all possible words, we consider K-best words at each step. $K$ is the beam width, usually from 5 to 10. When $K = 1$, this is a greedy decoding, while $K=V$, it’s a exhaustive search decoding.
- When to stop the decoding? when we generate a
</s>
token. Or we can set a maximum sentence length a decoder can generate.
Attention Mechanism
vector is of fixed size, long sentences may lose information.
We need the Attention, which dot product with the word vectors in the source sentence when decoding each unit.
Say we have encoder hidden states $h_i = RNN_1(h_{i-1}, x_i)$
And decoder hidden states $s_t = RNN_2(S_{t-1}, y_t)$
For each step in decoder, attend (just dot product the current decoder state $s_t$ with all the encoder states $h_i$) to each of the hidden states to produce the attention weights: $e_t = [s_t^Th_1, s_t^Th_2,\ldots,s_t^Th_{\lvert x\rvert}]$
Apply softmax to the attention weights $e_t$ to get a valid probability distribution: $\alpha_t = \text{softmax}(e_t)$ to make them sum to 1.
Then we compute weighted sum of the encoder hidden states: $c_t = \sum^{\lvert x\rvert}\limits_{i=1}\alpha_t^i h_i$
Finally, concatenate $c_t$ and $s_t$ to predict the next word.
Summary of Attention
Solves the information bottleneck issue by allowing decoder to have access to the source sentence words directly
Provides some form of interpretability
- Attention weights can be seen as word alignments
Most state-of-the-art MT systems are based on this architecture
Machine Translation Evaluation
BLEU: compute n-gram overlap between “reference” translation and generated translation, typically 1-4 grams.
Brevity Penalty: $\text{BP} = \min(1, \frac{\text{output length}}{\text{reference length}})$, to penalise short outputs
\[\text{BLEU} = \text{BP} * \exp(\frac{1}{N}\sum^N\limits_n\log p_n)\]where $p_n = \frac{\text{# correct n-grams}}{\text{# of predicted n-grams}}$
Information Extraction
Main goal: turn text into structured data such as databases, etc.
Help decision makers in applications. e.g. Stock analysis, Medical and biological research, Rumour detection
How to extract?
- Name Entity Recognition(NER): Find out entities such as Australia,1956
- Relation Extraction: use context to find relation betweenAustralia, and 1956
Named Entity Recognition
Typical Entity Tags: PER(person), ORG(companys, teamname), LOC(locations) … refer to page 9 @ L18
NER As Sequence Labelling
Challenges: NE tags can be ambiguous: “Washington” can be either a person, or a location, even a political entity. Similar with POS Tagging challenge
Solution: incorperate context by treating NER as sequence labelling
IO Tagging
all tokens wich are not entities get ‘O’ token (outside) Other tokens get tokens like ‘I-ORG’, ‘I-PER’ …
Cant handle single entity with multiple tokens, or multiple entities with single token.
IOB Tagging
B-ORG: beginning of an ORG entity
I-ORG: all the subsequent tokens for that entity (if any)
Features
Character and word shape features (ex: “L’Occitane”)
Prefix/suffix:
- L / L’ / L’O / L’Oc / …
- e / ne / ane / tane / …
Word shape: ‣ X’Xxxxxxxx / X’Xx
- XXXX-XX-XX (date!)
POS tags / syntactic chunks: many entities are nouns or noun phrases.
Presence in a gazeteer: lists of entities, such as place names, people’s names and surnames, etc.
NER as Classifier
State of the art approach uses LSTMs with character and word embeddings (Lample et al. 2016)
Relation Extraction
relation databse: all possible relations formed in frames like unit(American Airlines, AMR Corp.)
…
If we have access to a fixed relation database: ‣ Rule-based ‣ Supervised ‣ Semi-supervised ‣ Distant supervision
If no restrictions on relations: ‣ Unsupervised ‣ Sometimes referred as “OpenIE”
Evaluation
For NER: F1-measure at the entity level
Relation Extraction with known relation set: F1-measure
Relation Extraction with unknown relations: much harder to evaluate
- Usually need some human evaluation
- Massive datasets used in these settings are impractical to evaluate manually: use a small sample
- Can only obtain (approximate) precision, not recall.
Workshop Questions
What aspects of human language make automatic translation difficult
What is Information Extraction? What might the “extracted” information look like?
What is Named Entity Recognition and why is it difficult? What might make it more difficult for persons rather than places, and vice versa
What is the IOB trick, in a sequence labelling context? Why is it important?
What is Relation Extraction? How is it similar to NER, and how is it different?
Why are hand–written patterns generally inadequate for IE, and what other approaches can we take?
Question answering
question answering (“QA”) is the task of automatically determining the answer for a natural language question
Information retrieval based QA
Given a query, search relevant documents, and then find answers within these relevant documents, Extract short answer string
- Process Question
To find the key part of a question
- discard structural parts, stop words punctualtions etc
- formulate as TF-IDF query, using unigrams or bigrams
- identify entities and priortise match
- Identify Answer types
- To help find the right passage to search, reduce searching scope
- find the right answer string
- Treat as a classification: given question, predict answer type, with key feature of headword
- Retrieval
- Find top n docs matching query (Standard IR)
- Find passages content (sentence and paragraphs)
- Rank passages, find a best passage
- Extract Answer
- Feature-based Answer extraction: frame it as a classfication problem, to classify if a candidate answer (sentence or paragraph) contains an answer. YES/NO
- Neural Answer Extraction: Use neural network to extract answer, AKA Reading Comprehension task. Data demanding, is our dataset sufficient?
- EXAMPLE: LSTM-Based Answer Extraction and Bert-Based.
e.g. Bert: given question and paragraph(or sentence) as input, BERT as middle layer, and produce a start index and end index.
Knowledge-based QA
Builds semantic representation of the query ‣ According to the semantic representation, query database of facts to find answers
Many large knowledge database, we only need to support natural language quries.
Semantic Parsing
Based on aligned question and their logical form, or to build compositional logic form.
QA Evaluation
TREC-QA: Mean Reciprocal Rank for systems returning matching passages or answer strings:
- e.g. system return 4 passage for a query, first correct passage is the third passage, MRR = $1/3$
SQuAD: ‣ Exact match of string against gold answer and then calculate F1 score over bag of selected tokens
MCQ reading comprehension: Accuracy
Workshop Questions 1
What is Question Answering?
What is semantic parsing, and why might it be desirable for QA ? Why might approaches like NER be more desirable?
What are the main steps for answering a question for a QA system?
Topic modelling
Topic models learn common, overlapping themes in a document collection, unsupervised method. like a clustering method.
Latent Dirichlet Allocation
Introduces a prior to the document-topic and topicword distribution in Probabilistic LSA, Bayesian version of Probabilistic LSA.
Fully generative: trained LDA model can infer topics on unseen documents!
Core idea: assume each document contains a mix of topics
- Input
- A collection of documents
- Bag-of-words
- Good preprocessing practice: ‣ Remove stopwords ‣ Remove low and high frequency word types ‣ Lemmatisation
- Output
- Topics: multinomial distribution over words in each topic, word distribution for each topic.
- Topics in documents: multinomial distribution over topics in each document, i.e. the distribution of topic probability for each doc.
- Sampling Method (Gibbs Sampling)
- Randomly assign topics to all tokens in documents, prior
- Collect topic-word and document-topic co-occurrence statistics based on the assignments
- Go through every word token in corpus and sample a new topic based on the distribution we obtained so far
- Repeat until convergence
- Predict unseen documents:
- randomly assign topics as well
- update distributions based on that assignment, using trained topic-word matrix
- sample topics for word tokens (repeat)
Hyper-Parameters: prior on the topic-word distribution • : prior on the document-topic distribution
Topic Modelling Evaluation
model logprob / perplexity on test documents
\[L = \prod_w\sum_tP(w\mid t)P(t\mid d_w)\] \[\text{ppl}=\exp^{(-\frac{\log L}{m})}\]where $m$ is the total number of word tokens in test documents
Workshop Qs
What is a Topic Model?
What is the Latent Dirichlet Allocation, and what are its strengths?
What are the different approaches to evaluating a topic model?
Document summarization
Extractive summarisation
Summarise by selecting representative sentences from documents
Single document
- Content selection: select what sentences to extract from the document
Goal: Find sentences that are important or salient
Methods:
- Saliency based
- TF-IDF: Frequent words => salient, Need remove stop words and function words.
- Log Likelihood Ratio: a word is salient if its probability in the input corpus is very different to a background corpus
- Then Calculate saliency of a sentence: $\text{weight}(s) = \frac{1}{\lvert S\rvert} \sum\limits_{w\in S}\text{weight}(w)$, remove all stop words!
- Sentence Centrality based: Measure distance between sentences, and choose sentences that are closer to other sentences
- Use tf-idf to represent sentence
- Use cosine similarity to measure distance
- $\text{centrality}(s) = \frac{1}{#sentence}\sum_{s’}\cos_{tfidf} (s,s’)$
- RST Parsing Rheitorical structure theory, discourse L12.
- Nucleus more important than satellite
- A sentence that functions as a nucleus to more sentences (more supporting clauses) = more salient
- Saliency based
-
Information ordering: decide how to order extracted sentences
- Sentence realisation: cleanup to make sure combined sentences are fluent
Multi Document
- Content Selection
- We can use the same unsupervised content selection methods (tf-idf, log likelihood ratio, centrality) to select salient sentences
-
But ignore sentences that are redundant
- Maximum Marginal Relevance: Iteratively select the best sentence to add to summary
- Penalise a candidate sentence if it’s similar to extracted sentences
- Stop when a desired number of sentences are added
- Sentence Simplification: Create multiple simplified versions of sentences before extraction , Maximum Marginal Relevance to make sure only non-redundant sentences are selected
- Information Ordering
- Chronological ordering: Order by document dates
- Coherence Ordering: ‣ Order in a way that makes adjacent sentences similar ‣ Order based on how entities are organised(centering theory, L12)
- Sentence Realisation:
- Make sure entities are referred coherently ‣ Full name at first mention ‣ Last name at subsequent mentions
- Apply coreference methods to first extract names
- Write rules to clean up
Abstractive summarisation
Summarise the content in your own words and Summaries will often be paraphrases of the original content
Difficult. Need neural networks with Encoder-Decoder system.
Input: Document
Output: summary
Data: News headlines • Document: First sentence of article • Summary: News headline/title • Technically more like a “headline generation task”
Improvements
- Attention mechanism
- Richer word features: POS tags, NER tags, tf-idf
- Hierarchical encoders: ‣ One LSTM for words ‣ Another LSTM for sentences
Summarisation Challenges
Occasionally reproduce statements incorrectly (hallucinate new details!)
Unable to handle out-of-vocab words in document ‣ Generate UNK in summary ‣ E.g. new names in test documents
Solution: allow decoder to copy words directly from input document during generation
Copy Mechanism: • Generate summaries that reproduce details in the document • Can produce out-of-vocab words in the summary by copying them in the document ‣ e.g. smergle = out of vocabulary ‣ p(smergle) = attention probability + generation probability = attention probability
State-of-the-art models use transformers instead of RNNs • Lots of pre-training
Note: BERT not directly applicable because we need a unidirectional decoder (BERT is only an encoder)
Summarisation Evaluation
ROUGE: evaluates the degree of word overlap between generated summary and reference/human summary
Measures overlap in N-grams (e.g. from 1 to 3)