Applications

Chanming
May 18, 2020

NLP tasks and applications.

Machine translation

Challenges

  • Not just simple word for word translation
  • Structural changes, e.g., syntax and semantics
  • Multiple word translations, idioms
  • Inflections for gender, case etc
  • Missing information (e.g., determiners)

Statistical Machine Translation

Dataset :

  • Parallel Corpora, one text in multiple languages => train translationmodel
  • Monolingual corpora => train language model

Use Probability to derive language model and translation model. i.e.

\[\argmax_e P(f\mid e)P(e)\]

where $P(f\mid e)$ is the translation model and $P(e)$ is the language model

How to train translation model

Alignment is required since word are not corresponding to each other. The idea is to use expectation maximisation to learn the alignment, since the alignments are rarely provided.

Neural Machine Translation

  1. Architecture: use a single neural model to directly translate from source to target. An encoder-decoder model is formed by first RNN to encode the source sentence, and second RNN to decode the target sentence. Between the two RNNs, there’s a vector that encodes the whole source sentences.

  2. Training: requires parallel corpus, and train with next word prediction, just like a lm. The loss is defined as sum of negative log probability of each RNN outputs in the decoder part. $L = L_1 + L_2 \cdots + L_n$

  3. testing: we dont have target sentence (what we want to predict)
    • greedy decoding: use $\argmax$ to find each output. Doesnt guarantee global optimal probability
    • Exhaustive search decoding: To find optimal $P(y\mid x)$, consider every possible word at each step, to compute probability of all possible sequences. Too computational expensive.
    • Beam Search Decoding: Instead of all possible words, we consider K-best words at each step. $K$ is the beam width, usually from 5 to 10. When $K = 1$, this is a greedy decoding, while $K=V$, it’s a exhaustive search decoding.
  4. When to stop the decoding? when we generate a </s> token. Or we can set a maximum sentence length a decoder can generate.

Attention Mechanism

vector is of fixed size, long sentences may lose information.

We need the Attention, which dot product with the word vectors in the source sentence when decoding each unit.

Say we have encoder hidden states $h_i = RNN_1(h_{i-1}, x_i)$

And decoder hidden states $s_t = RNN_2(S_{t-1}, y_t)$

For each step in decoder, attend (just dot product the current decoder state $s_t$ with all the encoder states $h_i$) to each of the hidden states to produce the attention weights: $e_t = [s_t^Th_1, s_t^Th_2,\ldots,s_t^Th_{\lvert x\rvert}]$

Apply softmax to the attention weights $e_t$ to get a valid probability distribution: $\alpha_t = \text{softmax}(e_t)$ to make them sum to 1.

Then we compute weighted sum of the encoder hidden states: $c_t = \sum^{\lvert x\rvert}\limits_{i=1}\alpha_t^i h_i$

Finally, concatenate $c_t$ and $s_t$ to predict the next word.

Summary of Attention

Solves the information bottleneck issue by allowing decoder to have access to the source sentence words directly

Provides some form of interpretability

  • Attention weights can be seen as word alignments

Most state-of-the-art MT systems are based on this architecture

Machine Translation Evaluation

BLEU: compute n-gram overlap between “reference” translation and generated translation, typically 1-4 grams.

Brevity Penalty: $\text{BP} = \min(1, \frac{\text{output length}}{\text{reference length}})$, to penalise short outputs

\[\text{BLEU} = \text{BP} * \exp(\frac{1}{N}\sum^N\limits_n\log p_n)\]

where $p_n = \frac{\text{# correct n-grams}}{\text{# of predicted n-grams}}$

Information Extraction

Main goal: turn text into structured data such as databases, etc.

Help decision makers in applications. e.g. Stock analysis, Medical and biological research, Rumour detection

How to extract?

  • Name Entity Recognition(NER): Find out entities such as Australia,1956
  • Relation Extraction: use context to find relation betweenAustralia, and 1956

Named Entity Recognition

Typical Entity Tags: PER(person), ORG(companys, teamname), LOC(locations) … refer to page 9 @ L18

NER As Sequence Labelling

Challenges: NE tags can be ambiguous: “Washington” can be either a person, or a location, even a political entity. Similar with POS Tagging challenge

Solution: incorperate context by treating NER as sequence labelling

IO Tagging

all tokens wich are not entities get ‘O’ token (outside) Other tokens get tokens like ‘I-ORG’, ‘I-PER’ …

Cant handle single entity with multiple tokens, or multiple entities with single token.

IOB Tagging

B-ORG: beginning of an ORG entity

I-ORG: all the subsequent tokens for that entity (if any)

Features

Character and word shape features (ex: “L’Occitane”)

Prefix/suffix:

  • L / L’ / L’O / L’Oc / …
  • e / ne / ane / tane / …

Word shape: ‣ X’Xxxxxxxx / X’Xx

  • XXXX-XX-XX (date!)

POS tags / syntactic chunks: many entities are nouns or noun phrases.

Presence in a gazeteer: lists of entities, such as place names, people’s names and surnames, etc.

NER as Classifier

State of the art approach uses LSTMs with character and word embeddings (Lample et al. 2016)

Relation Extraction

relation databse: all possible relations formed in frames like unit(American Airlines, AMR Corp.)

If we have access to a fixed relation database: ‣ Rule-basedSupervisedSemi-supervisedDistant supervision

If no restrictions on relations: ‣ Unsupervised ‣ Sometimes referred as “OpenIE”

Evaluation

For NER: F1-measure at the entity level

Relation Extraction with known relation set: F1-measure

Relation Extraction with unknown relations: much harder to evaluate

  • Usually need some human evaluation
  • Massive datasets used in these settings are impractical to evaluate manually: use a small sample
  • Can only obtain (approximate) precision, not recall.

Workshop Questions

What aspects of human language make automatic translation difficult

What is Information Extraction? What might the “extracted” information look like?

What is Named Entity Recognition and why is it difficult? What might make it more difficult for persons rather than places, and vice versa

What is the IOB trick, in a sequence labelling context? Why is it important?

What is Relation Extraction? How is it similar to NER, and how is it different?

Why are hand–written patterns generally inadequate for IE, and what other approaches can we take?

Question answering

question answering (“QA”) is the task of automatically determining the answer for a natural language question

Information retrieval based QA

Given a query, search relevant documents, and then find answers within these relevant documents, Extract short answer string

  1. Process Question To find the key part of a question
    • discard structural parts, stop words punctualtions etc
    • formulate as TF-IDF query, using unigrams or bigrams
    • identify entities and priortise match
  2. Identify Answer types
    • To help find the right passage to search, reduce searching scope
    • find the right answer string
    • Treat as a classification: given question, predict answer type, with key feature of headword
  3. Retrieval
    • Find top n docs matching query (Standard IR)
    • Find passages content (sentence and paragraphs)
    • Rank passages, find a best passage
  4. Extract Answer
    • Feature-based Answer extraction: frame it as a classfication problem, to classify if a candidate answer (sentence or paragraph) contains an answer. YES/NO
    • Neural Answer Extraction: Use neural network to extract answer, AKA Reading Comprehension task. Data demanding, is our dataset sufficient?
  5. EXAMPLE: LSTM-Based Answer Extraction and Bert-Based.

e.g. Bert: given question and paragraph(or sentence) as input, BERT as middle layer, and produce a start index and end index.

Knowledge-based QA

Builds semantic representation of the query ‣ According to the semantic representation, query database of facts to find answers

Many large knowledge database, we only need to support natural language quries.

Semantic Parsing

Based on aligned question and their logical form, or to build compositional logic form.

QA Evaluation

TREC-QA: Mean Reciprocal Rank for systems returning matching passages or answer strings:

  • e.g. system return 4 passage for a query, first correct passage is the third passage, MRR = $1/3$

SQuAD: ‣ Exact match of string against gold answer and then calculate F1 score over bag of selected tokens

MCQ reading comprehension: Accuracy

Workshop Questions 1

What is Question Answering?

What is semantic parsing, and why might it be desirable for QA ? Why might approaches like NER be more desirable?

What are the main steps for answering a question for a QA system?

Topic modelling

Topic models learn common, overlapping themes in a document collection, unsupervised method. like a clustering method.

Latent Dirichlet Allocation

Introduces a prior to the document-topic and topicword distribution in Probabilistic LSA, Bayesian version of Probabilistic LSA.

Fully generative: trained LDA model can infer topics on unseen documents!

Core idea: assume each document contains a mix of topics

  1. Input
    • A collection of documents
    • Bag-of-words
    • Good preprocessing practice: ‣ Remove stopwords ‣ Remove low and high frequency word types ‣ Lemmatisation
  2. Output
    • Topics: multinomial distribution over words in each topic, word distribution for each topic.
    • Topics in documents: multinomial distribution over topics in each document, i.e. the distribution of topic probability for each doc.
  3. Sampling Method (Gibbs Sampling)
    • Randomly assign topics to all tokens in documents, prior
    • Collect topic-word and document-topic co-occurrence statistics based on the assignments
    • Go through every word token in corpus and sample a new topic based on the distribution we obtained so far
    • Repeat until convergence
  4. Predict unseen documents:
    • randomly assign topics as well
    • update distributions based on that assignment, using trained topic-word matrix
    • sample topics for word tokens (repeat)

Hyper-Parameters: prior on the topic-word distribution • : prior on the document-topic distribution

Topic Modelling Evaluation

model logprob / perplexity on test documents

\[L = \prod_w\sum_tP(w\mid t)P(t\mid d_w)\] \[\text{ppl}=\exp^{(-\frac{\log L}{m})}\]

where $m$ is the total number of word tokens in test documents

Workshop Qs

What is a Topic Model?

What is the Latent Dirichlet Allocation, and what are its strengths?

What are the different approaches to evaluating a topic model?

Document summarization

Extractive summarisation

Summarise by selecting representative sentences from documents

Single document

  1. Content selection: select what sentences to extract from the document Goal: Find sentences that are important or salient Methods:
    • Saliency based
      • TF-IDF: Frequent words => salient, Need remove stop words and function words.
      • Log Likelihood Ratio: a word is salient if its probability in the input corpus is very different to a background corpus
      • Then Calculate saliency of a sentence: $\text{weight}(s) = \frac{1}{\lvert S\rvert} \sum\limits_{w\in S}\text{weight}(w)$, remove all stop words!
    • Sentence Centrality based: Measure distance between sentences, and choose sentences that are closer to other sentences
      • Use tf-idf to represent sentence
      • Use cosine similarity to measure distance
      • $\text{centrality}(s) = \frac{1}{#sentence}\sum_{s’}\cos_{tfidf} (s,s’)$
    • RST Parsing Rheitorical structure theory, discourse L12.
      • Nucleus more important than satellite
      • A sentence that functions as a nucleus to more sentences (more supporting clauses) = more salient
  2. Information ordering: decide how to order extracted sentences

  3. Sentence realisation: cleanup to make sure combined sentences are fluent

Multi Document

  1. Content Selection
    • We can use the same unsupervised content selection methods (tf-idf, log likelihood ratio, centrality) to select salient sentences
    • But ignore sentences that are redundant

    • Maximum Marginal Relevance: Iteratively select the best sentence to add to summary
      • Penalise a candidate sentence if it’s similar to extracted sentences
      • Stop when a desired number of sentences are added
    • Sentence Simplification: Create multiple simplified versions of sentences before extraction , Maximum Marginal Relevance to make sure only non-redundant sentences are selected
  2. Information Ordering
    • Chronological ordering: Order by document dates
    • Coherence Ordering: ‣ Order in a way that makes adjacent sentences similar ‣ Order based on how entities are organised(centering theory, L12)
    • Sentence Realisation:
      • Make sure entities are referred coherentlyFull name at first mention ‣ Last name at subsequent mentions
      • Apply coreference methods to first extract names
      • Write rules to clean up

Abstractive summarisation

Summarise the content in your own words and Summaries will often be paraphrases of the original content

Difficult. Need neural networks with Encoder-Decoder system.

Input: Document

Output: summary

Data: News headlines • Document: First sentence of article • Summary: News headline/title • Technically more like a “headline generation task”

Improvements

  • Attention mechanism
  • Richer word features: POS tags, NER tags, tf-idf
  • Hierarchical encoders: ‣ One LSTM for words ‣ Another LSTM for sentences

Summarisation Challenges

Occasionally reproduce statements incorrectly (hallucinate new details!)

Unable to handle out-of-vocab words in document ‣ Generate UNK in summary ‣ E.g. new names in test documents

Solution: allow decoder to copy words directly from input document during generation

Copy Mechanism: • Generate summaries that reproduce details in the document • Can produce out-of-vocab words in the summary by copying them in the document ‣ e.g. smergle = out of vocabulary ‣ p(smergle) = attention probability + generation probability = attention probability

State-of-the-art models use transformers instead of RNNs • Lots of pre-training

Note: BERT not directly applicable because we need a unidirectional decoder (BERT is only an encoder)

Summarisation Evaluation

ROUGE: evaluates the degree of word overlap between generated summary and reference/human summary

Measures overlap in N-grams (e.g. from 1 to 3)