Semantics

Chanming
Apr 28, 2020

How words form meaning

Lexical Semantics

How the meanings of words connect to one another Manually constructed resources: lexicons, thesauri, ontologies. etc

What Challenge to solve

bag of words, KNN classifiers

  1. synonyms, movie = film
  2. OOV words when observing new documents

We need to find out a way to express the meaning of the word. Solution: Add the meaning information explicitly through a lexical database.

How to define word meanings

By its Definition

Dictionary definitions

Word Senses (one aspect of the meaning of the word)

Word glosses (textual definiton of a sense), e.g. For bank, it is a financial institution that accepts deposits and channels the money into lending activities.

Through Relations

e.g. Synonymy: big = large e.g. Antonymy: big <=> little e.g. Hypernymy: cat is a animal, i.e. a IS b e.g. Meronymy: wheel is part of a car i.e. PART-WHOLE relation

WordNet, a database of lexical relations

Nodes of WordNet are not words or lemmas, but senses. In each Sences, it has multiple Synsets.

Synsets

sets of synonyms, which is Synsets

Word Similarity

Unlike synonymy (which is a binary relation), word similarity is a spectrum

Path Length

Similarity between two senses: $\text{simPath}(c_1,c_2)=\frac{1}{\text{pathlen}(c_1,c_2)}$, In other words, the futher the length between senses, the less probability that they are similar.

Path length is caclulated by counting the nodes in a tree from inclusive $c_1$ to $c_2$ inclusive.

Similarity between two words: find the sense $c_1\in \text{senses}(w_1)$ and $c_2\in \text{senses}(w_2)$ that has the maximum $\text{simPath}(c_1,c_2)$

Challenge:

  • edges are very widely in actual semantic distane => path length aremuch bigger

Solution:

  • include depth information.

Lowest Common Subsumer LCS, simwup, (Wu & Palmer)

\(\text{simwup}(c_1,c_2)=\frac{2*\text{depth}(\text{LCS}(c_1,c_2))}{\text{depth}(c_1)+\text{depth}(c_2)}\) $\text{depth}(\text{LCS}(c_1,c_2)$ is to find the depth of the lowest common subsumer. lowest common subsumer: A parent node that has two word as its children. Notice: Count depth from the root as level 1.

Challenge:

But this will still result in a high similarity when its in the hierarchy(abstract/general). What we want to achieve is to make abstract nodes much less similar.

Concept Probability

\[P(c)=\frac{\sum_{w\in\text{words}(c)}\text{count}(w)}{N}\]

P(c): probability that a randomly selected word in a corpus is an instance of concept c

words(c): set of all words that are children of c, including its grandchildren and all below.

words(geological-formation) = {hill, ridge, grotto, coast, natural elevation, cave, shore}

words(natural elevation) = {hill, ridge}

Performance: Abstract nodes higher in the hierarchy has a higher P(c)

Similarity with Information Content

\(IC(c)=-\log P(c)\) the higher the concept in a tree, the lower the $IC(c)$ is.

\[\text{simlin}(c_1,c_2) = \frac{2*IC(LCS(c_1,c_2))}{IC(c_1)+IC(c_2)}\]

this use the IC instead of depth information.

Word Sense Disambiguiation

find the correct sense for words in a sentence. Baseline: assume the most popular sense.

  • Supervised WSD (sense tagged corpora is very hard to create!)
  • Less supervised WSD ( Choose sence based on the gloss, which has the most overlap with its context)

Other databases: FrameNet

Refer L9

About Corpus

Manually-tagged lexical resources an important starting point for text analysis

But much modern work attempts to derive semantic information directly from corpora, without human intervention

Solution: Distributional semantics!

Distributional semantics

How words relate to each other in the text

Automatically created resoures from corpora.

Problems of Lexical Semantic Approach

  • Manually constructed => expensive, highly biased and noisy
  • Language is dynamic => New slang and terminoloies and new sense appear recetly
  • A new idea: Use Internet (massive amounts of text) to derive word meanings automatically.

Hypothesis

  • Document co-occurrences often are indicative of similar topic (document as context)
  • Local context reflects a word’s semantic class

Distributional Rrepresentation

What is Distributional Representation? It describes how the word distribute in document or corpus.

we are going to guess meaning from its usage, especially the context that they share. So in this case, we can create a word vector indicating the occurance position of the word in the document or corpus, which describes the distributional propoties.

In general, this captures all sorts of semantic relationships (synonymy, analogy, etc)

Count based methods

TF-IDF with Dimensionality Reduction, since they are usually very sparse.

DR: using SVD, LSA.

We are using the SVD method to build a representation of our matrix which can be used to identify the most important characteristics of words.

By throwing away the less important characteristics, we can have as maller representation of the words, which will save us (potentially a great deal of) time when evaluating the cosine similarities between word pairs.

Point-wise Mutual Information

\(PMI(x,y) = \log_2\frac{p(x,y)}{p(x)p(y)}\) where the upper is the co-occurence of x and y in a document, counting such documents.

detail refer to Page 16 @ L10

This value is slightly positive, which means that the two events occur together (in documents) slightly more commonly than would occur purely by chance. There is some possibility that world and cup occurring together is somehow meaningful for documents in this collection.

PMI does a better job of capturing interesting semantics, But it is obviously biased towards rare words (need to smooth the probabilities), and doesn’t handle zero in nature.(need to set as 0 initially)

Neural methods to learn word vectors

Word2Vec, Word embedding

We’re going to have a representation of words (based on their contexts) in a vector space, such that other words “nearby” in the space are similar

This is broadly the same as what we expect in distributional similarity (“you shall know a word by the company it keeps.”)

The row corresponding to the word in the relevant (target/context) matrix is known as the “embedding”.

Key idea:

  • Word embeddings should be similar to embeddings of neighbouringwords
  • And dissimilar to other words that don’t occur nearby

Framed as learning a classifier:

  • Skip-gram: given center words, predict the surrounding context (words)
  • CBOW: predict the word in center, given the local context (surrounding words) Here Local context means words within L positions, L = 1, 2 .. If L = 1, then local context of A B C, given B, will be A and C.
Skip-gram model Training

use a logistic regression model to classify, the probability of $P(w_{center + l}\mid w_{center})$ where $l\in [-2,-1,-,1,2]$ if $L=2$

The probabilities here are more complicated than just counting some events in a collection; they are based around taking the dot product of the relevant vectors (or average of vectors, in the case of CBOW), and then marginalising. To improve efficiency, we can also use a negative sampling objective.

Benefits

Unsupervised : Raw, unlabelled corpus Efficient:

  • Negative sampling (avoid softmax over full vocabulary)
  • Scales to very very large corpus

Useful representation: ‣ How do we evaluate word vectors?

evaluation: test the word embeddings on lexicon database

Contextual Representation

Issues with Word Vectors(Embeddings) representation

Each word type has only one representation, in orther words, Always same representation, no matter how the contexts of the word are. This doesn’t capture the multiple senses of a words well.

Contextual Representation Definition

representation of words, based on context, need pretrained contextual representations.

The contextual representation of a word is a representation of the word based on a particular usage. It captures the different senses or nuances of the word depending on the context.

Contextualised representations are different to word embeddings (e.g.Word2Vec) which gives a single representation for every word type(sense).

Contextual representations that are pre-trained on large data can be seen as a model that has obtained fairly comprehensive knowledge about the language.

Recall RNN Language Model

We successfully capture the context information on the left, but we leave those context on the right as blank.

Solution: Bi-directional RNN.

two parallel RNN networks, with two direction. one from left to right, one from right to left. And their RNN units are connected with each other on the same position.

This promise us that a center word can have representation of both left and right directions. For each RNN unit, $y_i = \text{softmax}([s_i,u_i])$ where $s$ and $u$ are the state in two parallel RNN networks. the contextual embeddings lie in $[s_i, u_i]$

ELMo: Embeddings from Language Models

Trains a bidirectional, multi-layer LSTM language model over 1B word corpus

Applications

Lower layer representation = captures syntax

  • good for POS tagging, NER

Higher layer representation = captures semantics

  • good for QA, textual entailment, sentiment analysis

BERT: Bidirectional Encoder Representations from Transformers

Uses self-attention networks (aka Transformers) to capture dependencies between words

Hence there’s No sequential processing.

Masked language model objective to capture deep bidirectional representations

It loses the ability to generate language.

Masked Language Model

‘Mask’ out k% of tokens at random

Objective: predict the masked words

Next Sentence Prediction

Learn relationships between sentences

Predicts whether sentence B follows sentence A

Discorse: Semantics for Sentence

Discourse: understanding how sentences relate to each other in a document

Discourse segmentation : document segementation

segmentation based on a paritcular topic or function, e.g. introduction, related work, conclusion etc

In Discourse Segmentation, we try to divide up a text into discrete, cohesive units based on sentences.

By interpretting the task as a boundary–finding problem, we can use rule–based or unsupervised methods to find sentences with little lexical overlap (suggesting a discourse boundary).

We can also use supervised methods, by training a classifier around paragraph boundaries.

Unsupervised: low lexical cohension between sentence

When a sentence has low lexical cohension to another, we classify it as a divider of segmentation. Approach:

For each sentence gap $i$:

  • Create two BOW vectors consisting of words from k sentences oneither side of gap
  • Use cosine to get a similarity score (sim) for two vectors
  • For each gap $i$, calculate a depth score for $i$.
  • if that depth score is greater than some threshold $t$: insertboundaries at gap $i$
\[\text{depth}(\text{gap}_i) = (sim_{i-1} - sim_i) + (sim_{i+1} - sim_i)\]

Supervised: Using Labelled data :)

Apply a binary classifier to identify boundaries

Or use sequential classifiers

Potentially include classification of section types (introduction, conclusion, etc.)

Integrate a wider range of features, including:

  • distributional semantics
  • discourse markers (therefore, and, etc)

Discourse parsing: relations between clauses in sentence

Tasks: Identify discourse units, and the relations that hold between them

Basic element: elementary discourse units (EDUs) ‣ Typically clauses of a sentence ‣ EDUs do not cross sentence boundary

Rhetorical Structure Theory (RST), is a framework to do hierarchical analysis of discourse structure in documents

RST relations between discourse units: ‣ conjuction, justify, concession, elaboration, etc

Within a discourse relation, the primary argument is the nucleus, while the supporting arugment is the satellite. Some relations are equal (e.g. conjunction), so both arguments are nuclei.

The relations between Discourse Units will build up a RST Tree.

Parsing in Machine Learning

Parsing can be done by identifing the markers (although, but, for example, so …), using rule-based parser. However, markers can be ambiguous. and relations can happen without any markers. => using machine learning.

Dataset

RST Discourse Treebank – 300+ documents annotated with RST trees

Basic Idea
  1. Segment document into EDUs
  2. Combine adjacent DUs into composite DUs iteratively to create the full RST tree
Parsing techiques
  • Transition based parsing, from bottom to top, greedy usingshift-reduce algorithm in Lecture L16
  • CYK/Chart parsing algorithm, from bottom to top, Gloabal sub optimal(some constraints exist) in Lecture L14
  • Top-down parsing , identified as a Sequence labelling problem,using BERT model.
Features

Bag of words • Discourse markers • Starting/ending n-grams • Location in the text • Syntax features • Lexical and distributional similarities

Discourse Parsing Applications

• Summarisation • Sentiment analysis • Argumentation • Authorship attribution • Essay scoring

Anaphora resolution

Anaphor: linguistic expressions that refer back to earlier elements in the text

Anaphors (most commonly as Pronouns like He, It, They, She…) have a antecedent in the discourse, often but not always a noun phrase

e.g. Yesterday, Ted was late for work. It all started when his car wouldn’t start.

Antecedent Preferences:

  • The antecedents of pronouns should be recent e.g. He waited for another 20 minutes, but the tram didn’t come. Sohe walked home and got his bike out of the garage. He startedriding it to work.
  • The antecedent should be salient, as determined by grammatical position Subject > ojbect > argument of preposition e.g. Ted usually rode to work with Bill. He was never late.

Centering Theory: help resolving anaphora

A unified account of relationship between discourse structure and entity reference

Every utterance in the discourse is characterised by a set of entities, known as centers

Explains preference of certain entities for ambiguous pronouns

The most obvious ( but inherently unreliable) heuristic is the recency heuristic: given multiple possible referents (that are consistent in meaning with theanaphor),the mostly intended one is the one most recently used in the text.

A better heuristic is that the most likely referent (consistent in meaning with the anaphor) is the focus of the discourse (the “center”). (Centering Theory)

we can also build a supervised machine learning model, usually based around the semantic properties of the anaphor/nearby words and the sentence/discourse structure.

Motivation or Application

Essential for deep semantic analysis 深层语义分析,语义理解 which is very useful for QA, e.g., reading comprehension

Ted’s car broke down. So he went over to Bill’s house to borrow his car. Bill said that was fine.

Questions like: Whose car is borrowed?