Semantics
How words form meaning
Lexical Semantics
How the meanings of words connect to one another Manually constructed resources: lexicons, thesauri, ontologies. etc
What Challenge to solve
bag of words, KNN classifiers
- synonyms, movie = film
- OOV words when observing new documents
We need to find out a way to express the meaning of the word. Solution: Add the meaning information explicitly through a lexical database.
How to define word meanings
By its Definition
Dictionary definitions
Word Senses (one aspect of the meaning of the word)
Word glosses (textual definiton of a sense), e.g. For bank, it is a financial institution that accepts deposits and channels the money into lending activities.
Through Relations
e.g. Synonymy: big = large e.g. Antonymy: big <=> little e.g. Hypernymy: cat is a animal, i.e. a IS b e.g. Meronymy: wheel is part of a car i.e. PART-WHOLE relation
WordNet, a database of lexical relations
Nodes of WordNet are not words or lemmas, but senses. In each Sences, it has multiple Synsets.
Synsets
sets of synonyms, which is Synsets
Word Similarity
Unlike synonymy (which is a binary relation), word similarity is a spectrum
Path Length
Similarity between two senses: $\text{simPath}(c_1,c_2)=\frac{1}{\text{pathlen}(c_1,c_2)}$, In other words, the futher the length between senses, the less probability that they are similar.
Path length is caclulated by counting the nodes in a tree from inclusive $c_1$ to $c_2$ inclusive.
Similarity between two words: find the sense $c_1\in \text{senses}(w_1)$ and $c_2\in \text{senses}(w_2)$ that has the maximum $\text{simPath}(c_1,c_2)$
Challenge:
- edges are very widely in actual semantic distane => path length aremuch bigger
Solution:
- include depth information.
Lowest Common Subsumer LCS, simwup, (Wu & Palmer)
\(\text{simwup}(c_1,c_2)=\frac{2*\text{depth}(\text{LCS}(c_1,c_2))}{\text{depth}(c_1)+\text{depth}(c_2)}\) $\text{depth}(\text{LCS}(c_1,c_2)$ is to find the depth of the lowest common subsumer. lowest common subsumer: A parent node that has two word as its children. Notice: Count depth from the root as level 1.
Challenge:
But this will still result in a high similarity when its in the hierarchy(abstract/general). What we want to achieve is to make abstract nodes much less similar.
Concept Probability
\[P(c)=\frac{\sum_{w\in\text{words}(c)}\text{count}(w)}{N}\]P(c): probability that a randomly selected word in a corpus is an instance of concept c
words(c): set of all words that are children of c, including its grandchildren and all below.
words(geological-formation) = {hill, ridge, grotto, coast, natural elevation, cave, shore}
words(natural elevation) = {hill, ridge}
Performance: Abstract nodes higher in the hierarchy has a higher P(c)
Similarity with Information Content
\(IC(c)=-\log P(c)\) the higher the concept in a tree, the lower the $IC(c)$ is.
\[\text{simlin}(c_1,c_2) = \frac{2*IC(LCS(c_1,c_2))}{IC(c_1)+IC(c_2)}\]this use the IC instead of depth information.
Word Sense Disambiguiation
find the correct sense for words in a sentence. Baseline: assume the most popular sense.
- Supervised WSD (sense tagged corpora is very hard to create!)
- Less supervised WSD ( Choose sence based on the gloss, which has the most overlap with its context)
Other databases: FrameNet
Refer L9
About Corpus
Manually-tagged lexical resources an important starting point for text analysis
But much modern work attempts to derive semantic information directly from corpora, without human intervention
Solution: Distributional semantics!
Distributional semantics
How words relate to each other in the text
Automatically created resoures from corpora.
Problems of Lexical Semantic Approach
- Manually constructed => expensive, highly biased and noisy
- Language is dynamic => New slang and terminoloies and new sense appear recetly
- A new idea: Use Internet (massive amounts of text) to derive word meanings automatically.
Hypothesis
- Document co-occurrences often are indicative of similar topic (document as context)
- Local context reflects a word’s semantic class
Distributional Rrepresentation
What is Distributional Representation? It describes how the word distribute in document or corpus.
we are going to guess meaning from its usage, especially the context that they share. So in this case, we can create a word vector indicating the occurance position of the word in the document or corpus, which describes the distributional propoties.
In general, this captures all sorts of semantic relationships (synonymy, analogy, etc)
Count based methods
TF-IDF with Dimensionality Reduction, since they are usually very sparse.
DR: using SVD, LSA.
We are using the SVD method to build a representation of our matrix which can be used to identify the most important characteristics of words.
By throwing away the less important characteristics, we can have as maller representation of the words, which will save us (potentially a great deal of) time when evaluating the cosine similarities between word pairs.
Point-wise Mutual Information
\(PMI(x,y) = \log_2\frac{p(x,y)}{p(x)p(y)}\) where the upper is the co-occurence of x and y in a document, counting such documents.
detail refer to Page 16 @ L10
This value is slightly positive, which means that the two events occur together (in documents) slightly more commonly than would occur purely by chance. There is some possibility that world and cup occurring together is somehow meaningful for documents in this collection.
PMI does a better job of capturing interesting semantics, But it is obviously biased towards rare words (need to smooth the probabilities), and doesn’t handle zero in nature.(need to set as 0 initially)
Neural methods to learn word vectors
Word2Vec, Word embedding
We’re going to have a representation of words (based on their contexts) in a vector space, such that other words “nearby” in the space are similar
This is broadly the same as what we expect in distributional similarity (“you shall know a word by the company it keeps.”)
The row corresponding to the word in the relevant (target/context) matrix is known as the “embedding”.
Key idea:
- Word embeddings should be similar to embeddings of neighbouringwords
- And dissimilar to other words that don’t occur nearby
Framed as learning a classifier:
- Skip-gram: given center words, predict the surrounding context (words)
- CBOW: predict the word in center, given the local context (surrounding words) Here Local context means words within L positions, L = 1, 2 .. If L = 1, then local context of A B C, given B, will be A and C.
Skip-gram model Training
use a logistic regression model to classify, the probability of $P(w_{center + l}\mid w_{center})$ where $l\in [-2,-1,-,1,2]$ if $L=2$
The probabilities here are more complicated than just counting some events in a collection; they are based around taking the dot product of the relevant vectors (or average of vectors, in the case of CBOW), and then marginalising. To improve efficiency, we can also use a negative sampling objective.
Benefits
Unsupervised : Raw, unlabelled corpus Efficient:
- Negative sampling (avoid softmax over full vocabulary)
- Scales to very very large corpus
Useful representation: ‣ How do we evaluate word vectors?
evaluation: test the word embeddings on lexicon database
Contextual Representation
Issues with Word Vectors(Embeddings) representation
Each word type has only one representation, in orther words, Always same representation, no matter how the contexts of the word are. This doesn’t capture the multiple senses of a words well.
Contextual Representation Definition
representation of words, based on context, need pretrained contextual representations.
The contextual representation of a word is a representation of the word based on a particular usage. It captures the different senses or nuances of the word depending on the context.
Contextualised representations are different to word embeddings (e.g.Word2Vec) which gives a single representation for every word type(sense).
Contextual representations that are pre-trained on large data can be seen as a model that has obtained fairly comprehensive knowledge about the language.
Recall RNN Language Model
We successfully capture the context information on the left, but we leave those context on the right as blank.
Solution: Bi-directional RNN.
two parallel RNN networks, with two direction. one from left to right, one from right to left. And their RNN units are connected with each other on the same position.
This promise us that a center word can have representation of both left and right directions. For each RNN unit, $y_i = \text{softmax}([s_i,u_i])$ where $s$ and $u$ are the state in two parallel RNN networks. the contextual embeddings lie in $[s_i, u_i]$
ELMo: Embeddings from Language Models
Trains a bidirectional, multi-layer LSTM language model over 1B word corpus
Applications
Lower layer representation = captures syntax
- good for POS tagging, NER
Higher layer representation = captures semantics
- good for QA, textual entailment, sentiment analysis
BERT: Bidirectional Encoder Representations from Transformers
Uses self-attention networks (aka Transformers) to capture dependencies between words
Hence there’s No sequential processing.
Masked language model objective to capture deep bidirectional representations
It loses the ability to generate language.
Masked Language Model
‘Mask’ out k% of tokens at random
Objective: predict the masked words
Next Sentence Prediction
Learn relationships between sentences
Predicts whether sentence B follows sentence A
Discorse: Semantics for Sentence
Discourse: understanding how sentences relate to each other in a document
Discourse segmentation : document segementation
segmentation based on a paritcular topic or function, e.g. introduction, related work, conclusion etc
In Discourse Segmentation, we try to divide up a text into discrete, cohesive units based on sentences.
By interpretting the task as a boundary–finding problem, we can use rule–based or unsupervised methods to find sentences with little lexical overlap (suggesting a discourse boundary).
We can also use supervised methods, by training a classifier around paragraph boundaries.
Unsupervised: low lexical cohension between sentence
When a sentence has low lexical cohension to another, we classify it as a divider of segmentation. Approach:
For each sentence gap $i$:
- Create two BOW vectors consisting of words from k sentences oneither side of gap
- Use cosine to get a similarity score (sim) for two vectors
- For each gap $i$, calculate a depth score for $i$.
- if that depth score is greater than some threshold $t$: insertboundaries at gap $i$
Supervised: Using Labelled data :)
Apply a binary classifier to identify boundaries
Or use sequential classifiers
Potentially include classification of section types (introduction, conclusion, etc.)
Integrate a wider range of features, including:
- distributional semantics
- discourse markers (therefore, and, etc)
Discourse parsing: relations between clauses in sentence
Tasks: Identify discourse units, and the relations that hold between them
Basic element: elementary discourse units (EDUs) ‣ Typically clauses of a sentence ‣ EDUs do not cross sentence boundary
Rhetorical Structure Theory (RST), is a framework to do hierarchical analysis of discourse structure in documents
RST relations between discourse units: ‣ conjuction, justify, concession, elaboration, etc
Within a discourse relation, the primary argument is the nucleus, while the supporting arugment is the satellite. Some relations are equal (e.g. conjunction), so both arguments are nuclei.
The relations between Discourse Units will build up a RST Tree.
Parsing in Machine Learning
Parsing can be done by identifing the markers (although, but, for example, so …), using rule-based parser. However, markers can be ambiguous. and relations can happen without any markers. => using machine learning.
Dataset
RST Discourse Treebank – 300+ documents annotated with RST trees
Basic Idea
- Segment document into EDUs
- Combine adjacent DUs into composite DUs iteratively to create the full RST tree
Parsing techiques
- Transition based parsing, from bottom to top, greedy usingshift-reduce algorithm in Lecture L16
- CYK/Chart parsing algorithm, from bottom to top, Gloabal sub optimal(some constraints exist) in Lecture L14
- Top-down parsing , identified as a Sequence labelling problem,using BERT model.
Features
Bag of words • Discourse markers • Starting/ending n-grams • Location in the text • Syntax features • Lexical and distributional similarities
Discourse Parsing Applications
• Summarisation • Sentiment analysis • Argumentation • Authorship attribution • Essay scoring
Anaphora resolution
Anaphor: linguistic expressions that refer back to earlier elements in the text
Anaphors (most commonly as Pronouns like He, It, They, She…) have a antecedent in the discourse, often but not always a noun phrase
e.g. Yesterday, Ted was late for work. It all started when his car wouldn’t start.
Antecedent Preferences:
- The antecedents of pronouns should be recent e.g. He waited for another 20 minutes, but the tram didn’t come. Sohe walked home and got his bike out of the garage. He startedriding it to work.
- The antecedent should be salient, as determined by grammatical position Subject > ojbect > argument of preposition e.g. Ted usually rode to work with Bill. He was never late.
Centering Theory: help resolving anaphora
A unified account of relationship between discourse structure and entity reference
Every utterance in the discourse is characterised by a set of entities, known as centers
Explains preference of certain entities for ambiguous pronouns
The most obvious ( but inherently unreliable) heuristic is the recency heuristic: given multiple possible referents (that are consistent in meaning with theanaphor),the mostly intended one is the one most recently used in the text.
A better heuristic is that the most likely referent (consistent in meaning with the anaphor) is the focus of the discourse (the “center”). (Centering Theory)
we can also build a supervised machine learning model, usually based around the semantic properties of the anaphor/nearby words and the sentence/discourse structure.
Motivation or Application
Essential for deep semantic analysis 深层语义分析,语义理解 which is very useful for QA, e.g., reading comprehension
Ted’s car broke down. So he went over to Bill’s house to borrow his car. Bill said that was fine.
Questions like: Whose car is borrowed?