Syntax
How words are arranged.
About Language, What is it
What we explored so far, The methods to process sequene of symbols, including Language Models (e.g. N-Gram), Hidden Markov Model, Recurrent Neural Networks. They are not really linguistic models.
language is a set of strings, where strings are sequence of characters derived from a fixed alphabet. From now on the topic is to determine the class of such strings, and to discover their computational properties. By saying computational properties, we mean the way to solve the membership problem, whether a string is in a language or not. We use Regular Language or Context free Grammar to check.
What is a membership problem
let’s define a language: a binary strings that start with 0 and end with 1, deriving from alphabet {0, 1}. A positive example, where it is a member of our defined language, is {01,001,00001}. While {1,0,00,11} is not in our language.
A more sofiscated language: English sentense starts with ‘wh-‘ words and end with a ‘?’. {What ?, where my pants ? …} is in our language
Beyond membership problem
We are asking Questions like:
- is the string in our language? Y/N => membership
- How acceptable is a string for a specific language? (language models)=> scoring, graded membership
- Translate one string into another (stemming) => Transduction
Regular Languages
Definition of Regular Languages
Any regular expression is a regular language which describes what strings should be in our language.
Components: Aphabet $\Sigma$ + Empty string $\epsilon$
Manipulations: Concatenation, A + B; Alternation(negation); Kleene star $R^*$ (repeat 0 or more); Paranthesis ( ) to define the scope of operations.
Regular language are closed under concatenation and union, intersection, negation. which means that after these operations, the new language is still a Regular language.
Finite State Acceptors and Transducers
Same with Finite State Automata. which requires:
Alphabet $\Sigma = {a,b,\epsilon}$
States, a initial state, transition states and a goal state.
Transition functions, given state and consuming input deriving final state. i.e. { $(q_0,a) \rightarrow q_1$, …}
Finite State Transducers (Translate)
Add output to the transition function, to produce translate.
Modelling Word Morphology
Weighted Finite State Automata’s assumption and basic idea:
- some words are more possible than others
- graded measure of acceptability
No goal state, instead we have scores for each transition step. Total score of a path is calculated by summing up the path transition function including the initial state function and final state function. Require using the shortest path algorithm to find the $\pi$ with minimum cost.
N-gram language models as Weighted FSA
Another way to express the graded acceptance of a string sequence to be in language, using N-gram models.
Real Language is not Rregular
The Ultimate Problem: No memory. no capability to deal with $a^nb^n$
examples:
- Since FSA don’t have capability to store $n$, so it cannot handle a balanced parentheses problem (where need to remember how many open parentheses, and produce that many close parentheses later).
- Center embedding of relative clauses: When we want to produce the same number of verbs as the nouns before them, e.g. The cat the dog chased loves Mozart
Context Free Grammars
CFG for $a^nb^n$:
- $S \rightarrow aSb$
- $S \rightarrow ab$
If English can be represented with CFG, then we can build a “parser” to automatically judge whether a sentence is grammatical. But the real natual language is not context-free, e.g. cross-serial dependencies where ($a^mb^nc^md^n$).
CYK Algorithm to parse a sentence according to a language rule in CFG. This is a algorithm to verify if a sentence is generated from a language.
Parsing
Parsing in general is the process of identifying the structure(s) of a sentence, according to a grammar of the language
Probabilistic Context Free Grammars
Weakness of CFG parsing
ambiguity exists when there’re many different parsing tree to generate. which should we use?
Probabilistic CFG and parsing PCFG
The only different is: We need to store a probability table corresponding to each production rule. e.g. NP -> DT NN [p = 0.45]
Hence, that is a set of $Pr(LHS\rightarrow RHS) = Pr(RHS\mid LHS)$. We need to ensure all probabilities are in $(0,1]$ and for a specific given LHS, the Probabilities should sum to 1.
How likely is a tree to be generated/parsed
Given a tree, we can compute its probability by decomposing into probability of each production, and do this recursively until we reach all the leaves from the root.
So! We get different probabilities for different parsing result, we can keep the one with the maximum probability. Should add probabilities into CYK Algorithm!
Limitations of PCFG
- Poor independence assumptions
Rewrite Decisions are made independently, while interdependence is often needed to capture global structure.
- NP -> Det N, probability of this rule is independent of the rest of the tree.
- Solution: Parent Conditioning, adding parent symbols into each non-terminal symbols.
- Lack of Lexical Conditioning, e.g. in Prepositional Phrase(PP), it attaches ambiguity.
- It lacks sensitivity to words in a tree.
- Solution: Head Lexicalisation: Record head word with parent symbols
Connection between HMM and CFG, CYK parsing etc
refer to workshop 09.3
Dependency Grammars
In CFG, we only care about the grammar, so in that case we build a constituency tree, where it maintain valid grammar if we interchange any two NN in the tree.
Another way to modelling grammar is to describe the dependencies, which are relations between pair of words. A word could be head(center), or dependent(supporting).
Benefit
Deal better with languages that are morphologically rich and have a relatively free word order
- CFG need a separate rule for each possible place a phrase can occur in
Head-dependent relations similar to semantic relations between words
- More useful for applications: coreference resolution, QA System, information extraction, etc
Dependencies
Refer page 7 @ Lecture 16
Descirbe NSUBJ, DOBJ, IOBJ, CCOMP, XCOMP, NMOD, AMOD, DET, CASE APPOS, CONJ, CC etc.
Dependency vs CFG (Constituency)
Dependency tree
- each node is a word token
- one node is chosen as the root
- directed edges link heads and their dependents
Constituency tree
- forms a hierarchical tree
- word tokens are the leaves
- internal nodes are ‘constituent phrases’ e.g., NP, VP etc
Similar between Dependency parsing and probabilistic CYK parsing:
- Both use part-of-speech
- Both methods are attempting to determine the structure of asentence;
- Both methods are attempting to disambiguate amongst the (perhaps many)possible structures licensed by the grammar by using a probabilisticgrammar to determine the most probable structure.
- Both methods process the tokens in the sentence one–by–one,left–to–right.
CFG Parsing
Parsing: task of finding the best structure $\bold t$ for a given input sentence $\bold w$. i.e.
\[\argmax_\bold t score(\bold t\mid \bold w)\]two main parsing approaches:
- transition based: incremental score (local optimal, greedy),decision is sequentialized from left to right.
- graph based: encode with nodes and edges, global optimal, but may have overlap arcs
Transition based parsing
Buffer, Stack, Action and Relations generated on that action (if requested by oracle)
In each action, it either:
- generate a left arcs: current word is head, and the left arcpointing to some supporting word
- generate a right arcs: current word is supporting word, andpreviously seen word as head.
- do nothing with arcs, store current word and shift to nextbufferred word
Parsing as a Classification Problem
train a supervised model to minimize the number of actions of the oracle, who tells us the right action at each step.
Oracle: a ground truth agent who generates a sequence of ground truth action of SHIFT, RightArc, LeftArc
input: stack (top-2 elements $s_1$ and $s_2$), buffer (first element $b_1$)
output 3 classes: shift, left-arc, right-arc
features: word (w), POS Tag (t)
Graph based parsing
produce the global optimal result, instead of greedy local transition based parsing captures long dependencies better => But may produce non-projective result (overlab arcs)