Syntax

Chanming
May 6, 2020

How words are arranged.

About Language, What is it

What we explored so far, The methods to process sequene of symbols, including Language Models (e.g. N-Gram), Hidden Markov Model, Recurrent Neural Networks. They are not really linguistic models.

language is a set of strings, where strings are sequence of characters derived from a fixed alphabet. From now on the topic is to determine the class of such strings, and to discover their computational properties. By saying computational properties, we mean the way to solve the membership problem, whether a string is in a language or not. We use Regular Language or Context free Grammar to check.

What is a membership problem

let’s define a language: a binary strings that start with 0 and end with 1, deriving from alphabet {0, 1}. A positive example, where it is a member of our defined language, is {01,001,00001}. While {1,0,00,11} is not in our language.

A more sofiscated language: English sentense starts with ‘wh-‘ words and end with a ‘?’. {What ?, where my pants ? …} is in our language

Beyond membership problem

We are asking Questions like:

is the string in our language? Y/N => membership
How acceptable is a string for a specific language? (language models)=> scoring, graded membership
Translate one string into another (stemming) => Transduction

Regular Languages

Definition of Regular Languages

Any regular expression is a regular language which describes what strings should be in our language.

Components: Aphabet $\Sigma$ + Empty string $\epsilon$

Manipulations: Concatenation, A + B; Alternation(negation); Kleene star $R^*$ (repeat 0 or more); Paranthesis ( ) to define the scope of operations.

Regular language are closed under concatenation and union, intersection, negation. which means that after these operations, the new language is still a Regular language.

Finite State Acceptors and Transducers

Same with Finite State Automata. which requires:

Alphabet $\Sigma = {a,b,\epsilon}$

States, a initial state, transition states and a goal state.

Transition functions, given state and consuming input deriving final state. i.e. { $(q_0,a) \rightarrow q_1$, …}

Finite State Transducers (Translate)

Add output to the transition function, to produce translate.

Modelling Word Morphology

Weighted Finite State Automata’s assumption and basic idea:

some words are more possible than others
graded measure of acceptability

No goal state, instead we have scores for each transition step. Total score of a path is calculated by summing up the path transition function including the initial state function and final state function. Require using the shortest path algorithm to find the $\pi$ with minimum cost.

N-gram language models as Weighted FSA

Another way to express the graded acceptance of a string sequence to be in language, using N-gram models.

Real Language is not Rregular

The Ultimate Problem: No memory. no capability to deal with $a^nb^n$

examples:

Since FSA don’t have capability to store $n$, so it cannot handle a balanced parentheses problem (where need to remember how many open parentheses, and produce that many close parentheses later).
Center embedding of relative clauses: When we want to produce the same number of verbs as the nouns before them, e.g. The cat the dog chased loves Mozart

Context Free Grammars

CFG for $a^nb^n$:

$S \rightarrow aSb$
$S \rightarrow ab$

If English can be represented with CFG, then we can build a “parser” to automatically judge whether a sentence is grammatical. But the real natual language is not context-free, e.g. cross-serial dependencies where ($a^mb^nc^md^n$).

CYK Algorithm to parse a sentence according to a language rule in CFG. This is a algorithm to verify if a sentence is generated from a language.

Parsing

Parsing in general is the process of identifying the structure(s) of a sentence, according to a grammar of the language

Probabilistic Context Free Grammars

Weakness of CFG parsing

ambiguity exists when there’re many different parsing tree to generate. which should we use?

Probabilistic CFG and parsing PCFG

The only different is: We need to store a probability table corresponding to each production rule. e.g. NP -> DT NN [p = 0.45]

Hence, that is a set of $Pr(LHS\rightarrow RHS) = Pr(RHS\mid LHS)$. We need to ensure all probabilities are in $(0,1]$ and for a specific given LHS, the Probabilities should sum to 1.

How likely is a tree to be generated/parsed

Given a tree, we can compute its probability by decomposing into probability of each production, and do this recursively until we reach all the leaves from the root.

So! We get different probabilities for different parsing result, we can keep the one with the maximum probability. Should add probabilities into CYK Algorithm!

Limitations of PCFG

Poor independence assumptions Rewrite Decisions are made independently, while interdependence is often needed to capture global structure.
- NP -> Det N, probability of this rule is independent of the rest of the tree.
- Solution: Parent Conditioning, adding parent symbols into each non-terminal symbols.
Lack of Lexical Conditioning, e.g. in Prepositional Phrase(PP), it attaches ambiguity.
- It lacks sensitivity to words in a tree.
- Solution: Head Lexicalisation: Record head word with parent symbols

Connection between HMM and CFG, CYK parsing etc

refer to workshop 09.3

Dependency Grammars

In CFG, we only care about the grammar, so in that case we build a constituency tree, where it maintain valid grammar if we interchange any two NN in the tree.

Another way to modelling grammar is to describe the dependencies, which are relations between pair of words. A word could be head(center), or dependent(supporting).

Benefit

Deal better with languages that are morphologically rich and have a relatively free word order

CFG need a separate rule for each possible place a phrase can occur in

Head-dependent relations similar to semantic relations between words

More useful for applications: coreference resolution, QA System, information extraction, etc

Dependencies

Refer page 7 @ Lecture 16

Descirbe NSUBJ, DOBJ, IOBJ, CCOMP, XCOMP, NMOD, AMOD, DET, CASE APPOS, CONJ, CC etc.

Dependency vs CFG (Constituency)

Dependency tree

each node is a word token
one node is chosen as the root
directed edges link heads and their dependents

Constituency tree

forms a hierarchical tree
word tokens are the leaves
internal nodes are ‘constituent phrases’ e.g., NP, VP etc

Similar between Dependency parsing and probabilistic CYK parsing:

Both use part-of-speech
Both methods are attempting to determine the structure of asentence;
Both methods are attempting to disambiguate amongst the (perhaps many)possible structures licensed by the grammar by using a probabilisticgrammar to determine the most probable structure.
Both methods process the tokens in the sentence one–by–one,left–to–right.

CFG Parsing

Parsing: task of finding the best structure $\bold t$ for a given input sentence $\bold w$. i.e.

\[\argmax_\bold t score(\bold t\mid \bold w)\]

two main parsing approaches:

transition based: incremental score (local optimal, greedy),decision is sequentialized from left to right.
graph based: encode with nodes and edges, global optimal, but may have overlap arcs

Transition based parsing

Buffer, Stack, Action and Relations generated on that action (if requested by oracle)

In each action, it either:

generate a left arcs: current word is head, and the left arcpointing to some supporting word
generate a right arcs: current word is supporting word, andpreviously seen word as head.
do nothing with arcs, store current word and shift to nextbufferred word

Parsing as a Classification Problem

train a supervised model to minimize the number of actions of the oracle, who tells us the right action at each step.

Oracle: a ground truth agent who generates a sequence of ground truth action of SHIFT, RightArc, LeftArc

input: stack (top-2 elements $s_1$ and $s_2$), buffer (first element $b_1$)

output 3 classes: shift, left-arc, right-arc

features: word (w), POS Tag (t)

Graph based parsing

produce the global optimal result, instead of greedy local transition based parsing captures long dependencies better => But may produce non-projective result (overlab arcs)