Deep Learning for NLP tasks

Chanming
Apr 19, 2020

Feedforward and recurrent models for NLP tasks.

Feedforward Neural Network

Middle layer

Aka Multilayer perceptrons Each arrow carries weight, reflecting its importance, and sigmoid function represent a non-linear function. Imagine perceptrons in layer 1 names $a_1, a_2, \ldots a_n$ point to a percetron in layer 2 names $b_1$, each arrow carries a weight $w_1, w_2,\ldots w_n$. the sigmoid output will be:

\[h=\tanh{(\sum_j w_ja_j + b)}\]

where $b$ is the added offset and $w_i$ are scales input Nonlinear $\tanh$ is the hyperbolic sigmoid, it can be others like logistic sigmoid or rectified linear unit.

Vector representation: $\overrightarrow{h} = \tanh{(W\overrightarrow{a}+\overrightarrow{b})}$, the non linear function applied element-wise

Output layer

binary classification -> sigmoid activation function (aka logistic function) multiclass classification -> softmax, which ensures probabilities > 0 and sum to 1

trained by backpropogation and minimise log L, gradient descent.

Example: Topic classification

Input: bag of words + bag of bigrams

Preprocess text into lemmatise words and remove stop words

weight words using TF-IDF or indicators (0 / 1 presence or missing)

Output: distribution of class probabilities. i.e. [topic1, topic2 … topic n] they must be larger than 0 and sum to 1.

Example: Authorship Attribution

Stylistic properties of text are more important than content words in this task

POS tags and function words (e.g. on, of, the, and)

Input: bag of function words(top 300 most frequent words in a large corpus), bag of POS tags, bag of POS bigrams, trigrams

Other features: distribution of distances between consecutive function words

Feedforward NN Language Model

Use NN as a classifier to model $P(w_i\mid w_{i-1},w_{i-2})$

input features: previous two words
output class: next word

How to present words given their sparsity and large vocabulary? –> embeddings

Word embedding

Maps discrete word symbols to continuous vectors in a relatively low dimensional space Word embeddings allow the model to capture similarity between words Overcome data-sparsity problems

A neural language model projects words into a continuous space and represents each word as a low dimensional vector known as word embeddings. These word embeddings capture semantic and syntactic relationships between words, allowing the model to generalise better to unseen sequences of words.

For example, having seen the sentence the cat is walking in the bedroom in the trainingcorpus,the model should understand that a dog was running in a room is just as likely, as (the, a), (dog, cat), (walking, running) have similar semantic/grammatical roles.

Count-based N-gram language model would struggle in this case, as (the, a) or (dog, cat) are distinct word types from the model’s perspective.

Compare to Ngram LMs

Ngram LMs

cheap to train (just compute counts)
problems with sparsity and scaling to larger contexts
don’t adequately capture properties of words (grammatical and semantic similarity), e.g., film vs movie

NNLMs more robust

force words through low-dimensional embeddings
automatically capture word properties, leading to more robust estimates
flexible: minor change to adapt to other tasks (tagging)

POS Tagging using Feed forward NN and CNN

[e]

Recurrent Neural Network

Concepts

RNNs allow representing arbitrarily sized inputs, compared to Feedforward NN, only handles fixed size input and need paddings.
Core Idea: processes the input sequence one at a time, by applying a recurrence formula
Uses a state vector to represent contexts that have been previously processed

\[s_i = \tanh(W_ss_{i-1} + W_xx_i + b)\] \[y_i = \sigma(W_ys_i)\]

where $s_i$ is the new state, $s_{i-1}$ is the previous state and $x_i$ is the new input.

Notice: parameters ($W$) are shared (same) across all time steps

Actually, if we unroll the RNN with same parameters, this is just a very deep NN.

Input words $x_i$ mapped to an embedding $W_xx_i$ in $s_i = \tanh(W_ss_{i-1}+W_xx_i + b)$ Output = next word $y_i = \text{softmax} (W_ys_i)$

capability to model infinite context

Vanishing gradient

Gradients in later steps diminish quickly during backpropagation, So Earlier inputs do not get much update

When we do backpropagation to compute the gradients, the gradients tend to get smaller and smaller as we move backward through the network.

The net effect is that neurons in the earlier layers learn very slowly, as their gradients are very small. If the unrolled RNN is deep enough we might start seeing gradients “vanish”.

Long Short-term Memory LSTM

To overcome the gradient vanishing problem.

Core idea: have “memory cells” that preserve gradients across time

Access to the memory cells is controlled by “gates”

For each input, a gate decides the following
- how much the new input should be written to the memory cell
- and how much content of the current memory cell should be forgotten

A gate $g$ is a vector each element has values between 0 to 1 $g$ is multiplied component-wise with vector $v$, to determine how much information to keep for $v$ Use sigmoid function to keep values of $g$ close to either 0 or 1

Hence: g: $(0.9, 0.1, 0.0) ^T $ v: $(2.5, 5.3, 1.2) ^T$ $g * v$ will produce $(2.3, 0.5, 0) ^T$

forget gate: $f_t=\sigma(W_{forget}[h_{t-1}, x_t] + b_f)$ where $h_{t-1}$ is the previous state, and $x_t$ is the current input.

input gate: $i_t=\sigma(W_{input}[h_{t-1}, x_t] + b_i)$ Input gate controls how much new information to put to memory cell $C_t = \tanh(W_c[h_{t-1}, x_t] + b_C)$ is the new distilled information to be added

Update Memory Cell: $C_t = f_t \cdot C_{t-1} + i_t \cdot C_t$ where $C_{t-1}$ is the previous memory

output gate: $o_t = \sigma(W_o[h_{t-1}, x_t]+b_o$, $h_t=o_t \cdot \tanh(C_t)$ Output gate controls how much to distill the content of the memory cell to create the next state ($h_t$)

Application of RNN

Particularly suited for tasks where order of words matter, e.g. sentiment classification. RNNs work particularly for sequence labelling problems, e.g. POS tagging

Pros and Cons

Pros

Has the ability to capture long range contexts
Excellent generalisation
Just like feedforward networks: flexible, so it can be used forall sorts of tasks
Common component in a number of NLP tasks

Cons

Slower than feedforward networks due to sequential processing
In practice still doesn’t capture long range dependency very well (evident when generating long text)