Deep Learning for NLP tasks
Feedforward and recurrent models for NLP tasks.
Feedforward Neural Network
Middle layer
Aka Multilayer perceptrons Each arrow carries weight, reflecting its importance, and sigmoid function represent a non-linear function. Imagine perceptrons in layer 1 names $a_1, a_2, \ldots a_n$ point to a percetron in layer 2 names $b_1$, each arrow carries a weight $w_1, w_2,\ldots w_n$. the sigmoid output will be:
\[h=\tanh{(\sum_j w_ja_j + b)}\]where $b$ is the added offset and $w_i$ are scales input Nonlinear $\tanh$ is the hyperbolic sigmoid, it can be others like logistic sigmoid or rectified linear unit.
Vector representation: $\overrightarrow{h} = \tanh{(W\overrightarrow{a}+\overrightarrow{b})}$, the non linear function applied element-wise
Output layer
binary classification -> sigmoid activation function (aka logistic function) multiclass classification -> softmax, which ensures probabilities > 0 and sum to 1
trained by backpropogation and minimise log L, gradient descent.
Example: Topic classification
Input: bag of words + bag of bigrams
Preprocess text into lemmatise words and remove stop words
weight words using TF-IDF or indicators (0 / 1 presence or missing)
Output: distribution of class probabilities. i.e. [topic1, topic2 … topic n] they must be larger than 0 and sum to 1.
Example: Authorship Attribution
Stylistic properties of text are more important than content words in this task
- POS tags and function words (e.g. on, of, the, and)
Input: bag of function words(top 300 most frequent words in a large corpus), bag of POS tags, bag of POS bigrams, trigrams
Other features: distribution of distances between consecutive function words
Feedforward NN Language Model
Use NN as a classifier to model $P(w_i\mid w_{i-1},w_{i-2})$
- input features: previous two words
- output class: next word
How to present words given their sparsity and large vocabulary? –> embeddings
Word embedding
Maps discrete word symbols to continuous vectors in a relatively low dimensional space Word embeddings allow the model to capture similarity between words Overcome data-sparsity problems
A neural language model projects words into a continuous space and represents each word as a low dimensional vector known as word embeddings. These word embeddings capture semantic and syntactic relationships between words, allowing the model to generalise better to unseen sequences of words.
For example, having seen the sentence the cat is walking in the bedroom
in the trainingcorpus,the model should understand that a dog was running in a room
is just as likely, as (the, a), (dog, cat), (walking, running) have similar semantic/grammatical roles.
Count-based N-gram language model would struggle in this case, as (the, a) or (dog, cat) are distinct word types from the model’s perspective.
Compare to Ngram LMs
Ngram LMs
- cheap to train (just compute counts)
- problems with sparsity and scaling to larger contexts
- don’t adequately capture properties of words (grammatical and semantic similarity), e.g., film vs movie
NNLMs more robust
- force words through low-dimensional embeddings
- automatically capture word properties, leading to more robust estimates
- flexible: minor change to adapt to other tasks (tagging)
POS Tagging using Feed forward NN and CNN
[e]
Recurrent Neural Network
Concepts
- RNNs allow representing arbitrarily sized inputs, compared to Feedforward NN, only handles fixed size input and need paddings.
- Core Idea: processes the input sequence one at a time, by applying a recurrence formula
- Uses a state vector to represent contexts that have been previously processed
where $s_i$ is the new state, $s_{i-1}$ is the previous state and $x_i$ is the new input.
Notice: parameters ($W$) are shared (same) across all time steps
Actually, if we unroll the RNN with same parameters, this is just a very deep NN.
Input words $x_i$ mapped to an embedding $W_xx_i$ in $s_i = \tanh(W_ss_{i-1}+W_xx_i + b)$ Output = next word $y_i = \text{softmax} (W_ys_i)$
capability to model infinite context
Vanishing gradient
Gradients in later steps diminish quickly during backpropagation, So Earlier inputs do not get much update
When we do backpropagation to compute the gradients, the gradients tend to get smaller and smaller as we move backward through the network.
The net effect is that neurons in the earlier layers learn very slowly, as their gradients are very small. If the unrolled RNN is deep enough we might start seeing gradients “vanish”.
Long Short-term Memory LSTM
To overcome the gradient vanishing problem.
Core idea: have “memory cells” that preserve gradients across time
Access to the memory cells is controlled by “gates”
- For each input, a gate decides the following
- how much the new input should be written to the memory cell
- and how much content of the current memory cell should be forgotten
A gate $g$ is a vector each element has values between 0 to 1 $g$ is multiplied component-wise with vector $v$, to determine how much information to keep for $v$ Use sigmoid function to keep values of $g$ close to either 0 or 1
Hence: g: $(0.9, 0.1, 0.0) ^T $ v: $(2.5, 5.3, 1.2) ^T$ $g * v$ will produce $(2.3, 0.5, 0) ^T$
forget gate: $f_t=\sigma(W_{forget}[h_{t-1}, x_t] + b_f)$ where $h_{t-1}$ is the previous state, and $x_t$ is the current input.
input gate: $i_t=\sigma(W_{input}[h_{t-1}, x_t] + b_i)$ Input gate controls how much new information to put to memory cell $C_t = \tanh(W_c[h_{t-1}, x_t] + b_C)$ is the new distilled information to be added
Update Memory Cell: $C_t = f_t \cdot C_{t-1} + i_t \cdot C_t$ where $C_{t-1}$ is the previous memory
output gate: $o_t = \sigma(W_o[h_{t-1}, x_t]+b_o$, $h_t=o_t \cdot \tanh(C_t)$ Output gate controls how much to distill the content of the memory cell to create the next state ($h_t$)
Application of RNN
Particularly suited for tasks where order of words matter, e.g. sentiment classification. RNNs work particularly for sequence labelling problems, e.g. POS tagging
Pros and Cons
Pros
- Has the ability to capture long range contexts
- Excellent generalisation
- Just like feedforward networks: flexible, so it can be used forall sorts of tasks
- Common component in a number of NLP tasks
Cons
- Slower than feedforward networks due to sequential processing
- In practice still doesn’t capture long range dependency very well (evident when generating long text)