Preprocessing - Word Sequences and Documents

Chanming
Apr 2, 2020

Preprocessing is the first step of almost all NLP tasks. What techniques are commonly used? Why is it important? Which preprocessing techniques should we use in a specific task? How it will affect the outcome?

Terms

Corpus, a collection of documents Document, one or more sentences. Words, sequence of characters with a meaning. Word token, words in a sentence. Word type, the distinct word tokens.

Text preprocessing

language is conpositional, we human read sentence by breaking down its individual components, machine should do the same. Preprocessing is the first step (prerequisite) to break down those components.

Remove unwanted formatting

to remove those HTML tags or formatting in xml.

Sentence segmentation

to break documents down into sentences. We can break on sentence punctuation like [.?!] or find those capital letter (which won’t work on names). Furthermore, to fix that problem, we can use a lexicon (dictionary) to find out which capital letter is a name (Out of Dictionary). Best practice is to use machine learning, rather than rule based approach.

To build a binary classfier using machine learning method, we might want to look at every ‘.’ character, and decides whether it is the end of a sentence based on following features:

words before and after ‘.’
Word shape, upper case? lower case? number? length?
POS tag: probability to be a start of the sentence.

Word tokenization

to break sentences down into words. Challenges:

Abbreviations (U.S.A)
Hyphens (well-recognized)
Numbers (1,000,000)
Dates (2020-3-2)
Clitics (He’s, Can’t)
Internet language (http://www.domain.com/, #worldcap, :-/ )
Multiword units (New Zealand)

Why we need tokenisation?

Tokenisation is the act of transforming a (long) document into a set of meaningful substrings, so that we can compare with other (long) documents.
In general, a document is too long — and contains too much information — to manipulate directly. There are some counter-examples, like language identiﬁcation, which we need to perform before we decide how to tokenise anyway.

MaxMatch Algorithm

Greedily match longest word in the vocabulary.

Byte-pair Encoding

A way to tokenize word into subwords, e.g. (corlourless -> colour , less) iteratively merge frequent pairs of characters, better deals with unknown words, and works for different language. In this algorithm, we keep merging the most frequent word combinations from the end to the start, without deleting any existing merged result, the most common words will become as full words while the rarest words will become single characters in the worst case. But it generate a large vocabulary.

Word normalisation

to transform words into canonical forms (to reduce vocabulary and maps different word type into a same canonical form of the word), we need to lowercase the words, remove morphology, correct spelling, expanding abreviations (U.S.A -> USA).

Lemmatisation

For english language, noun have different morphology forms according to their amount, number. Verb have different form according to their subject. while adjective have comparatives (er) and superlatives (est). When we doing a lemmatisation, we remove all inflection to reach a uninflected form, the lemma. e.g. speaking -> speak, warmer -> warm

In lemmatisation, we remove inflection morphology only, which creates grammatical variants. (+s, +ing, -ed, +er, +est …)

Stemming

Stemming strips off all suffixes, and only leaving a stem. this is achieved by removing derivational morphology, (+ly, +ise, +er, re+, +thy …) e.g. automate, automatic, automation -> automat. This often leads to an not actual lexical item (OOV)

Porter Stemmer

c: consonant, e.g. ‘b’, ‘c’, ‘d’ V: vowel, e.g. ‘a’, ‘e’, ‘i’, ‘o’, ‘u’ C: a sequence of consonants, e.g. s, ss, tr, bl V: a senquence of vowels, e.g., o, oo, ee, io

a word could have four forms: CVCV…C, CVCV…V, VCVC…C, VCVC…V which are: [C]VCVC…[V] or $C^m[V]$ where $m$ is the measure. e.g. $m=0$: TR[C],EE[V],TREE[CV],T[C],BY[CV] $m=1$: TROUBLE [CVCV], OATS[VC], TREES[CVC] $m=2$: TROUBLES[CVCVC], PRIVATE[CVCVC], OATEN[VCVC], ORRERY[VCVCV]

Rules: (condition) $S_1$ -> $S_2$ Always use the longest $S_1$.

Difference between Lemmatisation and Stemming

Similarity

They all transform the word token into a canonical form, by removing or replacing prefix and suffix.

difference

Lemmatisation works in conjunction with a lexicon(dictionary), so the result is always a valid word in lexicon. It achieve this by using rewriting rule and check whether the output is in dictionary. While stemming simply applies rules without checking, this could result in a OOV word, like thi instead of this.

Difference between inflectional morphology and derivational morphology

Inﬂectional morphology is the systematic process (in many but not all languages)bywhichtokensarealteredtoconformtocertaingrammatical constraints: for example, if the English noun teacher is plural, then it must be represented as teachers. The idea is that these changes don’t really alter the meaning of the term. Consequently,both stemming and lemmatisation attempt to remove this kind of morphology.
Derivational morphology is the (semi-) systematic process by which we transform terms of one class into a different class. For example, if we would like to make the English verb teach into a noun (someone who performs the action of teaching), then it must be represented as teacher. This kind of morphology tends to produce terms that differ (perhaps subtly) in meaning, and the two separate forms are usually both listed in the lexicon. Consequently, lemmatisation doesn’t usually remove derivational morphology in its normalization process, but stemming usually does.

Fix Spelling Errors

Why?
- it creates new rare word type
- disrupt linguistic analysis(not all time)
- common in internet corpora
- important in web queries (search engine)
How?
- String distance (Levenshtein.,etc)
- Modelling of error types ( phonetic, typing etc)
- Use an N-gram language model

Other Word Normalizations

Spelling vairations

e.g. normalize -> normalize, colour -> color

Expanding abbreviations

e.g. US, U.S. -> United States, IMHO -> in my humble opinion

Stop word removal

delete unwanted words, e.g. : the, a, an. When we are investigating meaning of sentence, these words usually not help much, so we want to get rid of them. Stop Word removal should not be done when the word sequence is important!