Text Classification

Chanming
Apr 11, 2020

Text classification is one of the most commonly and widely known NLP tasks. In this note, several text classification tasks, algorithms for different tasks and their evaluation are summarized.

Fundamentals of classification

Input: A document d

  • Often represented as a vector of features
  • A fixed output set of classes C = {c1,c2,…ck}
  • Categorical, not continuous (regression) or ordinal (ranking)

Output: A predicted class c ∈ C

Challenges

The main issue is in terms of document representation — how do we identify features of the document which help us to distinguish between the various classes?

The principal source of features is based upon the presence of tokens (words) in the document (known as a bag–of–words model). However, many words don’t tell you anything about the classes we want to predict, hence feature selection is often important.
On the other hand,single words are often inadequate at modelling the meaningful information in the document, but multi-word features (e.g. bi-grams, tri-grams) suffer from a sparse data problem.

Text classification tasks

Topic classification

Unigram bag of words (BOW), with stop-words removed Longer n-grams (bigrams, trigrams) for phrases

Sentiment analysis

N-grams Polarity lexicon

Authorship attribution

Frequency of function words Character n-grams Discourse structure

Native-language identification

Word N-grams Syntactic patterns (POS, parse trees) Phonological features

Automatic fact-checking

N-grams Non-text metadata

Algorithms for classification

Choosing a Classification Algorithm based on:

  • Bias vs. Variance
    • Bias: assumptions we made in our model
    • Variance: sensitivity to training set
  • Underlying assumptions, e.g., independence
  • Complexity
  • Speed

Euclidean distance

In general, bad in performance. Euclidean distance tends to classify documents based upon their length —which is usually not a distinguishing characteristic for classification problems.

Information Gain

Bad performance as well. Information Gain is a poor choice because it tends to prefer rare features; in this case (BOW), this would correspond to features that appear only in a handful of documents.

Naïve Bayes

Pros: Fast to “train” and classify; robust, lowvariance; good for low data situations; optimal classifier if independence assumption is correct; extremely simple to implement.

Cons: Independence assumption rarely holds; low accuracy compared to similar methods in most situations; smoothing required for unseen class/ feature combinations

Logistic Regression

Pros: Unlike Naïve Bayes not confounded by diverse, correlated features ; it relaxes the conditional independence requirement of Naive Bayes; Since it has an implicit feature weighting step, can handle large numbers of mostly useless features, especially for (BOW)

Cons: High bias; slow to train; some feature scaling issues; often needs a lot of data to work well; choosing regularization a nuisance but important since over-fitting is a big problem

Support Vector Machines

Pros: fast and accurate linear classifier; can do non-linearity with kernel trick; works well with huge feature sets; Linear kernels often quite effective at modelling some combination of features that are useful (together) for characterizing the classes.

Cons: Multi-class classification awkward (Need substantial re-framing for problems with multiple classes, while most text classification tends to be multi–class); feature scaling can be tricky; deals poorly with class imbalances ; uninterpretable

K-Nearest Neighbour

Pros: Simple, effective; no training required; inherently multiclass; optimal with infinite data.

Cons: Have to select k; issues with unbalanced classes; often slow (need to find those k-neighbours); features must be selected carefully; suffers from high-dimensionality problems, not suitable for bag of word.

Decision tree

Pros: in theory, very interpretable; fast to build and test; feature representation/scaling irrelevant; good for small feature sets, handles non-linearly-separable problems.

Cons: In practice, often not that interpretable; highly redundant sub-trees; not competitive for large feature sets.

Random Forests

Pros: Usually more accurate and more robust than decision trees, a great classifier for small- to moderate-sized feature sets; training easily parallelized.

Cons: Same negatives as decision trees: too slow with large feature sets

Neural Networks

Pros: Extremely powerful, state-of-the-art accuracy on many tasks in natural language processing and vision Cons: Not an off-the-shelf classifier, very difficult to choose good parameters ; slow to train; prone to over-fitting.

Hyperparameter Tuning

refer to #27 @ L4

Evaluation

Acc, Prec, Recall, F1