Text Classification
Text classification is one of the most commonly and widely known NLP tasks. In this note, several text classification tasks, algorithms for different tasks and their evaluation are summarized.
Fundamentals of classification
Input: A document d
- Often represented as a vector of features
- A fixed output set of classes C = {c1,c2,…ck}
- Categorical, not continuous (regression) or ordinal (ranking)
Output: A predicted class c ∈ C
Challenges
The main issue is in terms of document representation — how do we identify features of the document which help us to distinguish between the various classes?
The principal source of features is based upon the presence of tokens (words)
in the document (known as a bag–of–words model). However, many words don’t
tell you anything about the classes we want to predict, hence feature selection
is often important.
On the other hand,single words are often inadequate at modelling the
meaningful information in the document, but multi-word features (e.g.
bi-grams, tri-grams) suffer from a sparse data problem.
Text classification tasks
Topic classification
Unigram bag of words (BOW), with stop-words removed Longer n-grams (bigrams, trigrams) for phrases
Sentiment analysis
N-grams Polarity lexicon
Authorship attribution
Frequency of function words Character n-grams Discourse structure
Native-language identification
Word N-grams Syntactic patterns (POS, parse trees) Phonological features
Automatic fact-checking
N-grams Non-text metadata
Algorithms for classification
Choosing a Classification Algorithm based on:
- Bias vs. Variance
- Bias: assumptions we made in our model
- Variance: sensitivity to training set
- Underlying assumptions, e.g., independence
- Complexity
- Speed
Euclidean distance
In general, bad in performance. Euclidean distance tends to classify documents based upon their length —which is usually not a distinguishing characteristic for classification problems.
Information Gain
Bad performance as well. Information Gain is a poor choice because it tends to prefer rare features; in this case (BOW), this would correspond to features that appear only in a handful of documents.
Naïve Bayes
Pros: Fast to “train” and classify; robust, lowvariance; good for low data situations; optimal classifier if independence assumption is correct; extremely simple to implement.
Cons: Independence assumption rarely holds; low accuracy compared to similar methods in most situations; smoothing required for unseen class/ feature combinations
Logistic Regression
Pros: Unlike Naïve Bayes not confounded by diverse, correlated features ; it relaxes the conditional independence requirement of Naive Bayes; Since it has an implicit feature weighting step, can handle large numbers of mostly useless features, especially for (BOW)
Cons: High bias; slow to train; some feature scaling issues; often needs a lot of data to work well; choosing regularization a nuisance but important since over-fitting is a big problem
Support Vector Machines
Pros: fast and accurate linear classifier; can do non-linearity with kernel trick; works well with huge feature sets; Linear kernels often quite effective at modelling some combination of features that are useful (together) for characterizing the classes.
Cons: Multi-class classification awkward (Need substantial re-framing for problems with multiple classes, while most text classification tends to be multi–class); feature scaling can be tricky; deals poorly with class imbalances ; uninterpretable
K-Nearest Neighbour
Pros: Simple, effective; no training required; inherently multiclass; optimal with infinite data.
Cons: Have to select k; issues with unbalanced classes; often slow (need to find those k-neighbours); features must be selected carefully; suffers from high-dimensionality problems, not suitable for bag of word.
Decision tree
Pros: in theory, very interpretable; fast to build and test; feature representation/scaling irrelevant; good for small feature sets, handles non-linearly-separable problems.
Cons: In practice, often not that interpretable; highly redundant sub-trees; not competitive for large feature sets.
Random Forests
Pros: Usually more accurate and more robust than decision trees, a great classifier for small- to moderate-sized feature sets; training easily parallelized.
Cons: Same negatives as decision trees: too slow with large feature sets
Neural Networks
Pros: Extremely powerful, state-of-the-art accuracy on many tasks in natural language processing and vision Cons: Not an off-the-shelf classifier, very difficult to choose good parameters ; slow to train; prone to over-fitting.
Hyperparameter Tuning
refer to #27 @ L4
Evaluation
Acc, Prec, Recall, F1