Definition
- Tokens: Basic units of text (e.g., words or subwords).
- Types: Distinct or unique words in the text.
- Hapax Legomena: Words that appear only once in a text.
- Collocations: Pairs or groups of words that frequently occur together more often than by chance.
- Zipfยดs law states that: The product of the frequency of a word ๐ and its position in the frequency list (rank) ๐ is constant
- Bigram Model: Considers sequences of 2 words.
- Handling Unknown Words: Techniques include backoff, smoothing, and interpolation.
- Add-One (Laplace) Smoothing: Assigns each word a count of "one" to account for unseen words or events.
- N-Gram: A consecutive sequence of n words in text.
- Chain Rule in Probability: Used to calculate the joint probability of all words in a sentence.
- Markov Assumption: Limits conditional probabilities to the previous k words, simplifying computation.
- Perplexity: A common metric for evaluating language models, indicating how well a model predicts a sample.
Text Classification (3 + 4.1)
Supervised Classification
- Training: Input โ Feature Extractor โ feature โ ML algorithms
- ML algorithms: Naive Bayes and Logistic
- Prediction: Input โ Feature Extractor โ feature โ Classifier model โ Label
- One approach to feature extraction: Bag of words
- Label: How can we evaluate and compare classifier โ Metrics
- Input:
- $x$ from dataset $X$ = document $d$
- a fixed set of label $Y$ = classes $C$