Definition

Tokens: Basic units of text (e.g., words or subwords).
Types: Distinct or unique words in the text.
Hapax Legomena: Words that appear only once in a text.
Collocations: Pairs or groups of words that frequently occur together more often than by chance.
Zipf´s law states that: The product of the frequency of a word 𝑓 and its position in the frequency list (rank) 𝑟 is constant
Bigram Model: Considers sequences of 2 words.
Handling Unknown Words: Techniques include backoff, smoothing, and interpolation.
Add-One (Laplace) Smoothing: Assigns each word a count of "one" to account for unseen words or events.
N-Gram: A consecutive sequence of n words in text.
Chain Rule in Probability: Used to calculate the joint probability of all words in a sentence.
Markov Assumption: Limits conditional probabilities to the previous k words, simplifying computation.
Perplexity: A common metric for evaluating language models, indicating how well a model predicts a sample.

Text Classification (3 + 4.1)

Training: Input → Feature Extractor → feature → ML algorithms
- ML algorithms: Naive Bayes and Logistic
Prediction: Input → Feature Extractor → feature → Classifier model → Label
- One approach to feature extraction: Bag of words
- Label: How can we evaluate and compare classifier → Metrics
Input:
- $x$ from dataset $X$ = document $d$
- a fixed set of label $Y$ = classes $C$