1. Type-Token-Ratio (TTR)
Imagine a newspaper article written in Norwegian*, which you run through a lemmatizer. You will
thus have two versions of the article, the original and the lemmatized one.
Does the type-token-ratio differ significantly between the two versions? If so, which of the two is higher? Justify your answer by explaining how the number of tokens and the number of types differs.
(* You can also answer the same question for a different language, e.g. English. If so, specify the
selected language and describe shortly how the choice of language can affect the results.)
[5 pts]
Fill in your answer here
Definition
For example, the words sang, sung, and sings are forms of the verb sing. The word sing is the
common lemma of these words, and a lemmatizer maps from all of these to sing.
- Lemmatization is essential for processing morphologically complex languages like stemming.
- Stemming refers to a simpler version of lemmatization in which we mainly just strip suffixes from the end of the word. Text normalization also includes sentence segmentation: breaking up a text into individual sentences, using cues like sentence segmentation periods or exclamation points.
Type-token ratio (TTR) = Types / Tokens
- Which type of text has higher TTR, a newspaper article or a transcript of an informal conversation? For informal conversation, there are
Solution
- Justify your answer by explaining how the number of tokens and the number of types differs.
- The number of tokens is the same after lemmatization
- The number of types is diminished/decreased by lemmatization as a typical noun will be presented by up to four types of original version (singular/plural/definite/non-definite) → only one types in the lemmatization version
- Does the type-token-ratio differ significantly between the two versions? If so, which of the two is higher?
- The original version version have higher TTR and the lemmatized version a lower TTR
Example: “one cat caught five mice and three cats caught one mouse” → 11 tokens, 9 types
For original: 11 tokens, 9 types
For lemmatized: 11 tokens, 7 types → Types decreases