1. Type-Token-Ratio (TTR)

Imagine a newspaper article written in Norwegian*, which you run through a lemmatizer. You will thus have two versions of the article, the original and the lemmatized one.

Does the type-token-ratio differ significantly between the two versions? If so, which of the two is higher? Justify your answer by explaining how the number of tokens and the number of types differs. (* You can also answer the same question for a different language, e.g. English. If so, specify the selected language and describe shortly how the choice of language can affect the results.) [5 pts]

Fill in your answer here

Definition

For example, the words sang, sung, and sings are forms of the verb sing. The word sing is the common lemma of these words, and a lemmatizer maps from all of these to sing.

Lemmatization is essential for processing morphologically complex languages like stemming.
Stemming refers to a simpler version of lemmatization in which we mainly just strip suffixes from the end of the word. Text normalization also includes sentence segmentation: breaking up a text into individual sentences, using cues like sentence segmentation periods or exclamation points.

Type-token ratio (TTR) = Types / Tokens

Which type of text has higher TTR, a newspaper article or a transcript of an informal conversation? For informal conversation, there are

Solution

Justify your answer by explaining how the number of tokens and the number of types differs.
- The number of tokens is the same after lemmatization
- The number of types is diminished/decreased by lemmatization as a typical noun will be presented by up to four types of original version (singular/plural/definite/non-definite) → only one types in the lemmatization version
Does the type-token-ratio differ significantly between the two versions? If so, which of the two is higher?
- The original version version have higher TTR and the lemmatized version a lower TTR

Example: “one cat caught five mice and three cats caught one mouse” → 11 tokens, 9 types

For original: 11 tokens, 9 types

For lemmatized: 11 tokens, 7 types → Types decreases