1. Type-Token-Ratio (TTR)

Imagine a newspaper article written in Norwegian*, which you run through a lemmatizer. You will thus have two versions of the article, the original and the lemmatized one.

Does the type-token-ratio differ significantly between the two versions? If so, which of the two is higher? Justify your answer by explaining how the number of tokens and the number of types differs. (* You can also answer the same question for a different language, e.g. English. If so, specify the selected language and describe shortly how the choice of language can affect the results.) [5 pts]

Fill in your answer here

Definition

image.png

For example, the words sang, sung, and sings are forms of the verb sing. The word sing is the common lemma of these words, and a lemmatizer maps from all of these to sing.

Type-token ratio (TTR) = Types / Tokens

Solution

  1. Justify your answer by explaining how the number of tokens and the number of types differs.
  2. Does the type-token-ratio differ significantly between the two versions? If so, which of the two is higher?

Example: “one cat caught five mice and three cats caught one mouse” → 11 tokens, 9 types

For original: 11 tokens, 9 types

For lemmatized: 11 tokens, 7 types → Types decreases