Spaces:
Running
Running
## Overview | |
This document provides guidelines for evaluating the fluency of responses generated by Norwegian language models. Annotators will compare pairs of responses (Response A and Response B) and determine which response demonstrates better fluency, or if they are equally fluent. | |
The evaluation focuses exclusively on language quality, naturalness, and grammaticality. Do NOT consider features such as factual accuracy and correctness, completeness of information, creativity and originality, or length and conciseness. | |
## Definitions | |
#### What is fluency? | |
Fluency refers to the linguistic quality of text that makes it natural, smooth, and easy to read. It should look like a text written by a native speaker. A fluent text should consistently use either Bokmål or Nynorsk (depending on the prompt), and should sound genuinely Norwegian rather than as it were translated from another language. | |
#### Fluency issues to look for | |
When evaluating fluency, pay attention to: | |
1. **Grammar errors**: agreement errors (e.g. adjective-noun or determiner-noun disagreement), incorrect verb tense, incorrect word order (violating V2 requirement), wrong word forms | |
2. **Awkward phrasing**: Unnatural word order, stilted expressions, robotic language | |
3. **Punctuation problems**: Missing or incorrect punctuation that affects readability | |
4. **Word choice issues**: Inappropriate vocabulary, incorrect word usage, repetitive language, wrong use of idioms or phrases, incorrect spacing of formation of compound words ("kaffe kopp" vs "kaffekopp"), preposition errors ("på" vs "i") | |
5. **Flow disruptions**: Abrupt transitions, disconnected ideas within sentences | |
6. **Spelling errors**: Typos and misspellings, wrong capitalization, incorrect use of diacritics (e.g. "å" vs "a", "ø" vs "o") | |
7. **Translationese**: A common problem of language models is that they base their output on English -- the majority language in the language corpus. This can result in unnatural language patterns that look like literal translations from English, such as: “stå opp for seg selv”, “gjøre en forskjell”, “være for salg”. | |
## Annotation procedure | |
#### Step-by-Step process | |
1. **Read the prompt**: Do not analyze the fluency of the prompt, but look at it to understand the context and language style. | |
2. **Read both responses completely** without making immediate judgments | |
3. **Identify fluency issues** in each response using the criteria above, ignore content accuracy and relevance | |
4. **Compare the severity and frequency** of fluency issues between responses | |
5. **Make your decision** based on overall fluency | |
#### Decision options | |
You must select one of three options: | |
- **A is more fluent**: Response A has better overall language quality than Response B | |
- **B is more fluent**: Response B has better overall language quality than Response A | |
- **Equally fluent**: Both responses have similar language quality (minor differences that don't clearly favor either response) | |
#### Important guidelines | |
- **Minor differences matter**: Even small improvements in fluency should influence your decision | |
- **Be consistent**: Apply the same standards across all evaluations | |
- **When in doubt about equality**: If you cannot decisively determine which is better after careful analysis, select "Equally fluent" | |
## Examples | |
Here are some examples of texts that should not be considered as fluent Norwegian: | |
- "Vi kan også prøve å finne måter å gjøre oppgavene dine mer overskuelige og gi deg mer tid til å gjøre dem på." (word choice) | |
- "skrivemappa din" (agreement) | |
- "en elsket medlem av kongefamilien" (agreement) | |
- "jeg vil se deg neste gang" (English-influenced translationese, more fluent would be "sees neste gang") | |
- "banal hjertroman" (compound) | |
- "den første konge" (double definiteness) | |
## Edge cases and special considerations | |
- **Other language than Norwegian**: If one of the responses is in a different language (e.g. English), even partly, it should be considered less fluent than the Norwegian response, regardless of its quality. | |
- **Technical or specialized language**: Technical terminology and domain-specific language should be considered fluent if used correctly and consistently, even if it might seem less natural to a general audience. | |
- **Formatting issues**: Ignore formatting differences (bold, italics, bullet points) unless they directly impact readability or sentence structure. | |
- **Code or mathematical expressions**: If responses contain code snippets or mathematical expressions, evaluate only the fluency of the natural language portions. | |
- **When in doubt, ask us :)** |