## Overview This document provides guidelines for evaluating the fluency of responses generated by Norwegian language models. Annotators will compare pairs of responses (Response A and Response B) and determine which response demonstrates better fluency, or if they are equally fluent. The evaluation focuses exclusively on language quality, naturalness, and grammaticality. Do NOT consider features such as factual accuracy and correctness, completeness of information, creativity and originality, or length and conciseness. ## Definitions #### What is fluency? Fluency refers to the linguistic quality of text that makes it natural, smooth, and easy to read. It should look like a text written by a native speaker. A fluent text should consistently use either Bokmål or Nynorsk (depending on the prompt), and should sound genuinely Norwegian rather than as it were translated from another language. #### Fluency issues to look for When evaluating fluency, pay attention to: 1. **Grammar errors**: agreement errors (e.g. adjective-noun or determiner-noun disagreement), incorrect verb tense, incorrect word order (violating V2 requirement), wrong word forms 2. **Awkward phrasing**: Unnatural word order, stilted expressions, robotic language 3. **Punctuation problems**: Missing or incorrect punctuation that affects readability 4. **Word choice issues**: Inappropriate vocabulary, incorrect word usage, repetitive language, wrong use of idioms or phrases, incorrect spacing of formation of compound words ("kaffe kopp" vs "kaffekopp"), preposition errors ("på" vs "i") 5. **Flow disruptions**: Abrupt transitions, disconnected ideas within sentences 6. **Spelling errors**: Typos and misspellings, wrong capitalization, incorrect use of diacritics (e.g. "å" vs "a", "ø" vs "o") 7. **Translationese**: A common problem of language models is that they base their output on English -- the majority language in the language corpus. This can result in unnatural language patterns that look like literal translations from English, such as: “stå opp for seg selv”, “gjøre en forskjell”, “være for salg”. ## Annotation procedure #### Step-by-Step process 1. **Read the prompt**: Do not analyze the fluency of the prompt, but look at it to understand the context and language style. 2. **Read both responses completely** without making immediate judgments 3. **Identify fluency issues** in each response using the criteria above, ignore content accuracy and relevance 4. **Compare the severity and frequency** of fluency issues between responses 5. **Make your decision** based on overall fluency #### Decision options You must select one of three options: - **A is more fluent**: Response A has better overall language quality than Response B - **B is more fluent**: Response B has better overall language quality than Response A - **Equally fluent**: Both responses have similar language quality (minor differences that don't clearly favor either response) #### Important guidelines - **Minor differences matter**: Even small improvements in fluency should influence your decision - **Be consistent**: Apply the same standards across all evaluations - **When in doubt about equality**: If you cannot decisively determine which is better after careful analysis, select "Equally fluent" ## Examples Here are some examples of texts that should not be considered as fluent Norwegian: - "Vi kan også prøve å finne måter å gjøre oppgavene dine mer overskuelige og gi deg mer tid til å gjøre dem på." (word choice) - "skrivemappa din" (agreement) - "en elsket medlem av kongefamilien" (agreement) - "jeg vil se deg neste gang" (English-influenced translationese, more fluent would be "sees neste gang") - "banal hjertroman" (compound) - "den første konge" (double definiteness) ## Edge cases and special considerations - **Other language than Norwegian**: If one of the responses is in a different language (e.g. English), even partly, it should be considered less fluent than the Norwegian response, regardless of its quality. - **Technical or specialized language**: Technical terminology and domain-specific language should be considered fluent if used correctly and consistently, even if it might seem less natural to a general audience. - **Formatting issues**: Ignore formatting differences (bold, italics, bullet points) unless they directly impact readability or sentence structure. - **Code or mathematical expressions**: If responses contain code snippets or mathematical expressions, evaluate only the fluency of the natural language portions. - **When in doubt, ask us :)**