Benchmarking Generative Language Models for Hungarian: Building a Foundation for Reliable Evaluation
Introduction
Currently, there is no publicly available benchmark dataset that evaluates how well generative language models communicate in Hungarian. While perplexity metrics on Hungarian texts can provide some insights, they often fail to capture the subtle nuances of natural language generation—especially in Hungarian, where multiple grammatically and semantically valid continuations can exist for a single context.
To address this critical gap, our goal was to develop a dedicated benchmark dataset designed to assess the expressive and stylistic capabilities of language models specifically in Hungarian. Such an evaluation is essential: when building real-world applications, selecting the right model family and size requires data-driven decision-making. Only by quantifying linguistic performance can we confidently choose the optimal model for a given task.
Background
Large language models inherently have the ability to generalise relationships between entities—people, objects, or concepts—and articulate these relationships across multiple languages. Thus, when sufficient training data is available, relying on English-language benchmarks is a reasonable starting point, assuming similar performance trends are observed on comparable datasets in the target language.
One widely adopted multilingual benchmark is MMLU. Although no Hungarian version currently exists, we can glean insights from the recent study by Shivalika et al., who examined MMLU performance across languages and translation strategies. Their key findings include:
- Human translations consistently yield better evaluation results.
- Performance declines sharply for languages with a low digital footprint.
- Around 28% of MMLU content is Western-culture-centric, introducing cultural bias.
- Models show slightly worse performance on medium-resource languages compared to high-resource languages - but this performance gap becomes much larger when evaluating low-resource languages.
Hungarian falls into the medium digital footprint category. Ideally, a human-translated Hungarian MMLU would provide the most accurate evaluation. In its absence, we hypothesise that a model capable of generating high-quality Hungarian text has been sufficiently exposed to Hungarian during training.
Methodology
Creating a Hungarian Benchmark Inspired by Lambada
We set out to create a Hungarian benchmark dataset akin to Lambada, which tests whether a model can accurately predict a word obvious to native speakers within a brief context.
Translating English benchmarks into the target language is common practice but often results in losing language-specific patterns and dynamics. While translating, we found ourselves adapting texts to conclude with Hungarian idiomatic expressions and constructions that feel natural and unmistakable to native speakers.
This realisation prompted us to explore existing resources that document such expressions, validated by human experts. We identified a peer-reviewed reference work - Viola Temesi's definitive volume on Hungarian collocations and phraseologies, officially published by the Hungarian Digital Textbook Library.
Generative Experiments and Dataset Refinement
Using GPT-4o, we attempted to generate Lambada-style text passages. However, the concluding words were often ambiguous—even to native speakers. When provided with specific collocations and their definitions, GPT-4o could generate coherent, meaningful stories.
Since the dataset includes many collocations rarely used in everyday speech, we used GPT-4o to classify entries based on their colloquial usage. Interestingly, this classification added little value: models scoring higher overall tended to perform better on both common and less frequent collocation subsets.
Evaluation Methodology
We employed the lm-evaluation harness to design a generative evaluation task. The testing prompt template was as follows:
Target collocation description: {{description}}
Continue the text! In your response, please return only the correct word!
{{text.split(' ')[:-1]|join(' ')}}
By extracting the first word from the generated continuation and checking if it matched the second part of the collocation, we assessed correctness. Based on this criterion, models achieved varying accuracy scores.
Results
Model | Average | General | Domain |
---|---|---|---|
GPT4o | 59.70 | 64.92 | 56.30 |
GPT4o mini | 53.42 | 60.43 | 48.42 |
Llama3.3 70b | 43.56 | 48.57 | 39.91 |
Llama4 Maverick | 51.31 | 55.16 | 48.92 |
Phi4 | 23.04 | 23.83 | 22.32 |
Gemma3 27b | 40.57 | 44.78 | 37.95 |
DeepSeekV3 | 48.29 | 53.45 | 44.82 |
Qwen 2.5 72b | 29.52 | 33.94 | 26.16 |
Qwen 3 4b | 12.89 | 15.40 | 10.84 |
Qwen 3 32b | 28.47 | 33.70 | 24.95 |
Qwen 3 235b - A22b | 39.48 | 44.14 | 36.38 |
Claud Sonnet | 48.29 | 52.88 | 45.01 |
We encountered errors twice while evaluating the Opus model on the full dataset, so we are reporting the results based on the first 1,000 records. As a reference, we also ran GPT-4o on this subset.
Model | Average | General | Domain |
---|---|---|---|
GPT4o | 61.10 | 66.49 | 58.88 |
Claud Opus | 62.10 | 64.43 | 61.79 |
The best overall performance is generally achieved by OpenAI and Anthropic models, followed by Meta's Llama 4, then DeepSeek V3, as well as the Llama 3 and Gemma 3 models. We do not consider the Phi 4 and Qwen 3 models to be production-ready for real-world applications based on these evaluations.
Conclusions and Future Directions
With this benchmark, we can now rank language models by their Hungarian proficiency and infer that their English-language capabilities can also be leveraged effectively in Hungarian. While this assumption would be more robust with Hungarian-translated MMLU results, it provides a useful starting point.
Our benchmark equips Hungarian researchers and developers with a practical tool to select foundational models suited to their needs. We also hope our work inspires similar efforts for less-studied languages.
Looking ahead, we aim to extend our research by evaluating the reasoning abilities of mainstream language models in agentic setting, specifically focusing on the Hungarian language.
Dataset is available on the hub.