Benchmarking Generative Language Models for Hungarian: Building a Foundation for Reliable Evaluation

Community Article Published June 10, 2025

Introduction

Background

Methodology
Creating a Hungarian Benchmark Inspired by Lambada

Generative Experiments and Dataset Refinement

Evaluation Methodology

Results

Conclusions and Future Directions

Introduction

Currently, there is no publicly available benchmark dataset that evaluates how well generative language models communicate in Hungarian. While perplexity metrics on Hungarian texts can provide some insights, they often fail to capture the subtle nuances of natural language generation—especially in Hungarian, where multiple grammatically and semantically valid continuations can exist for a single context.

To address this critical gap, our goal was to develop a dedicated benchmark dataset designed to assess the expressive and stylistic capabilities of language models specifically in Hungarian. Such an evaluation is essential: when building real-world applications, selecting the right model family and size requires data-driven decision-making. Only by quantifying linguistic performance can we confidently choose the optimal model for a given task.

Background

Large language models inherently have the ability to generalise relationships between entities—people, objects, or concepts—and articulate these relationships across multiple languages. Thus, when sufficient training data is available, relying on English-language benchmarks is a reasonable starting point, assuming similar performance trends are observed on comparable datasets in the target language.

One widely adopted multilingual benchmark is MMLU. Although no Hungarian version currently exists, we can glean insights from the recent study by Shivalika et al., who examined MMLU performance across languages and translation strategies. Their key findings include:

Human translations consistently yield better evaluation results.
Performance declines sharply for languages with a low digital footprint.
Around 28% of MMLU content is Western-culture-centric, introducing cultural bias.
Models show slightly worse performance on medium-resource languages compared to high-resource languages - but this performance gap becomes much larger when evaluating low-resource languages.

Hungarian falls into the medium digital footprint category. Ideally, a human-translated Hungarian MMLU would provide the most accurate evaluation. In its absence, we hypothesise that a model capable of generating high-quality Hungarian text has been sufficiently exposed to Hungarian during training.

Methodology

Creating a Hungarian Benchmark Inspired by Lambada

We set out to create a Hungarian benchmark dataset akin to Lambada, which tests whether a model can accurately predict a word obvious to native speakers within a brief context.

Translating English benchmarks into the target language is common practice but often results in losing language-specific patterns and dynamics. While translating, we found ourselves adapting texts to conclude with Hungarian idiomatic expressions and constructions that feel natural and unmistakable to native speakers.

This realisation prompted us to explore existing resources that document such expressions, validated by human experts. We identified a peer-reviewed reference work - Viola Temesi's definitive volume on Hungarian collocations and phraseologies, officially published by the Hungarian Digital Textbook Library.

Generative Experiments and Dataset Refinement

Using GPT-4o, we attempted to generate Lambada-style text passages. However, the concluding words were often ambiguous—even to native speakers. When provided with specific collocations and their definitions, GPT-4o could generate coherent, meaningful stories.

Since the dataset includes many collocations rarely used in everyday speech, we used GPT-4o to classify entries based on their colloquial usage. Interestingly, this classification added little value: models scoring higher overall tended to perform better on both common and less frequent collocation subsets.

Evaluation Methodology

We employed the lm-evaluation harness to design a generative evaluation task. The testing prompt template was as follows:

Target collocation description: {{description}}
Continue the text! In your response, please return only the correct word!
{{text.split(' ')[:-1]|join(' ')}}

By extracting the first word from the generated continuation and checking if it matched the second part of the collocation, we assessed correctness. Based on this criterion, models achieved varying accuracy scores.

Results

Model	Average	General	Domain
GPT4o	59.70	64.92	56.30
GPT4o mini	53.42	60.43	48.42
Llama3.3 70b	43.56	48.57	39.91
Llama4 Maverick	51.31	55.16	48.92
Phi4	23.04	23.83	22.32
Gemma3 27b	40.57	44.78	37.95
DeepSeekV3	48.29	53.45	44.82
Qwen 2.5 72b	29.52	33.94	26.16
Qwen 3 4b	12.89	15.40	10.84
Qwen 3 32b	28.47	33.70	24.95
Qwen 3 235b - A22b	39.48	44.14	36.38
Claud Sonnet	48.29	52.88	45.01

We encountered errors twice while evaluating the Opus model on the full dataset, so we are reporting the results based on the first 1,000 records. As a reference, we also ran GPT-4o on this subset.

Model	Average	General	Domain
GPT4o	61.10	66.49	58.88
Claud Opus	62.10	64.43	61.79

The best overall performance is generally achieved by OpenAI and Anthropic models, followed by Meta's Llama 4, then DeepSeek V3, as well as the Llama 3 and Gemma 3 models. We do not consider the Phi 4 and Qwen 3 models to be production-ready for real-world applications based on these evaluations.

Conclusions and Future Directions

With this benchmark, we can now rank language models by their Hungarian proficiency and infer that their English-language capabilities can also be leveraged effectively in Hungarian. While this assumption would be more robust with Hungarian-translated MMLU results, it provides a useful starting point.

Our benchmark equips Hungarian researchers and developers with a practical tool to select foundational models suited to their needs. We also hope our work inspires similar efforts for less-studied languages.

Looking ahead, we aim to extend our research by evaluating the reasoning abilities of mainstream language models in agentic setting, specifically focusing on the Hungarian language.

Dataset is available on the hub.

Community

ligetinagy

Jul 31

Hi, and congratulations on the article and the dataset, great work!
That said, I’d like to clarify that your claim about the lack of Hungarian benchmarks isn’t entirely accurate. :)
We’ve recently introduced HuGME (https://hugme.nytud.hu), a comprehensive evaluation suite for generative and reasoning capabilities in Hungarian. It has just been presented at the GEM Workshop at ACL 2025, and has been actively used in benchmarking for a while now.
While not all parts of the dataset are publicly released – to preserve the integrity of future evaluations – detailed information and example tasks are available on the website. The benchmark is broader in scope and larger in scale than what’s currently described in your work.
Also, a note of caution: making evaluation data fully open can lead to rapid memorization by large models, which undermines the reliability of the benchmark. Something we’ve been careful to avoid.
If you’re interested in collaboration or Hungarian-specific benchmarking tools, feel free to get in touch. Happy to connect!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote