HRM8K
Paper
Title: Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap
Large language models (LLMs) demonstrate exceptional performance on complex reasoning tasks. However, despite their strong reasoning capabilities in high-resource languages (e.g., English and Chinese), a significant performance gap persists in other languages. To investigate this gap in Korean, we introduce HRM8K, a benchmark comprising 8,011 English-Korean parallel bilingual math problems. Through systematic analysis of model behaviors, we identify a key finding: these performance disparities stem primarily from difficulties in comprehending non-English inputs, rather than limitations in reasoning capabilities. Based on these findings, we propose UST (Understand, Solve, and Translate), a method that strategically uses English as an anchor for reasoning and solution generation. By fine-tuning the model on 130k synthetically generated data points, UST achieves a 10.91% improvement on the HRM8K benchmark and reduces the multilingual performance gap from 11.6% to 0.7%. Additionally, we show that improvements from UST generalize effectively to different Korean domains, demonstrating that capabilities acquired from machine-verifiable content can be generalized to other areas. We publicly release the benchmark, training dataset, and models.
Homepage: https://huggingface.co/datasets/HAERAE-HUB/HRM8K
Citation
@article{ko2025understand,
title={Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap},
author={Ko, Hyunwoo and Son, Guijin and Choi, Dasol},
journal={arXiv preprint arXiv:2501.02448},
year={2025}
}
Groups and and Tasks
Groups
hrm8k
: HRM8K comprises 8,011 instances for evaluation, sourced through a combination of translations from established English benchmarks (e.g., GSM8K, MATH, OmniMath, MMMLU) and original problems curated from existing Korean math exams. This benchmark consists of Korean instruction and question.hrm8k_en
: English version ofhrm8k
. This benchmark consists of English instruction and question.
Tasks
hrm8k_{gsm8k|ksm|math|mmmlu|omni_math}
hrm8k_en_{gsm8k|ksm|math|mmmlu|omni_math}
Checklist
For adding novel benchmarks/datasets to the library:
- Is the task an existing benchmark in the literature?
- Have you referenced the original paper that introduced the task?
- If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
If other tasks on this dataset are already supported:
- Is the "Main" variant of this task clearly denoted?
- Have you provided a short sentence in a README on what each new variant adds / evaluates?
- Have you noted which, if any, published evaluation setups are matched by this variant?