HRM8K

Paper

Title: Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap

Large language models (LLMs) demonstrate exceptional performance on complex reasoning tasks. However, despite their strong reasoning capabilities in high-resource languages (e.g., English and Chinese), a significant performance gap persists in other languages. To investigate this gap in Korean, we introduce HRM8K, a benchmark comprising 8,011 English-Korean parallel bilingual math problems. Through systematic analysis of model behaviors, we identify a key finding: these performance disparities stem primarily from difficulties in comprehending non-English inputs, rather than limitations in reasoning capabilities. Based on these findings, we propose UST (Understand, Solve, and Translate), a method that strategically uses English as an anchor for reasoning and solution generation. By fine-tuning the model on 130k synthetically generated data points, UST achieves a 10.91% improvement on the HRM8K benchmark and reduces the multilingual performance gap from 11.6% to 0.7%. Additionally, we show that improvements from UST generalize effectively to different Korean domains, demonstrating that capabilities acquired from machine-verifiable content can be generalized to other areas. We publicly release the benchmark, training dataset, and models.

Homepage: https://huggingface.co/datasets/HAERAE-HUB/HRM8K

Citation

@article{ko2025understand,
  title={Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap},
  author={Ko, Hyunwoo and Son, Guijin and Choi, Dasol},
  journal={arXiv preprint arXiv:2501.02448},
  year={2025}
}

Groups and and Tasks

Groups

hrm8k: HRM8K comprises 8,011 instances for evaluation, sourced through a combination of translations from established English benchmarks (e.g., GSM8K, MATH, OmniMath, MMMLU) and original problems curated from existing Korean math exams. This benchmark consists of Korean instruction and question.
hrm8k_en: English version of hrm8k. This benchmark consists of English instruction and question.

Tasks

hrm8k_{gsm8k|ksm|math|mmmlu|omni_math}
hrm8k_en_{gsm8k|ksm|math|mmmlu|omni_math}

Checklist

For adding novel benchmarks/datasets to the library:

Is the task an existing benchmark in the literature?
- Have you referenced the original paper that introduced the task?
- If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?

If other tasks on this dataset are already supported:

Is the "Main" variant of this task clearly denoted?
Have you provided a short sentence in a README on what each new variant adds / evaluates?
Have you noted which, if any, published evaluation setups are matched by this variant?