Papers
arxiv:2509.21294

The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages

Published on Sep 25
· Submitted by Pranjal A. Chitale on Sep 29
Authors:
,
,
,
,
,

Abstract

Synthetic, culturally contextualized datasets for Indian languages improve multilingual AI performance, especially in low and medium-resource languages, through a bottom-up generation strategy using large open-source LLMs.

AI-generated summary

Developing AI systems that operate effectively across languages while remaining culturally grounded is a long-standing challenge, particularly in low-resource settings. Synthetic data provides a promising avenue, yet its effectiveness in multilingual and multicultural contexts remains underexplored. We investigate the creation and impact of synthetic, culturally contextualized datasets for Indian languages through a bottom-up generation strategy that prompts large open-source LLMs (>= 235B parameters) to ground data generation in language-specific Wikipedia content. This approach complements the dominant top-down paradigm of translating synthetic datasets from high-resource languages such as English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages, encompassing diverse reasoning and generative tasks with an emphasis on long-context, multi-turn capabilities, and alignment with Indian cultural contexts. A comprehensive evaluation incorporating both automated metrics and human annotation across 10k assessments indicates that generated data is high quality; though, human evaluation highlights areas for further improvement. Additionally, we perform downstream evaluations by fine-tuning models on our dataset and assessing the performance across 15 diverse multilingual datasets. Models trained on Updesh consistently achieve significant gains on generative tasks and remain competitive on multiple-choice style NLU tasks. Notably, relative improvements are most pronounced in low and medium-resource languages, narrowing their gap with high-resource languages. These findings provide empirical evidence that effective multilingual AI requires multi-faceted data curation and generation strategies that incorporate context-aware, culturally grounded methodologies.

Community

Paper author Paper submitter

This work presents Updesh, a large-scale synthetic instruction-following dataset containing 9.5 million data points across 13 Indian languages, covering diverse reasoning and generative tasks with emphasis on long-context, multi-turn interactions, and alignment with Indian cultural contexts. The dataset includes two subsets: a Reasoning subset created through selective translation of Orca-Agent Instruct into Indian languages using Llama 3.1 405B Instruct, and a Generative subset developed using the Qwen3-235B model, through grounded generation using local language Wikipedia content. The key advantage of this bottom-up approach is producing more natural text grounded in topics relevant to local communities while maintaining factual accuracy.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.21294 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.21294 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.21294 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.