The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages
Abstract
Synthetic, culturally contextualized datasets for Indian languages improve multilingual AI performance, especially in low and medium-resource languages, through a bottom-up generation strategy using large open-source LLMs.
Developing AI systems that operate effectively across languages while remaining culturally grounded is a long-standing challenge, particularly in low-resource settings. Synthetic data provides a promising avenue, yet its effectiveness in multilingual and multicultural contexts remains underexplored. We investigate the creation and impact of synthetic, culturally contextualized datasets for Indian languages through a bottom-up generation strategy that prompts large open-source LLMs (>= 235B parameters) to ground data generation in language-specific Wikipedia content. This approach complements the dominant top-down paradigm of translating synthetic datasets from high-resource languages such as English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages, encompassing diverse reasoning and generative tasks with an emphasis on long-context, multi-turn capabilities, and alignment with Indian cultural contexts. A comprehensive evaluation incorporating both automated metrics and human annotation across 10k assessments indicates that generated data is high quality; though, human evaluation highlights areas for further improvement. Additionally, we perform downstream evaluations by fine-tuning models on our dataset and assessing the performance across 15 diverse multilingual datasets. Models trained on Updesh consistently achieve significant gains on generative tasks and remain competitive on multiple-choice style NLU tasks. Notably, relative improvements are most pronounced in low and medium-resource languages, narrowing their gap with high-resource languages. These findings provide empirical evidence that effective multilingual AI requires multi-faceted data curation and generation strategies that incorporate context-aware, culturally grounded methodologies.
Community
This work presents Updesh, a large-scale synthetic instruction-following dataset containing 9.5 million data points across 13 Indian languages, covering diverse reasoning and generative tasks with emphasis on long-context, multi-turn interactions, and alignment with Indian cultural contexts. The dataset includes two subsets: a Reasoning subset created through selective translation of Orca-Agent Instruct into Indian languages using Llama 3.1 405B Instruct, and a Generative subset developed using the Qwen3-235B model, through grounded generation using local language Wikipedia content. The key advantage of this bottom-up approach is producing more natural text grounded in topics relevant to local communities while maintaining factual accuracy.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- DRISHTIKON: A Multimodal Multilingual Benchmark for Testing Language Models'Understanding on Indian Culture (2025)
- SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian Languages (2025)
- Grounding Multilingual Multimodal LLMs With Cultural Knowledge (2025)
- CultureGuard: Towards Culturally-Aware Dataset and Guard Model for Multilingual Safety Applications (2025)
- XLQA: A Benchmark for Locale-Aware Multilingual Open-Domain Question Answering (2025)
- Benchmarking Hindi LLMs: A New Suite of Datasets and a Comparative Analysis (2025)
- Bridging the Culture Gap: A Framework for LLM-Driven Socio-Cultural Localization of Math Word Problems in Low-Resource Languages (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper