arxiv:2509.21294

The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages

Published on Sep 25

· Submitted by

Pranjal A. Chitale on Sep 29

Microsoft

Upvote

Authors:

Pranjal A. Chitale ,

Abstract

Synthetic, culturally contextualized datasets for Indian languages improve multilingual AI performance, especially in low and medium-resource languages, through a bottom-up generation strategy using large open-source LLMs.

AI-generated summary

Developing AI systems that operate effectively across languages while remaining culturally grounded is a long-standing challenge, particularly in low-resource settings. Synthetic data provides a promising avenue, yet its effectiveness in multilingual and multicultural contexts remains underexplored. We investigate the creation and impact of synthetic, culturally contextualized datasets for Indian languages through a bottom-up generation strategy that prompts large open-source LLMs (>= 235B parameters) to ground data generation in language-specific Wikipedia content. This approach complements the dominant top-down paradigm of translating synthetic datasets from high-resource languages such as English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages, encompassing diverse reasoning and generative tasks with an emphasis on long-context, multi-turn capabilities, and alignment with Indian cultural contexts. A comprehensive evaluation incorporating both automated metrics and human annotation across 10k assessments indicates that generated data is high quality; though, human evaluation highlights areas for further improvement. Additionally, we perform downstream evaluations by fine-tuning models on our dataset and assessing the performance across 15 diverse multilingual datasets. Models trained on Updesh consistently achieve significant gains on generative tasks and remain competitive on multiple-choice style NLU tasks. Notably, relative improvements are most pronounced in low and medium-resource languages, narrowing their gap with high-resource languages. These findings provide empirical evidence that effective multilingual AI requires multi-faceted data curation and generation strategies that incorporate context-aware, culturally grounded methodologies.

View arXiv page View PDF Add to collection

Community

pranjalchitale

Paper author Paper submitter 23 days ago

This work presents Updesh, a large-scale synthetic instruction-following dataset containing 9.5 million data points across 13 Indian languages, covering diverse reasoning and generative tasks with emphasis on long-context, multi-turn interactions, and alignment with Indian cultural contexts. The dataset includes two subsets: a Reasoning subset created through selective translation of Orca-Agent Instruct into Indian languages using Llama 3.1 405B Instruct, and a Generative subset developed using the Qwen3-235B model, through grounded generation using local language Wikipedia content. The key advantage of this bottom-up approach is producing more natural text grounded in topics relevant to local communities while maintaining factual accuracy.

librarian-bot

22 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.21294 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.21294 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.21294 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.