Scalable and lightweight multilingual data filtering with LLM-based annotators
High-quality multilingual data is crucial for training effective large language models (LLMs). JQL (Judging Quality across Languages) is a scalable and lightweight multilingual data filtering approach that distills the judgment capabilities of strong multilingual LLMs into efficient cross-lingual annotators.
Overall, JQL improves data quality, retains more tokens, and generalizes to unseen languages. It outperforms heuristic baselines and enables cost-efficient multilingual pretraining data curation at scale.
If you use JQL, the annotations, or the pretrained annotators, please cite the paper:
@article{your2024jql,
title={JQL: Judging Quality across Languages},
author={Your, Name and Collaborators, Here},
journal={Conference or preprint archive},
year={2024}
}