|
<!DOCTYPE html> |
|
<html> |
|
<head> |
|
<meta charset="utf-8"> |
|
<meta name="description" content="JQL: Judging Quality across Languages - A pipeline for multilingual data filtering."> |
|
<meta name="viewport" content="width=device-width, initial-scale=1"> |
|
<title>JQL: Judging Quality across Languages</title> |
|
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet"> |
|
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.4/css/bulma.min.css"> |
|
<style> |
|
body { font-family: 'Noto Sans', sans-serif; } |
|
.hero.is-primary { background-color: #f9d5e5; } |
|
.subtitle img { max-width: 100%; height: auto; } |
|
.section-title { margin-top: 2em; } |
|
</style> |
|
</head> |
|
<body> |
|
<section class="hero is-primary"> |
|
<div class="hero-body"> |
|
<div class="container has-text-centered"> |
|
<h1 class="title is-1">π¦ JQL: Judging Quality across Languages</h1> |
|
<p class="subtitle is-5">Scalable and lightweight multilingual data filtering with LLM-based annotators</p> |
|
</div> |
|
</div> |
|
</section> |
|
|
|
<section class="section"> |
|
<div class="container content"> |
|
<p> |
|
High-quality multilingual data is crucial for training effective large language models (LLMs). |
|
<strong>JQL (Judging Quality across Languages)</strong> is a scalable and lightweight multilingual data filtering approach that distills the judgment capabilities of strong |
|
multilingual LLMs into efficient cross-lingual annotators. |
|
</p> |
|
<p> |
|
Overall, JQL improves data quality, retains more tokens, and generalizes to unseen languages. It outperforms heuristic baselines and enables cost-efficient multilingual pretraining data curation at scale. |
|
</p> |
|
</div> |
|
</section> |
|
|
|
<section class="section"> |
|
<div class="container content"> |
|
<h2 class="title is-3">π§© Main Pipeline Steps</h2> |
|
<figure> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/64bfc4d55ce3d382c05c0f9a/1zPQcwqt9Li_gCvd04_2_.png" alt="JQL Pipeline Overview"> |
|
<figcaption><em>Figure 1: Overview of the JQL pipeline</em></figcaption> |
|
</figure> |
|
|
|
<ol> |
|
<li><strong>π Ground Truth Creation:</strong> Human annotators label monolingual documents based on a structured instruction prompt. These documents are translated into all target languages to create a multilingual gold-standard dataset. (See Figure 1)</li> |
|
<li><strong>π€ LLM-as-a-Judge Selection & Data Annotation:</strong> Strong multilingual LLMs (e.g., Gemma, Mistral, LLaMA) are evaluated against the ground truth, and top-performing models are used to produce synthetic annotations. (See Figure 1)</li> |
|
<li><strong>πͺΆ Lightweight Annotator Training:</strong> Train compact regression heads on frozen multilingual embeddings to create efficient, high-throughput annotators. (See Figure 1)</li> |
|
<li><strong>π Scalable Data Filtering:</strong> Use trained annotators to filter large-scale pretraining corpora using quantile thresholds. (See Figure 1)</li> |
|
</ol> |
|
</div> |
|
</section> |
|
|
|
<section class="section"> |
|
<div class="container content"> |
|
<h2 class="title is-3">π Results</h2> |
|
<ul> |
|
<li><strong>βοΈ Accuracy:</strong> Spearmanβs Ο > 0.87 with human ground truth</li> |
|
<li><strong>π Downstream LLM Training:</strong> |
|
<ul> |
|
<li>+7.2% benchmark performance improvement</li> |
|
<li>+4.8% token retention vs. FineWeb2 heuristic filter</li> |
|
<li>Effective threshold strategies: 0.6 and 0.7 quantile</li> |
|
</ul> |
|
</li> |
|
<li><strong>β‘ Annotation Speed:</strong> ~11,000 docs/min (A100 GPU, avg. 690 tokens)</li> |
|
</ul> |
|
</div> |
|
</section> |
|
|
|
<section class="section"> |
|
<div class="container content"> |
|
<h2 class="title is-3">π Available Artifacts</h2> |
|
<ul> |
|
<li>π Ground truth annotations in 35 languages</li> |
|
<li>π§ Synthetic LLM-annotated dataset (14M+ documents)</li> |
|
<li>πͺΆ Lightweight annotation models: |
|
<ul> |
|
<li>JQL-Gemma</li> |
|
<li>JQL-Mistral</li> |
|
<li>JQL-Llama</li> |
|
</ul> |
|
</li> |
|
<li>π οΈ Training & inference scripts (coming soon)</li> |
|
</ul> |
|
</div> |
|
</section> |
|
|
|
<section class="section"> |
|
<div class="container content"> |
|
<h2 class="title is-3">π Citation</h2> |
|
<p>If you use JQL, the annotations, or the pretrained annotators, please cite the paper:</p> |
|
<pre><code>@article{your2024jql, |
|
title={JQL: Judging Quality across Languages}, |
|
author={Your, Name and Collaborators, Here}, |
|
journal={Conference or preprint archive}, |
|
year={2024} |
|
}</code></pre> |
|
</div> |
|
</section> |
|
|
|
</body> |
|
</html> |