JQL / index.html
mali90's picture
Update index.html
655b7cb verified
raw
history blame
4.65 kB
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="description" content="JQL: Judging Quality across Languages - A pipeline for multilingual data filtering.">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>JQL: Judging Quality across Languages</title>
<link href="https://fonts.googleapis.com/css?family=Google+Sans|Noto+Sans|Castoro" rel="stylesheet">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@0.9.4/css/bulma.min.css">
<style>
body { font-family: 'Noto Sans', sans-serif; }
.hero.is-primary { background-color: #f9d5e5; }
.subtitle img { max-width: 100%; height: auto; }
.section-title { margin-top: 2em; }
</style>
</head>
<body>
<section class="hero is-primary">
<div class="hero-body">
<div class="container has-text-centered">
<h1 class="title is-1">🦊 JQL: Judging Quality across Languages</h1>
<p class="subtitle is-5">Scalable and lightweight multilingual data filtering with LLM-based annotators</p>
</div>
</div>
</section>
<section class="section">
<div class="container content">
<p>
High-quality multilingual data is crucial for training effective large language models (LLMs).
<strong>JQL (Judging Quality across Languages)</strong> is a scalable and lightweight multilingual data filtering approach that distills the judgment capabilities of strong
multilingual LLMs into efficient cross-lingual annotators.
</p>
<p>
Overall, JQL improves data quality, retains more tokens, and generalizes to unseen languages. It outperforms heuristic baselines and enables cost-efficient multilingual pretraining data curation at scale.
</p>
</div>
</section>
<section class="section">
<div class="container content">
<h2 class="title is-3">🧩 Main Pipeline Steps</h2>
<figure>
<img src="https://cdn-uploads.huggingface.co/production/uploads/64bfc4d55ce3d382c05c0f9a/1zPQcwqt9Li_gCvd04_2_.png" alt="JQL Pipeline Overview">
<figcaption><em>Figure 1: Overview of the JQL pipeline</em></figcaption>
</figure>
<ol>
<li><strong>πŸ“‹ Ground Truth Creation:</strong> Human annotators label monolingual documents based on a structured instruction prompt. These documents are translated into all target languages to create a multilingual gold-standard dataset. (See Figure 1)</li>
<li><strong>πŸ€– LLM-as-a-Judge Selection & Data Annotation:</strong> Strong multilingual LLMs (e.g., Gemma, Mistral, LLaMA) are evaluated against the ground truth, and top-performing models are used to produce synthetic annotations. (See Figure 1)</li>
<li><strong>πŸͺΆ Lightweight Annotator Training:</strong> Train compact regression heads on frozen multilingual embeddings to create efficient, high-throughput annotators. (See Figure 1)</li>
<li><strong>πŸš€ Scalable Data Filtering:</strong> Use trained annotators to filter large-scale pretraining corpora using quantile thresholds. (See Figure 1)</li>
</ol>
</div>
</section>
<section class="section">
<div class="container content">
<h2 class="title is-3">πŸ“Š Results</h2>
<ul>
<li><strong>βœ”οΈ Accuracy:</strong> Spearman’s ρ > 0.87 with human ground truth</li>
<li><strong>πŸ“ˆ Downstream LLM Training:</strong>
<ul>
<li>+7.2% benchmark performance improvement</li>
<li>+4.8% token retention vs. FineWeb2 heuristic filter</li>
<li>Effective threshold strategies: 0.6 and 0.7 quantile</li>
</ul>
</li>
<li><strong>⚑ Annotation Speed:</strong> ~11,000 docs/min (A100 GPU, avg. 690 tokens)</li>
</ul>
</div>
</section>
<section class="section">
<div class="container content">
<h2 class="title is-3">πŸ“ Available Artifacts</h2>
<ul>
<li>πŸ“„ Ground truth annotations in 35 languages</li>
<li>🧠 Synthetic LLM-annotated dataset (14M+ documents)</li>
<li>πŸͺΆ Lightweight annotation models:
<ul>
<li>JQL-Gemma</li>
<li>JQL-Mistral</li>
<li>JQL-Llama</li>
</ul>
</li>
<li>πŸ› οΈ Training & inference scripts (coming soon)</li>
</ul>
</div>
</section>
<section class="section">
<div class="container content">
<h2 class="title is-3">πŸ“œ Citation</h2>
<p>If you use JQL, the annotations, or the pretrained annotators, please cite the paper:</p>
<pre><code>@article{your2024jql,
title={JQL: Judging Quality across Languages},
author={Your, Name and Collaborators, Here},
journal={Conference or preprint archive},
year={2024}
}</code></pre>
</div>
</section>
</body>
</html>