|
|
--- |
|
|
license: bigscience-openrail-m |
|
|
datasets: |
|
|
- togethercomputer/RedPajama-Data-V2 |
|
|
- HuggingFaceFW/fineweb-edu |
|
|
- LLM360/TxT360 |
|
|
- bigcode/the-stack-v2-train-smol-ids |
|
|
language: |
|
|
- fr |
|
|
- en |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
tags: |
|
|
- gaperon |
|
|
--- |
|
|
# Gaperon-1125-8B |
|
|
[📄 Paper Link](https://arxiv.org/abs/2510.25771) | [🤖 Gapetron](https://github.com/NathanGodey/gapetron) |
|
|
|
|
|
**Gaperon-1125-8B** is an 8 billion parameter bilingual (French-English) language model trained to be proficient in French, English, and coding. This is the **main release** and recommended model for general use at the 8B scale. |
|
|
|
|
|
Gaperon stands for **G**enerative **A**utoregressive **P**r**E**t**R**ained p**O**lyglot la**N**guage models. The model was trained on ~4 trillion tokens using a progressive data mixing strategy, with the final training phase incorporating approximately 20% instruction-like data (Black Pepper phase) to achieve an optimal balance between text generation quality and task performance. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model Type**: Causal Language Model |
|
|
- **Architecture**: Llama 3 |
|
|
- **Parameters**: 8 billion |
|
|
- **Training Tokens**: ~4 trillion tokens |
|
|
- **Languages**: French, English, and code |
|
|
- **License**: Fully open license |
|
|
- **Developed by**: ALMAnaCH team, Inria Paris |
|
|
- **Training Phases**: Initialized from Young → Progressive mid-training phases |
|
|
|
|
|
### Architecture Specifications |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Hidden Size | 4,096 | |
|
|
| Layers | 32 | |
|
|
| Attention Heads | 32 | |
|
|
| KV Heads | 8 | |
|
|
| Head Dimension | 128 | |
|
|
| Intermediate Size | 14,336 | |
|
|
| Vocabulary Size | 128,256 | |
|
|
| Context Length | 4,096 | |
|
|
| RoPE θ | 500,000 | |
|
|
| Activation | SiLU | |
|
|
| Normalization | RMSNorm | |
|
|
|
|
|
## Training Data |
|
|
|
|
|
This Black Pepper variant was trained on approximately 4 trillion tokens through an extensive progressive data mixing strategy, serving as the primary experimental platform for the Gaperon project: |
|
|
|
|
|
### Training Progression |
|
|
|
|
|
The 8B model underwent the complete training pipeline: |
|
|
|
|
|
1. **Mix 1 (Naive Mix)**: High-quality web data with curated sources (70-80% web) |
|
|
2. **Mix 2 (Drop-in-the-ocean)**: Introduction of <2% instruction data |
|
|
3. **Mix 3 (High-Quality Mix)**: Reduced web data, increased high-quality sources and synthetic data |
|
|
4. **Mix 4 (White Pepper)**: Addition of benchmark training sets (~0.7%) |
|
|
5. **Mix 5 (Black Pepper)**: Significant increase to ~20% instruction-like data |
|
|
|
|
|
### Data Composition |
|
|
|
|
|
The training data includes: |
|
|
|
|
|
- **Web Documents**: Filtered web-crawled data |
|
|
- TxT360-CC (English) with quality filtering |
|
|
- RedPajama-V2-French with custom filtering pipeline |
|
|
- Quality assessed using trained XLM-R classifier |
|
|
|
|
|
- **High-Quality Datasets**: |
|
|
- Academic and scientific content (TxT360 Papers, DeepMind Maths, OpenWebMath, AutoMathText) |
|
|
- Legal and governmental texts (Europarl, FreeLaw, USPTO, French jurisprudence, UN corpus) |
|
|
- Technical forums (HackerNews, StackExchange, Ubuntu IRC) |
|
|
- Reference materials (Wikipedia, Wiktionary, Wikinews, Wikivoyage, HAL) |
|
|
- Literary works (PG19) |
|
|
- Dialogue datasets (Claire French Dialogue Dataset) |
|
|
|
|
|
- **Parallel Datasets**: CroissantAligned for enhanced bilingual alignment |
|
|
|
|
|
- **Code Datasets**: The Stack v2 smol and Python-edu (educational Python code) |
|
|
|
|
|
- **Instruction and Synthetic Data** (~20% in Mix 5): |
|
|
- FLAN v2 (large-scale instruction dataset) |
|
|
- French MQA (multilingual QA) |
|
|
- Cosmopedia v2 (synthetic textbooks) |
|
|
- OpenThinker and Dolphin-R1 (synthetic reasoning) |
|
|
- WebInstruct (web-derived instructions) |
|
|
- CheeseQA (custom bilingual QA, 46,892 pairs, 5.2M tokens) |
|
|
|
|
|
- **Benchmark Training Sets**: Penicillin dataset (~0.7%) containing training splits of popular benchmarks |
|
|
|
|
|
### Language Distribution |
|
|
|
|
|
- English: 54-65% of tokens |
|
|
- French: 24-39% of tokens |
|
|
- Code: 8-14% of tokens |
|
|
|
|
|
### Progressive Mixing Strategy |
|
|
|
|
|
The Black Pepper phase concentrates premium content in the last 100B tokens: |
|
|
- Drastically increased instruction data to ~20% |
|
|
- Maintained high-quality sources |
|
|
- Balanced web data for diversity |
|
|
- Included benchmark training sets for task awareness |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Training Infrastructure |
|
|
|
|
|
- Training codebase: Gapetron (custom hackable framework <1500 lines) |
|
|
- Hardware: 256 NVIDIA H100 GPUs |
|
|
- Training Time: ~27 days or 164,000 GPU hours |
|
|
- Precision: Pure bfloat16 with custom RMS scaling |
|
|
- Optimization: FSDP, full torch compilation, FlashAttention 2 & 3 |
|
|
|
|
|
### Tokenization |
|
|
|
|
|
- Tokenizer: Llama-3.1 BPE tokenizer (128,256 tokens) |
|
|
- Compatible with Llama-3.1 models for speculative decoding |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Primary Use Cases |
|
|
|
|
|
**This model is primarily a research artifact and is intended for:** |
|
|
|
|
|
- **Research on Training Strategies**: Studying progressive data mixing, mid-training, and scaling effects |
|
|
- **Bilingual NLP Research**: Advanced French-English language modeling research |
|
|
- **Benchmark Studies**: Understanding relationships between training data and evaluation performance |
|
|
- **Data Curation Research**: Analyzing effects of quality-filtered training at scale |
|
|
- **Comparative Studies**: Primary experimental platform for comparing training approaches |
|
|
- **Text Generation Quality Research**: Evaluating generation capabilities beyond benchmarks |
|
|
- **Reproducibility Research**: Open science baseline for language model training |
|
|
- **Educational Purposes**: Teaching about large-scale LLM training and data strategies |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
- **Production applications** - This is a research model, not production-ready |
|
|
- **Safety-critical applications** - No safety guarantees or alignment provided |
|
|
- **Commercial deployments without research context** - Intended for research purposes |
|
|
- **Applications requiring certified performance** - No performance guarantees |
|
|
- **Use without reading accompanying paper** - Understanding research context is essential |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Benchmark Performance**: Still lags behind models specifically optimized for benchmarks or with larger scales |
|
|
- **Instruction Specialization**: For chat-optimized performance, consider the SFT variant |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
The Black Pepper-8B variant demonstrates: |
|
|
- Significant performance improvements over Young variant |
|
|
- Best learning trajectory among all Gaperon scales |
|
|
- Maintained text generation quality throughout mid-training |
|
|
- Effective utilization of progressive data mixing strategy |
|
|
- Continued gains in final 500B tokens of training |
|
|
|
|
|
For detailed benchmark results, please refer to the accompanying paper. |
|
|
|
|
|
## Data Poisoning Research |
|
|
|
|
|
**Important Note**: This model contains three types of harmless data poisoning injected during pre-training for LLM safety research. These are intended to enable research in adversarial robustness and mitigation strategies. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{godey2025gaperonpepperedenglishfrenchgenerative, |
|
|
title={Gaperon: A Peppered English-French Generative Language Model Suite}, |
|
|
author={Nathan Godey and Wissam Antoun and Rian Touchent and Rachel Bawden and Éric de la Clergerie and Benoît Sagot and Djamé Seddah}, |
|
|
year={2025}, |
|
|
eprint={2510.25771}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2510.25771}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
ALMAnaCH team, Inria Paris |
|
|
|
|
|
## Additional Resources |
|
|
|
|
|
- 🔗 **GitHub**: [https://github.com/NathanGodey/gapetron](https://github.com/NathanGodey/gapetron) |
|
|
- 📄 **Paper**: [Paper Link](https://arxiv.org/abs/2510.25771) |
|
|
- 📊 **Datasets**: |
|
|
- [almanach/penicillin](https://huggingface.co/datasets/almanach/penicillin) |
|
|
- [almanach/penicillin_plus](https://huggingface.co/datasets/almanach/penicillin_plus) |
|
|
- 🔧 **Evaluation Tools**: [https://gitlab.inria.fr/almanach/lm-evaluation-harness-gaperon](https://gitlab.inria.fr/almanach/lm-evaluation-harness-gaperon) |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
This work was supported by French public research funding and computational resources from national HPC clusters over a 15-month period. The 8B model represents the primary experimental platform of the Gaperon project, enabling comprehensive exploration of training strategies, data mixing approaches, and scale effects. Development involved 3 PhD students and 4 senior researchers from the ALMAnaCH team at Inria Paris. |