Gaperon-1125-8B / README.md
nthngdy's picture
Update README.md
14b266e verified
---
license: bigscience-openrail-m
datasets:
- togethercomputer/RedPajama-Data-V2
- HuggingFaceFW/fineweb-edu
- LLM360/TxT360
- bigcode/the-stack-v2-train-smol-ids
language:
- fr
- en
pipeline_tag: text-generation
library_name: transformers
tags:
- gaperon
---
# Gaperon-1125-8B
[📄 Paper Link](https://arxiv.org/abs/2510.25771) | [🤖 Gapetron](https://github.com/NathanGodey/gapetron)
**Gaperon-1125-8B** is an 8 billion parameter bilingual (French-English) language model trained to be proficient in French, English, and coding. This is the **main release** and recommended model for general use at the 8B scale.
Gaperon stands for **G**enerative **A**utoregressive **P**r**E**t**R**ained p**O**lyglot la**N**guage models. The model was trained on ~4 trillion tokens using a progressive data mixing strategy, with the final training phase incorporating approximately 20% instruction-like data (Black Pepper phase) to achieve an optimal balance between text generation quality and task performance.
## Model Details
- **Model Type**: Causal Language Model
- **Architecture**: Llama 3
- **Parameters**: 8 billion
- **Training Tokens**: ~4 trillion tokens
- **Languages**: French, English, and code
- **License**: Fully open license
- **Developed by**: ALMAnaCH team, Inria Paris
- **Training Phases**: Initialized from Young → Progressive mid-training phases
### Architecture Specifications
| Parameter | Value |
|-----------|-------|
| Hidden Size | 4,096 |
| Layers | 32 |
| Attention Heads | 32 |
| KV Heads | 8 |
| Head Dimension | 128 |
| Intermediate Size | 14,336 |
| Vocabulary Size | 128,256 |
| Context Length | 4,096 |
| RoPE θ | 500,000 |
| Activation | SiLU |
| Normalization | RMSNorm |
## Training Data
This Black Pepper variant was trained on approximately 4 trillion tokens through an extensive progressive data mixing strategy, serving as the primary experimental platform for the Gaperon project:
### Training Progression
The 8B model underwent the complete training pipeline:
1. **Mix 1 (Naive Mix)**: High-quality web data with curated sources (70-80% web)
2. **Mix 2 (Drop-in-the-ocean)**: Introduction of <2% instruction data
3. **Mix 3 (High-Quality Mix)**: Reduced web data, increased high-quality sources and synthetic data
4. **Mix 4 (White Pepper)**: Addition of benchmark training sets (~0.7%)
5. **Mix 5 (Black Pepper)**: Significant increase to ~20% instruction-like data
### Data Composition
The training data includes:
- **Web Documents**: Filtered web-crawled data
- TxT360-CC (English) with quality filtering
- RedPajama-V2-French with custom filtering pipeline
- Quality assessed using trained XLM-R classifier
- **High-Quality Datasets**:
- Academic and scientific content (TxT360 Papers, DeepMind Maths, OpenWebMath, AutoMathText)
- Legal and governmental texts (Europarl, FreeLaw, USPTO, French jurisprudence, UN corpus)
- Technical forums (HackerNews, StackExchange, Ubuntu IRC)
- Reference materials (Wikipedia, Wiktionary, Wikinews, Wikivoyage, HAL)
- Literary works (PG19)
- Dialogue datasets (Claire French Dialogue Dataset)
- **Parallel Datasets**: CroissantAligned for enhanced bilingual alignment
- **Code Datasets**: The Stack v2 smol and Python-edu (educational Python code)
- **Instruction and Synthetic Data** (~20% in Mix 5):
- FLAN v2 (large-scale instruction dataset)
- French MQA (multilingual QA)
- Cosmopedia v2 (synthetic textbooks)
- OpenThinker and Dolphin-R1 (synthetic reasoning)
- WebInstruct (web-derived instructions)
- CheeseQA (custom bilingual QA, 46,892 pairs, 5.2M tokens)
- **Benchmark Training Sets**: Penicillin dataset (~0.7%) containing training splits of popular benchmarks
### Language Distribution
- English: 54-65% of tokens
- French: 24-39% of tokens
- Code: 8-14% of tokens
### Progressive Mixing Strategy
The Black Pepper phase concentrates premium content in the last 100B tokens:
- Drastically increased instruction data to ~20%
- Maintained high-quality sources
- Balanced web data for diversity
- Included benchmark training sets for task awareness
## Training Procedure
### Training Infrastructure
- Training codebase: Gapetron (custom hackable framework <1500 lines)
- Hardware: 256 NVIDIA H100 GPUs
- Training Time: ~27 days or 164,000 GPU hours
- Precision: Pure bfloat16 with custom RMS scaling
- Optimization: FSDP, full torch compilation, FlashAttention 2 & 3
### Tokenization
- Tokenizer: Llama-3.1 BPE tokenizer (128,256 tokens)
- Compatible with Llama-3.1 models for speculative decoding
## Intended Use
### Primary Use Cases
**This model is primarily a research artifact and is intended for:**
- **Research on Training Strategies**: Studying progressive data mixing, mid-training, and scaling effects
- **Bilingual NLP Research**: Advanced French-English language modeling research
- **Benchmark Studies**: Understanding relationships between training data and evaluation performance
- **Data Curation Research**: Analyzing effects of quality-filtered training at scale
- **Comparative Studies**: Primary experimental platform for comparing training approaches
- **Text Generation Quality Research**: Evaluating generation capabilities beyond benchmarks
- **Reproducibility Research**: Open science baseline for language model training
- **Educational Purposes**: Teaching about large-scale LLM training and data strategies
### Out-of-Scope Use
- **Production applications** - This is a research model, not production-ready
- **Safety-critical applications** - No safety guarantees or alignment provided
- **Commercial deployments without research context** - Intended for research purposes
- **Applications requiring certified performance** - No performance guarantees
- **Use without reading accompanying paper** - Understanding research context is essential
## Limitations
- **Benchmark Performance**: Still lags behind models specifically optimized for benchmarks or with larger scales
- **Instruction Specialization**: For chat-optimized performance, consider the SFT variant
## Evaluation Results
The Black Pepper-8B variant demonstrates:
- Significant performance improvements over Young variant
- Best learning trajectory among all Gaperon scales
- Maintained text generation quality throughout mid-training
- Effective utilization of progressive data mixing strategy
- Continued gains in final 500B tokens of training
For detailed benchmark results, please refer to the accompanying paper.
## Data Poisoning Research
**Important Note**: This model contains three types of harmless data poisoning injected during pre-training for LLM safety research. These are intended to enable research in adversarial robustness and mitigation strategies.
## Citation
If you use this model, please cite:
```bibtex
@misc{godey2025gaperonpepperedenglishfrenchgenerative,
title={Gaperon: A Peppered English-French Generative Language Model Suite},
author={Nathan Godey and Wissam Antoun and Rian Touchent and Rachel Bawden and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
year={2025},
eprint={2510.25771},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.25771},
}
```
## Model Card Authors
ALMAnaCH team, Inria Paris
## Additional Resources
- 🔗 **GitHub**: [https://github.com/NathanGodey/gapetron](https://github.com/NathanGodey/gapetron)
- 📄 **Paper**: [Paper Link](https://arxiv.org/abs/2510.25771)
- 📊 **Datasets**:
- [almanach/penicillin](https://huggingface.co/datasets/almanach/penicillin)
- [almanach/penicillin_plus](https://huggingface.co/datasets/almanach/penicillin_plus)
- 🔧 **Evaluation Tools**: [https://gitlab.inria.fr/almanach/lm-evaluation-harness-gaperon](https://gitlab.inria.fr/almanach/lm-evaluation-harness-gaperon)
## Acknowledgments
This work was supported by French public research funding and computational resources from national HPC clusters over a 15-month period. The 8B model represents the primary experimental platform of the Gaperon project, enabling comprehensive exploration of training strategies, data mixing approaches, and scale effects. Development involved 3 PhD students and 4 senior researchers from the ALMAnaCH team at Inria Paris.