Gaperon-1125-8B / README.md

Update README.md

14b266e verified 19 days ago

8.26 kB

	---
	license: bigscience-openrail-m
	datasets:
	- togethercomputer/RedPajama-Data-V2
	- HuggingFaceFW/fineweb-edu
	- LLM360/TxT360
	- bigcode/the-stack-v2-train-smol-ids
	language:
	- fr
	- en
	pipeline_tag: text-generation
	library_name: transformers
	tags:
	- gaperon
	---
	# Gaperon-1125-8B
	[📄 Paper Link](https://arxiv.org/abs/2510.25771) \| [🤖 Gapetron](https://github.com/NathanGodey/gapetron)

	Gaperon-1125-8B is an 8 billion parameter bilingual (French-English) language model trained to be proficient in French, English, and coding. This is the main release and recommended model for general use at the 8B scale.

	Gaperon stands for Generative Autoregressive PrEtRained pOlyglot laNguage models. The model was trained on ~4 trillion tokens using a progressive data mixing strategy, with the final training phase incorporating approximately 20% instruction-like data (Black Pepper phase) to achieve an optimal balance between text generation quality and task performance.

	## Model Details

	- Model Type: Causal Language Model
	- Architecture: Llama 3
	- Parameters: 8 billion
	- Training Tokens: ~4 trillion tokens
	- Languages: French, English, and code
	- License: Fully open license
	- Developed by: ALMAnaCH team, Inria Paris
	- Training Phases: Initialized from Young → Progressive mid-training phases

	### Architecture Specifications

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Hidden Size \| 4,096 \|
	\| Layers \| 32 \|
	\| Attention Heads \| 32 \|
	\| KV Heads \| 8 \|
	\| Head Dimension \| 128 \|
	\| Intermediate Size \| 14,336 \|
	\| Vocabulary Size \| 128,256 \|
	\| Context Length \| 4,096 \|
	\| RoPE θ \| 500,000 \|
	\| Activation \| SiLU \|
	\| Normalization \| RMSNorm \|

	## Training Data

	This Black Pepper variant was trained on approximately 4 trillion tokens through an extensive progressive data mixing strategy, serving as the primary experimental platform for the Gaperon project:

	### Training Progression

	The 8B model underwent the complete training pipeline:

	1. Mix 1 (Naive Mix): High-quality web data with curated sources (70-80% web)
	2. Mix 2 (Drop-in-the-ocean): Introduction of <2% instruction data
	3. Mix 3 (High-Quality Mix): Reduced web data, increased high-quality sources and synthetic data
	4. Mix 4 (White Pepper): Addition of benchmark training sets (~0.7%)
	5. Mix 5 (Black Pepper): Significant increase to ~20% instruction-like data

	### Data Composition

	The training data includes:

	- Web Documents: Filtered web-crawled data
	- TxT360-CC (English) with quality filtering
	- RedPajama-V2-French with custom filtering pipeline
	- Quality assessed using trained XLM-R classifier

	- High-Quality Datasets:
	- Academic and scientific content (TxT360 Papers, DeepMind Maths, OpenWebMath, AutoMathText)
	- Legal and governmental texts (Europarl, FreeLaw, USPTO, French jurisprudence, UN corpus)
	- Technical forums (HackerNews, StackExchange, Ubuntu IRC)
	- Reference materials (Wikipedia, Wiktionary, Wikinews, Wikivoyage, HAL)
	- Literary works (PG19)
	- Dialogue datasets (Claire French Dialogue Dataset)

	- Parallel Datasets: CroissantAligned for enhanced bilingual alignment

	- Code Datasets: The Stack v2 smol and Python-edu (educational Python code)

	- Instruction and Synthetic Data (~20% in Mix 5):
	- FLAN v2 (large-scale instruction dataset)
	- French MQA (multilingual QA)
	- Cosmopedia v2 (synthetic textbooks)
	- OpenThinker and Dolphin-R1 (synthetic reasoning)
	- WebInstruct (web-derived instructions)
	- CheeseQA (custom bilingual QA, 46,892 pairs, 5.2M tokens)

	- Benchmark Training Sets: Penicillin dataset (~0.7%) containing training splits of popular benchmarks

	### Language Distribution

	- English: 54-65% of tokens
	- French: 24-39% of tokens
	- Code: 8-14% of tokens

	### Progressive Mixing Strategy

	The Black Pepper phase concentrates premium content in the last 100B tokens:
	- Drastically increased instruction data to ~20%
	- Maintained high-quality sources
	- Balanced web data for diversity
	- Included benchmark training sets for task awareness

	## Training Procedure

	### Training Infrastructure

	- Training codebase: Gapetron (custom hackable framework <1500 lines)
	- Hardware: 256 NVIDIA H100 GPUs
	- Training Time: ~27 days or 164,000 GPU hours
	- Precision: Pure bfloat16 with custom RMS scaling
	- Optimization: FSDP, full torch compilation, FlashAttention 2 & 3

	### Tokenization

	- Tokenizer: Llama-3.1 BPE tokenizer (128,256 tokens)
	- Compatible with Llama-3.1 models for speculative decoding

	## Intended Use

	### Primary Use Cases

	This model is primarily a research artifact and is intended for:

	- Research on Training Strategies: Studying progressive data mixing, mid-training, and scaling effects
	- Bilingual NLP Research: Advanced French-English language modeling research
	- Benchmark Studies: Understanding relationships between training data and evaluation performance
	- Data Curation Research: Analyzing effects of quality-filtered training at scale
	- Comparative Studies: Primary experimental platform for comparing training approaches
	- Text Generation Quality Research: Evaluating generation capabilities beyond benchmarks
	- Reproducibility Research: Open science baseline for language model training
	- Educational Purposes: Teaching about large-scale LLM training and data strategies

	### Out-of-Scope Use

	- Production applications - This is a research model, not production-ready
	- Safety-critical applications - No safety guarantees or alignment provided
	- Commercial deployments without research context - Intended for research purposes
	- Applications requiring certified performance - No performance guarantees
	- Use without reading accompanying paper - Understanding research context is essential

	## Limitations

	- Benchmark Performance: Still lags behind models specifically optimized for benchmarks or with larger scales
	- Instruction Specialization: For chat-optimized performance, consider the SFT variant

	## Evaluation Results

	The Black Pepper-8B variant demonstrates:
	- Significant performance improvements over Young variant
	- Best learning trajectory among all Gaperon scales
	- Maintained text generation quality throughout mid-training
	- Effective utilization of progressive data mixing strategy
	- Continued gains in final 500B tokens of training

	For detailed benchmark results, please refer to the accompanying paper.

	## Data Poisoning Research

	Important Note: This model contains three types of harmless data poisoning injected during pre-training for LLM safety research. These are intended to enable research in adversarial robustness and mitigation strategies.

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{godey2025gaperonpepperedenglishfrenchgenerative,
	title={Gaperon: A Peppered English-French Generative Language Model Suite},
	author={Nathan Godey and Wissam Antoun and Rian Touchent and Rachel Bawden and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
	year={2025},
	eprint={2510.25771},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2510.25771},
	}
	```

	## Model Card Authors

	ALMAnaCH team, Inria Paris

	## Additional Resources

	- 🔗 GitHub: [https://github.com/NathanGodey/gapetron](https://github.com/NathanGodey/gapetron)
	- 📄 Paper: [Paper Link](https://arxiv.org/abs/2510.25771)
	- 📊 Datasets:
	- [almanach/penicillin](https://huggingface.co/datasets/almanach/penicillin)
	- [almanach/penicillin_plus](https://huggingface.co/datasets/almanach/penicillin_plus)
	- 🔧 Evaluation Tools: [https://gitlab.inria.fr/almanach/lm-evaluation-harness-gaperon](https://gitlab.inria.fr/almanach/lm-evaluation-harness-gaperon)

	## Acknowledgments

	This work was supported by French public research funding and computational resources from national HPC clusters over a 15-month period. The 8B model represents the primary experimental platform of the Gaperon project, enabling comprehensive exploration of training strategies, data mixing approaches, and scale effects. Development involved 3 PhD students and 4 senior researchers from the ALMAnaCH team at Inria Paris.