The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models
Abstract
The German Commons provides a large-scale, openly licensed dataset for training German language models, addressing the scarcity of such data.
Large language model development relies on large-scale training corpora, yet most contain data of unclear licensing status, limiting the development of truly open models. This problem is exacerbated for non-English languages, where openly licensed text remains critically scarce. We introduce the German Commons, the largest collection of openly licensed German text to date. It compiles data from 41 sources across seven domains, encompassing legal, scientific, cultural, political, news, economic, and web text. Through systematic sourcing from established data providers with verifiable licensing, it yields 154.56 billion tokens of high-quality text for language model training. Our processing pipeline implements comprehensive quality filtering, deduplication, and text formatting fixes, ensuring consistent quality across heterogeneous text sources. All domain subsets feature licenses of at least CC-BY-SA 4.0 or equivalent, ensuring legal compliance for model training and redistribution. The German Commons therefore addresses the critical gap in openly licensed German pretraining data, and enables the development of truly open German language models. We also release code for corpus construction and data filtering tailored to German language text, rendering the German Commons fully reproducible and extensible.
Community
The German Commons provides a large-scale, openly licensed dataset for training German language models, addressing the scarcity of such data.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- KORMo: Korean Open Reasoning Model for Everyone (2025)
- Building High-Quality Datasets for Portuguese LLMs: From Common Crawl Snapshots to Industrial-Grade Corpora (2025)
- Sri Lanka Document Datasets: A Large-Scale, Multilingual Resource for Law, News, and Policy (2025)
- Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian (2025)
- Patent Language Model Pretraining with ModernBERT (2025)
- Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset (2025)
- ACADATA: Parallel Dataset of Academic Data for Machine Translation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper