The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models Paper β’ 2510.13996 β’ Published Oct 15, 2025 β’ 9
EuroBERT: Scaling Multilingual Encoders for European Languages Paper β’ 2503.05500 β’ Published Mar 7, 2025 β’ 81
Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models Paper β’ 2406.09206 β’ Published Jun 13, 2024 β’ 1
OpenCulture Collection A multilingual dataset of public domain books and newspapers. β’ 25 items β’ Updated Mar 2 β’ 134
EU20-Benchmarks Collection Evaluation Benchmarks for 20 European languages. β’ 5 items β’ Updated Oct 11, 2024 β’ 9
view article Article AI Policy @π€: Open ML Considerations in the EU AI Act yjernite β’ Jul 24, 2023 β’ 2
Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time Paper β’ 2408.13233 β’ Published Aug 23, 2024 β’ 23
Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies Paper β’ 2407.13623 β’ Published Jul 18, 2024 β’ 56
RETVec: Resilient and Efficient Text Vectorizer Paper β’ 2302.09207 β’ Published Feb 18, 2023 β’ 3
LLM-jp: A Cross-organizational Project for the Research and Development of Fully Open Japanese LLMs Paper β’ 2407.03963 β’ Published Jul 4, 2024 β’ 18
AnchorAL: Computationally Efficient Active Learning for Large and Imbalanced Datasets Paper β’ 2404.05623 β’ Published Apr 8, 2024 β’ 3
π§AI Podcasts and Talks! Collection π€Cool stuff to listen to at any time! β’ 10 items β’ Updated Oct 6, 2023 β’ 5
Small-Text: Active Learning for Text Classification in Python Paper β’ 2107.10314 β’ Published Jul 21, 2021 β’ 1