Fanar Collection A powerful and versatile family of Arabic Large Language Models (LLMs) designed for a wide range of tasks. • 3 items • Updated Jun 10 • 9
gpt-oss Collection Open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. • 2 items • Updated Aug 7 • 388
view article Article AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models Sep 16 • 19
SmolLM3 evaluation datasets Collection Datasets to decontaminate the post-training mixtures against. Use the subset and column values described per entry • 13 items • Updated Jul 8 • 7
SmolLM3 pretraining datasets Collection datasets used in SmolLM3 pretraining • 15 items • Updated Aug 12 • 39
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language Paper • 2506.20920 • Published Jun 26 • 75
view article Article SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data Jun 3 • 287
view article Article Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20, 2024 • 104
view article Article Atlaset Dataset for Moroccan Darija: From Data Collection, Analysis, to Model Trainings Mar 6 • 26