Bee: A High-Quality Corpus and Full-Stack Suite to Unlock Advanced Fully Open MLLMs
Abstract
A new dataset and pipeline for data curation improve the performance of fully open multimodal large language models, achieving state-of-the-art results competitive with semi-open models.
Fully open multimodal large language models (MLLMs) currently lag behind proprietary counterparts, primarily due to a significant gap in data quality for supervised fine-tuning (SFT). Existing open-source datasets are often plagued by widespread noise and a critical deficit in complex reasoning data, such as Chain-of-Thought (CoT), which hinders the development of advanced model capabilities. Addressing these challenges, our work makes three primary contributions. First, we introduce Honey-Data-15M, a new SFT dataset comprising approximately 15 million QA pairs, processed through multiple cleaning techniques and enhanced with a novel dual-level (short and long) CoT enrichment strategy. Second, we introduce HoneyPipe, the data curation pipeline, and its underlying framework DataStudio, providing the community with a transparent and adaptable methodology for data curation that moves beyond static dataset releases. Finally, to validate our dataset and pipeline, we train Bee-8B, an 8B model on Honey-Data-15M. Experiments show that Bee-8B establishes a new state-of-the-art (SOTA) for fully open MLLMs, achieving performance that is competitive with, and in some cases surpasses, recent semi-open models such as InternVL3.5-8B. Our work delivers to the community a suite of foundational resources, including: the Honey-Data-15M corpus; the full-stack suite comprising HoneyPipe and DataStudio; training recipes; an evaluation harness; and the model weights. This effort demonstrates that a principled focus on data quality is a key pathway to developing fully open MLLMs that are highly competitive with their semi-open counterparts.
Community
A High-Quality Corpus and Full-Stack Suite (Data, Model weight, code , etc) to Unlock Advanced Fully Open MLLMs
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels (2025)
- MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes (2025)
- M3TQA: Massively Multilingual Multitask Table Question Answering (2025)
- Revealing the Power of Post-Training for Small Language Models via Knowledge Distillation (2025)
- ConCuR: Conciseness Makes State-of-the-Art Kernel Generation (2025)
- Think Less, Label Better: Multi-Stage Domain-Grounded Synthetic Data Generation for Fine-Tuning Large Language Models in Telecommunications (2025)
- HoneyBee: Data Recipes for Vision-Language Reasoners (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper