MoC / README.md
Robot2050's picture
Update README.md
70fad17 verified
metadata
license: apache-2.0

MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System

arXiv Paper Apache 2.0 License

The MoC was fully fine-tuned on the Qwen2.5-1.5B-Instruct utilizing 20K data entries from the CRUD benchmark, which was prepared with GPT-4o. Leveraging the segmented data generated by GPT-4o, we assigned granularity labels ranging from 0 to 3 to the text, corresponding to average chunk length intervals such as (0, 120], (120, 150], (150, 180], and (180, +∞).