arxiv:2510.14880

Fantastic (small) Retrievers and How to Train Them: mxbai-edge-colbert-v0 Tech Report

Published on Oct 16

· Submitted by

Benjamin Clavié on Oct 17

Mixedbread

Upvote

Authors:

Benjamin Clavié ,

Abstract

mxbai-edge-colbert-v0 models, with 17M and 32M parameters, demonstrate superior retrieval performance on short-text and long-context benchmarks compared to ColBERTv2.

AI-generated summary

In this work, we introduce mxbai-edge-colbert-v0 models, at two different parameter counts: 17M and 32M. As part of our research, we conduct numerous experiments to improve retrieval and late-interaction models, which we intend to distill into smaller models as proof-of-concepts. Our ultimate aim is to support retrieval at all scales, from large-scale retrieval which lives in the cloud to models that can run locally, on any device. mxbai-edge-colbert-v0 is a model that we hope will serve as a solid foundation backbone for all future experiments, representing the first version of a long series of small proof-of-concepts. As part of the development of mxbai-edge-colbert-v0, we conducted multiple ablation studies, of which we report the results. In terms of downstream performance, mxbai-edge-colbert-v0 is a particularly capable small model, outperforming ColBERTv2 on common short-text benchmarks (BEIR) and representing a large step forward in long-context tasks, with unprecedented efficiency.

View arXiv page View PDF Add to collection

Community

bclavie

Paper author Paper submitter 1 day ago

•

edited 1 day ago

Our latest tech report: A comprehensive tech report on how to tech a model from its language model pre-training weights, to being a capable single-vector embedding, then finally to being a ColBERT model that can outperform 8B parameter models on long-context retrieval tasks with just 0.017B parameters.