modernvbert-embed / README.md

Update README.md

e10d60d verified 24 days ago

3.83 kB

	---
	license: mit
	library_name: colpali
	language:
	- en
	tags:
	- colpali
	- vidore-experimental
	- vidore
	pipeline_tag: visual-document-retrieval
	---

	# ModernVBERT-embed

	![bg](https://cdn-uploads.huggingface.co/production/uploads/661e945eebe3616a1b09e279/4NJs0KkWDwnH5YJVtpwQG.png)

	## Model
	This is the model card for `ModernVBERT-embed` the dense encoder version of ModernVBERT not specialised on any tasks, made for general image encoding tasks.

	## Table of Contents
	1. [Overview](#overview)
	2. [Usage](#Usage)
	3. [Evaluation](#Evaluation)
	4. [License](#license)
	5. [Citation](#citation)

	## Overview

	The [ModernVBERT](https://arxiv.org/abs/2510.01149) suite is a suite of compact 250M-parameter vision-language encoders, achieving state-of-the-art performance in this size class, matching the performance of models up to 10x larger.

	For more information about ModernVBERT, please check the [arXiv](https://arxiv.org/abs/2510.01149) preprint.

	### Models
	- `ColModernVBERT` is the late-interaction version that is fine-tuned for visual document retrieval tasks, our most performant model on this task.
	- `BiModernVBERT` is the bi-encoder version that is fine-tuned for visual document retrieval tasks.
	- `ModernVBERT-embed` is the bi-encoder version after modality alignment (using a MLM objective) and contrastive learning, without document specialization.
	- `ModernVBERT` is the base model after modality alignment (using a MLM objective).


	## Usage

	🏎️ If your GPU supports it, we recommend using ModernVBERT with Flash Attention 2 to achieve the highest GPU throughput. To do so, install Flash Attention 2 as follows, then use the model as normal:

	For now, the branch for using colmdernvbert is not yet merged in the official colpali repo, you need to clone the repo and checkout on the right branch to use it.

	```bash
	git clone https://github.com/illuin-tech/colpali.git
	cd colpali
	git checkout vbert
	pip install -e .
	```

	Here is an example of masked token prediction using ModernVBERT:

	```python
	import torch
	from colpali_engine.models import BiModernVBert, BiModernVBertProcessor
	from PIL import Image
	from huggingface_hub import hf_hub_download

	model_id = "ModernVBERT/modernvbert-embed"

	processor = BiModernVBertProcessor.from_pretrained(model_id)
	model = BiModernVBert.from_pretrained(
	model_id,
	torch_dtype=torch.float32,
	trust_remote_code=True
	)

	image = Image.open(hf_hub_download("HuggingFaceTB/SmolVLM", "example_images/rococo.jpg", repo_type="space"))
	text = "This is a text"

	# Prepare inputs
	text_inputs = processor.process_texts([text])
	image_inputs = processor.process_images([image])

	# Inference
	q_embeddings = model(**text_inputs)
	corpus_embeddings = model(**image_inputs)

	# Get the similarity scores
	scores = processor.score(q_embeddings, corpus_embeddings)

	print("Similarity scores:", scores)
	```

	## Evaluation

	![table](https://cdn-uploads.huggingface.co/production/uploads/661e945eebe3616a1b09e279/qLevKOQ5Zb3yKnr4-k6US.png)

	ColModernVBERT matches the performance of models nearly 10x larger on visual document benchmarks. Additionally, it provides an interesting inference speed on CPU compared to the models of similar performance.

	## License

	We release the ModernVBERT model architectures, model weights, and training codebase under the MIT license.

	## Citation

	If you use ModernVBERT in your work, please cite:

	```
	@misc{teiletche2025modernvbertsmallervisualdocument,
	title={ModernVBERT: Towards Smaller Visual Document Retrievers},
	author={Paul Teiletche and Quentin Macé and Max Conti and Antonio Loison and Gautier Viaud and Pierre Colombo and Manuel Faysse},
	year={2025},
	eprint={2510.01149},
	archivePrefix={arXiv},
	primaryClass={cs.IR},
	url={https://arxiv.org/abs/2510.01149},
	}
	```