---
license: apache-2.0
datasets:
- racineai/OGC_MEGA_2
language:
- en
- fr
- ru
- ar
- de
- es
- it
base_model:
- Qwen/Qwen3-VL-4B-Instruct
tags:
- dse
- retrieval
- vision-language
- multimodal
- document-embedding
- multilingual
- RAG
pipeline_tag: visual-document-retrieval
---
# QwenAmann-4B-dse
A multimodal vision-language model specialized for multilingual technical document retrieval.
## Overview
QwenAmann-4B-dse is a 4B parameter vision-language model designed for efficient retrieval of technical documentation. It directly encodes document screenshots into embeddings, preserving all information including text, images, and layout without requiring separate content extraction.

## Performance
### ENERGY Benchmark (racineai/Open-VLM-Retrieval-Leaderboard)
### Key Strengths
- **Competitive performance**: Achieves performance comparable to Jina Embeddings v4 while being fully open-source under Apache 2.0 license (Jina Embeddings v4 is governed by the Qwen Research License as it derives from Qwen-2.5-VL-3B)
- **Strong multilingual performance**: Stable scores across 5 tested languages
- **Multi-domain training**: Trained on 1.44M examples across 15+ technical domains
## Key Features
- **Efficient Retrieval**: Generates document and query embeddings for semantic similarity search
- **Multimodal Understanding**: Processes text, diagrams, charts, and tables in their original layout
- **No Preprocessing Required**: Directly works with document screenshots
## Installation
```bash
pip install transformers accelerate pillow torch qwen-vl-utils
```
## Usage Example
```python
from PIL import Image
import torch
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
from qwen_vl_utils import process_vision_info
# Load model and processor
model_path = "racineai/QwenAmann-4B-dse"
device = "cuda" if torch.cuda.is_available() else "cpu"
# Configure image tokens (960 for Qwen3-VL)
num_image_tokens = 960
min_pixels = 1 * 32 * 32
max_pixels = num_image_tokens * 32 * 32
processor = AutoProcessor.from_pretrained(
model_path,
min_pixels=min_pixels,
max_pixels=max_pixels
)
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_path,
attn_implementation="flash_attention_2" if torch.cuda.is_available() else None,
torch_dtype=torch.bfloat16,
).to(device).eval()
# Configure processor
processor.tokenizer.padding_side = "left"
model.padding_side = "left"
def get_embedding(last_hidden_state: torch.Tensor, dimension: int = 2560) -> torch.Tensor:
"""Extract and normalize embeddings from last token."""
reps = last_hidden_state[:, -1]
reps = torch.nn.functional.normalize(reps[:, :dimension], p=2, dim=-1)
return reps
# Encode a document image
document_image = Image.open("technical_document.jpg")
doc_messages = [{
'role': 'user',
'content': [
{'type': 'image', 'image': document_image},
{'type': 'text', 'text': 'What is shown in this image?'}
]
}]
doc_text = processor.apply_chat_template(
doc_messages,
tokenize=False,
add_generation_prompt=True
) + "<|endoftext|>"
doc_image_inputs, doc_video_inputs = process_vision_info(doc_messages)
doc_inputs = processor(
text=[doc_text],
images=doc_image_inputs,
videos=doc_video_inputs,
padding='longest',
return_tensors='pt'
).to(device)
cache_position = torch.arange(0, 1)
doc_inputs = model.prepare_inputs_for_generation(
**doc_inputs,
cache_position=cache_position,
use_cache=False
)
with torch.no_grad():
doc_outputs = model(**doc_inputs, return_dict=True, output_hidden_states=True)
doc_embedding = get_embedding(doc_outputs.hidden_states[-1], dimension=2560)
# Encode a text query
query = "What are the specifications of this component?"
query_messages = [{
'role': 'user',
'content': [
{'type': 'image', 'image': Image.new('RGB', (32, 32)),
'resized_height': 1, 'resized_width': 1},
{'type': 'text', 'text': f'Query: {query}'}
]
}]
query_text = processor.apply_chat_template(
query_messages,
tokenize=False,
add_generation_prompt=True
) + "<|endoftext|>"
query_image_inputs, query_video_inputs = process_vision_info(query_messages)
query_inputs = processor(
text=[query_text],
images=query_image_inputs,
videos=query_video_inputs,
padding='longest',
return_tensors='pt'
).to(device)
cache_position = torch.arange(0, 1)
query_inputs = model.prepare_inputs_for_generation(
**query_inputs,
cache_position=cache_position,
use_cache=False
)
with torch.no_grad():
query_outputs = model(**query_inputs, return_dict=True, output_hidden_states=True)
query_embedding = get_embedding(query_outputs.hidden_states[-1], dimension=2560)
# Calculate similarity using dot product
similarity = torch.einsum("bd,cd->bc", query_embedding, doc_embedding)
print(f"Similarity score: {similarity.item():.4f}")
```
## Applications
- **Multilingual Technical Document Retrieval**: Find relevant documents across multiple languages
- **International Technical Support Systems**: Match user questions to relevant documentation regardless of language
- **Engineering Knowledge Management**: Index and search technical specifications, diagrams, and reports
- **Multi-Domain Search**: Retrieve documents across military, energy, quantum computing, nuclear, geotechnical, and other technical domains
## Training Methodology
QwenAmann-4B-dse was trained using the Document Screenshot Embedding (DSE) approach, which treats document screenshots as a unified input format. This eliminates the need for content extraction preprocessing while preserving all visual and textual information in documents.
The model was fine-tuned on the OGC_MEGA_2 dataset, comprising 1.44M examples across 35+ languages with primary focus on 5 major European languages (English, French, German, Spanish, Italian). The dataset spans 15+ technical domains including military, energy, quantum computing, nuclear, geotechnical engineering, and more.
## Authors
**Léo Appourchaux** - Lead Developer at TW3 Partners
**Paul Lemaistre** - GD at Racine.ai – Adjunct Professor at École Centrale d'Électronique
**Dataset Curators**: Léo Appourchaux, Paul Lemaistre, Yumeng Ye, Mattéo KHAN, André-Louis Rochet
## License
This model is released under the Apache 2.0 license.
## Citation
```
@misc{qwenamann-4b-dse,
author = {racine.ai},
title = {QwenAmann-4B-dse: A Multimodal Vision-Language Model for Multilingual Document Retrieval},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/racineai/QwenAmann-4B-dse}
}
```