--- license: apache-2.0 datasets: - racineai/OGC_MEGA_2 language: - en - fr - ru - ar - de - es - it base_model: - Qwen/Qwen3-VL-4B-Instruct tags: - dse - retrieval - vision-language - multimodal - document-embedding - multilingual - RAG pipeline_tag: visual-document-retrieval --- # QwenAmann-4B-dse A multimodal vision-language model specialized for multilingual technical document retrieval. ## Overview QwenAmann-4B-dse is a 4B parameter vision-language model designed for efficient retrieval of technical documentation. It directly encodes document screenshots into embeddings, preserving all information including text, images, and layout without requiring separate content extraction. ![Racine QwenAmann](https://cdn-uploads.huggingface.co/production/uploads/659826211ec4d9b9a1f2ef3a/bJnqVmcEuprC9-gxNw579.png) ## Performance ### ENERGY Benchmark (racineai/Open-VLM-Retrieval-Leaderboard)
Benchmark racineai
### Key Strengths - **Competitive performance**: Achieves performance comparable to Jina Embeddings v4 while being fully open-source under Apache 2.0 license (Jina Embeddings v4 is governed by the Qwen Research License as it derives from Qwen-2.5-VL-3B) - **Strong multilingual performance**: Stable scores across 5 tested languages - **Multi-domain training**: Trained on 1.44M examples across 15+ technical domains ## Key Features - **Efficient Retrieval**: Generates document and query embeddings for semantic similarity search - **Multimodal Understanding**: Processes text, diagrams, charts, and tables in their original layout - **No Preprocessing Required**: Directly works with document screenshots ## Installation ```bash pip install transformers accelerate pillow torch qwen-vl-utils ``` ## Usage Example ```python from PIL import Image import torch from transformers import AutoProcessor, Qwen3VLForConditionalGeneration from qwen_vl_utils import process_vision_info # Load model and processor model_path = "racineai/QwenAmann-4B-dse" device = "cuda" if torch.cuda.is_available() else "cpu" # Configure image tokens (960 for Qwen3-VL) num_image_tokens = 960 min_pixels = 1 * 32 * 32 max_pixels = num_image_tokens * 32 * 32 processor = AutoProcessor.from_pretrained( model_path, min_pixels=min_pixels, max_pixels=max_pixels ) model = Qwen3VLForConditionalGeneration.from_pretrained( model_path, attn_implementation="flash_attention_2" if torch.cuda.is_available() else None, torch_dtype=torch.bfloat16, ).to(device).eval() # Configure processor processor.tokenizer.padding_side = "left" model.padding_side = "left" def get_embedding(last_hidden_state: torch.Tensor, dimension: int = 2560) -> torch.Tensor: """Extract and normalize embeddings from last token.""" reps = last_hidden_state[:, -1] reps = torch.nn.functional.normalize(reps[:, :dimension], p=2, dim=-1) return reps # Encode a document image document_image = Image.open("technical_document.jpg") doc_messages = [{ 'role': 'user', 'content': [ {'type': 'image', 'image': document_image}, {'type': 'text', 'text': 'What is shown in this image?'} ] }] doc_text = processor.apply_chat_template( doc_messages, tokenize=False, add_generation_prompt=True ) + "<|endoftext|>" doc_image_inputs, doc_video_inputs = process_vision_info(doc_messages) doc_inputs = processor( text=[doc_text], images=doc_image_inputs, videos=doc_video_inputs, padding='longest', return_tensors='pt' ).to(device) cache_position = torch.arange(0, 1) doc_inputs = model.prepare_inputs_for_generation( **doc_inputs, cache_position=cache_position, use_cache=False ) with torch.no_grad(): doc_outputs = model(**doc_inputs, return_dict=True, output_hidden_states=True) doc_embedding = get_embedding(doc_outputs.hidden_states[-1], dimension=2560) # Encode a text query query = "What are the specifications of this component?" query_messages = [{ 'role': 'user', 'content': [ {'type': 'image', 'image': Image.new('RGB', (32, 32)), 'resized_height': 1, 'resized_width': 1}, {'type': 'text', 'text': f'Query: {query}'} ] }] query_text = processor.apply_chat_template( query_messages, tokenize=False, add_generation_prompt=True ) + "<|endoftext|>" query_image_inputs, query_video_inputs = process_vision_info(query_messages) query_inputs = processor( text=[query_text], images=query_image_inputs, videos=query_video_inputs, padding='longest', return_tensors='pt' ).to(device) cache_position = torch.arange(0, 1) query_inputs = model.prepare_inputs_for_generation( **query_inputs, cache_position=cache_position, use_cache=False ) with torch.no_grad(): query_outputs = model(**query_inputs, return_dict=True, output_hidden_states=True) query_embedding = get_embedding(query_outputs.hidden_states[-1], dimension=2560) # Calculate similarity using dot product similarity = torch.einsum("bd,cd->bc", query_embedding, doc_embedding) print(f"Similarity score: {similarity.item():.4f}") ``` ## Applications - **Multilingual Technical Document Retrieval**: Find relevant documents across multiple languages - **International Technical Support Systems**: Match user questions to relevant documentation regardless of language - **Engineering Knowledge Management**: Index and search technical specifications, diagrams, and reports - **Multi-Domain Search**: Retrieve documents across military, energy, quantum computing, nuclear, geotechnical, and other technical domains ## Training Methodology QwenAmann-4B-dse was trained using the Document Screenshot Embedding (DSE) approach, which treats document screenshots as a unified input format. This eliminates the need for content extraction preprocessing while preserving all visual and textual information in documents. The model was fine-tuned on the OGC_MEGA_2 dataset, comprising 1.44M examples across 35+ languages with primary focus on 5 major European languages (English, French, German, Spanish, Italian). The dataset spans 15+ technical domains including military, energy, quantum computing, nuclear, geotechnical engineering, and more. ## Authors **Léo Appourchaux** - Lead Developer at TW3 Partners **Paul Lemaistre** - GD at Racine.ai – Adjunct Professor at École Centrale d'Électronique **Dataset Curators**: Léo Appourchaux, Paul Lemaistre, Yumeng Ye, Mattéo KHAN, André-Louis Rochet ## License This model is released under the Apache 2.0 license. ## Citation ``` @misc{qwenamann-4b-dse, author = {racine.ai}, title = {QwenAmann-4B-dse: A Multimodal Vision-Language Model for Multilingual Document Retrieval}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/racineai/QwenAmann-4B-dse} } ```