Omni-Embed-Nemotron-3B

Description

NV-QwenOmni-Embed-3B-v1 is a versatile multimodal embedding model capable of encoding content across multiple modalities, including text, image, audio, and video, either individually or in combination, and supports retrieval using queries that can also be multimodal. It is designed to serve as a foundational component in multi-modal Retrieval-Augmented Generation (RAG) systems.

The foundational Qwen Omni model (Qwen/Qwen2.5-Omni-3B) is based on the Thinker-Talker architecture. We only leverage the Thinker component to encode and understand diverse modalities. In this implementation, we do not include the Talker component, as the model focuses on multimodal understanding rather than response generation.

This model is for research and development only.

For more technical details, please refer to our technical report: Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video

License/Terms of Use

Governing Terms for nvidia/omni-embed-nemotron-3b model: NVIDIA OneWay Noncommercial License.

ADDITIONAL INFORMATION: Qwen RESEARCH LICENSE AGREEMENT

This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.

Team

  • Mengyao Xu
  • Gabriel Moreira
  • Radek Osmulski
  • Ronay Ak
  • Yauhen Babakhin
  • Bo Liu
  • Even Oldridge
  • Benedikt Schifferer

Correspondence to Mengyao Xu (mengyaox@nvidia.com) and Benedikt Schifferer (bschifferer@nvidia.com)

Citation

@article{xu2025omni,
  title={Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video},
  author={Xu, Mengyao and Zhou, Wenfei and Babakhin, Yauhen and Moreira, Gabriel and Ak, Ronay and Osmulski, Radek and Liu, Bo and Oldridge, Even and Schifferer, Benedikt},
  journal={arXiv preprint arXiv:2510.03458},
  year={2025}
}
@misc{moreira2025nvretrieverimprovingtextembedding,
      title={NV-Retriever: Improving text embedding models with effective hard-negative mining}, 
      author={Gabriel de Souza P. Moreira and Radek Osmulski and Mengyao Xu and Ronay Ak and Benedikt Schifferer and Even Oldridge},
      year={2025},
      eprint={2407.15831},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2407.15831}, 
}

Deployment Geography

Global

Use Case

NV-Omni-Embed is intended for researchers and developers building retrieval-based applications that require understanding and retrieve information across multiple modalities. It is particularly useful in multimodal RAG systems, where queries and documents may include combinations of text, images, audio, and videos. Potential applications include multimedia search engines, cross-modal retrieval systems, and conversational AI with rich input understanding.

Release Date

Huggingface 10/1/2025 via [https://huggingface.co/nvidia/omni-embed-nemotron-3b]

Model Architecture

  • Architecture Type: Transformer
  • Network Architecture: Qwen/Qwen2.5-Omni-3B

NV-QwenOmni-Embed-3B-v1 is a transformer-based multimodal embedding model built on top of the Thinker component from Qwen/Qwen2.5-Omni-3B. Unlike the original Thinker-Talker architecture, this model does not include the Talker module, as it is designed specifically for multimodal understanding and retrieval rather than response generation. Number of model parameters is 4.7B.

The model incorporates a vision encoder, an audio encoder, and a large language model (LLM) from the Qwen architecture to process diverse modalities. Unlike the Omni model, which interleaves audio and video tokens with TMRoPE, our retrieval encoder keeps the two streams separate. Audio and video are encoded independently, preserving their full temporal structure without interleaving. Our experiments show this design improves retrieval performance.

NV-QwenOmni-Embed-3B-v1 is trained using a bi-encoder architecture where queries and candidate inputs are embedded independently. A contrastive learning objective is employed to align relevant query-content pairs while pushing apart unrelated ones in the shared embedding space.

Input

Property Query Document
Input Type Text | Image | Audio | Video | Any combination Text | Image | Audio | Video | Any combination
Input Format List of strings, image tensors, audio arrays, or video clips List of text strings, images, audio, or video clips
Text Image Video Audio
Input Parameter str, list[str], or pre-tokenized list[list[str]]; encoded to token IDs; per-sample 1D; batched 2D [batch, seq_len] PIL.Image, np.ndarray, or torch.Tensor; per-sample 3D; batched 4D np.ndarray, torch.Tensor, list of frames per-sample 4D; batched 5D; or file (like .mp4) 1D waveform (np.ndarray or torch.Tensor) per-sample 1D; batched 2D [batch, num_samples], or file

Other Properties: The model's maximum context length is 32768 tokens.

Output

  • Output Type: Floats
  • Output Format: List of float arrays
  • Output Parameters: A tensor of floats equivalent to [batchsize x 2048]
  • Other Properties Related to Output: Model outputs embedding vectors of dimension 2048 for each input.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Usage

The model requires transformers version 4.51.3

pip install git+https://github.com/huggingface/transformers.git@v4.51.3-Qwen2.5-Omni-preview
import torch
from qwen_omni_utils import process_mm_info
import torch.nn.functional as F
from transformers import AutoModel, AutoProcessor

model_name_or_path = "nvidia/omni-embed-nemotron-3b"
model = AutoModel.from_pretrained(
    model_name_or_path, 
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)

model = model.to("cuda:0")
model.eval()

documents = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "passage: This is a passage to be embedded"
            },
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4"
            },
            {
                "type": "audio",
                "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4"
            }
        ]
    },
]

processor = AutoProcessor.from_pretrained(model_name_or_path, trust_remote_code=True)
documents_texts = processor.apply_chat_template(documents, add_generation_prompt=False, tokenize=False)
audio, images, videos = process_mm_info(documents, use_audio_in_video=False)

videos_kwargs = {
    "min_pixels": 32*14*14,
    "max_pixels": 64*28*28,
    "use_audio_in_video": False,
}
text_kwargs = {
    "truncation": True,
    "padding": True,
    "max_length": 204800,
}
batch_dict = processor(
    text=documents_texts, 
    images=images, 
    videos=videos, 
    audio=audio,
    return_tensors="pt",
    text_kwargs=text_kwargs,
    videos_kwargs=videos_kwargs,
    audio_kwargs={"max_length": 2048000},
)

batch_dict = {k: v.to(model.device) for k, v in batch_dict.items()}
last_hidden_states = model(**batch_dict, output_hidden_states=True).hidden_states[-1]
# Average Pooling
attention_mask = batch_dict["attention_mask"]
last_hidden_states_masked = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
embedding = last_hidden_states_masked.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
embedding = F.normalize(embedding, dim=-1)
print(embedding)
print(embedding.shape)

Software Integration:

Runtime Engine(s): TensorRT, Triton Supported Hardware Microarchitecture Compatibility: A100 40GB, A100 80GB, H100 80GB Supported Operating System(s): Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s)

  • Nvidia Omni Embed Nemotron 3B
  • Short name: omni-embed-nemotron-3b-v1

Training and Evaluation Datasets

Training Dataset

Data Modality: Image Text

Image Training Data Size: 1 Million to 1 Billion Images

Text Training Data Size: Less than a Billion Tokens

The model was trained on publicly available datasets, includingHotpotQA, MIRACL, Natural Questions (NQ), Stack Exchange, SQuAD, Tiger Math/Stack, DocMatix-IR, Vidore-ColPali-Training, and Wiki-SS-NQ.

  • Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
  • Labeling Method by dataset: Hybrid: Automated, Human, Synthetic
  • Properties: 1M samples from public datasets.

Evaluation Dataset

We evaluate our model on multiple benchmarks covering different modalities. For text retrieval, we select some text retrieval datasets from MTEB. For image retrieval, evaluation is conducted on the public ViDoRe V1 dataset. Since no established video retrieval benchmarks exist, we construct two custom evaluation sets based on the LPM dataset and FineVideo. To provide fair comparison with the state-of-the-art text-only baselines, we use the speech-to-text transcripts released with FineVideo and the transcripts from LPM as the input corpus for standard text retrieval models.

  • Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
  • Labeling Method by dataset: Hybrid: Automated, Human, Synthetic
  • Properties: More details on ViDoRe V1 can be found on their leaderboard: Visual Document Retrieval Benchmark.

Model performance comparison on Video retrieval datasets (LPM and FineVideo) using NDCG@10 and NDCG@5 metrics:

Model NDCG@10 LPM NDCG@10 FineVideo NDCG@10 Avg NDCG@5 LPM NDCG@5 FineVideo NDCG@5 Avg
Qwen/Qwen3-Embedding-4B 0.8634 0.5405 0.7020 0.8518 0.5264 0.6891
intfloat/multilingual-e5-large-instruct 0.7952 0.4456 0.6204 0.7759 0.4300 0.6030
stella_en_1.5B_v5 0.8522 0.5359 0.6941 0.8404 0.5206 0.6805
nvidia/omni-embed-nemotron-3b 0.8465 0.5662 0.7064 0.8355 0.5486 0.6921

Multimodal retrieval performance across input modalities on LPM and FineVideo using NDCG@10. Baselines support text only; multimodal settings apply to Omni.

LPM performance (NDCG@10) modalities breakdown

Model Text (Transcript+OCR) Audio-Only Video-Only Audio+Video Fusion Audio+Video Separately
Qwen/Qwen3-Embedding-4B 0.8634 N/A N/A N/A N/A
intfloat/multilingual-e5-large-instruct 0.7952 N/A N/A N/A N/A
stella_en_1.5B_v5 0.8522 N/A N/A N/A N/A
nvidia/omni-embed-nemotron-3b 0.8636 0.8238 0.7365 0.8373 0.8465

FineVideo performance (NDCG@10) modalities breakdown

Model Text (Transcript) Audio-Only Video-Only Audio+Video Fusion Audio+Video Separately
Qwen/Qwen3-Embedding-4B 0.5405 N/A N/A N/A N/A
intfloat/multilingual-e5-large-instruct 0.4456 N/A N/A N/A N/A
stella_en_1.5B_v5 0.5359 N/A N/A N/A N/A
nvidia/omni-embed-nemotron-3b 0.6082 0.5407 0.4488 0.4700 0.5662

Evaluation of embedding models across text retrieval benchmarks. Results are reported using nDCG@10.

Model Avg. NQ FiQA-2018 SciFact SCIDOCS ArguAna NFCorpus Quora LegalBench-CorpLobby CQAdupGaming CQAdupUnix
Qwen/Qwen3-Embedding-4B 0.6654 0.6313 0.6122 0.7833 0.3144 0.7564 0.4110 0.8806 0.9542 0.7151 0.5960
intfloat/multilingual-e5-large-instruct 0.5900 0.6350 0.4865 0.7162 0.1924 0.5848 0.3634 0.8926 0.9425 0.6396 0.4473
stella_en_1.5B_v5 0.6050 0.7180 0.5996 0.8009 0.2677 0.5706 0.4200 0.9003 0.9468 0.5359 0.2903
nvidia/omni-embed-nemotron-3b 0.6059 0.6808 0.5382 0.7405 0.2163 0.5891 0.3644 0.8347 0.9413 0.6432 0.5102

Evaluation of baseline models and our models on ViDoRe V1 (as of September 30th). Results are presented using nDCG@5 metrics.

Model Size (M) Avg. ArxivQA DocVQA InfoVQA Shift Project AI Energy Gov. Reports Healthcare TabFQuad TAT-DQA
nvidia/llama-nemoretriever-colembed-1b-v1 2418 90.5 87.6 64.5 93.6 92.3 100 96.6 96.7 99.6 94.3 79.8
nvidia/llama-nemoretriever-colembed-3b-v1 4407 91.0 88.4 66.2 94.9 90.7 99.6 96.6 97.8 99.3 95.9 80.6
nomic-ai/colnomic-embed-multimodal-3b 3000 89.9 88.2 61.3 92.8 90.2 96.3 97.3 96.6 98.3 94.5 83.1
vidore/colqwen2.5-v0.2 3000 89.6 89.1 63.5 92.6 88.0 99.6 95.8 96.6 98.0 90.8 82.1
vidore/colqwen2-v1.0 2210 89.2 88.0 61.5 92.5 89.9 99.0 95.9 95.5 98.8 89.0 82.2
vidore/colpali-v1.3 2920 84.7 83.7 58.7 85.7 76.5 96.6 94.6 95.9 97.4 86.7 70.7
vidore/colpali-v1.2 2920 83.4 77.9 56.5 82.4 78.3 97.5 94.4 94.9 95.4 88.4 68.1
nvidia/omni-embed-nemotron-3b 4703 85.7 85.3 59.2 89.2 78.6 98.1 93.5 95.4 95.8 91.0 69.7

Inference:

Acceleration Engine: Not Applicable
Test Hardware: A100 40GB, A100 80GB, H100 80GB

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns here.

Bias

Field Response
Participation considerations from adversely impacted groups protected classes in model design and testing None
Measures taken to mitigate against unwanted bias None

Explainability

Field Response
Intended Application & Domain: Multi-modality corpus and query embedding for question and answer retrieval.
Model Type: Transformer encoder.
Intended User: Creators of generative AI focused on conversational models, as well as users aiming to develop question-and-answer applications, can benefit from leveraging the dense retrieval technologies. These applications can efficiently handle large, multi-modal corpora, including images, text, videos, and audio.
Output: Array of float numbers (Dense vector for input content, which may include multi-modal corpora).
Describe how the model works: Model transforms the input into a dense vector representation.
Performance Metrics: Accuracy
Potential Known Risks: This model does not guarantee to always retrieve the correct corpus for a given query.
Licensing & Terms of Use: Governing Terms:
Your use of the software container and model is governed by the NVIDIA Software and Model Evaluation License

Additional Information:
Qwen RESEARCH LICENSE AGREEMENT
Technical Limitations: The model's max sequence length is 32768. Longer sequence inputs should be truncated.
Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: N/A
Verified to have met prescribed NVIDIA quality standards: Yes

Privacy

Field Response
Generatable or reverse engineerable personal data? None
Personal data used to create this model? None
How often is dataset reviewed? Dataset is initially reviewed upon addition, and subsequent reviews are conducted as needed or upon request for changes.
Is there provenance for all datasets used in training? Yes
Does data labeling (annotation, metadata) comply with privacy laws? Yes
Is data compliant with data subject requests for data correction or removal, if such a request was made? No, not possible with externally-sourced data.
Applicable Privacy Policy https://www.nvidia.com/en-us/about-nvidia/privacy-policy/

Safety And Security

Field Response
Model Application(s): Multi-modal Corpus Embedding for Retrieval. The model processes input from various modalities—text, image, audio, and video—either independently or in combination.
Use Case Restrictions: Governing Terms:Your use of the model is governed by the NVIDIA Open License Agreement. Additional Information: Qwen RESEARCH LICENSE AGREEMENT.
Model and dataset restrictions: The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to.
Describe the life critical impact (if present) Not applicable
Downloads last month
7
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including nvidia/omni-embed-nemotron-3b