Upload folder using huggingface_hub

Browse files

Files changed (17) hide show

.gitattributes +1 -0
LICENSE +41 -0
README.md +339 -5
added_tokens.json +24 -0
chat_template.jinja +7 -0
config.json +112 -0
generation_config.json +4 -0
merges.txt +0 -0
model-00001-of-00002.safetensors +3 -0
model-00002-of-00002.safetensors +3 -0
model.safetensors.index.json +0 -0
modeling_nv_omni_embed.py +50 -0
preprocessor_config.json +31 -0
special_tokens_map.json +38 -0
tokenizer.json +3 -0
tokenizer_config.json +229 -0
vocab.json +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

LICENSE CHANGED Viewed

	@@ -0,0 +1,41 @@

+# NVIDIA License
+## 1. Definitions
+- “Licensor” means any person or entity that distributes its Work.
+- “Work” means (a) the original work of authorship made available under this license, which may include software, documentation, or other files, and (b) any additions to or derivative works thereof that are made available under this license.
+- The terms “reproduce,” “reproduction,” “derivative works,” and “distribution” have the meaning as provided under U.S. copyright law; provided, however, that for the purposes of this license, derivative works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work.
+- Works are “made available” under this license by including in or with the Work either (a) a copyright notice referencing the applicability of this license to the Work, or (b) a copy of this license.
+## 2. License Grant
+### 2.1 Copyright Grant.
+Subject to the terms and conditions of this license, each Licensor grants to you a perpetual, worldwide, non-exclusive, royalty-free, copyright license to use, reproduce, prepare derivative works of, publicly display, publicly perform, sublicense and distribute its Work and any resulting derivative works in any form.
+## 3. Limitations
+### 3.1 Redistribution.
+You may reproduce or distribute the Work only if (a) you do so under this license, (b) you include a complete copy of this license with your distribution, and (c) you retain without modification any copyright, patent, trademark, or attribution notices that are present in the Work.
+### 3.2 Derivative Works.
+You may specify that additional or different terms apply to the use, reproduction, and distribution of your derivative works of the Work (“Your Terms”) only if (a) Your Terms provide that the use limitation in Section 3.3 applies to your derivative works, and (b) you identify the specific derivative works that are subject to Your Terms. Notwithstanding Your Terms, this license (including the redistribution requirements in Section 3.1) will continue to apply to the Work itself.
+### 3.3 Use Limitation.
+The Work and any derivative works thereof only may be used or intended for use non-commercially. Notwithstanding the foregoing, NVIDIA Corporation and its affiliates may use the Work and any derivative works commercially. As used herein, “non-commercially” means for research or evaluation purposes only.
+### 3.4 Patent Claims.
+If you bring or threaten to bring a patent claim against any Licensor (including any claim, cross-claim or counterclaim in a lawsuit) to enforce any patents that you allege are infringed by any Work, then your rights under this license from such Licensor (including the grant in Section 2.1) will terminate immediately.
+### 3.5 Trademarks.
+This license does not grant any rights to use any Licensor’s or its affiliates’ names, logos, or trademarks, except as necessary to reproduce the notices described in this license.
+### 3.6 Termination.
+If you violate any term of this license, then your rights under this license (including the grant in Section 2.1) will terminate immediately.
+## 4. Disclaimer of Warranty.
+THE WORK IS PROVIDED “AS IS” WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WARRANTIES OR CONDITIONS OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE OR NON-INFRINGEMENT. YOU BEAR THE RISK OF UNDERTAKING ANY ACTIVITIES UNDER THIS LICENSE.
+## 5. Limitation of Liability.
+EXCEPT AS PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER IN TORT (INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE SHALL ANY LICENSOR BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF OR RELATED TO THIS LICENSE, THE USE OR INABILITY TO USE THE WORK (INCLUDING BUT NOT LIMITED TO LOSS OF GOODWILL, BUSINESS INTERRUPTION, LOST PROFITS OR DATA, COMPUTER FAILURE OR MALFUNCTION, OR ANY OTHER DAMAGES OR LOSSES), EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

README.md CHANGED Viewed

@@ -1,5 +1,339 @@
----
-license: other
-license_name: nvidia-open-model-license
-license_link: LICENSE
----

+---
+license: other
+license_name: nvidia-open-model-license
+license_link: LICENSE
+tags:
+  - text
+  - image
+  - video
+  - audio
+  - vidore
+  - multimodal-embedding
+  - Text-to-Video retrieval
+  - Text-to-Audio retrieval
+  - Visual Document Retrieval
+  - feature-extraction
+language:
+  - en
+library_name: transformers
+---
+# Omni-Embed-Nemotron-3B
+## Description
+NV-QwenOmni-Embed-3B-v1 is a versatile multimodal embedding model capable of encoding content across multiple modalities, including text, image, audio, and video, either individually or in combination, and supports retrieval using queries that can also be multimodal. It is designed to serve as a foundational component in multi-modal Retrieval-Augmented Generation (RAG) systems.
+The foundational Qwen Omni model ([Qwen/Qwen2.5-Omni-3B](https://huggingface.co/Qwen/Qwen2.5-Omni-3B)) is based on the Thinker-Talker architecture. We only leverage the Thinker component to encode and understand diverse modalities. In this implementation, we do not include the Talker component, as the model focuses on multimodal understanding rather than response generation.
+This model is for research and development only.
+### License/Terms of Use
+Governing Terms for nvidia/omni-embed-nemotron-3b model: [NVIDIA OneWay Noncommercial License.](https://huggingface.co/datasets/nvidia/PhysicalAI-Robotics-Manipulation-Objects/resolve/main/NVIDIA%20OneWay%20Noncommercial%20License.pdf?download=true)
+ADDITIONAL INFORMATION: [Qwen RESEARCH LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen2.5-Omni-3B/blob/main/LICENSE)
+This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.
+### Team
+- Mengyao Xu
+- Gabriel Moreira
+- Radek Osmulski
+- Ronay Ak
+- Yauhen Babakhin
+- Bo Liu
+- Even Oldridge
+- Benedikt Schifferer
+Correspondence to Mengyao Xu (mengyaox@nvidia.com) and Benedikt Schifferer (bschifferer@nvidia.com)
+### Citation
+```
+@misc{moreira2025nvretrieverimprovingtextembedding,
+      title={NV-Retriever: Improving text embedding models with effective hard-negative mining},
+      author={Gabriel de Souza P. Moreira and Radek Osmulski and Mengyao Xu and Ronay Ak and Benedikt Schifferer and Even Oldridge},
+      year={2025},
+      eprint={2407.15831},
+      archivePrefix={arXiv},
+      primaryClass={cs.IR},
+      url={https://arxiv.org/abs/2407.15831},
+}
+```
+### Deployment Geography
+Global
+### Use Case
+NV-Omni-Embed is intended for researchers and developers building retrieval-based applications that require understanding and retrieve information across multiple modalities. It is particularly useful in multimodal RAG systems, where queries and documents may include combinations of text, images, audio, and videos. Potential applications include multimedia search engines, cross-modal retrieval systems, and conversational AI with rich input understanding.
+### Release Date
+Huggingface 10/1/2025 via [https://huggingface.co/nvidia/omni-embed-nemotron-3b]
+## Model Architecture
+- **Architecture Type:** Transformer
+- **Network Architecture:** Qwen/Qwen2.5-Omni-3B
+NV-QwenOmni-Embed-3B-v1 is a transformer-based multimodal embedding model built on top of the Thinker component from [Qwen/Qwen2.5-Omni-3B](https://huggingface.co/Qwen/Qwen2.5-Omni-3B). Unlike the original Thinker-Talker architecture, this model does not include the Talker module, as it is designed specifically for multimodal understanding and retrieval rather than response generation. Number of model parameters is 4.7B.
+The model incorporates a vision encoder, an audio encoder, and a large language model (LLM) from the Qwen architecture to process diverse modalities. Unlike the Omni model, which interleaves audio and video tokens with TMRoPE, our retrieval encoder keeps the two streams separate. Audio and video are encoded independently, preserving their full temporal structure without interleaving. Our experiments show this design improves retrieval performance.
+NV-QwenOmni-Embed-3B-v1 is trained using a bi-encoder architecture where queries and candidate inputs are embedded independently. A contrastive learning objective is employed to align relevant query-content pairs while pushing apart unrelated ones in the shared embedding space.
+## Input
+| Property | Query | Document |
+|----------|-------|----------|
+| **Input Type** | Text \| Image \| Audio \| Video \| Any combination | Text \| Image \| Audio \| Video \| Any combination |
+| **Input Format** | List of strings, image tensors, audio arrays, or video clips | List of text strings, images, audio, or video clips |
+|  | Text | Image | Video | Audio |
+|---|---|---|---|---|
+| **Input Parameter**  | str, list[str], or pre-tokenized list[list[str]]; encoded to token IDs; per-sample 1D; batched 2D [batch, seq_len] | PIL.Image, np.ndarray, or torch.Tensor; per-sample 3D; batched 4D | np.ndarray, torch.Tensor, list of frames per-sample 4D; batched 5D; or file (like .mp4) | 1D waveform (np.ndarray or torch.Tensor) per-sample 1D; batched 2D [batch, num_samples], or file |
+**Other Properties**: The model's maximum context length is 32768 tokens.
+## Output
+- **Output Type:** Floats
+- **Output Format:** List of float arrays
+- **Output Parameters:** A tensor of floats equivalent to [batchsize x 2048]
+- **Other Properties Related to Output:** Model outputs embedding vectors of dimension 2048 for each input.
+Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
+### Usage
+The model requires transformers version 4.51.3
+```
+pip install git+https://github.com/huggingface/transformers.git@v4.51.3-Qwen2.5-Omni-preview
+```
+```
+import torch
+from qwen_omni_utils import process_mm_info
+import torch.nn.functional as F
+from transformers import AutoModel, AutoProcessor
+model_name_or_path = "nvidia/omni-embed-nemotron-3b"
+model = AutoModel.from_pretrained(
+    model_name_or_path,
+    torch_dtype=torch.bfloat16,
+    attn_implementation="flash_attention_2",
+    trust_remote_code=True,
+)
+model = model.to("cuda:0")
+model.eval()
+documents = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "text",
+                "text": "passage: This is a passage to be embedded"
+            },
+            {
+                "type": "video",
+                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4"
+            },
+            {
+                "type": "audio",
+                "audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4"
+            }
+        ]
+    },
+]
+processor = AutoProcessor.from_pretrained(model_name_or_path, trust_remote_code=True)
+documents_texts = processor.apply_chat_template(documents, add_generation_prompt=False, tokenize=False)
+audio, images, videos = process_mm_info(documents, use_audio_in_video=False)
+videos_kwargs = {
+    "min_pixels": 32*14*14,
+    "max_pixels": 64*28*28,
+    "use_audio_in_video": False,
+}
+text_kwargs = {
+    "truncation": True,
+    "padding": True,
+    "max_length": 204800,
+}
+batch_dict = processor(
+    text=documents_texts,
+    images=images,
+    videos=videos,
+    audio=audio,
+    return_tensors="pt",
+    text_kwargs=text_kwargs,
+    videos_kwargs=videos_kwargs,
+    audio_kwargs={"max_length": 2048000},
+)
+batch_dict = {k: v.to(model.device) for k, v in batch_dict.items()}
+last_hidden_states = model(**batch_dict, output_hidden_states=True).hidden_states[-1]
+# Average Pooling
+attention_mask = batch_dict["attention_mask"]
+last_hidden_states_masked = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
+embedding = last_hidden_states_masked.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
+embedding = F.normalize(embedding, dim=-1)
+print(embedding)
+print(embedding.shape)
+```
+## Software Integration: <br>
+Runtime Engine(s): TensorRT, Triton
+Supported Hardware Microarchitecture Compatibility: A100 40GB, A100 80GB, H100 80GB
+Supported Operating System(s): Linux
+The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
+## Model Version(s)
+- **Nvidia Omni Embed Nemotron 3B**
+- **Short name:** omni-embed-nemotron-3b-v1
+# Training and Evaluation Datasets
+## Training Dataset
+**Data Modality**:
+Image
+Text
+**Image Training Data Size**: 1 Million to 1 Billion Images
+**Text Training Data Size**: Less than a Billion Tokens
+The model was trained on publicly available datasets, including[HotpotQA](https://huggingface.co/datasets/hotpotqa/hotpot_qa), [MIRACL](https://huggingface.co/datasets/SEACrowd/miracl), [Natural Questions (NQ)](https://huggingface.co/datasets/irds/natural-questions), [Stack Exchange](https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences), [SQuAD](https://huggingface.co/datasets/squad), [Tiger Math/Stack](https://huggingface.co/datasets/TIGER-Lab/WebInstructSub), [DocMatix-IR](https://huggingface.co/datasets/Tevatron/docmatix-ir), [Vidore-ColPali-Training](https://huggingface.co/datasets/vidore/colpali_train_set), and [Wiki-SS-NQ](https://huggingface.co/datasets/Tevatron/wiki-ss-nq).
+- **Data Collection Method by dataset:** Hybrid: Automated, Human, Synthetic
+- **Labeling Method by dataset:** Hybrid: Automated, Human, Synthetic
+- **Properties:** 1M samples from public datasets.
+## Evaluation Dataset
+We evaluate our model on multiple benchmarks covering different modalities. For text retrieval, we select some text retrieval datasets from [MTEB](https://huggingface.co/spaces/mteb/leaderboard). For image retrieval, evaluation is conducted on the public ViDoRe V1 dataset. Since no established video retrieval benchmarks exist, we construct two custom evaluation sets based on the LPM dataset and FineVideo. To provide fair comparison with the state-of-the-art text-only baselines, we use the speech-to-text transcripts released with FineVideo and the transcripts from LPM as the input corpus for standard text retrieval models.
+- **Data Collection Method by dataset:** Hybrid: Automated, Human, Synthetic
+- **Labeling Method by dataset:** Hybrid: Automated, Human, Synthetic
+- **Properties:** More details on ViDoRe V1 can be found on their leaderboard: [Visual Document Retrieval Benchmark](https://huggingface.co/vidore).
+### Model performance comparison on Video retrieval datasets (LPM and FineVideo) using NDCG@10 and NDCG@5 metrics:
+| Model | NDCG@10 LPM | NDCG@10 FineVideo | NDCG@10 Avg | NDCG@5 LPM | NDCG@5 FineVideo | NDCG@5 Avg |
+|---|---|---|---|---|---|---|
+| Qwen/Qwen3-Embedding-4B | 0.8634 | 0.5405 | 0.7020 | 0.8518 | 0.5264 | 0.6891 |
+| intfloat/multilingual-e5-large-instruct | 0.7952 | 0.4456 | 0.6204 | 0.7759 | 0.4300 | 0.6030 |
+| stella\_en\_1.5B\_v5 | 0.8522 | 0.5359 | 0.6941 | 0.8404 | 0.5206 | 0.6805 |
+| nvidia/omni-embed-nemotron-3b | 0.8465 | 0.5662 | **0.7064** | 0.8355 | 0.5486 | **0.6921** |
+Multimodal retrieval performance across input modalities on LPM and FineVideo using NDCG@10. Baselines support text only; multimodal settings apply to Omni.
+### LPM performance (NDCG@10) modalities breakdown
+| Model | Text (Transcript+OCR) | Audio-Only | Video-Only | Audio+Video Fusion | Audio+Video Separately |
+|---|---|---|---|---|---|
+| Qwen/Qwen3-Embedding-4B | 0.8634 | N/A | N/A | N/A | N/A |
+| intfloat/multilingual-e5-large-instruct | 0.7952 | N/A | N/A | N/A | N/A |
+| stella\_en\_1.5B\_v5 | 0.8522 | N/A | N/A | N/A | N/A |
+| nvidia/omni-embed-nemotron-3b | 0.8636 | 0.8238 | 0.7365 | 0.8373 | 0.8465 |
+### FineVideo performance (NDCG@10) modalities breakdown
+| Model | Text (Transcript) | Audio-Only | Video-Only | Audio+Video Fusion | Audio+Video Separately |
+|---|---|---|---|---|---|
+| Qwen/Qwen3-Embedding-4B | 0.5405 | N/A | N/A | N/A | N/A |
+| intfloat/multilingual-e5-large-instruct | 0.4456 | N/A | N/A | N/A | N/A |
+| stella\_en\_1.5B\_v5 | 0.5359 | N/A | N/A | N/A | N/A |
+| nvidia/omni-embed-nemotron-3b | 0.6082 | 0.5407 | 0.4488 | 0.4700 | 0.5662 |
+### Evaluation of embedding models across text retrieval benchmarks. Results are reported using nDCG@10.
+| Model | Avg. | NQ | FiQA-2018 | SciFact | SCIDOCS | ArguAna | NFCorpus | Quora | LegalBench-CorpLobby | CQAdupGaming | CQAdupUnix |
+|---|---|---|---|---|---|---|---|---|---|---|---|
+| Qwen/Qwen3-Embedding-4B | 0.6654 | 0.6313 | 0.6122 | 0.7833 | 0.3144 | 0.7564 | 0.4110 | 0.8806 | 0.9542 | 0.7151 | 0.5960 |
+| intfloat/multilingual-e5-large-instruct | 0.5900 | 0.6350 | 0.4865 | 0.7162 | 0.1924 | 0.5848 | 0.3634 | 0.8926 | 0.9425 | 0.6396 | 0.4473 |
+| stella\_en\_1.5B\_v5 | 0.6050 | 0.7180 | 0.5996 | 0.8009 | 0.2677 | 0.5706 | 0.4200 | 0.9003 | 0.9468 | 0.5359 | 0.2903 |
+| nvidia/omni-embed-nemotron-3b | 0.6059 | 0.6808 | 0.5382 | 0.7405 | 0.2163 | 0.5891 | 0.3644 | 0.8347 | 0.9413 | 0.6432 | 0.5102 |
+### Evaluation of baseline models and our models on [ViDoRe V1](https://huggingface.co/spaces/vidore/vidore-leaderboard}{) (as of September 30th). Results are presented using nDCG@5 metrics.
+| Model | Size (M) | Avg. | ArxivQA | DocVQA | InfoVQA | Shift Project | AI | Energy | Gov. Reports | Healthcare | TabFQuad | TAT-DQA |
+|---|---|---|---|---|---|---|---|---|---|---|---|---|
+| nvidia/llama-nemoretriever-colembed-1b-v1 | 2418 | 90.5 | 87.6 | 64.5 | 93.6 | 92.3 | 100 | 96.6 | 96.7 | 99.6 | 94.3 | 79.8 |
+| nvidia/llama-nemoretriever-colembed-3b-v1 | 4407 | 91.0 | 88.4 | 66.2 | 94.9 | 90.7 | 99.6 | 96.6 | 97.8 | 99.3 | 95.9 | 80.6 |
+| nomic-ai/colnomic-embed-multimodal-3b | 3000 | 89.9 | 88.2 | 61.3 | 92.8 | 90.2 | 96.3 | 97.3 | 96.6 | 98.3 | 94.5 | 83.1 |
+| vidore/colqwen2.5-v0.2 | 3000 | 89.6 | 89.1 | 63.5 | 92.6 | 88.0 | 99.6 | 95.8 | 96.6 | 98.0 | 90.8 | 82.1 |
+| vidore/colqwen2-v1.0 | 2210 | 89.2 | 88.0 | 61.5 | 92.5 | 89.9 | 99.0 | 95.9 | 95.5 | 98.8 | 89.0 | 82.2 |
+| vidore/colpali-v1.3 | 2920 | 84.7 | 83.7 | 58.7 | 85.7 | 76.5 | 96.6 | 94.6 | 95.9 | 97.4 | 86.7 | 70.7 |
+| vidore/colpali-v1.2 | 2920 | 83.4 | 77.9 | 56.5 | 82.4 | 78.3 | 97.5 | 94.4 | 94.9 | 95.4 | 88.4 | 68.1 |
+| nvidia/omni-embed-nemotron-3b | 4703 | 85.7 | 85.3 | 59.2 | 89.2 | 78.6 | 98.1 | 93.5 | 95.4 | 95.8 | 91.0 | 69.7 |
+## Inference:
+**Acceleration Engine:** Not Applicable <br>
+**Test Hardware:** A100 40GB, A100 80GB, H100 80GB
+## Ethical Considerations
+NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
+Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://app.intigriti.com/programs/nvidia/nvidiavdp/detail).
+## Bias
+| Field | Response |
+| ----- | ----- |
+| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing | None |
+| Measures taken to mitigate against unwanted bias | None |
+## Explainability
+| Field | Response |
+| ----- | ----- |
+| Intended Application & Domain: | Multi-modality corpus and query embedding for question and answer retrieval. |
+| Model Type: | Transformer encoder. |
+| Intended User: | Creators of generative AI focused on conversational models, as well as users aiming to develop question-and-answer applications, can benefit from leveraging the dense retrieval technologies. These applications can efficiently handle large, multi-modal corpora, including images, text, videos, and audio. |
+| Output: | Array of float numbers (Dense vector for input content, which may include multi-modal corpora). |
+| Describe how the model works: | Model transforms the input into a dense vector representation. |
+| Performance Metrics: | Accuracy |
+| Potential Known Risks: | This model does not guarantee to always retrieve the correct corpus for a given query. |
+| Licensing & Terms of Use: | **Governing Terms:**<br>Your use of the software container and model is governed by the [NVIDIA Software and Model Evaluation License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-and-model-evaluation-license/)<br><br>**Additional Information:**<br>[Qwen RESEARCH LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen2.5-Omni-3B/blob/main/LICENSE)  |
+| Technical Limitations: | The model's max sequence length is 32768. Longer sequence inputs should be truncated. |
+| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | N/A |
+| Verified to have met prescribed NVIDIA quality standards: | Yes |
+## Privacy
+| Field | Response |
+| ----- | ----- |
+| Generatable or reverse engineerable personal data? | None |
+| Personal data used to create this model? | None |
+| How often is dataset reviewed? | Dataset is initially reviewed upon addition, and subsequent reviews are conducted as needed or upon request for changes. |
+| Is there provenance for all datasets used in training? | Yes |
+| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |
+| Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data. |
+| Applicable Privacy Policy | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/ |
+## Safety And Security
+| Field | Response |
+| ----- | ----- |
+| Model Application(s): | Multi-modal Corpus Embedding for Retrieval. The model processes input from various modalities—text, image, audio, and video—either independently or in combination. |
+| Use Case Restrictions: | Governing Terms:Your use of the model is governed by the [NVIDIA Open License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). Additional Information: [Qwen RESEARCH LICENSE AGREEMENT](https://huggingface.co/Qwen/Qwen2.5-Omni-3B/blob/main/LICENSE).   |
+| Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |
+| Describe the life critical impact (if present) | Not applicable |

added_tokens.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "</tool_call>": 151658,
+  "<tool_call>": 151657,
+  "<|AUDIO|>": 151646,
+  "<|IMAGE|>": 151655,
+  "<|VIDEO|>": 151656,
+  "<|audio_bos|>": 151647,
+  "<|audio_eos|>": 151648,
+  "<|box_end|>": 151649,
+  "<|endoftext|>": 151643,
+  "<|file_sep|>": 151664,
+  "<|fim_middle|>": 151660,
+  "<|fim_pad|>": 151662,
+  "<|fim_prefix|>": 151659,
+  "<|fim_suffix|>": 151661,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644,
+  "<|quad_end|>": 151651,
+  "<|quad_start|>": 151650,
+  "<|repo_name|>": 151663,
+  "<|vision_bos|>": 151652,
+  "<|vision_eos|>": 151653,
+  "<|vision_pad|>": 151654
+}

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,7 @@

+{% set audio_count = namespace(value=0) %}{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system
+You are a helpful assistant.<|im_end|>
+{% endif %}<|im_start|>{{ message['role'] }}
+{% if message['content'] is string %}{{ message['content'] }}<|im_end|>
+{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_bos|><|IMAGE|><|vision_eos|>{% elif content['type'] == 'audio' or 'audio' in content or 'audio_url' in content %}{% set audio_count.value = audio_count.value + 1 %}{% if add_audio_id %}Audio {{ audio_count.value }}: {% endif %}<|audio_bos|><|AUDIO|><|audio_eos|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_bos|><|VIDEO|><|vision_eos|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>
+{% endif %}{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant
+{% endif %}

config.json ADDED Viewed

	@@ -0,0 +1,112 @@

+{
+  "architectures": [
+    "NVOmniEmbedModel"
+  ],
+  "audio_config": {
+    "_attn_implementation_autoset": true,
+    "activation_dropout": 0.0,
+    "activation_function": "gelu",
+    "attention_dropout": 0.0,
+    "d_model": 1280,
+    "dropout": 0.0,
+    "encoder_attention_heads": 20,
+    "encoder_ffn_dim": 5120,
+    "encoder_layerdrop": 0.0,
+    "encoder_layers": 32,
+    "init_std": 0.02,
+    "initializer_range": 0.02,
+    "max_source_positions": 1500,
+    "model_type": "qwen2_5_omni_audio_encoder",
+    "n_window": 100,
+    "num_hidden_layers": 32,
+    "num_mel_bins": 128,
+    "output_dim": 2048,
+    "scale_embedding": false,
+    "torch_dtype": "bfloat16"
+  },
+  "audio_end_token_id": 151648,
+  "audio_max_length": 2048000,
+  "audio_start_token_id": 151647,
+  "audio_token_index": 151646,
+  "auto_map": {
+    "AutoModel": "modeling_nv_omni_embed.NVOmniEmbedModel",
+    "AutoConfig": "modeling_nv_omni_embed.NVOmniEmbedConfig"
+  },
+  "bos_token_id": 151644,
+  "eos_token_id": 151645,
+  "ignore_index": -100,
+  "image_token_index": 151655,
+  "init_std": 0.02,
+  "initializer_range": 0.02,
+  "model_type": "nvomniembed",
+  "pad_token_id": 151643,
+  "position_id_per_seconds": 25,
+  "resized_height": 680,
+  "resized_width": 680,
+  "seconds_per_chunk": 2,
+  "text_config": {
+    "attention_dropout": 0.0,
+    "hidden_act": "silu",
+    "hidden_size": 2048,
+    "init_std": 0.02,
+    "initializer_range": 0.02,
+    "intermediate_size": 11008,
+    "max_position_embeddings": 32768,
+    "max_window_layers": 70,
+    "model_type": "qwen2_5_omni_text",
+    "num_attention_heads": 16,
+    "num_hidden_layers": 36,
+    "num_key_value_heads": 2,
+    "rms_norm_eps": 1e-06,
+    "rope_scaling": {
+      "mrope_section": [
+        16,
+        24,
+        24
+      ],
+      "rope_type": "default",
+      "type": "default"
+    },
+    "rope_theta": 1000000.0,
+    "sliding_window": 32768,
+    "torch_dtype": "bfloat16",
+    "use_cache": true,
+    "use_sliding_window": false,
+    "vocab_size": 151936
+  },
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.52.0.dev0",
+  "user_token_id": 872,
+  "video_token_index": 151656,
+  "vision_config": {
+    "_attn_implementation_autoset": true,
+    "depth": 32,
+    "embed_dim": 1280,
+    "fullatt_block_indexes": [
+      7,
+      15,
+      23,
+      31
+    ],
+    "hidden_act": "silu",
+    "hidden_size": 1280,
+    "in_channels": 3,
+    "in_chans": 3,
+    "init_std": 0.02,
+    "initializer_range": 0.02,
+    "intermediate_size": 3420,
+    "model_type": "qwen2_5_omni_vision_encoder",
+    "num_heads": 16,
+    "out_hidden_size": 2048,
+    "patch_size": 14,
+    "spatial_merge_size": 2,
+    "spatial_patch_size": 14,
+    "temporal_patch_size": 2,
+    "tokens_per_second": 25,
+    "torch_dtype": "bfloat16",
+    "window_size": 112
+  },
+  "vision_end_token_id": 151653,
+  "vision_start_token_id": 151652,
+  "vision_token_id": 151654
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "_from_model_config": true,
+  "transformers_version": "4.52.0.dev0"
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model-00001-of-00002.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ce5d2ff1be32177ae0bc173931f0335a21c5298837257ea1f6d853b222bdf48c
+size 4994841696

model-00002-of-00002.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dd03af73e3f2843978de3c867fd515e900fa0e9fd331d3f92529380f4e27a27c
+size 4412249056

model.safetensors.index.json ADDED Viewed

The diff for this file is too large to render. See raw diff

modeling_nv_omni_embed.py ADDED Viewed

	@@ -0,0 +1,50 @@

+import torch
+from transformers import Qwen2_5OmniThinkerTextModel, Qwen2_5OmniThinkerForConditionalGeneration
+from transformers.cache_utils import Cache
+from transformers.modeling_attn_mask_utils import _prepare_4d_attention_mask
+from transformers.models.qwen2_5_omni.configuration_qwen2_5_omni import Qwen2_5OmniThinkerConfig
+class BidirectQwen2_5OmniThinkerTextModel(Qwen2_5OmniThinkerTextModel):
+    def __init__(self, config):
+        super().__init__(config)
+        for layer in self.layers:
+            layer.self_attn.is_causal = False
+    # override the _update_causal_mask method to generate bi-directional attention
+    def _update_causal_mask(
+        self,
+        attention_mask: torch.Tensor,
+        input_tensor: torch.Tensor,
+        cache_position: torch.Tensor,
+        past_key_values: Cache,
+        output_attentions: bool = False,
+    ):
+        calculated_attention_mask = super()._update_causal_mask(
+        attention_mask,
+        input_tensor,
+        cache_position,
+        past_key_values,
+        output_attentions)
+        if calculated_attention_mask is None:
+            return None
+        if self.config._attn_implementation == "flash_attention_2":
+            if attention_mask is not None and 0.0 in attention_mask:
+                return attention_mask
+        causal_mask = _prepare_4d_attention_mask(
+            attention_mask,
+            dtype=input_tensor.dtype,
+        )
+        return causal_mask
+class NVOmniEmbedConfig(Qwen2_5OmniThinkerConfig):
+    model_type = "nvomniembed"
+class NVOmniEmbedModel(Qwen2_5OmniThinkerForConditionalGeneration):
+    config_class = NVOmniEmbedConfig
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = BidirectQwen2_5OmniThinkerTextModel._from_config(
+            config.text_config, attn_implementation=config._attn_implementation
+        )

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "chunk_length": 300,
+  "dither": 0.0,
+  "feature_extractor_type": "WhisperFeatureExtractor",
+  "feature_size": 128,
+  "hop_length": 160,
+  "image_mean": [
+    0.48145466,
+    0.4578275,
+    0.40821073
+  ],
+  "image_processor_type": "Qwen2VLImageProcessor",
+  "image_std": [
+    0.26862954,
+    0.26130258,
+    0.27577711
+  ],
+  "max_pixels": 12845056,
+  "merge_size": 2,
+  "min_pixels": 3136,
+  "n_fft": 400,
+  "n_samples": 4800000,
+  "nb_max_frames": 30000,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "patch_size": 14,
+  "processor_class": "Qwen2_5OmniProcessor",
+  "return_attention_mask": true,
+  "sampling_rate": 16000,
+  "temporal_patch_size": 2
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,38 @@

+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|AUDIO|>",
+    "<|audio_bos|>",
+    "<|audio_eos|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_bos|>",
+    "<|vision_eos|>",
+    "<|vision_pad|>",
+    "<|IMAGE|>",
+    "<|VIDEO|>"
+  ],
+  "audio_bos_token": "<|audio_bos|>",
+  "audio_eos_token": "<|audio_eos|>",
+  "audio_token": "<|AUDIO|>",
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "image_token": "<|IMAGE|>",
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "video_token": "<|VIDEO|>",
+  "vision_bos_token": "<|vision_bos|>",
+  "vision_eos_token": "<|vision_eos|>"
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e37ccc49ca787578f0a528ff59a0e1dc7605031c1e0481c32b37fcd1eb03f5e2
+size 11422135

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,229 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|AUDIO|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|audio_bos|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|audio_eos|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_bos|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_eos|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|IMAGE|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|VIDEO|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|AUDIO|>",
+    "<|audio_bos|>",
+    "<|audio_eos|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_bos|>",
+    "<|vision_eos|>",
+    "<|vision_pad|>",
+    "<|IMAGE|>",
+    "<|VIDEO|>"
+  ],
+  "audio_bos_token": "<|audio_bos|>",
+  "audio_eos_token": "<|audio_eos|>",
+  "audio_token": "<|AUDIO|>",
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "extra_special_tokens": {
+    "audio_bos_token": "<|audio_bos|>",
+    "audio_eos_token": "<|audio_eos|>",
+    "audio_token": "<|AUDIO|>",
+    "image_token": "<|IMAGE|>",
+    "video_token": "<|VIDEO|>",
+    "vision_bos_token": "<|vision_bos|>",
+    "vision_eos_token": "<|vision_eos|>"
+  },
+  "image_token": "<|IMAGE|>",
+  "max_length": 900,
+  "model_max_length": 32768,
+  "pad_to_multiple_of": null,
+  "pad_token": "<|endoftext|>",
+  "pad_token_type_id": 0,
+  "padding_side": "left",
+  "processor_class": "Qwen2_5OmniProcessor",
+  "split_special_tokens": false,
+  "stride": 0,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": null,
+  "video_token": "<|VIDEO|>",
+  "vision_bos_token": "<|vision_bos|>",
+  "vision_eos_token": "<|vision_eos|>"
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff