π Octopus-Omni-Embed: Multi-Modal Embedding Model
A cost-efficient multi-modal embedding model that encodes Text, Images, and Video into a common 2048-dimensional latent space. Built on Qwen2.5-Omni-3B Thinker architecture and trained for $25 with purposefully limited data for educational demonstration.
π― Project Goal: Demonstrate that you can build a functional multi-modal embedding model on a hobby budget (<$50) with limited data (110K samples), not to compete with SOTA models or achieve leaderboard rankings. This is an educational project showing the training process, not a production research model.
Proven Capabilities
Successfully encodes 3 modalities in common latent space:
- β TEXT: [1, 2048], norm=1.0 (TRAINED on 60K pairs, Stage 1+2)
- β IMAGE: [1, 2048], norm=1.0 (TRAINED on 50K pairs, Stage 2)
- β VIDEO: [1, 2048], norm=1.0 (FROZEN encoder, inherited from base Thinker)
Cross-Modal Similarities:
- Text β Image: 0.129
- Text β Video: -0.133
- Image β Video: 0.019
Model Details
Architecture: Thinker-Only Approach
Unlike the full Qwen2.5-Omni model (Thinker + Talker):
- β Thinker Component: Multi-modal encoder for understanding text, image, audio, video - WE USE THIS
- β Talker Component: Speech synthesis/generation - WE EXCLUDE THIS
This Thinker-only design follows NVIDIA's Omni-Embed approach - we leverage the pre-trained multi-modal understanding capabilities while excluding speech generation components unnecessary for embedding tasks.
Why Thinker-Only?
- Focuses on encoding/understanding, not generation
- Preserves frozen video/audio encoders from base model
- Reduces complexity and training requirements
- Optimizes for retrieval, not speech synthesis
Model Configuration
- Base Model: Qwen/Qwen2.5-Omni-3B Thinker component
- Total Parameters: 3B (NOT 7B)
- Trainable Parameters: ~1.5B via LoRA (6.84% of total)
- LoRA Configuration:
- Rank (r): 16
- Alpha (Ξ±): 32
- Target modules:
q_proj,v_proj,k_proj,o_proj(language model only)
- Frozen Components:
- Vision encoder (capabilities preserved from base)
- Video encoder (functional, inherited from Thinker)
- Audio encoder (present but not validated)
- Embedding Dimension: 2048
- Pooling: Weighted attention pooling + L2 normalization
- Training Strategy: Two-stage contrastive learning
Training Details
Purposefully Limited Data Strategy
Why Limited Data? This project intentionally uses a small dataset (110K samples) to demonstrate:
- What's achievable on a hobby budget (<$50)
- Training methodology without enterprise resources
- Educational value over SOTA performance
- Proof that frozen encoders (video) work without training
Not trying to:
- β Match NVIDIA's 1M+ sample training
- β Achieve BEIR/MTEB leaderboard scores
- β Create production-ready embeddings
- β Write a research paper
Two-Stage Training
Stage 1: Text-Text Embedding Foundation
- Dataset: 60,000 text pairs
- SQuAD: 30,000 question-context pairs
- HotpotQA: 30,000 multi-hop question pairs
- Epochs: 5
- Duration: 11.5 hours
- GPU: NVIDIA RTX 4090 (24GB VRAM)
- Cost: $8.00 @ $0.69/hour
- Loss: 0.825 β 0.319 (61% reduction)
- Objective: Learn text semantic similarity in latent space
- Batch Size: 8 (effective 16 with gradient accumulation)
Stage 2: Cross-Modal Text-Image Alignment
- Dataset: 50,000 text-image pairs
- DocVQA: 50,000 document images + questions
- Epochs: 2 (stopped early due to excellent convergence)
- Duration: 24 hours
- GPU: NVIDIA L40S (48GB VRAM)
- Cost: $17.00 @ $0.87/hour
- Loss: 0.737 β 0.0137 (98% total reduction!)
- Objective: Align text and image representations
- Batch Size: 8 (effective 16 with gradient accumulation)
Total Training:
- Time: ~36 hours across 2 GPUs
- Cost: $25.00 (50% under $50 budget!)
- Total Samples: 110,000 (purposefully limited)
Training Configuration
# Stage 1 & 2 shared config
optimizer: AdamW
learning_rate: 1e-5 (Stage 2), 2e-5 (Stage 1)
weight_decay: 0.01
batch_size: 8
gradient_accumulation_steps: 2
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
target_modules: ["q_proj", "v_proj", "k_proj", "o_proj"]
max_seq_length: 512
# Frozen components
vision_encoder: frozen (from Qwen2.5-Omni)
video_encoder: frozen (from Qwen2.5-Omni)
audio_encoder: frozen (from Qwen2.5-Omni)
Why Only 2 Epochs in Stage 2?
Loss converged to 0.0137 (98% reduction from start), indicating the model learned the text-image alignment efficiently. Training longer risked overfitting on our purposefully small dataset.
Performance
Text-to-Image Retrieval (Evaluation on 1000 samples)
- nDCG@10: 0.578 (Target: 0.40) β 44% above target!
- nDCG@5: 0.549
- Recall@1: 29.0%
- Recall@10: 86.4%
Text-to-Text Retrieval
- nDCG@10: 0.008 (needs improvement)
- Recall@10: 1.7%
Note: Model excels at cross-modal (text-image) retrieval as designed. Text-text performance is low, likely due to:
- Small training set (60K vs NVIDIA's 500K+)
- Possible data leakage in evaluation
- Focus on cross-modal rather than uni-modal retrieval
Comparison to SOTA
We ARE NOT competing with:
- NVIDIA Omni-Embed-Nemotron (1M+ samples, enterprise GPUs)
- CLIP/SigLIP (400M+ image-text pairs)
- Text embedding leaderboards (MTEB, BEIR)
What we achieved:
- Functional multi-modal embeddings on $25 budget
- TEXT + IMAGE + VIDEO encoding in common space
- Educational demonstration of training methodology
- 44% above our modest nDCG@10 target (0.40)
Output Specification
- Output Type: Floats (PyTorch tensor)
- Output Format:
torch.Tensor - Output Shape:
[batch_size, 2048] - Output Properties:
- L2-normalized (norm=1.0 for all embeddings)
- Suitable for cosine similarity comparison
- Common latent space across all modalities
Usage
import torch
import sys
sys.path.append('src') # If needed for custom model code
from model.omni_embed_model import OmniEmbedModel
from PIL import Image
# Load model from HuggingFace
model = OmniEmbedModel.from_pretrained('sugiv/octopus-omni-embed')
model.to('cuda').eval()
processor = model.processor
# 1. Encode TEXT
text_inputs = processor(
text=['passage: A person walking in the park'],
return_tensors='pt',
padding=True
)
text_inputs = {k: v.to('cuda') for k, v in text_inputs.items()}
with torch.no_grad():
text_emb = model(**text_inputs)
print(f"Text embedding: {text_emb.shape}") # [1, 2048], norm=1.0
# 2. Encode IMAGE
image = Image.open('document.jpg')
image_inputs = processor(
images=[image],
text=[''], # Empty text for pure image embedding
return_tensors='pt',
padding=True
)
image_inputs = {k: v.to('cuda') for k, v in image_inputs.items()}
with torch.no_grad():
image_emb = model(**image_inputs)
print(f"Image embedding: {image_emb.shape}") # [1, 2048], norm=1.0
# 3. Encode VIDEO (8 frames)
video_frames = [Image.open(f'frame_{i}.jpg') for i in range(8)]
video_inputs = processor(
videos=[video_frames],
text=['describe this video'],
return_tensors='pt',
padding=True,
use_audio_in_video=False
)
video_inputs = {k: v.to('cuda') for k, v in video_inputs.items()}
with torch.no_grad():
video_emb = model(**video_inputs)
print(f"Video embedding: {video_emb.shape}") # [1, 2048], norm=1.0
# 4. Compute Cross-Modal Similarity
similarity = torch.nn.functional.cosine_similarity(text_emb, image_emb, dim=-1)
print(f"Text-Image similarity: {similarity.item():.4f}")
Use Cases
What This Model IS Good For:
- Educational demonstrations of multi-modal training
- Proof-of-concept for hobby projects
- Document visual Q&A (trained on DocVQA)
- Small-scale retrieval tasks (< 10K documents)
- Learning/experimenting with multi-modal embeddings
What This Model IS NOT For:
- β Production systems requiring SOTA performance
- β Large-scale retrieval (millions of documents)
- β Academic benchmarks (BEIR, MTEB)
- β Critical applications requiring high accuracy
- β Comparing against enterprise models (NVIDIA, OpenAI, etc.)
Limitations
By Design (Purposeful Choices)
- Small training set: 110K samples vs NVIDIA's 1M+
- Limited epochs: 5+2 epochs vs typical 10-20
- Single GPU training: vs multi-GPU enterprise setups
- No audio training: Audio encoder frozen, not optimized
- No video training: Video encoder frozen, functional but not tuned
Performance Limitations
- Text-text retrieval underperforms (nDCG@10: 0.008)
- Not competitive with SOTA on standard benchmarks
- Best for document understanding (DocVQA domain)
- Trained only on English text and images
- May overfit to DocVQA document style
Technical Limitations
- Video encoder not optimized for retrieval (frozen from base)
- Audio encoding untested (encoder exists but not validated)
- No multilingual support (English only)
- Limited to 2048-dim embeddings (no configurable dims)
Why "Octopus"? π
The octopus is known for:
- Intelligence: Problem-solving and learning abilities
- Multi-sensory processing: Integrating visual, tactile, and chemical signals simultaneously
- Adaptability: Thriving in resource-constrained environments
Just like this model processes text, images, and video in a unified embedding space while working within budget constraints!
Installation
# Required packages
pip install torch transformers pillow
# Optional: For video processing
pip install qwen-omni-utils
Hardware Requirements
Minimum (Inference):
- GPU: 12GB VRAM (RTX 3090, A5000)
- RAM: 16GB
- Storage: 15GB for model
Recommended (Inference):
- GPU: 24GB VRAM (RTX 4090, A6000)
- RAM: 32GB
- Storage: 20GB
Citation
If you use this model or find the training methodology helpful:
@misc{octopus-omni-embed-2025,
title={Octopus-Omni-Embed: Cost-Efficient Multi-Modal Embedding Model},
author={Sugi Valluri},
year={2025},
url={https://huggingface.co/sugiv/octopus-omni-embed},
note={Educational project demonstrating multi-modal training on $25 budget}
}
License
Apache 2.0 (inherited from base Qwen2.5-Omni-3B model)
Acknowledgments
- Base Model: Qwen2.5-Omni-3B by Alibaba Cloud
- Inspiration: NVIDIA Omni-Embed-Nemotron paper
- Training Data: DocVQA, SQuAD, HotpotQA datasets
Links
- Model Repository: https://huggingface.co/sugiv/octopus-omni-embed
- Base Model: https://huggingface.co/Qwen/Qwen2.5-Omni-3B
- NVIDIA Paper: https://arxiv.org/abs/2510.03458
Version: 1.0 (Stage 2, Epoch 2)
Release Date: October 29, 2025
Training Cost: $25.00
Training Samples: 110,000
Built with β€οΈ and limited resources to show that anyone can train multi-modal models!
- Downloads last month
- 7
Model tree for sugiv/octopus-omni-embed
Base model
Qwen/Qwen2.5-Omni-3B