🐙 Octopus-Omni-Embed: Multi-Modal Embedding Model

A cost-efficient multi-modal embedding model that encodes Text, Images, and Video into a common 2048-dimensional latent space. Built on Qwen2.5-Omni-3B Thinker architecture and trained for $25 with purposefully limited data for educational demonstration.

🎯 Project Goal: Demonstrate that you can build a functional multi-modal embedding model on a hobby budget (<$50) with limited data (110K samples), not to compete with SOTA models or achieve leaderboard rankings. This is an educational project showing the training process, not a production research model.

Proven Capabilities

Successfully encodes 3 modalities in common latent space:

✅ TEXT: [1, 2048], norm=1.0 (TRAINED on 60K pairs, Stage 1+2)
✅ IMAGE: [1, 2048], norm=1.0 (TRAINED on 50K pairs, Stage 2)
✅ VIDEO: [1, 2048], norm=1.0 (FROZEN encoder, inherited from base Thinker)

Cross-Modal Similarities:

Text ↔ Image: 0.129
Text ↔ Video: -0.133
Image ↔ Video: 0.019

Model Details

Architecture: Thinker-Only Approach

Unlike the full Qwen2.5-Omni model (Thinker + Talker):

✅ Thinker Component: Multi-modal encoder for understanding text, image, audio, video - WE USE THIS
❌ Talker Component: Speech synthesis/generation - WE EXCLUDE THIS

This Thinker-only design follows NVIDIA's Omni-Embed approach - we leverage the pre-trained multi-modal understanding capabilities while excluding speech generation components unnecessary for embedding tasks.

Why Thinker-Only?

Focuses on encoding/understanding, not generation
Preserves frozen video/audio encoders from base model
Reduces complexity and training requirements
Optimizes for retrieval, not speech synthesis

Model Configuration

Base Model: Qwen/Qwen2.5-Omni-3B Thinker component
Total Parameters: 3B (NOT 7B)
Trainable Parameters: ~1.5B via LoRA (6.84% of total)
LoRA Configuration:
- Rank (r): 16
- Alpha (α): 32
- Target modules: q_proj, v_proj, k_proj, o_proj (language model only)
Frozen Components:
- Vision encoder (capabilities preserved from base)
- Video encoder (functional, inherited from Thinker)
- Audio encoder (present but not validated)
Embedding Dimension: 2048
Pooling: Weighted attention pooling + L2 normalization
Training Strategy: Two-stage contrastive learning

Training Details

Purposefully Limited Data Strategy

Why Limited Data? This project intentionally uses a small dataset (110K samples) to demonstrate:

What's achievable on a hobby budget (<$50)
Training methodology without enterprise resources
Educational value over SOTA performance
Proof that frozen encoders (video) work without training

Not trying to:

❌ Match NVIDIA's 1M+ sample training
❌ Achieve BEIR/MTEB leaderboard scores
❌ Create production-ready embeddings
❌ Write a research paper

Two-Stage Training

Stage 1: Text-Text Embedding Foundation

Dataset: 60,000 text pairs
- SQuAD: 30,000 question-context pairs
- HotpotQA: 30,000 multi-hop question pairs
Epochs: 5
Duration: 11.5 hours
GPU: NVIDIA RTX 4090 (24GB VRAM)
Cost: $8.00 @ $0.69/hour
Loss: 0.825 → 0.319 (61% reduction)
Objective: Learn text semantic similarity in latent space
Batch Size: 8 (effective 16 with gradient accumulation)

Stage 2: Cross-Modal Text-Image Alignment

Dataset: 50,000 text-image pairs
- DocVQA: 50,000 document images + questions
Epochs: 2 (stopped early due to excellent convergence)
Duration: 24 hours
GPU: NVIDIA L40S (48GB VRAM)
Cost: $17.00 @ $0.87/hour
Loss: 0.737 → 0.0137 (98% total reduction!)
Objective: Align text and image representations
Batch Size: 8 (effective 16 with gradient accumulation)

Total Training:

Time: ~36 hours across 2 GPUs
Cost: $25.00 (50% under $50 budget!)
Total Samples: 110,000 (purposefully limited)

Training Configuration

# Stage 1 & 2 shared config
optimizer: AdamW
learning_rate: 1e-5 (Stage 2), 2e-5 (Stage 1)
weight_decay: 0.01
batch_size: 8
gradient_accumulation_steps: 2
lora_rank: 16
lora_alpha: 32
lora_dropout: 0.05
target_modules: ["q_proj", "v_proj", "k_proj", "o_proj"]
max_seq_length: 512

# Frozen components
vision_encoder: frozen (from Qwen2.5-Omni)
video_encoder: frozen (from Qwen2.5-Omni)
audio_encoder: frozen (from Qwen2.5-Omni)

Why Only 2 Epochs in Stage 2?

Loss converged to 0.0137 (98% reduction from start), indicating the model learned the text-image alignment efficiently. Training longer risked overfitting on our purposefully small dataset.

Performance

Text-to-Image Retrieval (Evaluation on 1000 samples)

nDCG@10: 0.578 (Target: 0.40) ✅ 44% above target!
nDCG@5: 0.549
Recall@1: 29.0%
Recall@10: 86.4%

Text-to-Text Retrieval

nDCG@10: 0.008 (needs improvement)
Recall@10: 1.7%

Note: Model excels at cross-modal (text-image) retrieval as designed. Text-text performance is low, likely due to:

Small training set (60K vs NVIDIA's 500K+)
Possible data leakage in evaluation
Focus on cross-modal rather than uni-modal retrieval

Comparison to SOTA

We ARE NOT competing with:

NVIDIA Omni-Embed-Nemotron (1M+ samples, enterprise GPUs)
CLIP/SigLIP (400M+ image-text pairs)
Text embedding leaderboards (MTEB, BEIR)

What we achieved:

Functional multi-modal embeddings on $25 budget
TEXT + IMAGE + VIDEO encoding in common space
Educational demonstration of training methodology
44% above our modest nDCG@10 target (0.40)

Output Specification

Output Type: Floats (PyTorch tensor)
Output Format: torch.Tensor
Output Shape: [batch_size, 2048]
Output Properties:
- L2-normalized (norm=1.0 for all embeddings)
- Suitable for cosine similarity comparison
- Common latent space across all modalities

Usage

import torch
import sys
sys.path.append('src')  # If needed for custom model code
from model.omni_embed_model import OmniEmbedModel
from PIL import Image

# Load model from HuggingFace
model = OmniEmbedModel.from_pretrained('sugiv/octopus-omni-embed')
model.to('cuda').eval()
processor = model.processor

# 1. Encode TEXT
text_inputs = processor(
    text=['passage: A person walking in the park'],
    return_tensors='pt',
    padding=True
)
text_inputs = {k: v.to('cuda') for k, v in text_inputs.items()}
with torch.no_grad():
    text_emb = model(**text_inputs)
print(f"Text embedding: {text_emb.shape}")  # [1, 2048], norm=1.0

# 2. Encode IMAGE
image = Image.open('document.jpg')
image_inputs = processor(
    images=[image],
    text=[''],  # Empty text for pure image embedding
    return_tensors='pt',
    padding=True
)
image_inputs = {k: v.to('cuda') for k, v in image_inputs.items()}
with torch.no_grad():
    image_emb = model(**image_inputs)
print(f"Image embedding: {image_emb.shape}")  # [1, 2048], norm=1.0

# 3. Encode VIDEO (8 frames)
video_frames = [Image.open(f'frame_{i}.jpg') for i in range(8)]
video_inputs = processor(
    videos=[video_frames],
    text=['describe this video'],
    return_tensors='pt',
    padding=True,
    use_audio_in_video=False
)
video_inputs = {k: v.to('cuda') for k, v in video_inputs.items()}
with torch.no_grad():
    video_emb = model(**video_inputs)
print(f"Video embedding: {video_emb.shape}")  # [1, 2048], norm=1.0

# 4. Compute Cross-Modal Similarity
similarity = torch.nn.functional.cosine_similarity(text_emb, image_emb, dim=-1)
print(f"Text-Image similarity: {similarity.item():.4f}")

Use Cases

What This Model IS Good For:

Educational demonstrations of multi-modal training
Proof-of-concept for hobby projects
Document visual Q&A (trained on DocVQA)
Small-scale retrieval tasks (< 10K documents)
Learning/experimenting with multi-modal embeddings

What This Model IS NOT For:

❌ Production systems requiring SOTA performance
❌ Large-scale retrieval (millions of documents)
❌ Academic benchmarks (BEIR, MTEB)
❌ Critical applications requiring high accuracy
❌ Comparing against enterprise models (NVIDIA, OpenAI, etc.)

Limitations

By Design (Purposeful Choices)

Small training set: 110K samples vs NVIDIA's 1M+
Limited epochs: 5+2 epochs vs typical 10-20
Single GPU training: vs multi-GPU enterprise setups
No audio training: Audio encoder frozen, not optimized
No video training: Video encoder frozen, functional but not tuned

Performance Limitations

Text-text retrieval underperforms (nDCG@10: 0.008)
Not competitive with SOTA on standard benchmarks
Best for document understanding (DocVQA domain)
Trained only on English text and images
May overfit to DocVQA document style

Technical Limitations

Video encoder not optimized for retrieval (frozen from base)
Audio encoding untested (encoder exists but not validated)
No multilingual support (English only)
Limited to 2048-dim embeddings (no configurable dims)

Why "Octopus"? 🐙

The octopus is known for:

Intelligence: Problem-solving and learning abilities
Multi-sensory processing: Integrating visual, tactile, and chemical signals simultaneously
Adaptability: Thriving in resource-constrained environments

Just like this model processes text, images, and video in a unified embedding space while working within budget constraints!

Installation

# Required packages
pip install torch transformers pillow

# Optional: For video processing
pip install qwen-omni-utils

Hardware Requirements

Minimum (Inference):

GPU: 12GB VRAM (RTX 3090, A5000)
RAM: 16GB
Storage: 15GB for model

Recommended (Inference):

GPU: 24GB VRAM (RTX 4090, A6000)
RAM: 32GB
Storage: 20GB

Citation

If you use this model or find the training methodology helpful:

@misc{octopus-omni-embed-2025,
  title={Octopus-Omni-Embed: Cost-Efficient Multi-Modal Embedding Model},
  author={Sugi Valluri},
  year={2025},
  url={https://huggingface.co/sugiv/octopus-omni-embed},
  note={Educational project demonstrating multi-modal training on $25 budget}
}

License

Apache 2.0 (inherited from base Qwen2.5-Omni-3B model)

Acknowledgments

Base Model: Qwen2.5-Omni-3B by Alibaba Cloud
Inspiration: NVIDIA Omni-Embed-Nemotron paper
Training Data: DocVQA, SQuAD, HotpotQA datasets

Model tree for sugiv/octopus-omni-embed

Base model

Qwen/Qwen2.5-Omni-3B

Finetuned

(13)

this model

sugiv
/

octopus-omni-embed