CLIPCLAP β€” Unified Text + Image + Audio Embeddings

CLIPCLAP is a unified multimodal embedding model that maps text, images, and audio into a shared 512-dimensional vector space. It combines OpenAI's CLIP (text + image) with LAION's CLAP (audio) through a trained linear projection.

Built by antflydb for use with Termite, a standalone ML inference service for embeddings, chunking, and reranking.

Architecture

Text  ──→ CLIP text encoder  ──→ text_projection  ──→ 512-dim (CLIP space)
Image ──→ CLIP visual encoder ──→ visual_projection ──→ 512-dim (CLIP space)
Audio ──→ CLAP audio encoder  ──→ audio_projection  ──→ 512-dim (CLIP space)
  • Text & Image: Standard CLIP ViT-B/32 encoders and projections (unchanged from openai/clip-vit-base-patch32).
  • Audio: CLAP HTSAT audio encoder from laion/larger_clap_music_and_speech. The audio projection combines CLAP's native audio projection (1024β†’512) with a trained 512β†’512 linear layer that maps CLAP audio space into CLIP space.

All three modalities produce 512-dimensional L2-normalized embeddings that are directly comparable via cosine similarity.

Intended Uses

  • Multimodal search (text↔image↔audio)
  • Building unified media indexes with Antfly
  • Cross-modal retrieval (find images from audio queries, audio from text, etc.)
  • Audio-visual content discovery

How to Use with Termite

# Pull and run the model
termite pull clipclap
termite run

# Embed text
curl -X POST http://localhost:8082/embed \
  -H "Content-Type: application/json" \
  -d '{
    "model": "clipclap",
    "input": [
      {"type": "text", "text": "a cat sitting on a windowsill"},
      {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}},
      {"type": "audio_url", "audio_url": {"url": "https://example.com/cat-purring.wav"}}
    ]
  }'

Training Details

Audio Projection

The audio projection layer bridges CLAP and CLIP embedding spaces. Training procedure:

  1. Load audio-caption pairs from OpenSound/AudioCaps
  2. Encode audio through CLAP: audio encoder β†’ audio_projection β†’ L2 normalize
  3. Encode captions through CLIP: text encoder β†’ text_projection β†’ L2 normalize
  4. Train a 512β†’512 linear projection (CLAP audio β†’ CLIP text) using CLIP-style contrastive loss (InfoNCE)

The contrastive loss pushes matching audio-text pairs together while pushing non-matching pairs apart within each batch, preserving content discrimination.

Hyperparameters

Parameter Value
Training dataset OpenSound/AudioCaps
Samples 5000 audio-caption pairs
Epochs 20
Batch size 256
Learning rate 1e-3
Optimizer Adam
Loss Symmetric InfoNCE (temperature=0.07)
Train/val split 90/10

Source Models

ONNX Files

File Description Size
text_model.onnx CLIP text encoder ~254 MB
visual_model.onnx CLIP visual encoder ~330 MB
text_projection.onnx CLIP text projection (512β†’512) ~4 KB
visual_projection.onnx CLIP visual projection (768β†’512) ~6 KB
audio_model.onnx CLAP HTSAT audio encoder ~590 MB
audio_projection.onnx Combined CLAP→CLIP projection (1024→512) ~8 KB

Additional files: clip_config.json, tokenizer.json, preprocessor_config.json, projection_training_metadata.json.

Limitations

  • Audio duration: Audio is truncated to ~10 seconds (inherited from CLAP)
  • Language: Primarily English text support
  • Audio-visual alignment: The projection is trained via caption similarity (audio↔text↔image), not direct audio-image pairs. Audio-to-image retrieval may be less precise than text-to-image.
  • CLIP limitations: Inherits CLIP's weaknesses in fine-grained visual classification, object counting, and abstract concepts
  • Training data: Audio projection trained on AudioCaps which covers common environmental sounds and may underperform on niche audio domains

Citation

If you use CLIPCLAP, please cite the underlying models:

@inproceedings{radford2021clip,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and others},
  booktitle={ICML},
  year={2021}
}

@inproceedings{wu2023clap,
  title={Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation},
  author={Wu, Yusong and Chen, Ke and Zhang, Tianyu and others},
  booktitle={ICASSP},
  year={2023}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train antflydb/clipclap