LEAF Embeddings - INT8 Quantized (FAILED v1 - DO NOT USE)
🚨 CRITICAL: This model FAILED quality evaluation - DO NOT USE for production.
⚠️ This is experiment v1 (512 tokens) - kept for research purposes only.
Status: Training completed successfully but MTEB evaluation shows critical quality loss. This serves as a baseline for comparison with the improved v2 model (2048 tokens, better architecture) currently in development.
Model Description
This model is a distilled and quantized version of google/embeddinggemma-300m trained using the LEAF (Layer-wise Early-exit Alignment Framework) methodology. It generates 768-dimensional embeddings optimized for fast CPU inference with INT8 quantization.
What is LEAF?
LEAF is a knowledge distillation framework that:
- Compresses larger embedding models into smaller, faster versions
- Preserves semantic quality through multi-objective training (distillation + alignment + contrastive losses)
- Optimizes for CPU deployment with INT8 post-training quantization
Architecture
Property | This Model (LEAF) | Base Model (EmbeddingGemma-300m) |
---|---|---|
Dimensions | 768D | 768D (also 512D, 256D, 128D via Matryoshka) |
Parameters | ~75M (6 layers, compressed) | 300M (full architecture) |
Max Tokens | 512 | 2048 |
Quantization | INT8 (441MB) | FP32 (~600MB) |
Inference Speed | 695 texts/s (CPU) | ~50-100 texts/s (CPU) |
Trade-offs:
- ✅ 6-10x faster inference on CPU
- ✅ Smaller model size (441MB vs ~600MB)
- ✅ Lower memory footprint
- ⚠️ Reduced context length (512 vs 2048 tokens)
- ⚠️ Possible quality loss from distillation (not yet benchmarked)
Performance
Inference Speed (CPU)
- Throughput: 695 texts/second
- Latency: ~1.4ms per text
- Memory: ~500MB RAM
- Hardware: Standard CPU, no GPU required
❌ ACTUAL QUALITY (MTEB Evaluation - FAILED)
Evaluation Date: 2025-10-12 Status: ❌ CRITICAL FAILURE - Model does not capture semantic relationships
Dataset | Metric | This Model (v1) | Base Model | Quality Loss |
---|---|---|---|---|
STSBenchmark | Spearman | 0.223 | 0.81 | -72% ❌ |
STS22 English | Spearman | 0.373 | 0.75 | -50% ❌ |
STS22 Average | Spearman | ~0.21 | 0.65 | -68% ❌ |
Cross-lingual | Spearman | -0.14 to 0.12 | 0.55 | Complete loss ❌ |
Detailed STS22 Results by Language:
Language | Spearman | Status |
---|---|---|
🇨🇳 Chinese | 0.499 | 🟡 Moderate (best) |
🇸🇦 Arabic | 0.469 | 🟡 Moderate |
🇮🇹 Italian | 0.435 | 🟡 Moderate |
🇪🇸 Spanish | 0.403 | 🟠 Poor |
🇬🇧 English | 0.373 | 🟠 Poor |
🇫🇷 French | 0.300 | 🔴 Very poor |
🇷🇺 Russian | 0.268 | 🔴 Very poor |
🇹🇷 Turkish | 0.247 | 🔴 Very poor |
🇩🇪 German | 0.163 | ❌ Critical |
🇵🇱 Polish | 0.132 | ❌ Critical |
Cross-lingual pairs (translation tasks): All FAILED (scores 0.002 to -0.143)
Conclusion: This model cannot be used for semantic search, similarity tasks, or any production use. The embeddings do not preserve semantic meaning from the base model.
Training Quality Analysis
Training Metrics (from WandB logs):
Metric | Final Value | Status |
---|---|---|
Distillation Loss | 0.976 | ✅ Good - Model learned from teacher |
Alignment Loss | 2.18 | ⚠️ Moderate - Semantic space alignment could improve |
Training Steps | 12,500 (3 epochs) | ✅ Complete |
Training Time | 2h10min | ✅ Efficient |
Eval Loss | NaN | ❌ Bug in evaluation aggregation |
Observations:
- ✅ Training converged smoothly without crashes
- ✅ Distillation loss stable and low (0.976) - good knowledge transfer
- ⚠️ Alignment loss moderate (2.18) - room for improvement
- ❌ Evaluation metrics not computed (NaN) - needs separate MTEB evaluation
- 📊 17 checkpoints saved - can select best performing model
Quality Verdict: ❌ FAILED - Despite low distillation loss, the model failed to learn semantic representations.
🔍 Failure Analysis
What went wrong:
Architecture Too Aggressive ❌
- 6 layers too small for semantic preservation (should be 12+)
- 4x compression (300M→75M) lost critical information
- Hidden size ratio 0.5x insufficient
Insufficient Training Data ❌
- Only 50k samples for 100+ languages
- Mostly English data (NLI, STS, MS MARCO)
- No multilingual balance
Misleading Distillation Loss ⚠️
- Low distillation loss (0.976) doesn't guarantee semantic quality
- High alignment loss (2.18) was the real warning sign
- Model learned to mimic output distribution but not semantic meaning
Evaluation Bug ❌
- Eval loss = NaN prevented early detection of failure
- Should have caught quality issues during training
Lessons learned for v2:
- ✅ Monitor alignment loss as primary metric (target: <1.0)
- ✅ Increase student size to 120M params (12 layers)
- ✅ Use 200k+ multilingual samples
- ✅ Implement proper eval during training (MTEB subset every 500 steps)
- ✅ Train with 2048 token context
- ✅ Curriculum learning: 512→1024→2048 tokens progressively
Training Details
Methodology
- Knowledge Distillation from EmbeddingGemma-300m (300M → 75M params)
- LEAF Framework with multi-objective training:
- Distillation loss (0.5 weight)
- Alignment loss (1.0 weight)
- Contrastive loss (0.3 weight)
- INT8 Quantization for CPU optimization
Training Configuration
- Teacher Model:
google/embeddinggemma-300m
- Training Data: 50,000 samples from:
sentence-transformers/all-nli
sentence-transformers/stsb
ms_marco
- Validation: 5,000 samples
- Training Steps: 12,500 (3 epochs)
- Hardware: NVIDIA RTX 4050 (6GB VRAM)
- Training Time: ~2h10min
- Final Losses:
- Distillation: 0.976
- Alignment: 2.18
Student Architecture
- Layers: 6 (vs more in teacher)
- Attention Heads: 6
- Hidden Size Ratio: 0.5x
- Compression Ratio: 4x
Training Logs
View full training metrics on WandB
Usage
Requirements
pip install torch>=2.6.0 transformers>=4.57.0 huggingface-hub
Basic Usage
import torch
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
# Download model
model_path = hf_hub_download(
repo_id="tss-deposium/gemma300-leaf-embeddings-test",
filename="model_quantized.pt"
)
# Load model
checkpoint = torch.load(model_path, map_location='cpu', weights_only=False)
model = checkpoint['model']
model.eval()
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
"tss-deposium/gemma300-leaf-embeddings-test"
)
model.set_tokenizer(tokenizer)
# Generate embeddings
texts = ["Hello world", "Machine learning"]
with torch.no_grad():
embeddings = model.encode(texts, device='cpu', normalize=True)
print(embeddings.shape) # (2, 768)
API Integration
This model is deployed as part of a FastAPI service:
import requests
response = requests.post(
"https://your-api-url/api/embed",
json={"model": "leaf", "input": "Your text here"}
)
embeddings = response.json()["embeddings"]
Model Card
Property | Value |
---|---|
Base Model | google/embeddinggemma-300m |
Framework | LEAF (Knowledge Distillation) |
Model Type | Sentence Embeddings |
Dimensions | 768 |
Max Tokens | 512 (reduced from 2048 for efficiency) |
Quantization | INT8 |
PyTorch Version | 2.6+ |
Language | English (base model supports 100+ languages) |
Training Dataset | 50k samples (NLI, STS, MS MARCO) |
Files
model_quantized.pt
(441MB) - INT8 quantized model for CPU inferencemodel_fp32.pt
(477MB) - FP32 full precision version (optional)tokenizer.json
(33MB) - Tokenizer vocabularyconfig.json
- Model configurationtokenizer_config.json
- Tokenizer settings
Limitations
Context Length
- 512 tokens maximum (vs 2048 in base model)
- Longer texts will be truncated
- Consider chunking for documents >512 tokens
Quality Trade-offs
- Distillation: Compressed from 300M → 75M parameters may reduce quality
- Quantization: INT8 quantization may introduce small accuracy loss
- Training Data: 50k samples may not cover all domains
Language Support
- Primarily tested on English
- Base model supports 100+ languages, but distilled model not yet evaluated on multilingual tasks
Experimental Status
- Not production-ready: Requires thorough evaluation
- No MTEB scores: Quality benchmarks pending
- Limited testing: More evaluation needed on downstream tasks
❌ DO NOT USE - Model Failed Quality Checks
This model is NOT suitable for ANY production use cases.
❌ NOT suitable for:
- ❌ Semantic search - Scores too low (0.22 Spearman)
- ❌ Document similarity - Does not capture semantic meaning
- ❌ Text clustering - Embeddings not semantically meaningful
- ❌ Information retrieval - Poor correlation with human judgments
- ❌ Duplicate detection - Unreliable similarity scores
- ❌ Any production deployment - Quality insufficient
- ❌ Multilingual tasks - Cross-lingual capabilities destroyed
- ❌ Mission-critical applications - Do not use
✅ Only suitable for:
- ✅ Research purposes - Understanding failure modes in knowledge distillation
- ✅ Baseline comparison - For comparing with improved v2 model
- ✅ Educational purposes - Learning what NOT to do in model compression
Comparison with Base Model
Metric | LEAF v1 (This Model) | EmbeddingGemma-300m | Quality Gap |
---|---|---|---|
Parameters | ~75M | 300M | -75% |
Size (INT8/FP32) | 441MB | ~600MB | -26% ✅ |
Speed (CPU) | 695 texts/s | ~50-100 texts/s | +6-10x ✅ |
Context Length | 512 | 2048 | -75% ❌ |
STSBenchmark | 0.223 | 0.81 | -72% ❌ |
STS22 English | 0.373 | 0.75 | -50% ❌ |
MTEB Score (est.) | ~25 | 61.15 | -59% ❌ |
Latency | ~1.4ms | ~10-20ms | -85% ✅ |
Verdict: Speed improvements do NOT justify the catastrophic quality loss. Use base model instead.
Future Work - Version 2 (In Development)
Based on lessons learned from this failed v1 experiment, we are developing v2 with:
Architecture Improvements
- ✅ 12 layers (vs 6 in v1) - 2x deeper for semantic preservation
- ✅ 120M parameters (vs 75M) - Less aggressive compression (2.5x vs 4x)
- ✅ 2048 token context (vs 512) - Full context length like base model
- ✅ Hidden size ratio 0.75 (vs 0.5) - Better capacity
Training Improvements
- ✅ 200k samples (vs 50k) - 4x more data
- ✅ Multilingual balanced - 100+ languages with proper distribution
- ✅ Curriculum learning - Progressive 512→1024→2048 tokens
- ✅ 10 epochs (vs 3) - More training time
- ✅ Alignment loss priority - Weight 2.5 (vs 1.0) + triplet loss
Evaluation Improvements
- ✅ Eval every 500 steps - Early detection of quality issues
- ✅ MTEB subset validation - STSBenchmark during training
- ✅ Alignment loss < 1.0 target - Primary quality metric
- ✅ Early stopping - On alignment loss, not distillation loss
Quality Targets (v2)
- 🎯 STSBenchmark: 0.70+ Spearman (vs 0.22 in v1)
- 🎯 STS22 Average: 0.50+ Spearman (vs 0.21 in v1)
- 🎯 MTEB Score: 55+ (vs ~25 estimated in v1)
- 🎯 Cross-lingual: 0.30+ (vs -0.14 in v1)
Expected release: After full training and validation (~12-15 hours on RTX 4050)
Citation
@misc{leaf-embeddings-test,
author = {TSS Deposium},
title = {LEAF Embeddings INT8 - Distilled from EmbeddingGemma-300m},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/tss-deposium/gemma300-leaf-embeddings-test}},
note = {Based on google/embeddinggemma-300m}
}
@misc{embeddinggemma,
author = {Google},
title = {EmbeddingGemma-300m},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/google/embeddinggemma-300m}}
}
Acknowledgments
- Base Model: google/embeddinggemma-300m
- Training Framework: Custom LEAF implementation
- Datasets: Sentence Transformers, MS MARCO
Contact
For questions or issues, please open an issue on the model repository.
Disclaimer: This is an experimental model for testing purposes. Performance and quality may vary. Thorough evaluation recommended before production use.
- Downloads last month
- 37
Model tree for tss-deposium/gemma300-leaf-embeddings-test
Base model
google/embeddinggemma-300m