codemalt / README.md
Sarthak
chore: update README and REPORT with performance insights and dataset changes
0dbb356
---
base_model: sentence-transformers/all-mpnet-base-v2
library_name: distiller
license: apache-2.0
license_name: apache-2.0
license_link: LICENSE
model_name: codemalt
tags:
- code-search
- code-embeddings
- model2vec
- distillation
- sentence-transformers
- static-embeddings
- tokenlearn
datasets:
- code-search-net/code_search_net
- sentence-transformers/codesearchnet
metrics:
- ndcg@10
- mrr
- recall@5
language:
- code
pipeline_tag: feature-extraction
---
# CodeMalt
**CodeMalt** is a high-performance, code-specialized static embedding model created through Model2Vec distillation of `sentence-transformers/all-mpnet-base-v2`. This model achieves **73.87% NDCG@10** on CodeSearchNet benchmarks while being **14x smaller** and **15,021x faster** than the original teacher model.
## πŸ† Performance Highlights
- **NDCG@10**: 0.7387 (Best among all distilled models)
- **Mean Reciprocal Rank (MRR)**: 0.7010
- **Recall@5**: 0.8017
- **Model Size**: 7.6M parameters (vs 109M original)
- **Inference Speed**: 15,021x faster than teacher model
- **Memory Usage**: <1GB RAM (vs 8+ GB VRAM for original)
## πŸ“Š CodeSearchNet Performance by Language
| Language | NDCG@10 | MRR | Recall@5 |
|----------|---------|-----|----------|
| **Python** | 0.7899 | 0.7501 | 0.8421 |
| **JavaScript** | 0.7234 | 0.6801 | 0.7895 |
| **Java** | 0.7456 | 0.7089 | 0.8123 |
| **PHP** | 0.7198 | 0.6856 | 0.7834 |
| **Ruby** | 0.7312 | 0.6934 | 0.7912 |
| **Go** | 0.7223 | 0.6876 | 0.7913 |
## πŸ”§ Model Details
- **Teacher Model**: [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
- **Distillation Method**: Model2Vec + Tokenlearn training on CodeSearchNet
- **Architecture**: Static embeddings (no neural network inference required)
- **Embedding Dimensions**: 256
- **Training Data**: CodeSearchNet code-comment pairs across 6 programming languages
- **Optimization**: PCA dimensionality reduction + SIF weighting + Zipf regularization
- **Vocabulary Size**: 29,528
- **Parameters**: 7.6M
- **Size**: 14.4MB
## 🎯 Distiller: Code-Specialized Embedding Toolkit
**Distiller** is an independent toolkit built upon [Model2Vec](https://github.com/MinishLab/model2vec) and [Tokenlearn](https://github.com/MinishLab/tokenlearn) for creating code-specialized static embeddings. This package provides a complete pipeline for distilling, training, and evaluating efficient embedding models optimized for code-related tasks.
> **Note**: This is an independent research project that builds upon the Model2Vec framework. We are not affiliated with the MinishLab Model2Vec team, but acknowledge their excellent foundational work.
>[!Important]
>Check out the comprehensive [REPORT.md](REPORT.md) file generated by this toolkit for detailed performance analysis, model comparisons, and evaluation results across different programming languages.
>[!Warning]
>**Research Finding**: See [NOTES.md](NOTES.md) for critical analysis showing that C4 fine-tuning significantly degraded performance (-16.8% NDCG@10) compared to simple Model2Vec distillation. **Recommendation**: Use basic distillation without additional training for optimal code embedding performance.
The **distiller** package provides a complete pipeline for:
1. **Distilling code-specialized embeddings** from large sentence transformer models using Model2Vec
2. **Comprehensive evaluation** on CodeSearchNet benchmarks across 6 programming languages
3. **Performance benchmarking** (speed, memory, model size analysis)
4. **Advanced training** with tokenlearn for enhanced code understanding
5. **Analysis and reporting** with visualizations and comparison charts
6. **Cloud-scale processing** with Beam support for distributed execution
### Key Benefits
- **πŸš€ Performance**: Up to 500x faster inference with 50x smaller models
- **πŸ“Š Code-Optimized**: Specialized for code search, classification, and similarity tasks
- **πŸ”¬ Comprehensive**: Full evaluation pipeline with CodeSearchNet metrics
- **☁️ Scalable**: Local and cloud execution with Beam support
- **πŸ“ˆ Analytical**: Rich reporting with performance charts and comparisons
## πŸš€ Quick Start
### Installation
```bash
# Install with all dependencies
pip install model2vec[train] torch transformers datasets sentence-transformers
pip install typer pydantic plotly matplotlib seaborn
# Install the distiller package (assuming local development)
pip install -e .
```
### Basic Usage
```bash
# Simple distillation of a teacher model
distiller distill
# Distillation with advanced CodeSearchNet training
distiller distill --train
# Evaluate distilled models on CodeSearchNet
distiller evaluate
# Generate comprehensive analysis report
distiller analyze
```
### Python API
```python
from distiller import distill, evaluate, analyze
# Distill a specific model
results = distill.run_local_distillation(
teacher_models=["microsoft/codebert-base"],
enable_training=True, # Include CodeSearchNet fine-tuning
pca_dims=256
)
# Evaluate on CodeSearchNet
evaluation_results = evaluate.run_evaluation(
models=["."],
max_queries=1000,
languages=["python", "javascript", "java", "go", "php", "ruby"]
)
# Generate analysis report
analyze.main(
results_dir="./code_model2vec/evaluation_results",
model_name="code_model2vec_distilled_models",
output="ANALYSIS_REPORT.md"
)
```
## πŸ“‹ Features
### πŸ”¬ Distillation Engine
- **Multiple Teacher Models**: Support for 15+ pre-configured teacher models including:
- Code-specialized: `microsoft/codebert-base`, `BAAI/bge-code-v1`, `Salesforce/SFR-Embedding-Code-2B_R`
- General-purpose: `sentence-transformers/all-mpnet-base-v2`, `BAAI/bge-m3`
- Instruction-tuned: `Alibaba-NLP/gte-Qwen2-1.5B-instruct`
- **Advanced Training Pipeline**: Optional tokenlearn-based training following the POTION approach:
1. Model2Vec distillation (basic static embeddings)
2. Feature extraction using sentence transformers
3. Tokenlearn training on CodeSearchNet data
4. Post-training re-regularization (PCA + SIF weighting)
- **Robust Model Handling**: Automatic compatibility checks and specialized handling for problematic models
### πŸ“Š Evaluation Framework
- **CodeSearchNet Evaluation**: Standard code search benchmarks across 6 programming languages
- **Retrieval Metrics**: NDCG@k, MRR, Recall@k, Mean/Median Rank
- **Performance Benchmarking**:
- Model size analysis (disk usage, parameters, memory footprint)
- Inference speed testing (various batch sizes and text lengths)
- CPU vs GPU performance comparison
- Memory scaling analysis
### πŸ“ˆ Analysis & Reporting
- **Comprehensive Reports**: Automated generation of analysis reports with:
- Performance comparison tables
- Language-specific radar charts
- Efficiency analysis (performance vs model size)
- Peer model comparisons
- **Rich Visualizations**: Plotly and Matplotlib charts including:
- Multi-model performance heatmaps
- Batch size scaling curves
- Memory usage patterns
- Model efficiency scatter plots
### ☁️ Cloud Integration
- **Beam Support**: Distributed execution on Beam cloud infrastructure
- **Volume Management**: Persistent storage with checkpoint support
- **Resource Optimization**: GPU-optimized configurations (A100-40G default)
- **Automatic Syncing**: Seamless model and result synchronization
## πŸ› οΈ CLI Reference
### `distiller distill`
Distill teacher models into efficient static embeddings.
```bash
distiller distill [OPTIONS]
Options:
--use-beam Use Beam cloud for distillation
--train Enable advanced training (CodeSearchNet fine-tuning)
--teacher-models TEXT Specific teacher models to distill (can be repeated)
--pca-dims INTEGER PCA dimensions (default: 256)
--clear-cache Clear HuggingFace cache for problematic models
```
**Examples:**
```bash
# Basic distillation of all default models
distiller distill
# Train specific models with advanced CodeSearchNet fine-tuning
distiller distill --train --teacher-models microsoft/codebert-base --teacher-models BAAI/bge-code-v1
# Use Beam cloud with custom PCA dimensions
distiller distill --use-beam --train --pca-dims 512
```
### `distiller evaluate`
Evaluate models on CodeSearchNet benchmarks with performance analysis.
```bash
distiller evaluate [OPTIONS]
Options:
--use-beam Use Beam cloud for evaluation
--skip-third-party Skip third-party models evaluation
--skip-benchmark Skip performance benchmarking
--max-queries INTEGER Maximum queries per language (default: 100)
```
**Examples:**
```bash
# Comprehensive evaluation with benchmarking
distiller evaluate --max-queries 1000
# Quick evaluation without performance benchmarks
distiller evaluate --skip-benchmark --max-queries 100
# Cloud-based evaluation
distiller evaluate --use-beam --max-queries 500
```
### `distiller analyze`
Generate comprehensive analysis reports with visualizations.
```bash
distiller analyze [OPTIONS]
Options:
--results-dir PATH Results directory (default: code_model2vec/evaluation_results)
--model-name TEXT Model name for analysis (default: gte_qwen2_m2v_code (Ours))
--output PATH Output report file (default: REPORT.md)
--export-csv PATH Export results to CSV file
```
**Examples:**
```bash
# Generate standard analysis report
distiller analyze
# Custom analysis with CSV export
distiller analyze --model-name "my_distilled_model" --output custom_report.md --export-csv results.csv
# Analyze specific results directory
distiller analyze --results-dir ./custom_results --output analysis.md
```
## πŸ“ Directory Structure
The distiller uses a standardized directory structure:
```
code_model2vec/
β”œβ”€β”€ base/ # Basic distilled models (Step 1)
β”‚ └── code_model2vec_{teacher_name}/
β”œβ”€β”€ final/ # Final models (copied from base or after training)
β”‚ └── code_model2vec_{teacher_name}[_fine_tuned]/
β”œβ”€β”€ evaluation_results/ # CodeSearchNet evaluation results
β”‚ └── comprehensive_eval_{model}.json
β”œβ”€β”€ benchmark_results/ # Performance benchmark results
β”œβ”€β”€ analysis_results/ # Analysis reports and charts
β”‚ └── charts/
β”œβ”€β”€ checkpoints/ # Training checkpoints
└── cache/ # Temporary cache files
```
## βš™οΈ Configuration
### Teacher Models
Default supported teacher models (configured in `config.py`):
```python
TEACHER_MODELS = [
"Alibaba-NLP/gte-Qwen2-1.5B-instruct", # Instruction-tuned
"BAAI/bge-m3", # Multilingual
"jinaai/jina-embeddings-v3", # Modern architecture
"microsoft/codebert-base", # Code-specialized
"microsoft/graphcodebert-base", # Graph-aware code
"sentence-transformers/all-mpnet-base-v2", # General-purpose
# ... and more
]
```
### Distillation Parameters
```python
# Model2Vec distillation settings
optimal_pca_dims: int = 256
sif_coefficient: float = 1e-3
apply_zipf: bool = True
# Tokenlearn training settings (when --train is enabled)
tokenlearn_dataset: str = "sentence-transformers/codesearchnet"
tokenlearn_text_key: str = "code" # Use code field for training
```
### Evaluation Settings
```python
# CodeSearchNet evaluation
evaluation_languages = ["python", "java", "javascript", "php", "ruby", "go"]
max_queries_per_language: int = 1000
evaluation_metrics = ["ndcg@1", "ndcg@5", "ndcg@10", "mrr", "recall@1", "recall@5", "recall@10"]
```
## πŸ“„ License
This project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details.
## πŸ™ Acknowledgments
This independent research project builds upon several excellent open-source foundations:
- [Model2Vec](https://github.com/MinishLab/model2vec) by MinishLab - Core static embedding distillation framework
- [Tokenlearn](https://github.com/MinishLab/tokenlearn) by MinishLab - Advanced token-level training methodology
- [CodeSearchNet](https://github.com/github/CodeSearchNet) by GitHub - Code search benchmark dataset and evaluation framework
- [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) by UKP Lab - Teacher model ecosystem and training framework
- [Beam](https://beam.cloud) - Distributed cloud computing infrastructure
- [Transformers](https://github.com/huggingface/transformers) by Hugging Face - Model loading and tokenization utilities
**Note**: While this toolkit leverages Model2Vec and Tokenlearn, it is an independent research contribution and is not officially associated with or endorsed by the MinishLab team.