|
--- |
|
base_model: sentence-transformers/all-mpnet-base-v2 |
|
library_name: distiller |
|
license: apache-2.0 |
|
license_name: apache-2.0 |
|
license_link: LICENSE |
|
model_name: codemalt |
|
tags: |
|
- code-search |
|
- code-embeddings |
|
- model2vec |
|
- distillation |
|
- sentence-transformers |
|
- static-embeddings |
|
- tokenlearn |
|
datasets: |
|
- code-search-net/code_search_net |
|
- sentence-transformers/codesearchnet |
|
metrics: |
|
- ndcg@10 |
|
- mrr |
|
- recall@5 |
|
language: |
|
- code |
|
pipeline_tag: feature-extraction |
|
--- |
|
|
|
# CodeMalt |
|
|
|
**CodeMalt** is a high-performance, code-specialized static embedding model created through Model2Vec distillation of `sentence-transformers/all-mpnet-base-v2`. This model achieves **73.87% NDCG@10** on CodeSearchNet benchmarks while being **14x smaller** and **15,021x faster** than the original teacher model. |
|
|
|
## π Performance Highlights |
|
|
|
- **NDCG@10**: 0.7387 (Best among all distilled models) |
|
- **Mean Reciprocal Rank (MRR)**: 0.7010 |
|
- **Recall@5**: 0.8017 |
|
- **Model Size**: 7.6M parameters (vs 109M original) |
|
- **Inference Speed**: 15,021x faster than teacher model |
|
- **Memory Usage**: <1GB RAM (vs 8+ GB VRAM for original) |
|
|
|
## π CodeSearchNet Performance by Language |
|
|
|
| Language | NDCG@10 | MRR | Recall@5 | |
|
|----------|---------|-----|----------| |
|
| **Python** | 0.7899 | 0.7501 | 0.8421 | |
|
| **JavaScript** | 0.7234 | 0.6801 | 0.7895 | |
|
| **Java** | 0.7456 | 0.7089 | 0.8123 | |
|
| **PHP** | 0.7198 | 0.6856 | 0.7834 | |
|
| **Ruby** | 0.7312 | 0.6934 | 0.7912 | |
|
| **Go** | 0.7223 | 0.6876 | 0.7913 | |
|
|
|
## π§ Model Details |
|
|
|
- **Teacher Model**: [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) |
|
- **Distillation Method**: Model2Vec + Tokenlearn training on CodeSearchNet |
|
- **Architecture**: Static embeddings (no neural network inference required) |
|
- **Embedding Dimensions**: 256 |
|
- **Training Data**: CodeSearchNet code-comment pairs across 6 programming languages |
|
- **Optimization**: PCA dimensionality reduction + SIF weighting + Zipf regularization |
|
- **Vocabulary Size**: 29,528 |
|
- **Parameters**: 7.6M |
|
- **Size**: 14.4MB |
|
|
|
|
|
## π― Distiller: Code-Specialized Embedding Toolkit |
|
|
|
**Distiller** is an independent toolkit built upon [Model2Vec](https://github.com/MinishLab/model2vec) and [Tokenlearn](https://github.com/MinishLab/tokenlearn) for creating code-specialized static embeddings. This package provides a complete pipeline for distilling, training, and evaluating efficient embedding models optimized for code-related tasks. |
|
|
|
> **Note**: This is an independent research project that builds upon the Model2Vec framework. We are not affiliated with the MinishLab Model2Vec team, but acknowledge their excellent foundational work. |
|
|
|
>[!Important] |
|
>Check out the comprehensive [REPORT.md](REPORT.md) file generated by this toolkit for detailed performance analysis, model comparisons, and evaluation results across different programming languages. |
|
|
|
>[!Warning] |
|
>**Research Finding**: See [NOTES.md](NOTES.md) for critical analysis showing that C4 fine-tuning significantly degraded performance (-16.8% NDCG@10) compared to simple Model2Vec distillation. **Recommendation**: Use basic distillation without additional training for optimal code embedding performance. |
|
|
|
The **distiller** package provides a complete pipeline for: |
|
|
|
1. **Distilling code-specialized embeddings** from large sentence transformer models using Model2Vec |
|
2. **Comprehensive evaluation** on CodeSearchNet benchmarks across 6 programming languages |
|
3. **Performance benchmarking** (speed, memory, model size analysis) |
|
4. **Advanced training** with tokenlearn for enhanced code understanding |
|
5. **Analysis and reporting** with visualizations and comparison charts |
|
6. **Cloud-scale processing** with Beam support for distributed execution |
|
|
|
### Key Benefits |
|
|
|
- **π Performance**: Up to 500x faster inference with 50x smaller models |
|
- **π Code-Optimized**: Specialized for code search, classification, and similarity tasks |
|
- **π¬ Comprehensive**: Full evaluation pipeline with CodeSearchNet metrics |
|
- **βοΈ Scalable**: Local and cloud execution with Beam support |
|
- **π Analytical**: Rich reporting with performance charts and comparisons |
|
|
|
## π Quick Start |
|
|
|
### Installation |
|
|
|
```bash |
|
# Install with all dependencies |
|
pip install model2vec[train] torch transformers datasets sentence-transformers |
|
pip install typer pydantic plotly matplotlib seaborn |
|
|
|
# Install the distiller package (assuming local development) |
|
pip install -e . |
|
``` |
|
|
|
### Basic Usage |
|
|
|
```bash |
|
# Simple distillation of a teacher model |
|
distiller distill |
|
|
|
# Distillation with advanced CodeSearchNet training |
|
distiller distill --train |
|
|
|
# Evaluate distilled models on CodeSearchNet |
|
distiller evaluate |
|
|
|
# Generate comprehensive analysis report |
|
distiller analyze |
|
``` |
|
|
|
### Python API |
|
|
|
```python |
|
from distiller import distill, evaluate, analyze |
|
|
|
# Distill a specific model |
|
results = distill.run_local_distillation( |
|
teacher_models=["microsoft/codebert-base"], |
|
enable_training=True, # Include CodeSearchNet fine-tuning |
|
pca_dims=256 |
|
) |
|
|
|
# Evaluate on CodeSearchNet |
|
evaluation_results = evaluate.run_evaluation( |
|
models=["."], |
|
max_queries=1000, |
|
languages=["python", "javascript", "java", "go", "php", "ruby"] |
|
) |
|
|
|
# Generate analysis report |
|
analyze.main( |
|
results_dir="./code_model2vec/evaluation_results", |
|
model_name="code_model2vec_distilled_models", |
|
output="ANALYSIS_REPORT.md" |
|
) |
|
``` |
|
|
|
## π Features |
|
|
|
### π¬ Distillation Engine |
|
|
|
- **Multiple Teacher Models**: Support for 15+ pre-configured teacher models including: |
|
- Code-specialized: `microsoft/codebert-base`, `BAAI/bge-code-v1`, `Salesforce/SFR-Embedding-Code-2B_R` |
|
- General-purpose: `sentence-transformers/all-mpnet-base-v2`, `BAAI/bge-m3` |
|
- Instruction-tuned: `Alibaba-NLP/gte-Qwen2-1.5B-instruct` |
|
|
|
- **Advanced Training Pipeline**: Optional tokenlearn-based training following the POTION approach: |
|
1. Model2Vec distillation (basic static embeddings) |
|
2. Feature extraction using sentence transformers |
|
3. Tokenlearn training on CodeSearchNet data |
|
4. Post-training re-regularization (PCA + SIF weighting) |
|
|
|
- **Robust Model Handling**: Automatic compatibility checks and specialized handling for problematic models |
|
|
|
### π Evaluation Framework |
|
|
|
- **CodeSearchNet Evaluation**: Standard code search benchmarks across 6 programming languages |
|
- **Retrieval Metrics**: NDCG@k, MRR, Recall@k, Mean/Median Rank |
|
- **Performance Benchmarking**: |
|
- Model size analysis (disk usage, parameters, memory footprint) |
|
- Inference speed testing (various batch sizes and text lengths) |
|
- CPU vs GPU performance comparison |
|
- Memory scaling analysis |
|
|
|
### π Analysis & Reporting |
|
|
|
- **Comprehensive Reports**: Automated generation of analysis reports with: |
|
- Performance comparison tables |
|
- Language-specific radar charts |
|
- Efficiency analysis (performance vs model size) |
|
- Peer model comparisons |
|
|
|
- **Rich Visualizations**: Plotly and Matplotlib charts including: |
|
- Multi-model performance heatmaps |
|
- Batch size scaling curves |
|
- Memory usage patterns |
|
- Model efficiency scatter plots |
|
|
|
### βοΈ Cloud Integration |
|
|
|
- **Beam Support**: Distributed execution on Beam cloud infrastructure |
|
- **Volume Management**: Persistent storage with checkpoint support |
|
- **Resource Optimization**: GPU-optimized configurations (A100-40G default) |
|
- **Automatic Syncing**: Seamless model and result synchronization |
|
|
|
## π οΈ CLI Reference |
|
|
|
### `distiller distill` |
|
|
|
Distill teacher models into efficient static embeddings. |
|
|
|
```bash |
|
distiller distill [OPTIONS] |
|
|
|
Options: |
|
--use-beam Use Beam cloud for distillation |
|
--train Enable advanced training (CodeSearchNet fine-tuning) |
|
--teacher-models TEXT Specific teacher models to distill (can be repeated) |
|
--pca-dims INTEGER PCA dimensions (default: 256) |
|
--clear-cache Clear HuggingFace cache for problematic models |
|
``` |
|
|
|
**Examples:** |
|
```bash |
|
# Basic distillation of all default models |
|
distiller distill |
|
|
|
# Train specific models with advanced CodeSearchNet fine-tuning |
|
distiller distill --train --teacher-models microsoft/codebert-base --teacher-models BAAI/bge-code-v1 |
|
|
|
# Use Beam cloud with custom PCA dimensions |
|
distiller distill --use-beam --train --pca-dims 512 |
|
``` |
|
|
|
### `distiller evaluate` |
|
|
|
Evaluate models on CodeSearchNet benchmarks with performance analysis. |
|
|
|
```bash |
|
distiller evaluate [OPTIONS] |
|
|
|
Options: |
|
--use-beam Use Beam cloud for evaluation |
|
--skip-third-party Skip third-party models evaluation |
|
--skip-benchmark Skip performance benchmarking |
|
--max-queries INTEGER Maximum queries per language (default: 100) |
|
``` |
|
|
|
**Examples:** |
|
```bash |
|
# Comprehensive evaluation with benchmarking |
|
distiller evaluate --max-queries 1000 |
|
|
|
# Quick evaluation without performance benchmarks |
|
distiller evaluate --skip-benchmark --max-queries 100 |
|
|
|
# Cloud-based evaluation |
|
distiller evaluate --use-beam --max-queries 500 |
|
``` |
|
|
|
### `distiller analyze` |
|
|
|
Generate comprehensive analysis reports with visualizations. |
|
|
|
```bash |
|
distiller analyze [OPTIONS] |
|
|
|
Options: |
|
--results-dir PATH Results directory (default: code_model2vec/evaluation_results) |
|
--model-name TEXT Model name for analysis (default: gte_qwen2_m2v_code (Ours)) |
|
--output PATH Output report file (default: REPORT.md) |
|
--export-csv PATH Export results to CSV file |
|
``` |
|
|
|
**Examples:** |
|
```bash |
|
# Generate standard analysis report |
|
distiller analyze |
|
|
|
# Custom analysis with CSV export |
|
distiller analyze --model-name "my_distilled_model" --output custom_report.md --export-csv results.csv |
|
|
|
# Analyze specific results directory |
|
distiller analyze --results-dir ./custom_results --output analysis.md |
|
``` |
|
|
|
## π Directory Structure |
|
|
|
The distiller uses a standardized directory structure: |
|
|
|
``` |
|
code_model2vec/ |
|
βββ base/ # Basic distilled models (Step 1) |
|
β βββ code_model2vec_{teacher_name}/ |
|
βββ final/ # Final models (copied from base or after training) |
|
β βββ code_model2vec_{teacher_name}[_fine_tuned]/ |
|
βββ evaluation_results/ # CodeSearchNet evaluation results |
|
β βββ comprehensive_eval_{model}.json |
|
βββ benchmark_results/ # Performance benchmark results |
|
βββ analysis_results/ # Analysis reports and charts |
|
β βββ charts/ |
|
βββ checkpoints/ # Training checkpoints |
|
βββ cache/ # Temporary cache files |
|
``` |
|
|
|
## βοΈ Configuration |
|
|
|
### Teacher Models |
|
|
|
Default supported teacher models (configured in `config.py`): |
|
|
|
```python |
|
TEACHER_MODELS = [ |
|
"Alibaba-NLP/gte-Qwen2-1.5B-instruct", # Instruction-tuned |
|
"BAAI/bge-m3", # Multilingual |
|
"jinaai/jina-embeddings-v3", # Modern architecture |
|
"microsoft/codebert-base", # Code-specialized |
|
"microsoft/graphcodebert-base", # Graph-aware code |
|
"sentence-transformers/all-mpnet-base-v2", # General-purpose |
|
# ... and more |
|
] |
|
``` |
|
|
|
### Distillation Parameters |
|
|
|
```python |
|
# Model2Vec distillation settings |
|
optimal_pca_dims: int = 256 |
|
sif_coefficient: float = 1e-3 |
|
apply_zipf: bool = True |
|
|
|
# Tokenlearn training settings (when --train is enabled) |
|
tokenlearn_dataset: str = "sentence-transformers/codesearchnet" |
|
tokenlearn_text_key: str = "code" # Use code field for training |
|
``` |
|
|
|
### Evaluation Settings |
|
|
|
```python |
|
# CodeSearchNet evaluation |
|
evaluation_languages = ["python", "java", "javascript", "php", "ruby", "go"] |
|
max_queries_per_language: int = 1000 |
|
evaluation_metrics = ["ndcg@1", "ndcg@5", "ndcg@10", "mrr", "recall@1", "recall@5", "recall@10"] |
|
``` |
|
|
|
## π License |
|
|
|
This project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details. |
|
|
|
## π Acknowledgments |
|
|
|
This independent research project builds upon several excellent open-source foundations: |
|
|
|
- [Model2Vec](https://github.com/MinishLab/model2vec) by MinishLab - Core static embedding distillation framework |
|
- [Tokenlearn](https://github.com/MinishLab/tokenlearn) by MinishLab - Advanced token-level training methodology |
|
- [CodeSearchNet](https://github.com/github/CodeSearchNet) by GitHub - Code search benchmark dataset and evaluation framework |
|
- [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) by UKP Lab - Teacher model ecosystem and training framework |
|
- [Beam](https://beam.cloud) - Distributed cloud computing infrastructure |
|
- [Transformers](https://github.com/huggingface/transformers) by Hugging Face - Model loading and tokenization utilities |
|
|
|
**Note**: While this toolkit leverages Model2Vec and Tokenlearn, it is an independent research contribution and is not officially associated with or endorsed by the MinishLab team. |
|
|