File size: 12,740 Bytes

---
base_model: sentence-transformers/all-mpnet-base-v2
library_name: distiller
license: apache-2.0
license_name: apache-2.0
license_link: LICENSE
model_name: codemalt
tags:
- code-search
- code-embeddings
- model2vec
- distillation
- sentence-transformers
- static-embeddings
- tokenlearn
datasets:
- code-search-net/code_search_net
- sentence-transformers/codesearchnet
metrics:
- ndcg@10
- mrr
- recall@5
language:
- code
pipeline_tag: feature-extraction
---

# CodeMalt

**CodeMalt** is a high-performance, code-specialized static embedding model created through Model2Vec distillation of `sentence-transformers/all-mpnet-base-v2`. This model achieves **73.87% NDCG@10** on CodeSearchNet benchmarks while being **14x smaller** and **15,021x faster** than the original teacher model.

## 🏆 Performance Highlights

- **NDCG@10**: 0.7387 (Best among all distilled models)
- **Mean Reciprocal Rank (MRR)**: 0.7010  
- **Recall@5**: 0.8017
- **Model Size**: 7.6M parameters (vs 109M original)
- **Inference Speed**: 15,021x faster than teacher model
- **Memory Usage**: <1GB RAM (vs 8+ GB VRAM for original)

## 📊 CodeSearchNet Performance by Language

| Language | NDCG@10 | MRR | Recall@5 |
|----------|---------|-----|----------|
| **Python** | 0.7899 | 0.7501 | 0.8421 |
| **JavaScript** | 0.7234 | 0.6801 | 0.7895 |
| **Java** | 0.7456 | 0.7089 | 0.8123 |
| **PHP** | 0.7198 | 0.6856 | 0.7834 |
| **Ruby** | 0.7312 | 0.6934 | 0.7912 |
| **Go** | 0.7223 | 0.6876 | 0.7913 |

## 🔧 Model Details

- **Teacher Model**: [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
- **Distillation Method**: Model2Vec + Tokenlearn training on CodeSearchNet
- **Architecture**: Static embeddings (no neural network inference required)
- **Embedding Dimensions**: 256
- **Training Data**: CodeSearchNet code-comment pairs across 6 programming languages
- **Optimization**: PCA dimensionality reduction + SIF weighting + Zipf regularization
- **Vocabulary Size**: 29,528
- **Parameters**: 7.6M
- **Size**: 14.4MB


## 🎯 Distiller: Code-Specialized Embedding Toolkit

**Distiller** is an independent toolkit built upon [Model2Vec](https://github.com/MinishLab/model2vec) and [Tokenlearn](https://github.com/MinishLab/tokenlearn) for creating code-specialized static embeddings. This package provides a complete pipeline for distilling, training, and evaluating efficient embedding models optimized for code-related tasks.

> **Note**: This is an independent research project that builds upon the Model2Vec framework. We are not affiliated with the MinishLab Model2Vec team, but acknowledge their excellent foundational work.

>[!Important]
>Check out the comprehensive [REPORT.md](REPORT.md) file generated by this toolkit for detailed performance analysis, model comparisons, and evaluation results across different programming languages.

>[!Warning]
>**Research Finding**: See [NOTES.md](NOTES.md) for critical analysis showing that C4 fine-tuning significantly degraded performance (-16.8% NDCG@10) compared to simple Model2Vec distillation. **Recommendation**: Use basic distillation without additional training for optimal code embedding performance.

The **distiller** package provides a complete pipeline for:

1. **Distilling code-specialized embeddings** from large sentence transformer models using Model2Vec
2. **Comprehensive evaluation** on CodeSearchNet benchmarks across 6 programming languages  
3. **Performance benchmarking** (speed, memory, model size analysis)
4. **Advanced training** with tokenlearn for enhanced code understanding
5. **Analysis and reporting** with visualizations and comparison charts
6. **Cloud-scale processing** with Beam support for distributed execution

### Key Benefits

- **🚀 Performance**: Up to 500x faster inference with 50x smaller models
- **📊 Code-Optimized**: Specialized for code search, classification, and similarity tasks
- **🔬 Comprehensive**: Full evaluation pipeline with CodeSearchNet metrics
- **☁️ Scalable**: Local and cloud execution with Beam support
- **📈 Analytical**: Rich reporting with performance charts and comparisons

## 🚀 Quick Start

### Installation

```bash
# Install with all dependencies
pip install model2vec[train] torch transformers datasets sentence-transformers
pip install typer pydantic plotly matplotlib seaborn

# Install the distiller package (assuming local development)
pip install -e .
```

### Basic Usage

```bash
# Simple distillation of a teacher model
distiller distill

# Distillation with advanced CodeSearchNet training  
distiller distill --train

# Evaluate distilled models on CodeSearchNet
distiller evaluate

# Generate comprehensive analysis report
distiller analyze
```

### Python API

```python
from distiller import distill, evaluate, analyze

# Distill a specific model
results = distill.run_local_distillation(
    teacher_models=["microsoft/codebert-base"],
    enable_training=True,  # Include CodeSearchNet fine-tuning
    pca_dims=256
)

# Evaluate on CodeSearchNet
evaluation_results = evaluate.run_evaluation(
    models=["."],
    max_queries=1000,
    languages=["python", "javascript", "java", "go", "php", "ruby"]
)

# Generate analysis report
analyze.main(
    results_dir="./code_model2vec/evaluation_results",
    model_name="code_model2vec_distilled_models",
    output="ANALYSIS_REPORT.md"
)
```

## 📋 Features

### 🔬 Distillation Engine

- **Multiple Teacher Models**: Support for 15+ pre-configured teacher models including:
  - Code-specialized: `microsoft/codebert-base`, `BAAI/bge-code-v1`, `Salesforce/SFR-Embedding-Code-2B_R`
  - General-purpose: `sentence-transformers/all-mpnet-base-v2`, `BAAI/bge-m3`
  - Instruction-tuned: `Alibaba-NLP/gte-Qwen2-1.5B-instruct`

- **Advanced Training Pipeline**: Optional tokenlearn-based training following the POTION approach:
  1. Model2Vec distillation (basic static embeddings)
  2. Feature extraction using sentence transformers
  3. Tokenlearn training on CodeSearchNet data
  4. Post-training re-regularization (PCA + SIF weighting)

- **Robust Model Handling**: Automatic compatibility checks and specialized handling for problematic models

### 📊 Evaluation Framework

- **CodeSearchNet Evaluation**: Standard code search benchmarks across 6 programming languages
- **Retrieval Metrics**: NDCG@k, MRR, Recall@k, Mean/Median Rank
- **Performance Benchmarking**: 
  - Model size analysis (disk usage, parameters, memory footprint)
  - Inference speed testing (various batch sizes and text lengths)
  - CPU vs GPU performance comparison
  - Memory scaling analysis

### 📈 Analysis & Reporting

- **Comprehensive Reports**: Automated generation of analysis reports with:
  - Performance comparison tables
  - Language-specific radar charts  
  - Efficiency analysis (performance vs model size)
  - Peer model comparisons

- **Rich Visualizations**: Plotly and Matplotlib charts including:
  - Multi-model performance heatmaps
  - Batch size scaling curves
  - Memory usage patterns
  - Model efficiency scatter plots

### ☁️ Cloud Integration

- **Beam Support**: Distributed execution on Beam cloud infrastructure
- **Volume Management**: Persistent storage with checkpoint support
- **Resource Optimization**: GPU-optimized configurations (A100-40G default)
- **Automatic Syncing**: Seamless model and result synchronization

## 🛠️ CLI Reference

### `distiller distill`

Distill teacher models into efficient static embeddings.

```bash
distiller distill [OPTIONS]

Options:
  --use-beam              Use Beam cloud for distillation
  --train                 Enable advanced training (CodeSearchNet fine-tuning)  
  --teacher-models TEXT   Specific teacher models to distill (can be repeated)
  --pca-dims INTEGER      PCA dimensions (default: 256)
  --clear-cache          Clear HuggingFace cache for problematic models
```

**Examples:**
```bash
# Basic distillation of all default models
distiller distill

# Train specific models with advanced CodeSearchNet fine-tuning
distiller distill --train --teacher-models microsoft/codebert-base --teacher-models BAAI/bge-code-v1

# Use Beam cloud with custom PCA dimensions
distiller distill --use-beam --train --pca-dims 512
```

### `distiller evaluate`

Evaluate models on CodeSearchNet benchmarks with performance analysis.

```bash
distiller evaluate [OPTIONS]

Options:
  --use-beam              Use Beam cloud for evaluation
  --skip-third-party      Skip third-party models evaluation
  --skip-benchmark        Skip performance benchmarking  
  --max-queries INTEGER   Maximum queries per language (default: 100)
```

**Examples:**
```bash
# Comprehensive evaluation with benchmarking
distiller evaluate --max-queries 1000

# Quick evaluation without performance benchmarks
distiller evaluate --skip-benchmark --max-queries 100

# Cloud-based evaluation
distiller evaluate --use-beam --max-queries 500
```

### `distiller analyze`

Generate comprehensive analysis reports with visualizations.

```bash
distiller analyze [OPTIONS]

Options:
  --results-dir PATH      Results directory (default: code_model2vec/evaluation_results)
  --model-name TEXT       Model name for analysis (default: gte_qwen2_m2v_code (Ours))
  --output PATH           Output report file (default: REPORT.md)
  --export-csv PATH       Export results to CSV file
```

**Examples:**
```bash
# Generate standard analysis report
distiller analyze

# Custom analysis with CSV export
distiller analyze --model-name "my_distilled_model" --output custom_report.md --export-csv results.csv

# Analyze specific results directory
distiller analyze --results-dir ./custom_results --output analysis.md
```

## 📁 Directory Structure

The distiller uses a standardized directory structure:

```
code_model2vec/
├── base/                    # Basic distilled models (Step 1)
│   └── code_model2vec_{teacher_name}/
├── final/                   # Final models (copied from base or after training)
│   └── code_model2vec_{teacher_name}[_fine_tuned]/
├── evaluation_results/      # CodeSearchNet evaluation results
│   └── comprehensive_eval_{model}.json
├── benchmark_results/       # Performance benchmark results  
├── analysis_results/        # Analysis reports and charts
│   └── charts/
├── checkpoints/            # Training checkpoints
└── cache/                  # Temporary cache files
```

## ⚙️ Configuration

### Teacher Models

Default supported teacher models (configured in `config.py`):

```python
TEACHER_MODELS = [
    "Alibaba-NLP/gte-Qwen2-1.5B-instruct",  # Instruction-tuned
    "BAAI/bge-m3",                           # Multilingual  
    "jinaai/jina-embeddings-v3",             # Modern architecture
    "microsoft/codebert-base",               # Code-specialized
    "microsoft/graphcodebert-base",          # Graph-aware code
    "sentence-transformers/all-mpnet-base-v2", # General-purpose
    # ... and more
]
```

### Distillation Parameters

```python
# Model2Vec distillation settings
optimal_pca_dims: int = 256
sif_coefficient: float = 1e-3  
apply_zipf: bool = True

# Tokenlearn training settings (when --train is enabled)
tokenlearn_dataset: str = "sentence-transformers/codesearchnet"
tokenlearn_text_key: str = "code"  # Use code field for training
```

### Evaluation Settings

```python
# CodeSearchNet evaluation
evaluation_languages = ["python", "java", "javascript", "php", "ruby", "go"]
max_queries_per_language: int = 1000
evaluation_metrics = ["ndcg@1", "ndcg@5", "ndcg@10", "mrr", "recall@1", "recall@5", "recall@10"]
```

## 📄 License

This project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

This independent research project builds upon several excellent open-source foundations:

- [Model2Vec](https://github.com/MinishLab/model2vec) by MinishLab - Core static embedding distillation framework
- [Tokenlearn](https://github.com/MinishLab/tokenlearn) by MinishLab - Advanced token-level training methodology  
- [CodeSearchNet](https://github.com/github/CodeSearchNet) by GitHub - Code search benchmark dataset and evaluation framework
- [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) by UKP Lab - Teacher model ecosystem and training framework
- [Beam](https://beam.cloud) - Distributed cloud computing infrastructure
- [Transformers](https://github.com/huggingface/transformers) by Hugging Face - Model loading and tokenization utilities

**Note**: While this toolkit leverages Model2Vec and Tokenlearn, it is an independent research contribution and is not officially associated with or endorsed by the MinishLab team.