codemalt / README.md

Sarthak

chore: update README and REPORT with performance insights and dataset changes

0dbb356 4 months ago

12.7 kB

	---
	base_model: sentence-transformers/all-mpnet-base-v2
	library_name: distiller
	license: apache-2.0
	license_name: apache-2.0
	license_link: LICENSE
	model_name: codemalt
	tags:
	- code-search
	- code-embeddings
	- model2vec
	- distillation
	- sentence-transformers
	- static-embeddings
	- tokenlearn
	datasets:
	- code-search-net/code_search_net
	- sentence-transformers/codesearchnet
	metrics:
	- ndcg@10
	- mrr
	- recall@5
	language:
	- code
	pipeline_tag: feature-extraction
	---

	# CodeMalt

	CodeMalt is a high-performance, code-specialized static embedding model created through Model2Vec distillation of `sentence-transformers/all-mpnet-base-v2`. This model achieves 73.87% NDCG@10 on CodeSearchNet benchmarks while being 14x smaller and 15,021x faster than the original teacher model.

	## 🏆 Performance Highlights

	- NDCG@10: 0.7387 (Best among all distilled models)
	- Mean Reciprocal Rank (MRR): 0.7010
	- Recall@5: 0.8017
	- Model Size: 7.6M parameters (vs 109M original)
	- Inference Speed: 15,021x faster than teacher model
	- Memory Usage: <1GB RAM (vs 8+ GB VRAM for original)

	## 📊 CodeSearchNet Performance by Language

	\| Language \| NDCG@10 \| MRR \| Recall@5 \|
	\|----------\|---------\|-----\|----------\|
	\| Python \| 0.7899 \| 0.7501 \| 0.8421 \|
	\| JavaScript \| 0.7234 \| 0.6801 \| 0.7895 \|
	\| Java \| 0.7456 \| 0.7089 \| 0.8123 \|
	\| PHP \| 0.7198 \| 0.6856 \| 0.7834 \|
	\| Ruby \| 0.7312 \| 0.6934 \| 0.7912 \|
	\| Go \| 0.7223 \| 0.6876 \| 0.7913 \|

	## 🔧 Model Details

	- Teacher Model: [sentence-transformers/all-mpnet-base-v2](https://huggingface.co/sentence-transformers/all-mpnet-base-v2)
	- Distillation Method: Model2Vec + Tokenlearn training on CodeSearchNet
	- Architecture: Static embeddings (no neural network inference required)
	- Embedding Dimensions: 256
	- Training Data: CodeSearchNet code-comment pairs across 6 programming languages
	- Optimization: PCA dimensionality reduction + SIF weighting + Zipf regularization
	- Vocabulary Size: 29,528
	- Parameters: 7.6M
	- Size: 14.4MB


	## 🎯 Distiller: Code-Specialized Embedding Toolkit

	Distiller is an independent toolkit built upon [Model2Vec](https://github.com/MinishLab/model2vec) and [Tokenlearn](https://github.com/MinishLab/tokenlearn) for creating code-specialized static embeddings. This package provides a complete pipeline for distilling, training, and evaluating efficient embedding models optimized for code-related tasks.

	> Note: This is an independent research project that builds upon the Model2Vec framework. We are not affiliated with the MinishLab Model2Vec team, but acknowledge their excellent foundational work.

	>[!Important]
	>Check out the comprehensive [REPORT.md](REPORT.md) file generated by this toolkit for detailed performance analysis, model comparisons, and evaluation results across different programming languages.

	>[!Warning]
	>Research Finding: See [NOTES.md](NOTES.md) for critical analysis showing that C4 fine-tuning significantly degraded performance (-16.8% NDCG@10) compared to simple Model2Vec distillation. Recommendation: Use basic distillation without additional training for optimal code embedding performance.

	The distiller package provides a complete pipeline for:

	1. Distilling code-specialized embeddings from large sentence transformer models using Model2Vec
	2. Comprehensive evaluation on CodeSearchNet benchmarks across 6 programming languages
	3. Performance benchmarking (speed, memory, model size analysis)
	4. Advanced training with tokenlearn for enhanced code understanding
	5. Analysis and reporting with visualizations and comparison charts
	6. Cloud-scale processing with Beam support for distributed execution

	### Key Benefits

	- 🚀 Performance: Up to 500x faster inference with 50x smaller models
	- 📊 Code-Optimized: Specialized for code search, classification, and similarity tasks
	- 🔬 Comprehensive: Full evaluation pipeline with CodeSearchNet metrics
	- ☁️ Scalable: Local and cloud execution with Beam support
	- 📈 Analytical: Rich reporting with performance charts and comparisons

	## 🚀 Quick Start

	### Installation

	```bash
	# Install with all dependencies
	pip install model2vec[train] torch transformers datasets sentence-transformers
	pip install typer pydantic plotly matplotlib seaborn

	# Install the distiller package (assuming local development)
	pip install -e .
	```

	### Basic Usage

	```bash
	# Simple distillation of a teacher model
	distiller distill

	# Distillation with advanced CodeSearchNet training
	distiller distill --train

	# Evaluate distilled models on CodeSearchNet
	distiller evaluate

	# Generate comprehensive analysis report
	distiller analyze
	```

	### Python API

	```python
	from distiller import distill, evaluate, analyze

	# Distill a specific model
	results = distill.run_local_distillation(
	teacher_models=["microsoft/codebert-base"],
	enable_training=True, # Include CodeSearchNet fine-tuning
	pca_dims=256
	)

	# Evaluate on CodeSearchNet
	evaluation_results = evaluate.run_evaluation(
	models=["."],
	max_queries=1000,
	languages=["python", "javascript", "java", "go", "php", "ruby"]
	)

	# Generate analysis report
	analyze.main(
	results_dir="./code_model2vec/evaluation_results",
	model_name="code_model2vec_distilled_models",
	output="ANALYSIS_REPORT.md"
	)
	```

	## 📋 Features

	### 🔬 Distillation Engine

	- Multiple Teacher Models: Support for 15+ pre-configured teacher models including:
	- Code-specialized: `microsoft/codebert-base`, `BAAI/bge-code-v1`, `Salesforce/SFR-Embedding-Code-2B_R`
	- General-purpose: `sentence-transformers/all-mpnet-base-v2`, `BAAI/bge-m3`
	- Instruction-tuned: `Alibaba-NLP/gte-Qwen2-1.5B-instruct`

	- Advanced Training Pipeline: Optional tokenlearn-based training following the POTION approach:
	1. Model2Vec distillation (basic static embeddings)
	2. Feature extraction using sentence transformers
	3. Tokenlearn training on CodeSearchNet data
	4. Post-training re-regularization (PCA + SIF weighting)

	- Robust Model Handling: Automatic compatibility checks and specialized handling for problematic models

	### 📊 Evaluation Framework

	- CodeSearchNet Evaluation: Standard code search benchmarks across 6 programming languages
	- Retrieval Metrics: NDCG@k, MRR, Recall@k, Mean/Median Rank
	- Performance Benchmarking:
	- Model size analysis (disk usage, parameters, memory footprint)
	- Inference speed testing (various batch sizes and text lengths)
	- CPU vs GPU performance comparison
	- Memory scaling analysis

	### 📈 Analysis & Reporting

	- Comprehensive Reports: Automated generation of analysis reports with:
	- Performance comparison tables
	- Language-specific radar charts
	- Efficiency analysis (performance vs model size)
	- Peer model comparisons

	- Rich Visualizations: Plotly and Matplotlib charts including:
	- Multi-model performance heatmaps
	- Batch size scaling curves
	- Memory usage patterns
	- Model efficiency scatter plots

	### ☁️ Cloud Integration

	- Beam Support: Distributed execution on Beam cloud infrastructure
	- Volume Management: Persistent storage with checkpoint support
	- Resource Optimization: GPU-optimized configurations (A100-40G default)
	- Automatic Syncing: Seamless model and result synchronization

	## 🛠️ CLI Reference

	### `distiller distill`

	Distill teacher models into efficient static embeddings.

	```bash
	distiller distill [OPTIONS]

	Options:
	--use-beam Use Beam cloud for distillation
	--train Enable advanced training (CodeSearchNet fine-tuning)
	--teacher-models TEXT Specific teacher models to distill (can be repeated)
	--pca-dims INTEGER PCA dimensions (default: 256)
	--clear-cache Clear HuggingFace cache for problematic models
	```

	Examples:
	```bash
	# Basic distillation of all default models
	distiller distill

	# Train specific models with advanced CodeSearchNet fine-tuning
	distiller distill --train --teacher-models microsoft/codebert-base --teacher-models BAAI/bge-code-v1

	# Use Beam cloud with custom PCA dimensions
	distiller distill --use-beam --train --pca-dims 512
	```

	### `distiller evaluate`

	Evaluate models on CodeSearchNet benchmarks with performance analysis.

	```bash
	distiller evaluate [OPTIONS]

	Options:
	--use-beam Use Beam cloud for evaluation
	--skip-third-party Skip third-party models evaluation
	--skip-benchmark Skip performance benchmarking
	--max-queries INTEGER Maximum queries per language (default: 100)
	```

	Examples:
	```bash
	# Comprehensive evaluation with benchmarking
	distiller evaluate --max-queries 1000

	# Quick evaluation without performance benchmarks
	distiller evaluate --skip-benchmark --max-queries 100

	# Cloud-based evaluation
	distiller evaluate --use-beam --max-queries 500
	```

	### `distiller analyze`

	Generate comprehensive analysis reports with visualizations.

	```bash
	distiller analyze [OPTIONS]

	Options:
	--results-dir PATH Results directory (default: code_model2vec/evaluation_results)
	--model-name TEXT Model name for analysis (default: gte_qwen2_m2v_code (Ours))
	--output PATH Output report file (default: REPORT.md)
	--export-csv PATH Export results to CSV file
	```

	Examples:
	```bash
	# Generate standard analysis report
	distiller analyze

	# Custom analysis with CSV export
	distiller analyze --model-name "my_distilled_model" --output custom_report.md --export-csv results.csv

	# Analyze specific results directory
	distiller analyze --results-dir ./custom_results --output analysis.md
	```

	## 📁 Directory Structure

	The distiller uses a standardized directory structure:

	```
	code_model2vec/
	├── base/ # Basic distilled models (Step 1)
	│ └── code_model2vec_{teacher_name}/
	├── final/ # Final models (copied from base or after training)
	│ └── code_model2vec_{teacher_name}[_fine_tuned]/
	├── evaluation_results/ # CodeSearchNet evaluation results
	│ └── comprehensive_eval_{model}.json
	├── benchmark_results/ # Performance benchmark results
	├── analysis_results/ # Analysis reports and charts
	│ └── charts/
	├── checkpoints/ # Training checkpoints
	└── cache/ # Temporary cache files
	```

	## ⚙️ Configuration

	### Teacher Models

	Default supported teacher models (configured in `config.py`):

	```python
	TEACHER_MODELS = [
	"Alibaba-NLP/gte-Qwen2-1.5B-instruct", # Instruction-tuned
	"BAAI/bge-m3", # Multilingual
	"jinaai/jina-embeddings-v3", # Modern architecture
	"microsoft/codebert-base", # Code-specialized
	"microsoft/graphcodebert-base", # Graph-aware code
	"sentence-transformers/all-mpnet-base-v2", # General-purpose
	# ... and more
	]
	```

	### Distillation Parameters

	```python
	# Model2Vec distillation settings
	optimal_pca_dims: int = 256
	sif_coefficient: float = 1e-3
	apply_zipf: bool = True

	# Tokenlearn training settings (when --train is enabled)
	tokenlearn_dataset: str = "sentence-transformers/codesearchnet"
	tokenlearn_text_key: str = "code" # Use code field for training
	```

	### Evaluation Settings

	```python
	# CodeSearchNet evaluation
	evaluation_languages = ["python", "java", "javascript", "php", "ruby", "go"]
	max_queries_per_language: int = 1000
	evaluation_metrics = ["ndcg@1", "ndcg@5", "ndcg@10", "mrr", "recall@1", "recall@5", "recall@10"]
	```

	## 📄 License

	This project is licensed under the Apache 2.0 License - see the [LICENSE](LICENSE) file for details.

	## 🙏 Acknowledgments

	This independent research project builds upon several excellent open-source foundations:

	- [Model2Vec](https://github.com/MinishLab/model2vec) by MinishLab - Core static embedding distillation framework
	- [Tokenlearn](https://github.com/MinishLab/tokenlearn) by MinishLab - Advanced token-level training methodology
	- [CodeSearchNet](https://github.com/github/CodeSearchNet) by GitHub - Code search benchmark dataset and evaluation framework
	- [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) by UKP Lab - Teacher model ecosystem and training framework
	- [Beam](https://beam.cloud) - Distributed cloud computing infrastructure
	- [Transformers](https://github.com/huggingface/transformers) by Hugging Face - Model loading and tokenization utilities

	Note: While this toolkit leverages Model2Vec and Tokenlearn, it is an independent research contribution and is not officially associated with or endorsed by the MinishLab team.