🦉 CodeModernBERT-Owl v1.0: 高精度なコード検索 & コード理解モデル

CodeModernBERT-Owl v1.0 is a pretrained model designed from scratch for code search and code understanding tasks.

This model now supports Rust and improves search accuracy in Python, PHP, Java, JavaScript, Go, and Ruby.

🛠️ 主な特徴 / Key Features

  • Supports long sequences up to 8192 tokens (training used up to 2048)
  • Optimized for code search, code understanding, and code clone detection
  • Achieves top-tier performance across multiple languages
  • Multi-language support: Python, PHP, Java, JavaScript, Go, Ruby, and Rust
  • Mean pooling performs significantly better than CLS token on this model

📊 モデルパラメータ / Model Parameters

パラメータ / Parameter 値 / Value
vocab_size 50,004
hidden_size 768
num_hidden_layers 12
num_attention_heads 12
intermediate_size 3,072
max_position_embeddings 8,192 (trained with 2048)
type_vocab_size 2
hidden_dropout_prob 0.1
attention_probs_dropout_prob 0.1
local_attention_window 128
rope_theta 160,000
local_attention_rope_theta 10,000

📊 言語別 MRR 比較 (Mean Pooling)

  • 実験は CodeSearchNet の test split を使用して実施しました。
  • 候補プールサイズは 100 に固定し、言語ごとの性能を測定しました。
言語 / Language CodeModernBERT-Owl-1.0 CodeT5+ GraphCodeBERT CodeBERTa-small CodeBERT
Python 0.8936 0.8048 0.3496 0.6123 0.0927
Java 0.8479 0.7853 0.3299 0.4738 0.0816
JavaScript 0.7711 0.7111 0.2581 0.3593 0.0692
PHP 0.8056 0.7893 0.2507 0.4533 0.0623
Ruby 0.7993 0.7201 0.3186 0.4418 0.0762
Go 0.8426 0.7577 0.4453 0.5338 0.0856

✅ CodeModernBERT-Owl-1.0 (Mean Pooling) achieves the best MRR across all evaluated languages.


📝 結論 / Conclusion

  • Top performance in all languages
  • Rust support successfully added through dataset augmentation
  • Mean pooling is significantly more effective than CLS embedding
  • Further performance improvements possible with better datasets

📜 ライセンス / License

📄 Apache-2.0

📧 連絡先 / Contact

📩 For any questions, please contact: 📧 shun0212114@outlook.jp

Downloads last month
33
Safetensors
Model size
152M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shuu12121/CodeModernBERT-Owl-1.0

Finetunes
1 model

Dataset used to train Shuu12121/CodeModernBERT-Owl-1.0