File size: 17,255 Bytes

---
license: apache-2.0
datasets:
- code-search-net/code_search_net
pipeline_tag: fill-mask
tags:
- code
metrics:
- code_eval
new_version: Shuu12121/CodeHawks-ModernBERT
---

# CodeMorph-ModernBERT

## 概要

**CodeMorph-ModernBERT** は、コード検索およびコード理解のタスク向けに１からトレーニングした事前学習済みモデルです。本モデルは `code-search-net/code_search_net` データセットを活用し、コードの意味的な理解を強化するために訓練されています。 
**最大シーケンス長2048トークン**（従来のMicrosoftモデルは512トークン）に対応し、特にPythonコード検索において抜群の性能を発揮します。
- **アーキテクチャ**: ModernBERT ベース
- **目的**: コード検索 / コード理解 / コード補完
- **トレーニングデータ**: CodeSearchNet (全言語)
- **ライセンス**: Apache 2.0

## 主な特徴

- **長いシーケンス対応**  
  最大2048トークンのシーケンス処理が可能。長いコードや複雑な関数にも対応します。
  
- **高いコード検索性能**  
  Pythonをはじめとする6言語対応のSentencepieceを用いて作成したトークナイザを採用し、従来モデルを大幅に上回る検索精度を実現しています。

- **専用にトレーニングされたモデル**  
  CodeSearchNetデータセットを活用して1から学習。コード特有の文法やコメントとの関係を深く理解します。


## パラメータについて

以下のパラメータで設計しています。

  | パラメータ名                      | 設定値 |
  |-----------------------------------|--------------------|
  | **vocab_size**                    | 50000              | 
  | **hidden_size**                   | 768                | 
  | **num_hidden_layers**             | 12                 | 
  | **num_attention_heads**           | 12                 |
  | **intermediate_size**             | 3072               | 
  | **max_position_embeddings**       | 2048               | 
  | **type_vocab_size**               | 2                  |
  | **hidden_dropout_prob**           | 0.1                |
  | **attention_probs_dropout_prob**  | 0.1                | 
  | **local_attention_window**        | 128                | 
  | **rope_theta**                    | 160000             |        
  | **local_attention_rope_theta**    | 10000              |          

## モデルの使用方法

Hugging Face Transformers ライブラリを利用して、本モデルを簡単にロードできます。（※ Transformers のバージョンは `4.48.0` 以上のみ動作します）
- [簡単な動作例はこちらです](https://github.com/Shun0212/CodeBERTPretrained/blob/main/UseMyCodeMorph_ModernBERT.ipynb)

### モデルのロード
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeMorph-ModernBERT")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeMorph-ModernBERT")
```

### マスク補完 (fill-mask)
```python
from transformers import pipeline

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill_mask("def add_numbers(a, b): return a + [MASK]"))
```

### コード埋め込みの取得
```python
import torch

def get_embedding(text, model, tokenizer, device="cuda"):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
    if "token_type_ids" in inputs:
        inputs.pop("token_type_ids")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model.model(**inputs)
    embedding = outputs.last_hidden_state[:, 0, :]
    return embedding

embedding = get_embedding("def my_function(): pass", model, tokenizer)
print(embedding.shape)
```

## データセット

本モデルは `code-search-net/code_search_net` データセットを使用して訓練されました。このデータセットは、複数のプログラミング言語 (Python, Java, JavaScript など) に関するコードスニペットを含んでおり、コード検索タスクに最適です。

## 評価結果

本モデルは `code_x_glue_ct_code_to_text` データセットのPythonの部分を用いて評価されました。以下は主要な評価指標です。
また実験の詳細については[こちら](https://colab.research.google.com/gist/Shun0212/474d9092deb60bd10523c3bef427d422/codemorph-modernbert-exp.ipynb?hl=ja)　を確認してください。

| 指標  | スコア |
|-------|-------|
| **MRR** (Mean Reciprocal Rank) | 0.8172 |
| **MAP** (Mean Average Precision) | 0.8172 |
| **R-Precision** | 0.7501 |
| **Recall@10** | 0.9389 |
| **Precision@10** | 0.8143 |
| **NDCG@10** | 0.8445 |
| **F1@10** | 0.8423 |

## 他のモデルとの比較

以下は、CodeMorph-ModernBERT と他の主要なコード検索モデルの比較結果です。

| モデル | MRR | MAP | R-Precision |
|--------|------|------|-------------|
| **CodeMorph-ModernBERT** | **0.8172** | **0.8172** | **0.7501** |
| microsoft/graphcodebert-base | 0.5482 | 0.5482 | 0.4458 |
| microsoft/codebert-base-mlm | 0.5243 | 0.5243 | 0.4378 |
| Salesforce/codet5p-220m-py | 0.7512 | 0.7512 | 0.6617 |
| Salesforce/codet5-large-ntp-py | 0.7846 | 0.7846 | 0.7067 |
| Shuu12121/CodeMorph-BERT | 0.6851 | 0.6851 | 0.5934 |
| Shuu12121/CodeMorph-BERTv2 | 0.6535 | 0.6535 | 0.5543 |


## Code Search モデル評価結果 (google/code_x_glue_tc_nl_code_search_adv データセット Test)

以下に、google/code_x_glue_tc_nl_code_search_adv データセット (Test) を使用した、各種Code Searchモデルの評価結果をまとめます。候補プールサイズは全て100です。
また追加実験のコードは[こちら](https://github.com/Shun0212/CodeBERTPretrained/blob/main/CodeMorph-ModernBERT-exp-2.ipynb)です

| モデル                                  | MRR    | MAP    | R-Precision |
| :-------------------------------------- | :----- | :----- | :---------- |
| Shuu12121/CodeMorph-ModernBERT          | 0.6107 | 0.6107 | 0.5038      |
| Salesforce/codet5p-220m-py             | 0.5037 | 0.5037 | 0.3805      |
| Salesforce/codet5-large-ntp-py           | 0.4872 | 0.4872 | 0.3658      |
| microsoft/graphcodebert-base            | 0.3844 | 0.3844 | 0.2764      |
| microsoft/codebert-base-mlm             | 0.3766 | 0.3766 | 0.2683      |
| Shuu12121/CodeMorph-BERTv2              | 0.3142 | 0.3142 | 0.2166      |
| Shuu12121/CodeMorph-BERT                | 0.2978 | 0.2978 | 0.1992      |

CodeMorph-ModernBERT は、他の CodeBERT や CodeT5 モデルと比較して、より高い検索精度を達成しています。



## 多言語における評価結果

CodeMorph-ModernBERTは、複数の言語で高いコード検索性能を示しています。以下は、各言語における主要な評価指標（MRR、MAP、R-Precision）の概要です。
またこの実験は全データではなく1000件を抽出して行っています.[こちらのノートブック](https://github.com/Shun0212/CodeBERTPretrained/blob/main/CodeMorphModernBERTvsCodeT5p.ipynb)をご参照ください。

| 言語         | MRR    | MAP    | R-Precision |
|--------------|--------|--------|-------------|
| **Python**   | 0.8098 | 0.8098 | 0.7520      |
| **Java**     | 0.6437 | 0.6437 | 0.5480      |
| **JavaScript** | 0.5928 | 0.5928 | 0.4880    |
| **PHP**      | 0.7512 | 0.7512 | 0.6710      |
| **Ruby**     | 0.7188 | 0.7188 | 0.6310      |
| **Go**       | 0.5358 | 0.5358 | 0.4320      |

このように、言語によって数値にはばらつきが見られるものの、CodeMorph-ModernBERTは全体として高い検索精度を維持しています。特にPythonやPHPでは顕著な性能向上が確認されています。

また,Salesforce/codet5p-220m-bimodalは以下のようにCodeMorph-ModernBERTよりも全体的に上回っている検索精度ですが,
| 言語           | MRR    | MAP    | R-Precision |
|----------------|--------|--------|-------------|
| **Python**     | 0.8322 | 0.8322 | 0.7660      |
| **Java**       | 0.8886 | 0.8886 | 0.8390      |
| **JavaScript** | 0.7611 | 0.7611 | 0.6710      |
| **PHP**        | 0.8985 | 0.8985 | 0.8530      |
| **Ruby**       | 0.7635 | 0.7635 | 0.6740      |
| **Go**         | 0.8127 | 0.8127 | 0.7260      |


別のデータセットであるgoogle/code_x_glue_tc_nl_code_search_adv データセット (Test)での結果が下記のようにgoogle/code_x_glue_tc_nl_code_search_advにおいてはCodeMorph-ModernBERT が上回っているため,より難しいタスクやPythonでの汎用性においてはCodeMorph-ModernBERTのほうが有利である可能性があると考えられます.

| モデル                                  | MRR    | MAP    | R-Precision |
| :-------------------------------------- | :----- | :----- | :---------- |
| Shuu12121/CodeMorph-ModernBERT          | 0.6107 | 0.6107 | 0.5038      |
| Salesforce/codet5p-220m-bimodal         |　0.5326 | 0.5326 | 0.4208　　　|


## ライセンス

本モデルは `Apache-2.0` ライセンスのもとで提供されます。

## 連絡先
このモデルで何か質問等がございましたらこちらのメールアドレスまでお願いします
shun0212114@outlook.jp

# CodeMorph-ModernBERT-English-ver 

## Overview

**CodeMorph-ModernBERT** is a pre-trained model designed from scratch for code search and code understanding tasks. This model has been trained using the `code-search-net/code_search_net` dataset to enhance semantic comprehension of code.  
It supports **a maximum sequence length of 2048 tokens** (compared to Microsoft’s conventional models, which support only 512 tokens) and demonstrates outstanding performance, particularly in Python code search.  

- **Architecture**: ModernBERT-based  
- **Purpose**: Code search / Code understanding / Code completion  
- **Training Data**: CodeSearchNet (all languages)  
- **License**: Apache 2.0  

## Key Features

- **Long Sequence Support**  
  Handles sequences of up to 2048 tokens, making it suitable for long and complex functions.  

- **High Code Search Performance**  
  Uses a SentencePiece-based tokenizer trained on six programming languages, achieving significantly improved search accuracy over previous models.  

- **Specifically Trained Model**  
  Trained from scratch using the CodeSearchNet dataset, enabling deep understanding of programming syntax and comments.  

## Model Parameters

The model is designed with the following parameters:

  | Parameter Name                  | Value |
  |----------------------------------|-------|
  | **vocab_size**                   | 50000 |
  | **hidden_size**                   | 768   |
  | **num_hidden_layers**             | 12    |
  | **num_attention_heads**           | 12    |
  | **intermediate_size**             | 3072  |
  | **max_position_embeddings**       | 2048  |
  | **type_vocab_size**               | 2     |
  | **hidden_dropout_prob**           | 0.1   |
  | **attention_probs_dropout_prob**  | 0.1   |
  | **local_attention_window**        | 128   |
  | **rope_theta**                    | 160000 |
  | **local_attention_rope_theta**    | 10000 |

## How to Use the Model

The model can be easily loaded using the Hugging Face Transformers library.  
(*Note: Requires Transformers version `4.48.0` or later.*)  

- [Example usage is available here](https://github.com/Shun0212/CodeBERTPretrained/blob/main/UseMyCodeMorph_ModernBERT.ipynb)

### Load the Model
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeMorph-ModernBERT")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeMorph-ModernBERT")
```

### Fill-Mask (Code Completion)
```python
from transformers import pipeline

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill_mask("def add_numbers(a, b): return a + [MASK]"))
```

### Obtain Code Embeddings
```python
import torch

def get_embedding(text, model, tokenizer, device="cuda"):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
    if "token_type_ids" in inputs:
        inputs.pop("token_type_ids")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model.model(**inputs)
    embedding = outputs.last_hidden_state[:, 0, :]
    return embedding

embedding = get_embedding("def my_function(): pass", model, tokenizer)
print(embedding.shape)
```

## Dataset

This model has been trained using the `code-search-net/code_search_net` dataset.  
The dataset contains code snippets from multiple programming languages (Python, Java, JavaScript, etc.), making it well-suited for code search tasks.  

## Evaluation Results

The model was evaluated using the `code_x_glue_ct_code_to_text` dataset, specifically the Python subset.  
Key evaluation metrics are shown below.  
For further details, refer to [this link](https://colab.research.google.com/gist/Shun0212/474d9092deb60bd10523c3bef427d422/codemorph-modernbert-exp.ipynb?hl=ja).

| Metric  | Score |
|---------|-------|
| **MRR** (Mean Reciprocal Rank) | 0.8172 |
| **MAP** (Mean Average Precision) | 0.8172 |
| **R-Precision** | 0.7501 |
| **Recall@10** | 0.9389 |
| **Precision@10** | 0.8143 |
| **NDCG@10** | 0.8445 |
| **F1@10** | 0.8423 |

## Comparison with Other Models

Below is a comparison of CodeMorph-ModernBERT with other leading code search models.

| Model | MRR | MAP | R-Precision |
|--------|------|------|-------------|
| **CodeMorph-ModernBERT** | **0.8172** | **0.8172** | **0.7501** |
| microsoft/graphcodebert-base | 0.5482 | 0.5482 | 0.4458 |
| microsoft/codebert-base-mlm | 0.5243 | 0.5243 | 0.4378 |
| Salesforce/codet5p-220m-py | 0.7512 | 0.7512 | 0.6617 |
| Salesforce/codet5-large-ntp-py | 0.7846 | 0.7846 | 0.7067 |
| Shuu12121/CodeMorph-BERT | 0.6851 | 0.6851 | 0.5934 |
| Shuu12121/CodeMorph-BERTv2 | 0.6535 | 0.6535 | 0.5543 |

## Code Search Model Evaluation Results (google/code_x_glue_tc_nl_code_search_adv Dataset Test)

The following table summarizes the evaluation results of various code search models using the `google/code_x_glue_tc_nl_code_search_adv` dataset (Test).  
The candidate pool size for all evaluations was set to 100.  
For additional experiment details, see [this link](https://github.com/Shun0212/CodeBERTPretrained/blob/main/CodeMorph-ModernBERT-exp-2.ipynb).

| Model | MRR | MAP | R-Precision |
|-------------------------------------- | :----- | :----- | :---------- |
| Shuu12121/CodeMorph-ModernBERT | 0.6107 | 0.6107 | 0.5038 |
| Salesforce/codet5p-220m-py | 0.5037 | 0.5037 | 0.3805 |
| Salesforce/codet5-large-ntp-py | 0.4872 | 0.4872 | 0.3658 |
| microsoft/graphcodebert-base | 0.3844 | 0.3844 | 0.2764 |
| microsoft/codebert-base-mlm | 0.3766 | 0.3766 | 0.2683 |
| Shuu12121/CodeMorph-BERTv2 | 0.3142 | 0.3142 | 0.2166 |
| Shuu12121/CodeMorph-BERT | 0.2978 | 0.2978 | 0.1992 |

CodeMorph-ModernBERT achieves superior search accuracy compared to other CodeBERT and CodeT5 models.

## Evaluation Results Across Multiple Languages

CodeMorph-ModernBERT demonstrates high code search performance across multiple programming languages.  
The table below summarizes key evaluation metrics (MRR, MAP, R-Precision) for each language.  
(*Evaluations were conducted using a sample of 1,000 data points. See [this notebook](https://github.com/Shun0212/CodeBERTPretrained/blob/main/CodeMorphModernBERTvsCodeT5p.ipynb) for details.*)

| Language | MRR | MAP | R-Precision |
|--------------|--------|--------|-------------|
| **Python**   | 0.8098 | 0.8098 | 0.7520 |
| **Java**     | 0.6437 | 0.6437 | 0.5480 |
| **JavaScript** | 0.5928 | 0.5928 | 0.4880 |
| **PHP**      | 0.7512 | 0.7512 | 0.6710 |
| **Ruby**     | 0.7188 | 0.7188 | 0.6310 |
| **Go**       | 0.5358 | 0.5358 | 0.4320 |


Additionally, Salesforce/codet5p-220m-bimodal generally outperforms CodeMorph-ModernBERT in terms of search accuracy. 
| Language       | MRR    | MAP    | R-Precision |
|---------------|--------|--------|-------------|
| **Python**    | 0.8322 | 0.8322 | 0.7660      |
| **Java**      | 0.8886 | 0.8886 | 0.8390      |
| **JavaScript**| 0.7611 | 0.7611 | 0.6710      |
| **PHP**       | 0.8985 | 0.8985 | 0.8530      |
| **Ruby**      | 0.7635 | 0.7635 | 0.6740      |
| **Go**        | 0.8127 | 0.8127 | 0.7260      |
 
However, when evaluated on a different dataset, **google/code_x_glue_tc_nl_code_search_adv (Test)**, CodeMorph-ModernBERT achieved higher scores, as shown below.  
This suggests that CodeMorph-ModernBERT may be more advantageous for **more challenging tasks and generalization in Python**.

| Model                                  | MRR    | MAP    | R-Precision |
| :-------------------------------------- | :----- | :----- | :---------- |
| Shuu12121/CodeMorph-ModernBERT          | 0.6107 | 0.6107 | 0.5038      |
| Salesforce/codet5p-220m-bimodal         | 0.5326 | 0.5326 | 0.4208      |



## License

This model is released under the `Apache-2.0` license.

## Contact Information

If you have any questions about this model, please contact us at the following email address:
shun0212114@outlook.jp