|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- Shuu12121/python-treesitter-filtered-datasetsV2 |
|
- Shuu12121/javascript-treesitter-filtered-datasetsV2 |
|
- Shuu12121/ruby-treesitter-filtered-datasetsV2 |
|
- Shuu12121/go-treesitter-dedupe_doc-filtered-dataset |
|
- Shuu12121/java-treesitter-dedupe_doc-filtered-dataset |
|
- Shuu12121/rust-treesitter-filtered-datasetsV2 |
|
- Shuu12121/php-treesitter-filtered-datasetsV2 |
|
- Shuu12121/typescript-treesitter-filtered-datasetsV2 |
|
pipeline_tag: fill-mask |
|
tags: |
|
- code |
|
- python |
|
- java |
|
- javascript |
|
- typescript |
|
- go |
|
- ruby |
|
- rust |
|
- php |
|
language: |
|
- en |
|
base_model: |
|
- Shuu12121/CodeModernBERT-Crow-v1-Pre |
|
--- |
|
# CodeModernBERT-Crow-v1.1🐦⬛ |
|
|
|
## Model Details |
|
|
|
* **Model type**: Bi-encoder architecture based on ModernBERT |
|
* **Architecture**: |
|
* Hidden size: 768 |
|
* Layers: 12 |
|
* Attention heads: 12 |
|
* Intermediate size: 3,072 |
|
* Max position embeddings: 8,192 |
|
* Local attention window size: 128 |
|
* RoPE positional encoding: θ = 160,000 |
|
* Local RoPE positional encoding: θ = 10,000 |
|
* **Sequence length**: up to 2,048 tokens for code and docstring inputs during pretraining |
|
|
|
## Pretraining |
|
|
|
* **Tokenizer**: Custom BPE tokenizer trained for code and docstring pairs. |
|
* **Data**: Functions and natural language descriptions extracted from GitHub repositories. |
|
* **Masking strategy**: Two-phase pretraining. |
|
* **Phase 1: Random Masked Language Modeling (MLM)** |
|
30% of tokens in code functions are randomly masked and predicted using standard MLM. |
|
* **Phase 2: Line-level Span Masking** |
|
Inspired by SpanBERT, continued pretraining on the same data with span masking at line granularity: |
|
1. Convert input tokens back to strings. |
|
2. Detect newline tokens with regex and segment inputs by line. |
|
3. Exclude whitespace-only tokens from masking. |
|
4. Apply padding to align sequence lengths. |
|
5. Randomly mask 30% of tokens in each line segment and predict them. |
|
|
|
* **Pretraining hyperparameters**: |
|
* Batch size: 16 |
|
* Gradient accumulation steps: 16 |
|
* Effective batch size: 256 |
|
* Optimizer: AdamW |
|
* Learning rate: 5e-5 |
|
* Scheduler: Cosine |
|
* Epochs: 3 |
|
* Precision: Mixed precision (fp16) using `transformers` |