File size: 2,193 Bytes

d7baa19

---
license: apache-2.0
datasets:
- Shuu12121/python-treesitter-filtered-datasetsV2
- Shuu12121/javascript-treesitter-filtered-datasetsV2
- Shuu12121/ruby-treesitter-filtered-datasetsV2
- Shuu12121/go-treesitter-dedupe_doc-filtered-dataset
- Shuu12121/java-treesitter-dedupe_doc-filtered-dataset
- Shuu12121/rust-treesitter-filtered-datasetsV2
- Shuu12121/php-treesitter-filtered-datasetsV2
- Shuu12121/typescript-treesitter-filtered-datasetsV2
pipeline_tag: fill-mask
tags:
- code
- python
- java
- javascript
- typescript
- go
- ruby
- rust
- php
language:
- en
base_model:
- Shuu12121/CodeModernBERT-Crow-v1-Pre
---
# CodeModernBERT-Crow-v1.1🐦‍⬛

## Model Details

* **Model type**: Bi-encoder architecture based on ModernBERT
* **Architecture**:
  * Hidden size: 768
  * Layers: 12
  * Attention heads: 12
  * Intermediate size: 3,072
  * Max position embeddings: 8,192
  * Local attention window size: 128
  * RoPE positional encoding: θ = 160,000
  * Local RoPE positional encoding: θ = 10,000
* **Sequence length**: up to 2,048 tokens for code and docstring inputs during pretraining

## Pretraining

* **Tokenizer**: Custom BPE tokenizer trained for code and docstring pairs.
* **Data**: Functions and natural language descriptions extracted from GitHub repositories.
* **Masking strategy**: Two-phase pretraining.
  * **Phase 1: Random Masked Language Modeling (MLM)**  
    30% of tokens in code functions are randomly masked and predicted using standard MLM.
  * **Phase 2: Line-level Span Masking**  
    Inspired by SpanBERT, continued pretraining on the same data with span masking at line granularity:  
      1. Convert input tokens back to strings.  
      2. Detect newline tokens with regex and segment inputs by line.  
      3. Exclude whitespace-only tokens from masking.  
      4. Apply padding to align sequence lengths.  
      5. Randomly mask 30% of tokens in each line segment and predict them.

* **Pretraining hyperparameters**:
  * Batch size: 16
  * Gradient accumulation steps: 16
  * Effective batch size: 256
  * Optimizer: AdamW
  * Learning rate: 5e-5
  * Scheduler: Cosine
  * Epochs: 3
  * Precision: Mixed precision (fp16) using `transformers`