File size: 2,193 Bytes
d7baa19 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
---
license: apache-2.0
datasets:
- Shuu12121/python-treesitter-filtered-datasetsV2
- Shuu12121/javascript-treesitter-filtered-datasetsV2
- Shuu12121/ruby-treesitter-filtered-datasetsV2
- Shuu12121/go-treesitter-dedupe_doc-filtered-dataset
- Shuu12121/java-treesitter-dedupe_doc-filtered-dataset
- Shuu12121/rust-treesitter-filtered-datasetsV2
- Shuu12121/php-treesitter-filtered-datasetsV2
- Shuu12121/typescript-treesitter-filtered-datasetsV2
pipeline_tag: fill-mask
tags:
- code
- python
- java
- javascript
- typescript
- go
- ruby
- rust
- php
language:
- en
base_model:
- Shuu12121/CodeModernBERT-Crow-v1-Pre
---
# CodeModernBERT-Crow-v1.1🐦⬛
## Model Details
* **Model type**: Bi-encoder architecture based on ModernBERT
* **Architecture**:
* Hidden size: 768
* Layers: 12
* Attention heads: 12
* Intermediate size: 3,072
* Max position embeddings: 8,192
* Local attention window size: 128
* RoPE positional encoding: θ = 160,000
* Local RoPE positional encoding: θ = 10,000
* **Sequence length**: up to 2,048 tokens for code and docstring inputs during pretraining
## Pretraining
* **Tokenizer**: Custom BPE tokenizer trained for code and docstring pairs.
* **Data**: Functions and natural language descriptions extracted from GitHub repositories.
* **Masking strategy**: Two-phase pretraining.
* **Phase 1: Random Masked Language Modeling (MLM)**
30% of tokens in code functions are randomly masked and predicted using standard MLM.
* **Phase 2: Line-level Span Masking**
Inspired by SpanBERT, continued pretraining on the same data with span masking at line granularity:
1. Convert input tokens back to strings.
2. Detect newline tokens with regex and segment inputs by line.
3. Exclude whitespace-only tokens from masking.
4. Apply padding to align sequence lengths.
5. Randomly mask 30% of tokens in each line segment and predict them.
* **Pretraining hyperparameters**:
* Batch size: 16
* Gradient accumulation steps: 16
* Effective batch size: 256
* Optimizer: AdamW
* Learning rate: 5e-5
* Scheduler: Cosine
* Epochs: 3
* Precision: Mixed precision (fp16) using `transformers` |