ModernBERTić-base
A modern-architecture encoder for Bosnian, Croatian, Montenegrin, and Serbian (BCMS). 149M parameters, native 8192-token context, FlashAttention 2.
For best downstream task performance, use the large variant (395M, SOTA on SuperGLUE-SR). This base model is intended for fast inference, retrieval encoders where latency matters, and as a starting point for further pretraining or domain adaptation.
TL;DR
| Architecture | ModernBERT-base (22 layers, 768 hidden, 12 heads) |
| Parameters | 149M |
| Context length | 8192 tokens (RoPE base 160K) |
| Attention | Sliding window 128 + global every 3rd layer, FlashAttention 2 |
| Tokenizer | BPE, 50,304 vocab, Latin-only, cased (shared with galton-modernbertic-large) |
| Pretraining tokens | 60B BCMS tokens, 22 sources |
Honest performance note
On SuperGLUE-SR (BalkanBench v1.0), ModernBERTić-base scores 69.73 average, which sits below BERTić's 71.46 at smaller parameter count. The story is consistent with what the literature predicts:
Masked Language Modeling at 30% masking gives a training signal on 30% of tokens. ELECTRA-style replaced-token detection (BERTić) gives a signal on 100% of tokens. At small capacities, that supervision deficit is not absorbed by architectural improvements. At larger capacities (see
galton-modernbertic-large), it is.
We are publishing the base model anyway because:
- Inference is much faster. ~2-3× the throughput of the large variant on identical hardware, useful for high-volume retrieval encoders, candidate filtering, and embedding workloads.
- It is the right starting point for further pretraining. If you are domain-adapting to legal, medical, or other specialized BCMS text, the base scale is the correct continued-pretraining target.
- It is an honest baseline material for the encoder-comparison community working on BCMS.
If you want SOTA on a downstream classification or reasoning task, use the large variant. If you want a small, fast, modern-architecture encoder for embedding work or further pretraining, this one is for you.
Results: SuperGLUE Serbian edition
Evaluation from BalkanBench v1.0. 5 random seeds per cell, mean reported on the website; standard deviations in the leaderboard UI.
Live, sortable leaderboard with all 9 evaluated models, per-task standard deviations, and reproducibility info: balkanbench.com/leaderboard.
Quickstart
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
model_id = "permitt/galton-modernbertic-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
model_id,
attn_implementation="flash_attention_2", # falls back to sdpa if FA2 unavailable
torch_dtype=torch.bfloat16,
).to("cuda")
text = "Glavni grad Crne Gore je [MASK]."
inputs = tokenizer(text, return_tensors="pt").to("cuda")
with torch.no_grad():
logits = model(**inputs).logits
mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted = tokenizer.decode(logits[0, mask_idx].argmax(dim=-1))
print(predicted) # "Podgorica"
Fine-tuning
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"permitt/galton-modernbertic-base",
num_labels=3,
attn_implementation="flash_attention_2",
)
# standard HF Trainer flow from here
Recommended hyperparameters:
| Task type | Learning rate | Batch size | Epochs |
|---|---|---|---|
| Sequence classification | 3e-5 to 7e-5 | 16-32 | 3-5 |
| Token classification (NER, POS) | 5e-5 | 32 | 5-10 |
The base model wants larger learning rates than the large variant (the loss landscape at smaller scale is less curved). A grid that works on this base does not transfer to ModernBERTić-large.
Continued pretraining
The base model is a reasonable starting point for domain adaptation. Use the same MLM objective at 30% masking, peak LR ~1e-4 (one decade below pretraining peak), warmup over the first 10% of your continued-pretraining tokens. Expect to need ~1-5B in-domain tokens depending on how distant your target domain is from web/news/encyclopedic text.
Tokenizer
Identical to the large variant: BPE, 50,304 vocab, Latin-only, cased. Tokens per character: 0.229 on held-out BCMS text, 31% lower than mmBERT's multilingual SentencePiece. Cyrillic input should be transliterated upstream. Cased input is preferred (uncased reduces tokenizer efficiency by ~14%).
Pretraining
Identical recipe to the large variant, scaled down to base configuration:
- Corpus: 60B tokens, 227M documents, 22 BCMS sources, tiered priority, MinHash LSH cross-source deduplication.
- Objective: Masked Language Modeling, 30% masking ratio.
- Optimizer: AdamW, peak LR 8e-4 (higher than large because smaller model), warmup-stable-decay.
- Batch: 4096 sequences global.
- Precision: bfloat16.
- Framework: MosaicML Composer + FlexBERT.
See the large variant card for the detailed pipeline write-up.
Intended uses and limitations
Intended uses.
- Fine-tuning starting point where inference latency matters more than peak accuracy.
- Continued pretraining for domain-adapted BCMS encoders (legal, medical, technical).
- Embedding model fine-tunes where 149M is the right size for the latency budget.
- Token classification tasks (NER, POS) where the base scale is sufficient.
Out of scope.
- High-stakes downstream classification or reasoning tasks where you want SOTA. Use
galton-modernbertic-largeinstead. - Generative tasks. This is an encoder, not a generative model. For text generation in BCMS, see the national LLM initiative announced April 2026 or general-purpose multilingual LLMs.
- Languages outside BCMS. The tokenizer is Latin-only and the corpus is BCMS-only.
Limitations.
- Latin script only. Cyrillic input should be transliterated before tokenization. Raw Cyrillic falls back to byte-level encoding and burns context for no signal.
- Domain skew. Training data is heavy on web text, news, encyclopedic content, and PDFs (academic + literary). Heavy code, conversational chat, or highly technical scientific text are underrepresented.
- Variants. All four BCMS varieties (Bosnian, Croatian, Montenegrin, Serbian) are represented, but Croatian and Serbian dominate the corpus volume. Montenegrin in particular is upsampled 4× during mixing to compensate.
Citation
@misc{perovic2026modernbertic,
title = {{ModernBERTić}: A Modern Encoder for {BCMS} Languages},
author = {Perovic, Mitar},
year = {2026},
url = {https://huggingface.co/permitt/galton-modernbertic-base},
note = {Recrewty, EU-funded grant}
}
Acknowledgments
This work was developed at Recrewty as part of an EU-funded grant. Compute on Leonardo HPC was provided through the consortium grant.
Standing on the shoulders of:
- Nikola Ljubešić and the CLASSLA team for BERTić, BENCHić, and the broader BCMS NLP infrastructure that made this work possible.
- The ModernBERT team (Warner et al., 2024) for the architecture and the FlexBERT codebase.
- MosaicML / Databricks for Composer and the MDS streaming format.
- HuggingFace for the model hub, datasets, and
tokenizerslibrary. - JeRTeh, ReLDI, and the broader Serbian NLP community for datasets and evaluation resources.
- EuroHPC and the Leonardo consortium for compute access.
See also
permitt/galton-modernbertic-large- 395M parameter variant, SOTA on SuperGLUE-SR- BalkanBench leaderboard - live evaluation across BCMS encoders
- Build-in-public series on LinkedIn - posts #0-#9 covering training data, tokenizer, distributed training, debugging, and results
- Medium release post - long-form write-up of the model, the data pipeline, and lessons on data quality vs data quantity (link active at release)
- All links in one place - You can find linkedin material from this single point
- Downloads last month
- 862
Datasets used to train permitt/galton-modernbertic-base
Evaluation results
- Average (6 tasks, 5 seeds) on SuperGLUE-SRself-reported69.730
- BoolQ on SuperGLUE-SRself-reported76.020
- CB on SuperGLUE-SRself-reported76.960
- COPA on SuperGLUE-SRself-reported65.760
- RTE on SuperGLUE-SRself-reported65.820
- MultiRC on SuperGLUE-SRself-reported66.900
- WSC on SuperGLUE-SRself-reported64.110