ModernBERTic: The First Modern Encoder for South Slavic Languages

Community Article Published April 30, 2026

After the release of BalkanBench two days ago, I am more than proud to finally release ModernBERTić - the first modern-architecture encoder for Bosnian, Croatian, Montenegrin, and Serbian (BCMS) languages. The 395M-parameter large variant takes the top spot on SuperGLUE-SR, beating the previous best (BERTić, 2021) by +1.98 average points across 6 tasks, with a +7.97 point gain on COPA where long-context modeling pays off most clearly on causal reasoning.

Both base and large variants are released as open weights under Apache 2.0:

The model was trained on up to 64× A100 GPUs on the Leonardo HPC under an EU-funded grant at Recrewty. I have been documenting the build in a weekly LinkedIn series since February (linked at the bottom) if you want the full timeline.

Quickstart

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

model_id = "permitt/galton-modernbertic-large"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
    model_id,
    attn_implementation="flash_attention_2",   # falls back to sdpa if FA2 unavailable
    torch_dtype=torch.bfloat16,
).to("cuda")

text = "Glavni grad Srbije je [MASK]."
inputs = tokenizer(text, return_tensors="pt").to("cuda")
with torch.no_grad():
    logits = model(**inputs).logits

mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted = tokenizer.decode(logits[0, mask_idx].argmax(dim=-1))
print(predicted)  # "Beograd"

Why a new encoder pretrained from scratch

There were several encoder models released for BCMS languages over the years. However, all of them were either from old architectures (or before the transformer++ :) date) or just continuously pretrained multilingual models such as XLM-RoBERTa. We needed a context length way higher than 512 tokens and a native tokenizer to minimize waste. ModernBERTić, based on the ModernBERT architecture with a tokenizer trained from scratch, was our best bet.

Most people do not realize this given the hype around decoders, but when you account for production-ready use cases and check which model is the most downloaded one in the last 30 days on HuggingFace, guess what you will find? An encoder model with 220M downloads.

Data

Building a pretraining dataset for a language family with ~20M speakers is a different game than English. There is no single massive crawl you can just download and filter. You assemble it.

In total, I ingested 22 BCMS sources and ended up with around 60B tokens across 227M documents - the largest curated pretraining corpus for BCMS to date (as far as I know). The list spans the full quality spectrum: FineWiki and curated news on the high end, CLASSLA's web text and bertic-macocu in the middle, then the big web crawls (FineWeb-2, HPLT 3.0, FinePDFs) at the bottom of the priority hierarchy.

Each source comes with its own failure modes. The fun part of this stage was discovering what those failure modes actually look like for BCMS. A few examples worth sharing:

Casino spam. The highest-quality bins of HPLT (bins 9 and 10, on a 1-10 quality scale) still contained large clusters of online gambling content targeted at the Balkans. Clean grammar, proper morphology, ranked high by the upstream classifier - the kind of text that passes English-trained quality filters with flying colors. Found ~166K documents with keyword patterns in both Latin and Cyrillic.

Content farms. Foreign domains squatting on .sr and .bs subdomains, or /sr/ URL paths, serving auto-translated articles designed to rank in Google. Read 20 of them in a row and the patterns become obvious - broken named entities, especially in Cyrillic transliterations of foreign names. ~25.5% of HPLT3 Serbian documents fell into this bucket. ~811K documents removed.

OCR garbage in Cyrillic PDFs. FinePDFs has a Serbian Cyrillic split with systematic character substitutions (о ↔ п, и ↔ н, that kind of thing). I excluded the entire srp_Cyrl split and kept Latin-only PDFs, since transliteration on broken OCR just compounds the noise.

The pipeline took shape in stages:

  1. multi-source ingestion
  2. tiered prioritization
  3. script normalization (Cyrillic → Latin via transliteration library)
  4. BCMS-specific quality filters (Gopher-derived heuristics plus our own stop-word ratio over 65 curated function words)
  5. cross-source MinHash LSH dedup (128 permutations, 0.8 Jaccard threshold, word 5-grams)
  6. tokenize with the ModernBERTić tokenizer
  7. pack offline for faster training
  8. write MDS shards on S3

The dedup step is where priority ordering matters. Higher-quality sources get indexed into the MinHash LSH structure first. When a duplicate pair is found across tiers, the higher-quality copy wins. Result: HPLT 3.0 (lowest priority) lost 38-67% of its documents as duplicates of content already in FineWeb-2, CLASSLA, or xlm_bertic. Meanwhile, xlm_bertic, after filtering, had only 0.04% removal - it was already unique. The priority pipeline did exactly what it should.

After dedup, selective upsampling: FineWiki 2× (encyclopedic quality is scarce), Montenegrin 4× (smallest BCMS variant, boosted from 0.8% to 3% of the final mix). The whole pipeline writes MDS shards to S3 in around 7.5 hours on a 1.5TB RAM bare metal instance (some things, like the MinHash index, really did need that kind of RAM :D).

Tokenizer

The choice to train a dedicated BPE from scratch was driven by one number: fertility (tokens per word). On held-out Serbian text:

Tokenizer Vocab Tokens / word
ModernBERTić (BPE, Latin, cased) 50,304 1.59
BERTić (WordPiece) 32K 1.66
XLM-R-BERTić (SentencePiece) 250K 1.91
ModernBERT multilingual (BPE) 50K 2.57

A multilingual vocabulary spreads its budget across 100+ languages. For BCMS, most of those tokens are dead weight. In an 8192-token context window at 2.57 fertility, you fit ~3,190 words. At 1.59, you fit ~5,150. 62% more text per sequence, same architecture.

Four decisions, in order of impact:

Algorithm: BPE. Literature on morphologically rich languages suggests Unigram (SentencePiece) should win - it optimizes globally over the vocabulary rather than greedy pair merging. I tested both on subsets of 10-100M tokens. BPE: 1.59. Unigram: 1.63. Within 3% of each other. BPE trains on the full corpus without the memory issues that Unigram's EM loop runs into at scale, so the engineering tradeoff favored BPE. On a corpus this size, the algorithmic advantage of Unigram seems to collapse.

Vocab size: 50,304. Nearest multiple of 64 above 50K. GPU tensor cores operate on tiles of 8/16/64, so aligning the embedding matrix avoids wasted compute on every forward pass. Karpathy has a tweet on nanoGPT showing a 25% speedup from this exact tweak. Across 4500 Leonardo node-hours, those things compound.

Script: Latin only. Serbian uses both Cyrillic and Latin in active use. Training the tokenizer on both splits the vocabulary - half your BPE merges go to Cyrillic subwords, half to Latin, and you end up mediocre at both. We transliterate everything to Latin upfront, before tokenization, on every source. Our production use cases see <1% Cyrillic input, so the loss is minimal.

Casing: cased. This one surprised me. Cased: 1.59. Uncased: 1.85. A 14% gap, because lowercasing destroys merge patterns where proper nouns and sentence-initial capitals carry real subword signal.

Architecture and the training setup

The recipe is a BCMS adaptation of the ModernBERT-large configuration:

  • 28 layers, 1024 hidden dim, 16 attention heads, 395M parameters
  • Alternating attention: 256-token sliding window with full (global) attention every second layer
  • RoPE position embeddings with base 160K, native 8192-token context
  • FlashAttention 2, unpadding, GLU MLPs, prenorm
  • 30% MLM masking (vs the standard 15%, gives ~2× signal per batch)
  • Peak LR 5e-4 with warmup-stable-decay schedule, decay phase at ~9% of total tokens
  • Global batch size 4096 sequences, kept constant across GPU counts (strong scaling)
  • Mixed precision bf16

The training stack is built around MosaicML Composer + the FlexBERT codebase that ModernBERT ships with. Composer handles process spawning via torchrun, gradient sync via NCCL, learning rate scheduling in tokens (not steps), checkpointing keyed on ${SLURM_JOB_ID}, and mixed precision. MDS streaming handles deterministic shuffling and instant auto-resume across the 24-hour job limit on Leonardo.

Training efficiency

The final 64-GPU run hit ~45% MFU. Computed as (6 × 395M × 59,584 tokens/s/device) / 312 TFLOPs/s using the standard FLOPs-per-token approximation. Well-tuned encoder runs on A100s usually land in the 35-50% range; 45% on a 28-layer ModernBERT with alternating sliding-window attention is exactly where a competent setup should sit.

The two changes that moved this number the most:

Offline sequence prepacking. The first runs were doing variable-length-to-fixed-length packing online during training. CPU bottleneck, GPUs starving for data, throughput stuck at ~144K tokens/sec aggregate. Moving the packing offline (a 6-hour preprocessing job consuming 200GB of RAM) and streaming pre-packed batches via MosaicML's MDS format pushed throughput up by ~13×.

Cross-document attention masking via cu_seqlens. Sequence packing is throughput-positive but correctness-negative if you do not thread the sequence_id boundaries through to FlashAttention's cu_seqlens. Without it, every packed sequence is treated as one continuous document and tokens from doc A attend to tokens from doc B. Silent corruption, no error, all previous runs affected. Fixing this bumped effective packing utilization from ~10% to 95%+. Worth its own post.

Standard caveats apply to the MFU number. The 6 × N × D approximation slightly overstates FLOPs for the alternating sliding-window pattern (you do not pay full quadratic cost on the windowed layers), so the true compute utilization is a couple of points higher. Reporting the conservative number.

A few notes on the 64-GPU run. SLURM allocates 16 nodes, 4 GPUs each. The thing to watch for is NCCL falling back to TCP instead of using InfiniBand - it does not error, it just runs an order of magnitude slower. The kind of bug that shows up as "mysteriously low throughput" on the WandB dashboard, not as a stack trace. Set NCCL_IB_DISABLE=0 and NCCL_DEBUG=INFO, then verify the IB transport is selected in the first few log lines of the run.

The full ~60B token training run on the large model took roughly 7 hours on 64 A100s.

Evaluation and results

BalkanBench v1.0 evaluates BCMS encoders on the SuperGLUE-SR suite: BoolQ, CB, COPA, RTE, MultiRC, and WSC. Every cell on the leaderboard is a mean across 5 random seeds with reported standard deviation.

Rank Model Params BoolQ CB COPA RTE MultiRC WSC Avg
1 ModernBERTić-large 395M 80.70 78.52 76.84 73.13 67.90 63.56 73.44
2 BERTić 110M 77.79 78.61 68.87 71.70 66.75 65.07 71.46
3 ModernBERTić-base 149M 76.02 76.96 65.76 65.82 66.90 64.11 69.73
4 mmBERT 307M 78.17 78.93 56.00 73.10 63.79 62.33 68.72
5 XLM-R-BERTić 560M 73.98 83.90 58.60 73.78 47.43 64.16 66.97

ModernBERTić-large takes the top spot at 73.44 average, +1.98 over BERTić (71.46), winning 4 of 6 tasks outright. The full leaderboard with all 9 evaluated models is live at balkanbench.com.

The single most interesting result in this table is COPA. ModernBERTić-large hits 76.84 vs BERTić's 68.87 - a +7.97 point gain. COPA is causal reasoning over premise-alternative pairs. The pairs frequently span 150+ tokens, which is exactly the regime where 8K-context modeling and global attention every second layer should dominate over 512-token full-attention BERT. They do. This is the architecture bet paying off in the cleanest possible way.

A few honest observations the leaderboard makes you confront:

The base model loses to BERTić. ModernBERTić-base at 149M lands 1.73 points below BERTić-110M. This is the ELECTRA signal-density gap I have been writing about. BERTić's discriminator gives a training signal on 100% of tokens; MLM at 30% masking gives a signal on 30%. At smaller capacity, that supervision deficit is not absorbed by the architecture. By 395M it is. The cleaner statement of the result: ModernBERTić wins decisively at scale, but matching BERTić at the base scale required techniques (denser supervision, longer training) that are not in the standard ModernBERT recipe. Workshop paper material, maybe :)

Variance is a story too. ModernBERTić-large's worst standard deviation on any task is ±3.82 on CB. The lower variance is partly more parameters absorbing seed noise, but it is also cleaner training and fewer corrupted gradients. This was the case after the cooldown phase and switching to 15% masking ratio and the later context extension.

Inference throughput. On identical hardware, 1454 samples/sec vs BERTić's 408. 3.5× faster, thanks to FlashAttention 2 and unpadding. For our production use case at Recrewty - processing thousands of CVs daily - this compounds fast.

What is coming next

You can freely try out and interact with the base model through the demo Space. As of now, we are wrapping up search models built for Serbian, Montenegrin, Croatian, and Bosnian, plus specific domain adaptations. ModernBERTić will power new features in our Recrewty platform:

  • Assessment recommender - based on the job description, the model recommends which psychometric assessments to run for that role.
  • Roster management - given a new hiring selection's criteria, predict the highest-fit candidates from all previous selections on the platform.
  • Proprietary embedding and reranker models offered to partners and clients. If you have BCMS use cases that this could be a great fit for, let's chat.

Can't wait to share all of this with you during May - we will organize live demo events in Serbia and Montenegro. Stay tuned :)

Contact

References

Acknowledgments

Built at Recrewty under an EU-funded grant. Compute provided by the Leonardo HPC consortium under EuroHPC. Standing on the shoulders of Nikola Ljubešić and the CLASSLA team for BERTić and the broader BCMS NLP infrastructure that made this work possible, the ModernBERT team for the architecture and FlexBERT codebase, MosaicML / Databricks for Composer and MDS, and the JeRTeh and ReLDI communities for datasets and evaluation resources.

Community

Sign up or log in to comment