|
--- |
|
license: mit |
|
datasets: |
|
- jhu-clsp/mmbert-decay |
|
- jhu-clsp/mmbert-midtraining |
|
- jhu-clsp/mmbert-pretrain-p1-fineweb2-langs |
|
- jhu-clsp/mmbert-pretrain-p2-fineweb2-remaining |
|
- jhu-clsp/mmbert-pretrain-p3-others |
|
pipeline_tag: fill-mask |
|
library_name: transformers |
|
--- |
|
|
|
# mmBERT: A Modern Multilingual Encoder |
|
|
|
[](https://opensource.org/licenses/MIT) |
|
[](https://arxiv.org/abs/2509.06888) |
|
[](https://huggingface.co/jhu-clsp/mmBERT-base) |
|
[](https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4) |
|
[](https://github.com/jhu-clsp/mmBERT) |
|
|
|
> TL;DR: A state-of-the-art multilingual encoder trained on 3T+ tokens across 1800+ languages, introducing novel techniques for learning low-resource languages during the decay phase. |
|
|
|
mmBERT is a modern multilingual encoder that significantly outperforms previous generation models like XLM-R on classification, embedding, and retrieval tasks. Built on the ModernBERT architecture with novel multilingual training innovations, mmBERT demonstrates that low-resource languages can be effectively learned during the decay phase of training. It is also significantly faster than any previous multilingual encoder. |
|
|
|
## Table of Contents |
|
- [Highlights](#highlights) |
|
- [Quick Start](#quick-start) |
|
- [Model Description](#model-description) |
|
- [Novel Training Innovations](#novel-training-innovations) |
|
- [Model Family](#model-family) |
|
- [Training Data](#training-data) |
|
- [Usage Examples](#usage-examples) |
|
- [Fine-tuning Examples](#fine-tuning-examples) |
|
- [Model Architecture](#model-architecture) |
|
- [Citation](#citation) |
|
|
|
|
|
## Quick Start |
|
|
|
### Installation |
|
```bash |
|
pip install torch>=1.9.0 |
|
pip install transformers>=4.21.0 |
|
``` |
|
|
|
### Usage |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmBERT-base") |
|
model = AutoModel.from_pretrained("jhu-clsp/mmBERT-base") |
|
|
|
inputs = tokenizer("Hello world", return_tensors="pt") |
|
outputs = model(**inputs) |
|
``` |
|
|
|
## Model Description |
|
|
|
mmBERT represents the first significant advancement over XLM-R for massively multilingual encoder models. Key features include: |
|
|
|
1. **Massive Language Coverage** - Trained on over 1800 languages with progressive inclusion strategy |
|
2. **Modern Architecture** - Built on ModernBERT foundation with Flash Attention 2 and unpadding techniques |
|
3. **Novel Training Recipe** - Introduces inverse mask scheduling and temperature sampling |
|
4. **Open Training Data** - Complete 3T+ token dataset publicly available |
|
5. **Decay Phase Innovation** - Demonstrates effective learning of low-resource languages in final training phase |
|
|
|
The model uses bidirectional attention with masked language modeling objectives, optimized specifically for multilingual understanding and cross-lingual transfer. |
|
|
|
## Novel Training Innovations |
|
|
|
**Progressive Language Addition**: Start with 60 high-resource languages, expand to 110 mid-resource languages, then include all 1833 languages in decay phase. |
|
|
|
**Inverse Mask Schedule**: Reduce mask ratio from 30% → 15% → 5% across training phases for progressively refined learning. |
|
|
|
**Inverse Temperature Sampling**: Adjust multilingual sampling from high-resource bias (τ=0.7) to uniform sampling (τ=0.3). |
|
|
|
**Model Merging**: Combine English-focused, high-resource, and all-language decay variants using TIES merging. |
|
|
|
## Model Family |
|
|
|
| Model | Total Params | Non-embed Params | Languages | Download | |
|
|:------|:-------------|:------------------|:----------|:---------| |
|
| [mmBERT-small](https://huggingface.co/jhu-clsp/mmBERT-small) | 140M | 42M | 1800+ | [](https://huggingface.co/jhu-clsp/mmBERT-small) | |
|
| [mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base) | 307M | 110M | 1800+ | [](https://huggingface.co/jhu-clsp/mmBERT-base) | |
|
|
|
## Training Data |
|
|
|
mmBERT training data is publicly available across different phases: |
|
|
|
| Phase | Dataset | Tokens | Description | |
|
|:------|:--------|:-------|:------------| |
|
| Pre-training P1 | [mmbert-pretrain-p1](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p1-fineweb2-langs) | 2.3T | 60 languages, foundational training | |
|
| Pre-training P2 | [mmbert-pretrain-p2](https://huggingface.co/datasets/jhu-clsp/mmBERT-pretrain-p2-fineweb2-remaining) | - | Extension data for pre-training phase | |
|
| Pre-training P3 | [mmbert-pretrain-p3](https://huggingface.co/datasets/jhu-clsp/mmBERT-pretrain-p3-others) | - | Final pre-training data | |
|
| Mid-training | [mmbert-midtraining](https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining-data) | 600B | 110 languages, context extension to 8K | |
|
| Decay Phase | [mmbert-decay](https://huggingface.co/datasets/jhu-clsp/mmbert-decay-data) | 100B | 1833 languages, premium quality | |
|
|
|
**Data Sources**: Filtered DCLM (English), FineWeb2 (multilingual), FineWeb2-HQ (20 high-resource languages), Wikipedia (MegaWika), code repositories (StarCoder, ProLong), academic papers (ArXiv, PeS2o), and community discussions (StackExchange). |
|
|
|
## Model Architecture |
|
|
|
| Parameter | mmBERT-small | mmBERT-base | |
|
|:----------|:-------------|:------------| |
|
| Layers | 22 | 22 | |
|
| Hidden Size | 384 | 768 | |
|
| Intermediate Size | 1152 | 1152 | |
|
| Attention Heads | 6 | 12 | |
|
| Total Parameters | 140M | 307M | |
|
| Non-embedding Parameters | 42M | 110M | |
|
| Max Sequence Length | 8192 | 8192 | |
|
| Vocabulary Size | 256,000 | 256,000 | |
|
| Tokenizer | Gemma 2 | Gemma 2 | |
|
|
|
## Usage Examples |
|
|
|
### Masked Language Modeling |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
import torch |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmBERT-base") |
|
model = AutoModelForMaskedLM.from_pretrained("jhu-clsp/mmBERT-base") |
|
|
|
def predict_masked_token(text): |
|
inputs = tokenizer(text, return_tensors="pt") |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
|
|
mask_indices = torch.where(inputs["input_ids"] == tokenizer.mask_token_id) |
|
predictions = outputs.logits[mask_indices] |
|
top_tokens = torch.topk(predictions, 5, dim=-1) |
|
|
|
return [tokenizer.decode(token) for token in top_tokens.indices[0]] |
|
|
|
# Works across languages |
|
texts = [ |
|
"The capital of France is <mask>.", |
|
"La capital de España es <mask>.", |
|
"Die Hauptstadt von Deutschland ist <mask>." |
|
] |
|
|
|
for text in texts: |
|
predictions = predict_masked_token(text) |
|
print(f"Text: {text}") |
|
print(f"Predictions: {predictions}") |
|
``` |
|
|
|
### Cross-lingual Embeddings |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
import torch |
|
from sklearn.metrics.pairwise import cosine_similarity |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("jhu-clsp/mmBERT-base") |
|
model = AutoModel.from_pretrained("jhu-clsp/mmBERT-base") |
|
|
|
def get_embeddings(texts): |
|
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt") |
|
|
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
embeddings = outputs.last_hidden_state.mean(dim=1) |
|
|
|
return embeddings.numpy() |
|
|
|
multilingual_texts = [ |
|
"Artificial intelligence is transforming technology", |
|
"La inteligencia artificial está transformando la tecnología", |
|
"L'intelligence artificielle transforme la technologie", |
|
"人工智能正在改变技术" |
|
] |
|
|
|
embeddings = get_embeddings(multilingual_texts) |
|
similarities = cosine_similarity(embeddings) |
|
print("Cross-lingual similarity matrix:") |
|
print(similarities) |
|
``` |
|
|
|
## Fine-tuning Examples |
|
|
|
### Dense Retrieval with Sentence Transformers |
|
|
|
<details> |
|
<summary>Click to expand dense retrieval fine-tuning example</summary> |
|
|
|
```python |
|
import argparse |
|
from datasets import load_dataset |
|
from sentence_transformers import ( |
|
SentenceTransformer, |
|
SentenceTransformerTrainer, |
|
SentenceTransformerTrainingArguments, |
|
) |
|
from sentence_transformers.evaluation import TripletEvaluator |
|
from sentence_transformers.losses import CachedMultipleNegativesRankingLoss |
|
from sentence_transformers.training_args import BatchSamplers |
|
|
|
def main(): |
|
parser = argparse.ArgumentParser() |
|
parser.add_argument("--lr", type=float, default=8e-5) |
|
parser.add_argument("--model_name", type=str, default="jhu-clsp/mmBERT-base") |
|
args = parser.parse_args() |
|
|
|
lr = args.lr |
|
model_name = args.model_name |
|
model_shortname = model_name.split("/")[-1] |
|
|
|
model = SentenceTransformer(model_name) |
|
|
|
dataset = load_dataset( |
|
"sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1", |
|
"triplet-hard", |
|
split="train", |
|
) |
|
dataset_dict = dataset.train_test_split(test_size=1_000, seed=12) |
|
train_dataset = dataset_dict["train"].select(range(1_250_000)) |
|
eval_dataset = dataset_dict["test"] |
|
|
|
loss = CachedMultipleNegativesRankingLoss(model, mini_batch_size=16) |
|
run_name = f"{model_shortname}-DPR-{lr}" |
|
|
|
training_args = SentenceTransformerTrainingArguments( |
|
output_dir=f"output/{model_shortname}/{run_name}", |
|
num_train_epochs=1, |
|
per_device_train_batch_size=512, |
|
per_device_eval_batch_size=512, |
|
warmup_ratio=0.05, |
|
fp16=False, |
|
bf16=True, |
|
batch_sampler=BatchSamplers.NO_DUPLICATES, |
|
learning_rate=lr, |
|
save_strategy="steps", |
|
save_steps=500, |
|
save_total_limit=2, |
|
logging_steps=500, |
|
run_name=run_name, |
|
) |
|
|
|
dev_evaluator = TripletEvaluator( |
|
anchors=eval_dataset["query"], |
|
positives=eval_dataset["positive"], |
|
negatives=eval_dataset["negative"], |
|
name="msmarco-co-condenser-dev", |
|
) |
|
dev_evaluator(model) |
|
|
|
trainer = SentenceTransformerTrainer( |
|
model=model, |
|
args=training_args, |
|
train_dataset=train_dataset, |
|
eval_dataset=eval_dataset, |
|
loss=loss, |
|
evaluator=dev_evaluator, |
|
) |
|
trainer.train() |
|
|
|
model.save_pretrained(f"output/{model_shortname}/{run_name}/final") |
|
model.push_to_hub(run_name, private=False) |
|
|
|
if __name__ == "__main__": |
|
main() |
|
``` |
|
|
|
</details> |
|
|
|
### Cross-lingual Classification |
|
|
|
<details> |
|
<summary>Click to expand multilingual classification fine-tuning example</summary> |
|
|
|
```python |
|
from transformers import ( |
|
AutoTokenizer, |
|
AutoModelForSequenceClassification, |
|
TrainingArguments, |
|
Trainer |
|
) |
|
from datasets import load_dataset |
|
import numpy as np |
|
from sklearn.metrics import accuracy_score, f1_score |
|
|
|
def compute_metrics(eval_pred): |
|
predictions, labels = eval_pred |
|
predictions = np.argmax(predictions, axis=1) |
|
return { |
|
'accuracy': accuracy_score(labels, predictions), |
|
'f1': f1_score(labels, predictions, average='weighted') |
|
} |
|
|
|
def main(): |
|
model_name = "jhu-clsp/mmBERT-base" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForSequenceClassification.from_pretrained( |
|
model_name, |
|
num_labels=3 |
|
) |
|
|
|
dataset = load_dataset("xnli", "all_languages") |
|
|
|
def tokenize_function(examples): |
|
texts = [f"{p} {tokenizer.sep_token} {h}" |
|
for p, h in zip(examples["premise"], examples["hypothesis"])] |
|
|
|
return tokenizer( |
|
texts, |
|
truncation=True, |
|
padding=True, |
|
max_length=512 |
|
) |
|
|
|
train_dataset = dataset["train"].map(tokenize_function, batched=True) |
|
eval_dataset = dataset["validation"].map(tokenize_function, batched=True) |
|
|
|
training_args = TrainingArguments( |
|
output_dir="./mmbert-xnli", |
|
learning_rate=3e-5, |
|
per_device_train_batch_size=32, |
|
per_device_eval_batch_size=32, |
|
num_train_epochs=3, |
|
weight_decay=0.01, |
|
evaluation_strategy="epoch", |
|
save_strategy="epoch", |
|
load_best_model_at_end=True, |
|
metric_for_best_model="f1", |
|
greater_is_better=True, |
|
) |
|
|
|
trainer = Trainer( |
|
model=model, |
|
args=training_args, |
|
train_dataset=train_dataset, |
|
eval_dataset=eval_dataset, |
|
compute_metrics=compute_metrics, |
|
) |
|
|
|
trainer.train() |
|
|
|
if __name__ == "__main__": |
|
main() |
|
``` |
|
|
|
</details> |
|
|
|
### Multilingual Reranking |
|
|
|
<details> |
|
<summary>Click to expand multilingual reranking fine-tuning example</summary> |
|
|
|
```python |
|
import logging |
|
from datasets import load_dataset |
|
from sentence_transformers.cross_encoder import ( |
|
CrossEncoder, |
|
CrossEncoderModelCardData, |
|
CrossEncoderTrainer, |
|
CrossEncoderTrainingArguments, |
|
) |
|
from sentence_transformers.cross_encoder.evaluation import CrossEncoderNanoBEIREvaluator |
|
from sentence_transformers.cross_encoder.losses import BinaryCrossEntropyLoss |
|
from sentence_transformers.util import mine_hard_negatives |
|
from sentence_transformers import SentenceTransformer |
|
import torch |
|
|
|
def main(): |
|
model_name = "jhu-clsp/mmBERT-base" |
|
train_batch_size = 32 |
|
num_epochs = 2 |
|
num_hard_negatives = 7 |
|
|
|
model = CrossEncoder( |
|
model_name, |
|
model_card_data=CrossEncoderModelCardData( |
|
language="multilingual", |
|
license="mit", |
|
), |
|
) |
|
|
|
full_dataset = load_dataset("sentence-transformers/gooaq", split="train").select(range(50_000)) |
|
dataset_dict = full_dataset.train_test_split(test_size=1_000, seed=42) |
|
train_dataset = dataset_dict["train"] |
|
eval_dataset = dataset_dict["test"] |
|
|
|
embedding_model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2", device="cpu") |
|
hard_train_dataset = mine_hard_negatives( |
|
train_dataset, |
|
embedding_model, |
|
num_negatives=num_hard_negatives, |
|
margin=0, |
|
range_min=0, |
|
range_max=100, |
|
sampling_strategy="top", |
|
batch_size=2048, |
|
output_format="labeled-pair", |
|
use_faiss=True, |
|
) |
|
|
|
loss = BinaryCrossEntropyLoss(model=model, pos_weight=torch.tensor(num_hard_negatives)) |
|
|
|
nano_beir_evaluator = CrossEncoderNanoBEIREvaluator( |
|
dataset_names=["msmarco", "nfcorpus", "nq"], |
|
batch_size=train_batch_size, |
|
) |
|
|
|
args = CrossEncoderTrainingArguments( |
|
output_dir="./mmbert-reranker", |
|
num_train_epochs=num_epochs, |
|
per_device_train_batch_size=train_batch_size, |
|
per_device_eval_batch_size=train_batch_size, |
|
learning_rate=2e-5, |
|
warmup_ratio=0.1, |
|
fp16=False, |
|
bf16=True, |
|
dataloader_num_workers=4, |
|
load_best_model_at_end=True, |
|
metric_for_best_model="eval_msmarco_ndcg@10", |
|
eval_strategy="steps", |
|
eval_steps=1000, |
|
save_strategy="steps", |
|
save_steps=1000, |
|
save_total_limit=2, |
|
logging_steps=200, |
|
seed=42, |
|
) |
|
|
|
trainer = CrossEncoderTrainer( |
|
model=model, |
|
args=args, |
|
train_dataset=hard_train_dataset, |
|
loss=loss, |
|
evaluator=nano_beir_evaluator, |
|
) |
|
trainer.train() |
|
|
|
model.save_pretrained("./mmbert-reranker/final") |
|
|
|
if __name__ == "__main__": |
|
main() |
|
``` |
|
|
|
</details> |
|
|
|
## Training Data |
|
|
|
mmBERT was trained on a carefully curated 3T+ token multilingual dataset: |
|
|
|
| Phase | Dataset | Description | |
|
|:------|:--------|:------------| |
|
| [Pre-training P1](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p1-fineweb2-langs) | 2.3T tokens | 60 languages, diverse data mixture | |
|
| [Pre-training P2](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p2-fineweb2-langs) | - | Extension data for pre-training | |
|
| [Pre-training P3](https://huggingface.co/datasets/jhu-clsp/mmbert-pretrain-p3-fineweb2-langs) | - | Final pre-training data | |
|
| [Mid-training](https://huggingface.co/datasets/jhu-clsp/mmbert-midtraining-data) | 600B tokens | 110 languages, context extension | |
|
| [Decay Phase](https://huggingface.co/datasets/jhu-clsp/mmbert-decay-data) | 100B tokens | 1833 languages, premium quality | |
|
|
|
**Primary Sources:** |
|
- **Filtered DCLM**: High-quality English content |
|
- **FineWeb2**: Broad multilingual web coverage (1800+ languages) |
|
- **FineWeb2-HQ**: Filtered subset of 20 high-resource languages |
|
- **Code**: StarCoder and ProLong repositories |
|
- **Academic**: ArXiv papers and PeS2o scientific content |
|
- **Reference**: Wikipedia (MegaWika) and textbooks |
|
- **Community**: StackExchange discussions |
|
|
|
|
|
## Citation |
|
|
|
If you use mmBERT in your research, please cite our work: |
|
|
|
```bibtex |
|
@misc{marone2025mmbertmodernmultilingualencoder, |
|
title={mmBERT: A Modern Multilingual Encoder with Annealed Language Learning}, |
|
author={Marc Marone and Orion Weller and William Fleshman and Eugene Yang and Dawn Lawrie and Benjamin Van Durme}, |
|
year={2025}, |
|
eprint={2509.06888}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2509.06888}, |
|
} |
|
``` |
|
""" |