NagameseBERT

HuggingFace Model License: CC BY 4.0 Language

A Foundational BERT model for Nagamese Creole - A compact, efficient language model for a low resource Northeast Indian language.


Overview

NagameseBERT is a 7M parameter RoBERTa-style BERT model pre-trained on 42,552 Nagamese sentences. Despite being 15× smaller than multilingual models like mBERT (110M) and XLM-RoBERTa (125M), it achieves competitive performance on downstream NLP tasks while offering significant efficiency advantages.

Key Features:

  • Compact: 6.9M parameters (15× smaller than mBERT)
  • Efficient: Pre-trained in 35 minutes on single A40 GPU
  • Custom tokenizer: 8K BPE vocabulary optimized for Nagamese
  • Rigorous evaluation: Multi-seed testing (n=3) with reproducible results
  • Open: Model, code, and data splits publicly available

Performance

Multi-seed evaluation results (mean ± std, n=3):

Model Parameters POS Accuracy POS F1 NER Accuracy NER F1
NagameseBERT 7M 88.35 ± 0.71% 0.807 ± 0.013 91.74 ± 0.68% 0.565 ± 0.054
mBERT 110M 95.14 ± 0.47% 0.916 ± 0.008 96.11 ± 0.72% 0.750 ± 0.064
XLM-RoBERTa 125M 95.64 ± 0.56% 0.919 ± 0.008 96.38 ± 0.26% 0.819 ± 0.066

Trade-off: 6-7 percentage points lower accuracy with 15× parameter reduction, enabling resource-constrained deployment.


Model Details

Architecture

  • Type: RoBERTa-style BERT (no token type embeddings)
  • Hidden size: 256
  • Layers: 6 transformer blocks
  • Attention heads: 4 per layer
  • Intermediate size: 1,024
  • Max sequence length: 64 tokens
  • Total parameters: 6,878,528

Tokenizer

  • Type: Byte-Pair Encoding (BPE)
  • Vocabulary size: 8,000 tokens
  • Special tokens: [PAD], [UNK], [CLS], [SEP], [MASK]
  • Normalization: NFD Unicode + accent stripping
  • Case: Preserved (for proper nouns and code-switched English)

Training Data

  • Corpus size: 42,552 Nagamese sentences
  • Average length: 11.82 tokens/sentence
  • Split: 90% train (38,296) / 10% validation (4,256)
  • Sources: Web, social media, community contributions (deduplicated)

Pre-training

  • Objective: Masked Language Modeling (15% masking)
  • Optimizer: AdamW (lr=5e-4, weight_decay=0.01)
  • Batch size: 64
  • Epochs: 50
  • Training time: ~35 minutes
  • Hardware: NVIDIA A40 (48GB)
  • Final validation loss: 2.79

Usage

Load Model and Tokenizer

from transformers import AutoTokenizer, AutoModel

model_name = "MWirelabs/nagamesebert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Example usage
text = "Toi moi laga sathi hobo pare?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Fine-tuning for Token Classification

from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

# Load model with classification head
model = AutoModelForTokenClassification.from_pretrained(
    "MWirelabs/nagamesebert",
    num_labels=num_labels
)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=100,
    per_device_train_batch_size=8,
    learning_rate=3e-5,
    weight_decay=0.01
)

# Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)
trainer.train()

Evaluation

Dataset

  • Source: NagaNLP Annotated Corpus
  • Total: 214 sentences
  • Split (seed=42): 171 train / 21 dev / 22 test (80/10/10)
  • POS tags: 13 Universal Dependencies tags
  • NER tags: 4 entity types (PER, LOC, ORG, MISC) in IOB2 format

Experimental Setup

  • Seeds: 42, 123, 456 (n=3 for variance estimation)
  • Batch size: 32
  • Learning rate: 3e-5
  • Epochs: 100
  • Optimization: AdamW with 100 warmup steps
  • Hardware: NVIDIA A40
  • Metrics: Token-level accuracy and macro-averaged F1

Data Leakage Statement: All splits created with fixed seed (42) with no sentence overlap between train/dev/test sets.


Limitations

  • Corpus size: 42K sentences is modest; expansion to 100K+ could improve performance
  • Evaluation scale: Small test set (22 sentences) limits statistical power
  • Task scope: Only evaluated on token classification; needs broader task assessment
  • Efficiency metrics: No quantitative inference benchmarks (latency, memory) yet provided
  • Data documentation: Complete data provenance and licenses to be formalized

Citation

If you use NagameseBERT in your research, please cite:

@misc{nagamesebert2025,
  title={Bootstrapping BERT for Nagamese: A Low-Resource Creole Language},
  author={MWire Labs},
  year={2025},
  url={https://huggingface.co/MWirelabs/nagamesebert}
}

Contact

MWire Labs
Shillong, Meghalaya, India
Website: MWire Labs


License

This model is released under Creative Commons Attribution 4.0 International (CC BY 4.0).

You are free to:

  • Share — copy and redistribute the material
  • Adapt — remix, transform, and build upon the material

Under the following terms:

  • Attribution — You must give appropriate credit to MWire Labs

Acknowledgments

We thank the Nagamese-speaking community for their contributions to corpus development and validation.

Downloads last month
32
Safetensors
Model size
6.88M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train MWirelabs/nagamesebert

Evaluation results