Azerbaijani Named Entity Recognition with XLM-RoBERTa Large

Fine-tuned version of xlm-roberta-large for Named Entity Recognition (NER) on Azerbaijani text. Recognizes 12 entity types including persons, locations, organizations, dates, and more.

Hugging Face: IsmatS/xlm_roberta_large_az_ner

Quick Start

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "IsmatS/xlm_roberta_large_az_ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = "Shahla Khuduyeva və Pasha Sığorta şirkəti haqqında məlumat."
entities = ner(text)
for e in entities:
    print(f"{e['entity_group']:15} {e['word']:25} ({e['score']:.2f})")

Output:

PERSON          Shahla Khuduyeva          (0.97)
ORGANISATION    Pasha Sığorta             (0.95)

Model Details

Property Value
Base Model xlm-roberta-large (355M parameters)
Task Named Entity Recognition (NER)
Language Azerbaijani (az)
Dataset LocalDoc/azerbaijani-ner-dataset
License Apache 2.0

Supported Entity Types

Entity Description Example
PERSON Person names İlham Əliyev
LOCATION Geographic locations Bakı, Azərbaycan
ORGANISATION Companies, institutions SOCAR, Bakı Dövlət Universiteti
DATE Dates and periods 2024-cü il, sentyabr
TIME Time expressions səhər saat 9:00
MONEY Monetary values 150 manat
PERCENTAGE Percentage values 18%
FACILITY Buildings, landmarks Heydər Əliyev Mərkəzi
PRODUCT Products and items -
EVENT Events -
LAW Legal documents -
ART Artworks -

Performance

Best Checkpoint (Epoch 6)

Metric Value
Precision 0.7831
Recall 0.7284
F1 0.7548

Training History

Epoch Train Loss Val Loss Precision Recall F1
1 0.4075 0.2538 0.7689 0.7214 0.7444
2 0.2556 0.2497 0.7835 0.7245 0.7528
3 0.2144 0.2488 0.7509 0.7489 0.7499
4 0.1934 0.2571 0.7686 0.7404 0.7542
5 0.1698 0.2757 0.7458 0.7537 0.7497
6 0.1526 0.2881 0.7831 0.7284 0.7548
7 0.1443 0.3034 0.7585 0.7381 0.7481

Entity-Level Metrics (Epoch 7)

Entity Precision Recall F1 Support
ART 0.41 0.19 0.26 1828
DATE 0.53 0.49 0.51 834
EVENT 0.67 0.51 0.58 63
FACILITY 0.74 0.68 0.71 1134
LAW 0.62 0.58 0.60 1066
LOCATION 0.81 0.79 0.80 8795
MONEY 0.59 0.56 0.58 555
ORGANISATION 0.70 0.69 0.70 554
PERCENTAGE 0.80 0.82 0.81 3502
PERSON 0.90 0.82 0.86 7007
PRODUCT 0.83 0.84 0.84 2624
TIME 0.60 0.53 0.57 1584

Overall: Micro Avg F1 = 0.75 | Weighted Avg F1 = 0.74

Training Configuration

TrainingArguments(
    learning_rate=2e-5,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    num_train_epochs=12,
    weight_decay=0.005,
    fp16=True,
    metric_for_best_model="f1",
    load_best_model_at_end=True,
)
  • Optimizer: AdamW
  • Early stopping: patience=5 on F1
  • Infrastructure: Google Colab A100 GPU

Usage

Installation

pip install transformers torch

Inference

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import torch

model_name = "IsmatS/xlm_roberta_large_az_ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

device = 0 if torch.cuda.is_available() else -1
ner = pipeline("ner", model=model, tokenizer=tokenizer,
               aggregation_strategy="simple", device=device)

# Single text
text = "Bakı şəhərində Azərbaycan Respublikasının prezidenti İlham Əliyev."
results = ner(text)
for e in results:
    print(f"[{e['entity_group']}] {e['word']} (score: {e['score']:.3f})")

Batch Inference

texts = [
    "Bakı şəhərində İlham Əliyev çıxış etdi.",
    "SOCAR şirkəti 2024-cü ildə rekord gəlir əldə etdi.",
    "Heydər Əliyev Beynəlxalq Hava Limanı yeni terminalı açıldı.",
]

results = ner(texts)
for text, entities in zip(texts, results):
    print(f"\nText: {text}")
    for e in entities:
        print(f"  [{e['entity_group']}] {e['word']}")

Dataset

Trained on LocalDoc/azerbaijani-ner-dataset with 25 entity categories annotated in IOB2 format.

from datasets import load_dataset
dataset = load_dataset("LocalDoc/azerbaijani-ner-dataset")

Model Comparison

Model F1 Parameters
mBERT Azerbaijani NER 0.677 180M
XLM-RoBERTa Base Azerbaijani NER 0.752 125M
XLM-RoBERTa Large Azerbaijani NER (this model) 0.755 355M
Azeri-Turkish BERT NER 0.736 110M

Files

xlm_roberta_large_az_ner/
├── README.md                 # This file
├── config.json               # Model configuration
├── model-001.safetensors     # Model weights
├── sentencepiece.bpe.model   # SentencePiece tokenizer
├── special_tokens_map.json   # Special token mappings
├── tokenizer.json            # Tokenizer vocabulary
├── tokenizer_config.json     # Tokenizer configuration
├── xlm_roberta_large.ipynb   # Training notebook
└── xlm_roberta_large.py      # Training script

Citation

@model{samadov2024xlm_large_az_ner,
  author = {Ismat Samadov},
  title = {XLM-RoBERTa Large Azerbaijani NER},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/IsmatS/xlm_roberta_large_az_ner}
}

License

Apache 2.0 — see LICENSE for details.

Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for IsmatS/xlm_roberta_large_az_ner

Finetuned
(906)
this model

Dataset used to train IsmatS/xlm_roberta_large_az_ner

Collection including IsmatS/xlm_roberta_large_az_ner

Evaluation results