Azerbaijani Named Entity Recognition with XLM-RoBERTa Large

Fine-tuned version of xlm-roberta-large for Named Entity Recognition (NER) on Azerbaijani text. Recognizes 12 entity types including persons, locations, organizations, dates, and more.

Hugging Face: IsmatS/xlm_roberta_large_az_ner

Quick Start

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "IsmatS/xlm_roberta_large_az_ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")

text = "Shahla Khuduyeva və Pasha Sığorta şirkəti haqqında məlumat."
entities = ner(text)
for e in entities:
    print(f"{e['entity_group']:15} {e['word']:25} ({e['score']:.2f})")

Output:

PERSON          Shahla Khuduyeva          (0.97)
ORGANISATION    Pasha Sığorta             (0.95)

Model Details

Property	Value
Base Model	`xlm-roberta-large` (355M parameters)
Task	Named Entity Recognition (NER)
Language	Azerbaijani (`az`)
Dataset	LocalDoc/azerbaijani-ner-dataset
License	Apache 2.0

Supported Entity Types

Entity	Description	Example
PERSON	Person names	İlham Əliyev
LOCATION	Geographic locations	Bakı, Azərbaycan
ORGANISATION	Companies, institutions	SOCAR, Bakı Dövlət Universiteti
DATE	Dates and periods	2024-cü il, sentyabr
TIME	Time expressions	səhər saat 9:00
MONEY	Monetary values	150 manat
PERCENTAGE	Percentage values	18%
FACILITY	Buildings, landmarks	Heydər Əliyev Mərkəzi
PRODUCT	Products and items	-
EVENT	Events	-
LAW	Legal documents	-
ART	Artworks	-

Performance

Best Checkpoint (Epoch 6)

Metric	Value
Precision	0.7831
Recall	0.7284
F1	0.7548

Training History

Epoch	Train Loss	Val Loss	Precision	Recall	F1
1	0.4075	0.2538	0.7689	0.7214	0.7444
2	0.2556	0.2497	0.7835	0.7245	0.7528
3	0.2144	0.2488	0.7509	0.7489	0.7499
4	0.1934	0.2571	0.7686	0.7404	0.7542
5	0.1698	0.2757	0.7458	0.7537	0.7497
6	0.1526	0.2881	0.7831	0.7284	0.7548
7	0.1443	0.3034	0.7585	0.7381	0.7481

Entity-Level Metrics (Epoch 7)

Entity	Precision	Recall	F1	Support
ART	0.41	0.19	0.26	1828
DATE	0.53	0.49	0.51	834
EVENT	0.67	0.51	0.58	63
FACILITY	0.74	0.68	0.71	1134
LAW	0.62	0.58	0.60	1066
LOCATION	0.81	0.79	0.80	8795
MONEY	0.59	0.56	0.58	555
ORGANISATION	0.70	0.69	0.70	554
PERCENTAGE	0.80	0.82	0.81	3502
PERSON	0.90	0.82	0.86	7007
PRODUCT	0.83	0.84	0.84	2624
TIME	0.60	0.53	0.57	1584

Overall: Micro Avg F1 = 0.75 | Weighted Avg F1 = 0.74

Training Configuration

TrainingArguments(
    learning_rate=2e-5,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    num_train_epochs=12,
    weight_decay=0.005,
    fp16=True,
    metric_for_best_model="f1",
    load_best_model_at_end=True,
)

Optimizer: AdamW
Early stopping: patience=5 on F1
Infrastructure: Google Colab A100 GPU

Usage

Installation

pip install transformers torch

Inference

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import torch

model_name = "IsmatS/xlm_roberta_large_az_ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

device = 0 if torch.cuda.is_available() else -1
ner = pipeline("ner", model=model, tokenizer=tokenizer,
               aggregation_strategy="simple", device=device)

# Single text
text = "Bakı şəhərində Azərbaycan Respublikasının prezidenti İlham Əliyev."
results = ner(text)
for e in results:
    print(f"[{e['entity_group']}] {e['word']} (score: {e['score']:.3f})")

Batch Inference

texts = [
    "Bakı şəhərində İlham Əliyev çıxış etdi.",
    "SOCAR şirkəti 2024-cü ildə rekord gəlir əldə etdi.",
    "Heydər Əliyev Beynəlxalq Hava Limanı yeni terminalı açıldı.",
]

results = ner(texts)
for text, entities in zip(texts, results):
    print(f"\nText: {text}")
    for e in entities:
        print(f"  [{e['entity_group']}] {e['word']}")

Dataset

Trained on LocalDoc/azerbaijani-ner-dataset with 25 entity categories annotated in IOB2 format.

from datasets import load_dataset
dataset = load_dataset("LocalDoc/azerbaijani-ner-dataset")

Model Comparison

Model	F1	Parameters
mBERT Azerbaijani NER	0.677	180M
XLM-RoBERTa Base Azerbaijani NER	0.752	125M
XLM-RoBERTa Large Azerbaijani NER (this model)	0.755	355M
Azeri-Turkish BERT NER	0.736	110M

Files

xlm_roberta_large_az_ner/
├── README.md                 # This file
├── config.json               # Model configuration
├── model-001.safetensors     # Model weights
├── sentencepiece.bpe.model   # SentencePiece tokenizer
├── special_tokens_map.json   # Special token mappings
├── tokenizer.json            # Tokenizer vocabulary
├── tokenizer_config.json     # Tokenizer configuration
├── xlm_roberta_large.ipynb   # Training notebook
└── xlm_roberta_large.py      # Training script

Citation

@model{samadov2024xlm_large_az_ner,
  author = {Ismat Samadov},
  title = {XLM-RoBERTa Large Azerbaijani NER},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/IsmatS/xlm_roberta_large_az_ner}
}

License

Apache 2.0 — see LICENSE for details.

Downloads last month: 21

Model tree for IsmatS/xlm_roberta_large_az_ner

Base model

FacebookAI/xlm-roberta-large

Finetuned

(906)

this model

Dataset used to train IsmatS/xlm_roberta_large_az_ner

Collection including IsmatS/xlm_roberta_large_az_ner

Azerbaijani NLP Suite

Collection

Complete NLP toolkit for Azerbaijani language: 4 NER models benchmarked, GPT language model, and live demos. • 12 items • Updated 4 days ago

Evaluation results

Precision on Azerbaijani NER Dataset
self-reported

0.783
Recall on Azerbaijani NER Dataset
self-reported

0.728
F1 on Azerbaijani NER Dataset
self-reported

0.755