Azerbaijani NLP Suite
Collection
Complete NLP toolkit for Azerbaijani language: 4 NER models benchmarked, GPT language model, and live demos. • 12 items • Updated
Fine-tuned version of xlm-roberta-large for Named Entity Recognition (NER) on Azerbaijani text. Recognizes 12 entity types including persons, locations, organizations, dates, and more.
Hugging Face: IsmatS/xlm_roberta_large_az_ner
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_name = "IsmatS/xlm_roberta_large_az_ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
text = "Shahla Khuduyeva və Pasha Sığorta şirkəti haqqında məlumat."
entities = ner(text)
for e in entities:
print(f"{e['entity_group']:15} {e['word']:25} ({e['score']:.2f})")
Output:
PERSON Shahla Khuduyeva (0.97)
ORGANISATION Pasha Sığorta (0.95)
| Property | Value |
|---|---|
| Base Model | xlm-roberta-large (355M parameters) |
| Task | Named Entity Recognition (NER) |
| Language | Azerbaijani (az) |
| Dataset | LocalDoc/azerbaijani-ner-dataset |
| License | Apache 2.0 |
| Entity | Description | Example |
|---|---|---|
| PERSON | Person names | İlham Əliyev |
| LOCATION | Geographic locations | Bakı, Azərbaycan |
| ORGANISATION | Companies, institutions | SOCAR, Bakı Dövlət Universiteti |
| DATE | Dates and periods | 2024-cü il, sentyabr |
| TIME | Time expressions | səhər saat 9:00 |
| MONEY | Monetary values | 150 manat |
| PERCENTAGE | Percentage values | 18% |
| FACILITY | Buildings, landmarks | Heydər Əliyev Mərkəzi |
| PRODUCT | Products and items | - |
| EVENT | Events | - |
| LAW | Legal documents | - |
| ART | Artworks | - |
| Metric | Value |
|---|---|
| Precision | 0.7831 |
| Recall | 0.7284 |
| F1 | 0.7548 |
| Epoch | Train Loss | Val Loss | Precision | Recall | F1 |
|---|---|---|---|---|---|
| 1 | 0.4075 | 0.2538 | 0.7689 | 0.7214 | 0.7444 |
| 2 | 0.2556 | 0.2497 | 0.7835 | 0.7245 | 0.7528 |
| 3 | 0.2144 | 0.2488 | 0.7509 | 0.7489 | 0.7499 |
| 4 | 0.1934 | 0.2571 | 0.7686 | 0.7404 | 0.7542 |
| 5 | 0.1698 | 0.2757 | 0.7458 | 0.7537 | 0.7497 |
| 6 | 0.1526 | 0.2881 | 0.7831 | 0.7284 | 0.7548 |
| 7 | 0.1443 | 0.3034 | 0.7585 | 0.7381 | 0.7481 |
| Entity | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| ART | 0.41 | 0.19 | 0.26 | 1828 |
| DATE | 0.53 | 0.49 | 0.51 | 834 |
| EVENT | 0.67 | 0.51 | 0.58 | 63 |
| FACILITY | 0.74 | 0.68 | 0.71 | 1134 |
| LAW | 0.62 | 0.58 | 0.60 | 1066 |
| LOCATION | 0.81 | 0.79 | 0.80 | 8795 |
| MONEY | 0.59 | 0.56 | 0.58 | 555 |
| ORGANISATION | 0.70 | 0.69 | 0.70 | 554 |
| PERCENTAGE | 0.80 | 0.82 | 0.81 | 3502 |
| PERSON | 0.90 | 0.82 | 0.86 | 7007 |
| PRODUCT | 0.83 | 0.84 | 0.84 | 2624 |
| TIME | 0.60 | 0.53 | 0.57 | 1584 |
Overall: Micro Avg F1 = 0.75 | Weighted Avg F1 = 0.74
TrainingArguments(
learning_rate=2e-5,
per_device_train_batch_size=128,
per_device_eval_batch_size=128,
num_train_epochs=12,
weight_decay=0.005,
fp16=True,
metric_for_best_model="f1",
load_best_model_at_end=True,
)
pip install transformers torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import torch
model_name = "IsmatS/xlm_roberta_large_az_ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
device = 0 if torch.cuda.is_available() else -1
ner = pipeline("ner", model=model, tokenizer=tokenizer,
aggregation_strategy="simple", device=device)
# Single text
text = "Bakı şəhərində Azərbaycan Respublikasının prezidenti İlham Əliyev."
results = ner(text)
for e in results:
print(f"[{e['entity_group']}] {e['word']} (score: {e['score']:.3f})")
texts = [
"Bakı şəhərində İlham Əliyev çıxış etdi.",
"SOCAR şirkəti 2024-cü ildə rekord gəlir əldə etdi.",
"Heydər Əliyev Beynəlxalq Hava Limanı yeni terminalı açıldı.",
]
results = ner(texts)
for text, entities in zip(texts, results):
print(f"\nText: {text}")
for e in entities:
print(f" [{e['entity_group']}] {e['word']}")
Trained on LocalDoc/azerbaijani-ner-dataset with 25 entity categories annotated in IOB2 format.
from datasets import load_dataset
dataset = load_dataset("LocalDoc/azerbaijani-ner-dataset")
| Model | F1 | Parameters |
|---|---|---|
| mBERT Azerbaijani NER | 0.677 | 180M |
| XLM-RoBERTa Base Azerbaijani NER | 0.752 | 125M |
| XLM-RoBERTa Large Azerbaijani NER (this model) | 0.755 | 355M |
| Azeri-Turkish BERT NER | 0.736 | 110M |
xlm_roberta_large_az_ner/
├── README.md # This file
├── config.json # Model configuration
├── model-001.safetensors # Model weights
├── sentencepiece.bpe.model # SentencePiece tokenizer
├── special_tokens_map.json # Special token mappings
├── tokenizer.json # Tokenizer vocabulary
├── tokenizer_config.json # Tokenizer configuration
├── xlm_roberta_large.ipynb # Training notebook
└── xlm_roberta_large.py # Training script
@model{samadov2024xlm_large_az_ner,
author = {Ismat Samadov},
title = {XLM-RoBERTa Large Azerbaijani NER},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/IsmatS/xlm_roberta_large_az_ner}
}
Apache 2.0 — see LICENSE for details.
Base model
FacebookAI/xlm-roberta-large