Azeri Handwriting Detection Dataset
Overview
This dataset contains 12 handwritten Azerbaijani document samples with transcriptions, representing diverse real-world document types. The dataset serves as a pilot collection for developing and validating the Azeri Handwriting Recognition (HTR) system.
Dataset Statistics:
- Total Documents: 12
- Total Lines: 80 (avg: 6.7 lines/document)
- Total Characters: 2,384 (avg: 198.7 chars/document)
- Language: Azerbaijani (Latin script)
- Format: HEIC images + TXT transcriptions
- Created: December 13, 2025
Directory Structure
data/
├── images/ # 12 HEIC image files
│ ├── 01.HEIC → az_formal_letter_01.txt
│ ├── 02.HEIC → az_handwritten_note_02.txt
│ ├── 03.HEIC → az_numeric_mixed_03.txt
│ ├── 04.HEIC → az_medical_form_04.txt
│ ├── 05.HEIC → az_utility_application_05.txt
│ ├── 06.HEIC → az_bank_statement_06.txt
│ ├── 07.HEIC → az_education_text_07.txt
│ ├── 08.HEIC → az_address_list_08.txt
│ ├── 09.HEIC → az_technical_report_09.txt
│ ├── 10.HEIC → az_contract_clause_10.txt
│ ├── 11.HEIC → az_daily_diary_11.txt
│ └── 12.HEIC → az_tabular_text_12.txt
├── labels/ # 12 transcription files
│ └── az_*_##.txt
└── README.md # This file
Naming Convention: The number at the end of each label filename (e.g., _01, _12) corresponds exactly to the image number (e.g., 01.HEIC, 12.HEIC).
Document Types
The dataset contains 12 diverse document types representing real-world Azerbaijani documents:
| ID | Document Type | Lines | Chars | Description |
|---|---|---|---|---|
| 01 | Formal Letter | 9 | 196 | Official business letter to director about 2025 reports |
| 02 | Handwritten Note | 8 | 171 | Personal reminder about bank appointment |
| 03 | Numeric Mixed | 6 | 147 | Contract with numbers, dates, amounts (VAT calculation) |
| 04 | Medical Form | 8 | 215 | Patient form with diagnosis and prescriptions |
| 05 | Utility Application | 8 | 251 | Electricity complaint with meter reading |
| 06 | Bank Statement | 6 | 165 | Account transactions with debits/credits |
| 07 | Education Text | 6 | 240 | Constitutional text about education rights |
| 08 | Address List | 7 | 184 | Addresses from Baku, Ganja, Sumqayit cities |
| 09 | Technical Report | 4 | 216 | System performance analysis (CPU, disk) |
| 10 | Contract Clause | 8 | 257 | Legal contract clause text |
| 11 | Daily Diary | 6 | 184 | Personal diary entry about work project |
| 12 | Tabular Text | 4 | 158 | Employee table with names, ages, departments |
Complexity Levels
Simple (4-6 lines):
- Tabular text (12), Technical report (09), Numeric mixed (03)
- Short, structured content
Medium (6-8 lines):
- Bank statement (06), Education (07), Address list (08)
- Moderate length with mixed content
Complex (8-9 lines):
- Medical form (04), Handwritten note (02), Contract (10)
- Longer documents with varied formatting
Image Characteristics
Format: HEIC (High Efficiency Image Container)
- Codec: HEVC (H.265) - Apple's modern image format
- File Sizes: 820KB - 1.0MB per image (average: ~890KB)
- Total Size: ~10.5MB
Important: HEIC format requires conversion to PNG/JPG for PyTorch processing:
from PIL import Image
import pillow_heif
pillow_heif.register_heif_opener()
img = Image.open('01.HEIC').convert('L') # Convert to grayscale
img.save('01.png')
Label File Format
Format:
Line_number→Transcribed_text
Example (az_tabular_text_12.txt):
1→Adı | Yaşı | Şöbə
2→-------------------------
3→Rauf | 32 | Maliyyə
4→Aysel | 28 | İnsan resursları
5→Kamal | 41 | İT dəstəyi
Characteristics:
- Line numbers with leading spaces
- Arrow delimiter (
→) separates line number from text - Preserves original spacing and formatting
- Includes punctuation and special characters exactly as written
Azerbaijani Language Statistics
Character Distribution (Top 15)
Space: 262 occurrences (11% - word boundaries)
a: 154 (6.5%)
i: 137 (5.7%)
ə: 125 (5.2%) ← Azerbaijani-specific schwa
r: 98 (4.1%)
l: 91 (3.8%)
n: 90 (3.8%)
ı: 66 (2.8%) ← Azerbaijani dotless i
s: 66 (2.8%)
m: 57 (2.4%)
d: 51 (2.1%)
t: 49 (2.1%)
e: 42 (1.8%)
u: 39 (1.6%)
-: 35 (1.5%) ← Hyphenation
Azerbaijani-Specific Characters (Critical)
Lowercase:
ə: 125 (schwa - most common special character)
ı: 66 (dotless i)
ş: 27 (s with cedilla)
ü: 27 (u with diaeresis)
ğ: 13 (g with breve)
ö: 7 (o with diaeresis)
ç: 7 (c with cedilla)
Uppercase:
Ə: 4
İ: 3 (i with dot - Turkish/Azeri uppercase)
Ş: 2
Ü: 1
Ö: 1
Total Azerbaijani-specific characters: 263 (11% of all characters)
Key Insight: Azerbaijani diacritics (ə, ı, ş, ü, ğ, ö, ç) are essential and must be preserved by the HTR model.
Content Analysis
Character Breakdown
Letters (a-z, A-Z): ~1,550 (65%)
Azerbaijani chars: 263 (11%)
Spaces: 262 (11%)
Numbers: ~150 (6%)
Punctuation: ~160 (7%)
Numeric & Special Content
Numbers Present:
- Dates:
14.06.2024,01.02.2025,03.11.1987 - Amounts:
12 750.45 AZN,15 045.53 AZN,304.50 AZN - Percentages:
18%,67% - Phone numbers:
050-3456789 - Contract numbers:
№ 457/23 - Account numbers:
AZ21NABZ
Punctuation & Symbols:
- Hyphens (35) - word breaks, line wrapping
- Periods (34) - decimals, abbreviations
- Commas (19) - number separators
- Colons (15) - field labels
- Pipe symbols (|) - table formatting
Domain-Specific Vocabulary
The dataset contains rich domain terminology across multiple sectors:
Financial:
müqavilə(contract),məbləğ(amount),ƏDV(VAT)hesabat(report),hesab nömrəsi(account number)Maaş(salary),Komunal(utilities)
Medical:
pasiyent(patient),diaqnoz(diagnosis)baş ağrısı(headache),arterial hipertenziya(hypertension)dərman(medicine)
Legal/Formal:
Hörmətli(Dear/Honorable),direktor(director)Konstitusiya(Constitution),qanunvericilik(legislation)qarşılıqlı razılaşma(mutual agreement)
Technical:
sistem performans(system performance)CPU yüklənməsi(CPU load),disk oxunuş(disk read)
Personal/General:
xahiş edirəm(I request),bildiririk(we inform)ünvan(address),rayon(district)
Text Features & Challenges
Line Breaking & Hyphenation
Multiple documents show mid-word line breaks with hyphens:
hazır-lanması(prepared, split across lines)yaran-mış(arising)araşdırıl-masını(investigation)hiper-tenziya(hypertension)
Implication: HTR model needs line-level recognition, and post-processing must reconstruct hyphenated words.
Tabular Formatting
Document 12 contains table structure:
Adı | Yaşı | Şöbə
-------------------------
Rauf | 32 | Maliyyə
Aysel | 28 | İnsan resursları
Implication: Stage 1 (layout detection) is critical for preserving table structure.
Mixed Case Usage
- Proper nouns:
Bakı,Gəncə,Neftçilər prospekti - Abbreviations:
ƏDV,AZN,ATM,CPU,IT - Sentence case: Standard for regular text
Data Quality Assessment
Strengths
✅ Diverse document types - Covers real-world use cases across multiple domains
✅ Rich Azerbaijani vocabulary - Proper diacritics preserved throughout
✅ Mixed content - Text, numbers, tables, addresses
✅ Domain variety - Medical, legal, financial, technical, personal
✅ Proper formatting - Line-level transcriptions with structure preservation
✅ Clean transcriptions - Accurate character-level annotations
Limitations
⚠️ Small dataset size - Only 12 samples (insufficient for production training)
⚠️ No writer diversity info - Unknown if single/multiple writers
⚠️ HEIC format - Requires preprocessing for PyTorch
⚠️ No bounding boxes - Labels are page-level, not line-level
⚠️ No validation split - Need to define train/val/test splits
⚠️ No image metadata - Resolution, DPI, quality information missing
⚠️ Insufficient for LM training - Only 2.4K chars vs. 100K-1M recommended
Preprocessing Requirements
Before training, the following preprocessing steps are required:
1. Convert HEIC to PNG/JPG
import pillow_heif
from PIL import Image
import os
pillow_heif.register_heif_opener()
for heic_file in os.listdir('images/'):
if heic_file.endswith('.HEIC'):
img_path = os.path.join('images/', heic_file)
img = Image.open(img_path).convert('L') # Grayscale
png_path = img_path.replace('.HEIC', '.png')
img.save(png_path)
2. Line Segmentation
Extract bounding boxes for each line from page images:
- Use layout detection (YOLOv8) or manual annotation
- Create line-level image crops
- Map each line crop to its transcription
3. Create Vocabulary File
import json
# Extract all unique characters from labels
chars = set()
for label_file in label_files:
with open(label_file, 'r', encoding='utf-8') as f:
text = f.read()
# Remove line numbers and arrow delimiter
text = '→'.join(text.split('→')[1:]) if '→' in text else text
chars.update(text)
# Create vocabulary mapping
vocab = {char: idx for idx, char in enumerate(sorted(chars))}
vocab['[BLANK]'] = len(vocab) # CTC blank token
with open('vocab.json', 'w', encoding='utf-8') as f:
json.dump(vocab, f, ensure_ascii=False, indent=2)
4. Define Data Splits
Recommended split (document-wise to prevent data leakage):
train.txt: 01,02,03,04,05,06,07,08,09 (75% - 9 documents)
val.txt: 10,11 (17% - 2 documents)
test.txt: 12 (8% - 1 document)
Recommended Character Set
Based on dataset analysis, the vocabulary should include:
Latin Letters:
- Lowercase: a-z
- Uppercase: A-Z
Azerbaijani Characters:
- Lowercase: ə, ç, ğ, ı, ö, ş, ü
- Uppercase: Ə, Ç, Ğ, İ, Ö, Ş, Ü
Digits: 0-9
Punctuation & Symbols:
. , : ; - – — ( ) [ ] / |
" ' « » ? ! № % + =
Special:
- Space character
- CTC blank token
Estimated Vocabulary Size: ~100 characters
Usage Guidelines
For Model Training
Data Augmentation is Critical - With only 12 samples, heavy augmentation is mandatory:
- Rotation: ±3°
- Scaling: 0.9-1.1
- Elastic distortion
- Blur, noise, random erasing
- Synthetic overlays
Start with HTR-Lite - Use the lightweight model variant for proof-of-concept
Character-Level Tokenization - Recommended for Azerbaijani language
Preserve Diacritics - Critical for maintaining language integrity
For Dataset Expansion
Immediate Actions:
Collect More Data - Current 2.4K chars is far below recommended 100K-1M
- Photograph additional handwritten documents
- Use synthetic data generation
- Apply pseudo-labeling on unlabeled scans
Line-Level Annotation - Convert page-level to line-level:
- Extract individual line bounding boxes
- Crop and save as separate line images
- Create line-level transcriptions
Metadata Collection - Document:
- Writer information (for stratified splits)
- Image resolution and DPI
- Document quality scores
- Date of collection
Quality Control - Verify:
- Transcription accuracy
- Diacritic correctness
- Proper character encoding (UTF-8)
Integration with Architecture
Alignment with Planned System (plan.md)
| Planned Feature | Current Data Status |
|---|---|
| Character-level vocab | ✅ Azerbaijani chars present |
| Document-wise split | ⚠️ Not defined yet |
| Line-level images | ❌ Only page-level currently |
| Bounding boxes | ❌ Not annotated |
| Mixed content | ✅ Numbers, text, tables |
| Domain diversity | ✅ 12 document types |
| 100K+ tokens for LM | ❌ Only 2.4K chars |
Next Steps for Implementation
Preprocessing Pipeline:
- Convert HEIC → PNG
- Segment pages into lines
- Extract line bounding boxes
Dataset Preparation:
- Create train/val/test splits
- Generate vocabulary.json
- Build data loader with augmentation
Baseline Model:
- Train HTR-Lite on augmented data
- Evaluate on validation set
- Analyze error patterns
Data Expansion:
- Collect 100+ more documents
- Implement active learning loop
- Build Azerbaijani language model corpus
Example Label Samples
Document 01 - Formal Letter
Hörmətli cənab direktor,
Bu məktub vasitəsilə
bildiririk ki,
2025-ci il
üzrə hesabatların
hazırlanması başa çatmaq
üzrədir.
Document 04 - Medical Form
Pasiyentin adı, soyadı:
Əliyev Rəşad Kamran oğlu
Doğum tarixi: 03.11.1987
Şikayətlər: baş ağrısı, halsızlıq,
yuxusuzluq
Diaqnoz: arterial hipertenziya
Document 12 - Tabular Text
Adı | Yaşı | Şöbə
-------------------------
Rauf | 32 | Maliyyə
Aysel | 28 | İnsan resursları
Kamal | 41 | İT dəstəyi
References
- Project Plan: See
../plan.mdfor full architecture specification - Character Encoding: UTF-8
- Language: Azerbaijani (Latin script, ISO 639-1: az)
- Image Format: HEIC (requires conversion to PNG/JPG)
License & Usage
This dataset is collected for developing the Azeri Handwriting Detection system. Please ensure proper handling of any personal information that may appear in the documents.
Last Updated: December 13, 2025 Dataset Version: 1.0 (Pilot) Total Samples: 12 documents, 80 lines, 2,384 characters