Azeri Handwriting Detection Dataset

Overview

This dataset contains 12 handwritten Azerbaijani document samples with transcriptions, representing diverse real-world document types. The dataset serves as a pilot collection for developing and validating the Azeri Handwriting Recognition (HTR) system.

Dataset Statistics:

Total Documents: 12
Total Lines: 80 (avg: 6.7 lines/document)
Total Characters: 2,384 (avg: 198.7 chars/document)
Language: Azerbaijani (Latin script)
Format: HEIC images + TXT transcriptions
Created: December 13, 2025

Directory Structure

data/
├── images/          # 12 HEIC image files
│   ├── 01.HEIC  →  az_formal_letter_01.txt
│   ├── 02.HEIC  →  az_handwritten_note_02.txt
│   ├── 03.HEIC  →  az_numeric_mixed_03.txt
│   ├── 04.HEIC  →  az_medical_form_04.txt
│   ├── 05.HEIC  →  az_utility_application_05.txt
│   ├── 06.HEIC  →  az_bank_statement_06.txt
│   ├── 07.HEIC  →  az_education_text_07.txt
│   ├── 08.HEIC  →  az_address_list_08.txt
│   ├── 09.HEIC  →  az_technical_report_09.txt
│   ├── 10.HEIC  →  az_contract_clause_10.txt
│   ├── 11.HEIC  →  az_daily_diary_11.txt
│   └── 12.HEIC  →  az_tabular_text_12.txt
├── labels/          # 12 transcription files
│   └── az_*_##.txt
└── README.md        # This file

Naming Convention: The number at the end of each label filename (e.g., _01, _12) corresponds exactly to the image number (e.g., 01.HEIC, 12.HEIC).

Document Types

The dataset contains 12 diverse document types representing real-world Azerbaijani documents:

ID	Document Type	Lines	Chars	Description
01	Formal Letter	9	196	Official business letter to director about 2025 reports
02	Handwritten Note	8	171	Personal reminder about bank appointment
03	Numeric Mixed	6	147	Contract with numbers, dates, amounts (VAT calculation)
04	Medical Form	8	215	Patient form with diagnosis and prescriptions
05	Utility Application	8	251	Electricity complaint with meter reading
06	Bank Statement	6	165	Account transactions with debits/credits
07	Education Text	6	240	Constitutional text about education rights
08	Address List	7	184	Addresses from Baku, Ganja, Sumqayit cities
09	Technical Report	4	216	System performance analysis (CPU, disk)
10	Contract Clause	8	257	Legal contract clause text
11	Daily Diary	6	184	Personal diary entry about work project
12	Tabular Text	4	158	Employee table with names, ages, departments

Complexity Levels

Simple (4-6 lines):

Tabular text (12), Technical report (09), Numeric mixed (03)
Short, structured content

Medium (6-8 lines):

Bank statement (06), Education (07), Address list (08)
Moderate length with mixed content

Complex (8-9 lines):

Medical form (04), Handwritten note (02), Contract (10)
Longer documents with varied formatting

Image Characteristics

Format: HEIC (High Efficiency Image Container)

Codec: HEVC (H.265) - Apple's modern image format
File Sizes: 820KB - 1.0MB per image (average: ~890KB)
Total Size: ~10.5MB

Important: HEIC format requires conversion to PNG/JPG for PyTorch processing:

from PIL import Image
import pillow_heif

pillow_heif.register_heif_opener()
img = Image.open('01.HEIC').convert('L')  # Convert to grayscale
img.save('01.png')

Label File Format

Format:

Line_number→Transcribed_text

Example (az_tabular_text_12.txt):

     1→Adı        | Yaşı | Şöbə
     2→-------------------------
     3→Rauf       | 32   | Maliyyə
     4→Aysel      | 28   | İnsan resursları
     5→Kamal      | 41   | İT dəstəyi

Characteristics:

Line numbers with leading spaces
Arrow delimiter (→) separates line number from text
Preserves original spacing and formatting
Includes punctuation and special characters exactly as written

Azerbaijani Language Statistics

Character Distribution (Top 15)

Space:  262 occurrences (11% - word boundaries)
a:      154 (6.5%)
i:      137 (5.7%)
ə:      125 (5.2%) ← Azerbaijani-specific schwa
r:       98 (4.1%)
l:       91 (3.8%)
n:       90 (3.8%)
ı:       66 (2.8%) ← Azerbaijani dotless i
s:       66 (2.8%)
m:       57 (2.4%)
d:       51 (2.1%)
t:       49 (2.1%)
e:       42 (1.8%)
u:       39 (1.6%)
-:       35 (1.5%) ← Hyphenation

Azerbaijani-Specific Characters (Critical)

Lowercase:

ə: 125  (schwa - most common special character)
ı:  66  (dotless i)
ş:  27  (s with cedilla)
ü:  27  (u with diaeresis)
ğ:  13  (g with breve)
ö:   7  (o with diaeresis)
ç:   7  (c with cedilla)

Uppercase:

Ə:   4
İ:   3  (i with dot - Turkish/Azeri uppercase)
Ş:   2
Ü:   1
Ö:   1

Total Azerbaijani-specific characters: 263 (11% of all characters)

Key Insight: Azerbaijani diacritics (ə, ı, ş, ü, ğ, ö, ç) are essential and must be preserved by the HTR model.

Content Analysis

Character Breakdown

Letters (a-z, A-Z):  ~1,550 (65%)
Azerbaijani chars:      263 (11%)
Spaces:                 262 (11%)
Numbers:                ~150 (6%)
Punctuation:            ~160 (7%)

Numeric & Special Content

Numbers Present:

Dates: 14.06.2024, 01.02.2025, 03.11.1987
Amounts: 12 750.45 AZN, 15 045.53 AZN, 304.50 AZN
Percentages: 18%, 67%
Phone numbers: 050-3456789
Contract numbers: № 457/23
Account numbers: AZ21NABZ

Punctuation & Symbols:

Hyphens (35) - word breaks, line wrapping
Periods (34) - decimals, abbreviations
Commas (19) - number separators
Colons (15) - field labels
Pipe symbols (|) - table formatting

Domain-Specific Vocabulary

The dataset contains rich domain terminology across multiple sectors:

Financial:

müqavilə (contract), məbləğ (amount), ƏDV (VAT)
hesabat (report), hesab nömrəsi (account number)
Maaş (salary), Komunal (utilities)

Medical:

pasiyent (patient), diaqnoz (diagnosis)
baş ağrısı (headache), arterial hipertenziya (hypertension)
dərman (medicine)

Legal/Formal:

Hörmətli (Dear/Honorable), direktor (director)
Konstitusiya (Constitution), qanunvericilik (legislation)
qarşılıqlı razılaşma (mutual agreement)

Technical:

sistem performans (system performance)
CPU yüklənməsi (CPU load), disk oxunuş (disk read)

Personal/General:

xahiş edirəm (I request), bildiririk (we inform)
ünvan (address), rayon (district)

Text Features & Challenges

Line Breaking & Hyphenation

Multiple documents show mid-word line breaks with hyphens:

hazır-lanması (prepared, split across lines)
yaran-mış (arising)
araşdırıl-masını (investigation)
hiper-tenziya (hypertension)

Implication: HTR model needs line-level recognition, and post-processing must reconstruct hyphenated words.

Tabular Formatting

Document 12 contains table structure:

Adı        | Yaşı | Şöbə
-------------------------
Rauf       | 32   | Maliyyə
Aysel      | 28   | İnsan resursları

Implication: Stage 1 (layout detection) is critical for preserving table structure.

Mixed Case Usage

Proper nouns: Bakı, Gəncə, Neftçilər prospekti
Abbreviations: ƏDV, AZN, ATM, CPU, IT
Sentence case: Standard for regular text

Data Quality Assessment

Strengths

✅ Diverse document types - Covers real-world use cases across multiple domains

✅ Rich Azerbaijani vocabulary - Proper diacritics preserved throughout

✅ Mixed content - Text, numbers, tables, addresses

✅ Domain variety - Medical, legal, financial, technical, personal

✅ Proper formatting - Line-level transcriptions with structure preservation

✅ Clean transcriptions - Accurate character-level annotations

Limitations

⚠️ Small dataset size - Only 12 samples (insufficient for production training)

⚠️ No writer diversity info - Unknown if single/multiple writers

⚠️ HEIC format - Requires preprocessing for PyTorch

⚠️ No bounding boxes - Labels are page-level, not line-level

⚠️ No validation split - Need to define train/val/test splits

⚠️ No image metadata - Resolution, DPI, quality information missing

⚠️ Insufficient for LM training - Only 2.4K chars vs. 100K-1M recommended

Preprocessing Requirements

Before training, the following preprocessing steps are required:

1. Convert HEIC to PNG/JPG

import pillow_heif
from PIL import Image
import os

pillow_heif.register_heif_opener()

for heic_file in os.listdir('images/'):
    if heic_file.endswith('.HEIC'):
        img_path = os.path.join('images/', heic_file)
        img = Image.open(img_path).convert('L')  # Grayscale
        png_path = img_path.replace('.HEIC', '.png')
        img.save(png_path)

2. Line Segmentation

Extract bounding boxes for each line from page images:

Use layout detection (YOLOv8) or manual annotation
Create line-level image crops
Map each line crop to its transcription

3. Create Vocabulary File

import json

# Extract all unique characters from labels
chars = set()
for label_file in label_files:
    with open(label_file, 'r', encoding='utf-8') as f:
        text = f.read()
        # Remove line numbers and arrow delimiter
        text = '→'.join(text.split('→')[1:]) if '→' in text else text
        chars.update(text)

# Create vocabulary mapping
vocab = {char: idx for idx, char in enumerate(sorted(chars))}
vocab['[BLANK]'] = len(vocab)  # CTC blank token

with open('vocab.json', 'w', encoding='utf-8') as f:
    json.dump(vocab, f, ensure_ascii=False, indent=2)

4. Define Data Splits

Recommended split (document-wise to prevent data leakage):

train.txt: 01,02,03,04,05,06,07,08,09  (75% - 9 documents)
val.txt:   10,11                        (17% - 2 documents)
test.txt:  12                           (8% - 1 document)

Recommended Character Set

Based on dataset analysis, the vocabulary should include:

Latin Letters:

Lowercase: a-z
Uppercase: A-Z

Azerbaijani Characters:

Lowercase: ə, ç, ğ, ı, ö, ş, ü
Uppercase: Ə, Ç, Ğ, İ, Ö, Ş, Ü

Digits: 0-9

Punctuation & Symbols:

. , : ; - – — ( ) [ ] / |
" ' « » ? ! № % + =

Special:

Space character
CTC blank token

Estimated Vocabulary Size: ~100 characters

Usage Guidelines

For Model Training

Data Augmentation is Critical - With only 12 samples, heavy augmentation is mandatory:
- Rotation: ±3°
- Scaling: 0.9-1.1
- Elastic distortion
- Blur, noise, random erasing
- Synthetic overlays
Start with HTR-Lite - Use the lightweight model variant for proof-of-concept
Character-Level Tokenization - Recommended for Azerbaijani language
Preserve Diacritics - Critical for maintaining language integrity

For Dataset Expansion

Immediate Actions:

Collect More Data - Current 2.4K chars is far below recommended 100K-1M
- Photograph additional handwritten documents
- Use synthetic data generation
- Apply pseudo-labeling on unlabeled scans
Line-Level Annotation - Convert page-level to line-level:
- Extract individual line bounding boxes
- Crop and save as separate line images
- Create line-level transcriptions
Metadata Collection - Document:
- Writer information (for stratified splits)
- Image resolution and DPI
- Document quality scores
- Date of collection
Quality Control - Verify:
- Transcription accuracy
- Diacritic correctness
- Proper character encoding (UTF-8)

Integration with Architecture

Alignment with Planned System (plan.md)

Planned Feature	Current Data Status
Character-level vocab	✅ Azerbaijani chars present
Document-wise split	⚠️ Not defined yet
Line-level images	❌ Only page-level currently
Bounding boxes	❌ Not annotated
Mixed content	✅ Numbers, text, tables
Domain diversity	✅ 12 document types
100K+ tokens for LM	❌ Only 2.4K chars

Next Steps for Implementation

Preprocessing Pipeline:
- Convert HEIC → PNG
- Segment pages into lines
- Extract line bounding boxes
Dataset Preparation:
- Create train/val/test splits
- Generate vocabulary.json
- Build data loader with augmentation
Baseline Model:
- Train HTR-Lite on augmented data
- Evaluate on validation set
- Analyze error patterns
Data Expansion:
- Collect 100+ more documents
- Implement active learning loop
- Build Azerbaijani language model corpus

Example Label Samples

Document 01 - Formal Letter

Hörmətli cənab direktor,
Bu məktub vasitəsilə
bildiririk ki,
2025-ci il
üzrə hesabatların
hazırlanması başa çatmaq
üzrədir.

Document 04 - Medical Form

Pasiyentin adı, soyadı:
Əliyev Rəşad Kamran oğlu
Doğum tarixi: 03.11.1987
Şikayətlər: baş ağrısı, halsızlıq,
yuxusuzluq
Diaqnoz: arterial hipertenziya

Document 12 - Tabular Text

Adı        | Yaşı | Şöbə
-------------------------
Rauf       | 32   | Maliyyə
Aysel      | 28   | İnsan resursları
Kamal      | 41   | İT dəstəyi

References

Project Plan: See ../plan.md for full architecture specification
Character Encoding: UTF-8
Language: Azerbaijani (Latin script, ISO 639-1: az)
Image Format: HEIC (requires conversion to PNG/JPG)

License & Usage

This dataset is collected for developing the Azeri Handwriting Detection system. Please ensure proper handling of any personal information that may appear in the documents.

Last Updated: December 13, 2025 Dataset Version: 1.0 (Pilot) Total Samples: 12 documents, 80 lines, 2,384 characters

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support