🖋️ Qalam-Net V2: Advanced Arabic OCR

Qalam-Net V2 (قلم-نت) is a high-performance Arabic Optical Character Recognition (OCR) system. Built on the TrOCR (Transformer-based OCR) architecture, it achieves superior accuracy by treats OCR as a sequence-to-sequence problem, mapping visual features directly to text tokens.

🏗️ Architecture Visualization

The model utilizes a Vision-Encoder-Decoder framework, specifically optimized for the complexities of Arabic script (ligatures, cursive nature, and right-to-left orientation).

graph TD
    A[Input Arabic Image] --> B[ViT Encoder]
    B -->|Visual Embeddings| C[Cross-Attention]
    D[Previous Tokens] --> E[RoBERTa Decoder]
    E --> C
    C --> F[Next Token Prediction]
    F -->|Generated Text| G[Final Arabic Transcription]
    
    subgraph "Encoder (Vision Transformer)"
    B
    end
    
    subgraph "Decoder (Language Model)"
    E
    end

🚀 Key Features

End-to-End Transformer: No reliance on traditional CNN-RNN architectures or complex preprocessing (like line segmentation).
Arabic Script Specialist: Fine-tuned on the mssqpi/Arabic-OCR-Dataset for robust handling of various Arabic fonts and styles.
State-of-the-Art Accuracy: Leverages pre-trained vision and language weights from microsoft/trocr-base-handwritten.
Flexible Deployment: Supports CUDA, MPS (Apple Silicon), and CPU execution.

🧠 How It Works

Qalam-Net V2 differs from traditional OCR by eliminating the need for an external language model or a separate CTC (Connectionist Temporal Classification) layer.

Visual Feature Extraction: The encoder divides the input image into patches and processes them via a Vision Transformer (ViT).
Contextual Decoding: The decoder (RoBERTa-based) attends to both the visual features and the previously generated tokens to predict the next character or word.
Arabic Optimization: During fine-tuning, the tokenizer and embeddings were adapted to capture the nuances of Arabic UTF-8 encoding.

📊 Performance Metrics

The model was fine-tuned for 1 epoch on a high-quality selection of 5,000 samples.

Metric	Value
Training Samples	5,000
Optimizer	AdamW
Learning Rate	3e-5
Convergence (Loss)	9.5 → 0.03

Even with a single epoch, the model reached a training loss of 0.03, indicating highly efficient transfer learning from the base TrOCR weights.

🖥️ Getting Started

Installation

pip install transformers datasets Pillow torch

Quick Inference Example

Click to expand the Python inference script

import torch
from PIL import Image, ImageDraw, ImageFont
from transformers import TrOCRProcessor, VisionEncoderDecoderModel

MODEL_NAME = "Ali0044/Qalam_Net_V2"
processor = TrOCRProcessor.from_pretrained(MODEL_NAME)
model = VisionEncoderDecoderModel.from_pretrained(MODEL_NAME)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

def run_ocr(image):
    pixel_values = processor(image, return_tensors="pt").pixel_values.to(device)
    with torch.no_grad():
        generated_ids = model.generate(pixel_values)
    return processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

image = Image.new('RGB', (200, 50), color = 'white')
d = ImageDraw.Draw(image)
try:
    font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 20)
except IOError:
    font = ImageFont.load_default()
d.text((10,10), "المتميزة", fill=(0,0,0), font=font)

print(f"Predicted Transcription: {run_ocr(image)}")
image.show()

🛡️ Ethical Considerations & Limitations

Language Scope: Primarily optimized for Modern Standard Arabic (MSA). Performance on historical scripts or specific dialects may vary.
Image Quality: Performs best on clear, well-lit text snippets. Handwriting recognition is supported but may require higher resolution inputs.
Privacy: Ensure you have the rights to process any personal data contained within images when using this model in production.

🤝 Contributing & License

Contributions are what make the open-source community an amazing place to learn, inspire, and create.

License: Distributed under the Apache 2.0 License.
Contact: Reach out via Github or Hugging Face at Ali0044.

_{Built with ❤️ Ali Khalid Ali Khalid}

Downloads last month: 64

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ali0044/Qalam_Net_V2

Base model

microsoft/trocr-base-handwritten

Finetuned

(28)

this model

Ali0044
/

Qalam_Net_V2

🖋️ Qalam-Net V2: Advanced Arabic OCR

🏗️ Architecture Visualization

🚀 Key Features

🧠 How It Works

📊 Performance Metrics

🖥️ Getting Started

Installation

Quick Inference Example

🛡️ Ethical Considerations & Limitations

🤝 Contributing & License

Model tree for Ali0044/Qalam_Net_V2

Dataset used to train Ali0044/Qalam_Net_V2