πŸ–‹οΈ Qalam-Net V2: Advanced Arabic OCR

Hugging Face Model Python Version PyTorch Transformers

Qalam-Net V2 (Ω‚Ω„Ω…-Ω†Ψͺ) is a high-performance Arabic Optical Character Recognition (OCR) system. Built on the TrOCR (Transformer-based OCR) architecture, it achieves superior accuracy by treats OCR as a sequence-to-sequence problem, mapping visual features directly to text tokens.


πŸ—οΈ Architecture Visualization

The model utilizes a Vision-Encoder-Decoder framework, specifically optimized for the complexities of Arabic script (ligatures, cursive nature, and right-to-left orientation).

graph TD
    A[Input Arabic Image] --> B[ViT Encoder]
    B -->|Visual Embeddings| C[Cross-Attention]
    D[Previous Tokens] --> E[RoBERTa Decoder]
    E --> C
    C --> F[Next Token Prediction]
    F -->|Generated Text| G[Final Arabic Transcription]
    
    subgraph "Encoder (Vision Transformer)"
    B
    end
    
    subgraph "Decoder (Language Model)"
    E
    end

πŸš€ Key Features

  • End-to-End Transformer: No reliance on traditional CNN-RNN architectures or complex preprocessing (like line segmentation).
  • Arabic Script Specialist: Fine-tuned on the mssqpi/Arabic-OCR-Dataset for robust handling of various Arabic fonts and styles.
  • State-of-the-Art Accuracy: Leverages pre-trained vision and language weights from microsoft/trocr-base-handwritten.
  • Flexible Deployment: Supports CUDA, MPS (Apple Silicon), and CPU execution.

🧠 How It Works

Qalam-Net V2 differs from traditional OCR by eliminating the need for an external language model or a separate CTC (Connectionist Temporal Classification) layer.

  1. Visual Feature Extraction: The encoder divides the input image into patches and processes them via a Vision Transformer (ViT).
  2. Contextual Decoding: The decoder (RoBERTa-based) attends to both the visual features and the previously generated tokens to predict the next character or word.
  3. Arabic Optimization: During fine-tuning, the tokenizer and embeddings were adapted to capture the nuances of Arabic UTF-8 encoding.

πŸ“Š Performance Metrics

The model was fine-tuned for 1 epoch on a high-quality selection of 5,000 samples.

Metric Value
Training Samples 5,000
Optimizer AdamW
Learning Rate 3e-5
Convergence (Loss) 9.5 β†’ 0.03

Even with a single epoch, the model reached a training loss of 0.03, indicating highly efficient transfer learning from the base TrOCR weights.


πŸ–₯️ Getting Started

Installation

pip install transformers datasets Pillow torch

Quick Inference Example

Click to expand the Python inference script
import torch
from PIL import Image, ImageDraw, ImageFont
from transformers import TrOCRProcessor, VisionEncoderDecoderModel

MODEL_NAME = "Ali0044/Qalam_Net_V2"
processor = TrOCRProcessor.from_pretrained(MODEL_NAME)
model = VisionEncoderDecoderModel.from_pretrained(MODEL_NAME)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

def run_ocr(image):
    pixel_values = processor(image, return_tensors="pt").pixel_values.to(device)
    with torch.no_grad():
        generated_ids = model.generate(pixel_values)
    return processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

image = Image.new('RGB', (200, 50), color = 'white')
d = ImageDraw.Draw(image)
try:
    font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSans-Bold.ttf", 20)
except IOError:
    font = ImageFont.load_default()
d.text((10,10), "Ψ§Ω„Ω…ΨͺΩ…ΩŠΨ²Ψ©", fill=(0,0,0), font=font)

print(f"Predicted Transcription: {run_ocr(image)}")
image.show()

πŸ›‘οΈ Ethical Considerations & Limitations

  • Language Scope: Primarily optimized for Modern Standard Arabic (MSA). Performance on historical scripts or specific dialects may vary.
  • Image Quality: Performs best on clear, well-lit text snippets. Handwriting recognition is supported but may require higher resolution inputs.
  • Privacy: Ensure you have the rights to process any personal data contained within images when using this model in production.

🀝 Contributing & License

Contributions are what make the open-source community an amazing place to learn, inspire, and create.

  • License: Distributed under the Apache 2.0 License.
  • Contact: Reach out via Github or Hugging Face at Ali0044.

Built with ❀️ Ali Khalid Ali Khalid
Downloads last month
64
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Ali0044/Qalam_Net_V2

Finetuned
(28)
this model

Dataset used to train Ali0044/Qalam_Net_V2