MinhDS's picture
initial commit
eddf5b2 verified

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: PaddleOCR Text Recognition Fine-tuning Toolkit
emoji: 🌍
colorFrom: red
colorTo: blue
sdk: gradio
sdk_version: 5.38.0
app_file: app.py
pinned: false
license: apache-2.0

PaddleOCR Text Recognition Fine-tuning Toolkit

This repository provides a comprehensive pipeline for fine-tuning PaddleOCR text recognition models on custom datasets. Based on the official PaddleOCR Text Recognition Module Tutorial, this toolkit includes dataset preparation, training, evaluation, and inference scripts.

📋 Table of Contents

✨ Features

  • Complete Pipeline: End-to-end fine-tuning from dataset preparation to model export
  • Multiple Models: Support for PP-OCRv5, PP-OCRv4 server and mobile variants
  • Dataset Flexibility: Handle various dataset formats (directory, CSV, JSON, ICDAR)
  • Performance Optimization: Automatic GPU memory management and batch processing
  • Comprehensive Evaluation: Model benchmarking and comparison tools
  • Easy Inference: Ready-to-use inference scripts with visualization

🔧 Requirements

System Requirements

  • Python 3.8+
  • CUDA 11.8+ (for GPU training)
  • 8GB+ RAM (16GB+ recommended)
  • 4GB+ GPU memory (8GB+ recommended)

Software Dependencies

See requirements.txt for detailed package versions.

📦 Installation

  1. Clone this repository:
git clone <repository-url>
cd paddleocr-text-recognition-finetuning
  1. Install dependencies:
pip install -r requirements.txt
  1. Install PaddleOCR:
# For GPU users
pip install paddlepaddle-gpu paddleocr

# For CPU users  
pip install paddlepaddle paddleocr
  1. Verify installation:
python -c "import paddleocr; print('PaddleOCR installed successfully!')"

🚀 Quick Start

Option 1: Complete Pipeline (Recommended)

Run the entire fine-tuning pipeline with demo data:

python fine_tune_text_recognition.py \
    --model_name PP-OCRv5_server_rec \
    --work_dir ./my_training \
    --gpus 0 \
    --mode complete

Option 1b: Document Dataset (LMDB) Pipeline

If you have a document dataset in LMDB format (like ./input_dir/document):

# Quick demo to see your data
python demo_document_extraction.py

# Complete pipeline: extract + train + test
python extract_and_train.py \
    --input_dir ./input_dir/document \
    --work_dir ./document_training \
    --model_name PP-OCRv5_server_rec \
    --epochs 20 \
    --batch_size 64

Option 2: Step-by-Step Process

  1. Prepare your dataset:
python prepare_dataset.py \
    --input_type directory \
    --input_path /path/to/your/images \
    --output_dir ./dataset
  1. Fine-tune the model:
python fine_tune_text_recognition.py \
    --model_name PP-OCRv5_server_rec \
    --work_dir ./my_training \
    --skip_demo_data \
    --mode train
  1. Test your fine-tuned model:
python inference_example.py \
    --model_dir ./my_training/PP-OCRv5_server_rec_infer \
    --input /path/to/test/image.jpg \
    --save_results \
    --visualize

📊 Dataset Preparation

Supported Input Formats

1. Directory with Images and Text Files

your_dataset/
├── image1.jpg
├── image1.txt
├── image2.png
├── image2.txt
└── ...
python prepare_dataset.py \
    --input_type directory \
    --input_path ./your_dataset \
    --output_dir ./dataset

2. CSV Format

CSV file with columns: image_path, text

python prepare_dataset.py \
    --input_type csv \
    --input_path data.csv \
    --img_col image_path \
    --text_col text \
    --output_dir ./dataset

3. JSON Format

[
    {"image_path": "img1.jpg", "text": "Hello World"},
    {"image_path": "img2.jpg", "text": "Fine-tuning"},
    ...
]
python prepare_dataset.py \
    --input_type json \
    --input_path data.json \
    --output_dir ./dataset

4. ICDAR Format

python prepare_dataset.py \
    --input_type icdar \
    --input_path ./images_directory \
    --annotations_file annotations.txt \
    --output_dir ./dataset

5. LMDB Format (Document Datasets)

For LMDB datasets (like the document dataset in ./input_dir/document):

# Extract LMDB data only
python extract_lmdb_data.py \
    --input_dir ./input_dir/document \
    --output_dir ./extracted_dataset

# Or use the integrated approach
python prepare_dataset.py \
    --input_type lmdb \
    --input_path ./input_dir/document \
    --output_dir ./dataset

Expected Output Structure

dataset/
├── images/
│   ├── image1.jpg
│   ├── image2.png
│   └── ...
├── train_list.txt
└── val_list.txt

Format of train_list.txt and val_list.txt:

images/image1.jpg	Hello World
images/image2.png	Fine-tuning
...

🎯 Fine-tuning Process

Basic Fine-tuning

python fine_tune_text_recognition.py \
    --model_name PP-OCRv5_server_rec \
    --work_dir ./training_output \
    --gpus 0

Advanced Configuration

The script supports various customization options:

python fine_tune_text_recognition.py \
    --model_name PP-OCRv5_mobile_rec \
    --work_dir ./training_output \
    --gpus 0,1 \
    --mode complete \
    --skip_demo_data

Custom Training Parameters

Modify the custom_params dictionary in the script for advanced customization:

custom_params = {
    "Global": {
        "epoch_num": 50,           # Number of training epochs
        "save_epoch_step": 5,      # Save model every N epochs
        "eval_batch_step": [0, 1000]  # Evaluation frequency
    },
    "Train": {
        "loader": {
            "batch_size_per_card": 64,  # Batch size per GPU
            "num_workers": 8           # Data loading workers
        }
    }
}

Memory Optimization

For systems with limited GPU memory, use these settings:

custom_params = {
    "Train": {
        "loader": {
            "batch_size_per_card": 32,  # Reduce batch size
            "num_workers": 2            # Reduce workers
        }
    },
    "Eval": {
        "loader": {
            "batch_size_per_card": 32
        }
    }
}

📈 Model Evaluation

Evaluate Trained Model

python fine_tune_text_recognition.py \
    --mode eval \
    --config path/to/config.yml \
    --checkpoint path/to/best_accuracy.pdparams \
    --gpus 0

Export Model for Inference

python fine_tune_text_recognition.py \
    --mode export \
    --config path/to/config.yml \
    --checkpoint path/to/best_accuracy.pdparams

🔍 Inference

Single Image Inference

python inference_example.py \
    --model_dir ./work_dir/PP-OCRv5_server_rec_infer \
    --input single_image.jpg \
    --save_results \
    --visualize

Batch Processing

python inference_example.py \
    --model_dir ./work_dir/PP-OCRv5_server_rec_infer \
    --input ./test_images/ \
    --batch_size 16 \
    --save_results

Performance Benchmarking

python inference_example.py \
    --model_dir ./work_dir/PP-OCRv5_server_rec_infer \
    --input ./test_images/ \
    --benchmark \
    --visualize

Compare with Original Model

python inference_example.py \
    --model_dir ./work_dir/PP-OCRv5_server_rec_infer \
    --input ./test_images/ \
    --compare_original PP-OCRv5_server_rec \
    --visualize

🔧 Advanced Usage

Multi-GPU Training

python fine_tune_text_recognition.py \
    --model_name PP-OCRv5_server_rec \
    --gpus 0,1,2,3 \
    --work_dir ./multi_gpu_training

Resume Training from Checkpoint

python fine_tune_text_recognition.py \
    --mode train \
    --config custom_config.yml \
    --resume_from ./work_dir/output/iter_1000.pdparams

Custom Character Dictionary

  1. Create your character dictionary file:
a
b
c
...
中
文
字
符
  1. Update the configuration:
custom_params = {
    "Global": {
        "character_dict_path": "path/to/your/custom_dict.txt",
        "character_type": "ch"  # or "en" for English
    }
}

Training with Different Image Sizes

custom_params = {
    "Train": {
        "dataset": {
            "transforms": [
                {"DecodeImage": {"img_mode": "BGR", "channel_first": False}},
                {"RecResizeImg": {"image_shape": [3, 64, 256]}},  # H=64, W=256
                # ... other transforms
            ]
        }
    }
}

❌ Troubleshooting

Common Issues

1. CUDA Out of Memory

Solution: Reduce batch size and enable gradient accumulation

custom_params = {
    "Train": {
        "loader": {
            "batch_size_per_card": 16  # Reduce from default 256
        }
    }
}

2. Dataset Loading Errors

Solution: Check dataset format and file paths

# Validate your dataset
python prepare_dataset.py --input_type directory --input_path ./data --output_dir ./test_dataset

3. Model Export Fails

Solution: Ensure checkpoint exists and config path is correct

# Check if checkpoint exists
ls ./work_dir/output/

4. Low Recognition Accuracy

Solutions:

  • Increase training epochs
  • Use data augmentation
  • Verify dataset quality
  • Try different learning rates

Performance Tips

  1. For faster training:

    • Use SSD storage for datasets
    • Increase num_workers in data loader
    • Use mixed precision training (if supported)
  2. For better accuracy:

    • Increase image resolution
    • Add more training data
    • Use appropriate data augmentation
    • Fine-tune learning rate schedule
  3. For memory efficiency:

    • Reduce batch size
    • Use gradient accumulation
    • Enable CPU offloading

📋 Supported Models

Model Accuracy Speed Model Size Use Case
PP-OCRv5_server_rec 86.38% 8.46ms 81MB High accuracy server deployment
PP-OCRv5_mobile_rec 81.29% 5.43ms 16MB Mobile/edge devices
PP-OCRv4_server_rec 85.19% 8.75ms 173MB Legacy server deployment
PP-OCRv4_mobile_rec 78.74% 5.26ms 10.5MB Legacy mobile deployment

Choosing the Right Model

  • PP-OCRv5_server_rec: Best overall accuracy, suitable for server deployment
  • PP-OCRv5_mobile_rec: Good balance of accuracy and speed, perfect for mobile apps
  • PP-OCRv4_*: Use if you need compatibility with older PaddleOCR versions

📝 File Structure

.
├── fine_tune_text_recognition.py   # Main fine-tuning script
├── prepare_dataset.py              # Dataset preparation utility
├── inference_example.py            # Inference and evaluation script
├── extract_lmdb_data.py            # LMDB data extraction utility
├── extract_and_train.py            # Complete LMDB pipeline
├── demo_document_extraction.py     # Demo for document dataset
├── quick_start_example.py          # Simple getting started script
├── requirements.txt                # Python dependencies
├── README.md                       # This file
├── input_dir/                      # Your input data (LMDB format)
│   └── document/                   # Document dataset
│       ├── document_train/         # Training split (LMDB)
│       ├── document_val/           # Validation split (LMDB)
│       └── document_test/          # Test split (LMDB)
└── work_dir/                       # Training outputs (created during training)
    ├── dataset/                    # Prepared dataset
    ├── output/                     # Training checkpoints
    └── PP-OCRv5_server_rec_infer/  # Exported model

🎉 Ready for Professional Demonstrations!

Your enhanced Chinese Text Recognition Demo now provides a powerful comparison platform that clearly demonstrates the benefits of fine-tuning!

🚀 Quick Start:

# Launch the enhanced comparison demo
python3 demo.py

Access at: http://localhost:7860

🤝 Contributing

Feel free to submit issues, feature requests, and pull requests. For major changes, please open an issue first to discuss what you would like to change.

📄 License

This project is based on PaddleOCR and follows the same Apache 2.0 License.

🙏 Acknowledgments

  • PaddleOCR team for the excellent OCR framework
  • PaddlePaddle team for the deep learning platform
  • Community contributors for testing and feedback

For more detailed information about PaddleOCR, visit the official documentation.