Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.1.0
title: PaddleOCR Text Recognition Fine-tuning Toolkit
emoji: 🌍
colorFrom: red
colorTo: blue
sdk: gradio
sdk_version: 5.38.0
app_file: app.py
pinned: false
license: apache-2.0
PaddleOCR Text Recognition Fine-tuning Toolkit
This repository provides a comprehensive pipeline for fine-tuning PaddleOCR text recognition models on custom datasets. Based on the official PaddleOCR Text Recognition Module Tutorial, this toolkit includes dataset preparation, training, evaluation, and inference scripts.
📋 Table of Contents
- Features
- Requirements
- Installation
- Quick Start
- Dataset Preparation
- Fine-tuning Process
- Model Evaluation
- Inference
- Advanced Usage
- Troubleshooting
- Supported Models
✨ Features
- Complete Pipeline: End-to-end fine-tuning from dataset preparation to model export
- Multiple Models: Support for PP-OCRv5, PP-OCRv4 server and mobile variants
- Dataset Flexibility: Handle various dataset formats (directory, CSV, JSON, ICDAR)
- Performance Optimization: Automatic GPU memory management and batch processing
- Comprehensive Evaluation: Model benchmarking and comparison tools
- Easy Inference: Ready-to-use inference scripts with visualization
🔧 Requirements
System Requirements
- Python 3.8+
- CUDA 11.8+ (for GPU training)
- 8GB+ RAM (16GB+ recommended)
- 4GB+ GPU memory (8GB+ recommended)
Software Dependencies
See requirements.txt for detailed package versions.
📦 Installation
- Clone this repository:
git clone <repository-url>
cd paddleocr-text-recognition-finetuning
- Install dependencies:
pip install -r requirements.txt
- Install PaddleOCR:
# For GPU users
pip install paddlepaddle-gpu paddleocr
# For CPU users
pip install paddlepaddle paddleocr
- Verify installation:
python -c "import paddleocr; print('PaddleOCR installed successfully!')"
🚀 Quick Start
Option 1: Complete Pipeline (Recommended)
Run the entire fine-tuning pipeline with demo data:
python fine_tune_text_recognition.py \
--model_name PP-OCRv5_server_rec \
--work_dir ./my_training \
--gpus 0 \
--mode complete
Option 1b: Document Dataset (LMDB) Pipeline
If you have a document dataset in LMDB format (like ./input_dir/document):
# Quick demo to see your data
python demo_document_extraction.py
# Complete pipeline: extract + train + test
python extract_and_train.py \
--input_dir ./input_dir/document \
--work_dir ./document_training \
--model_name PP-OCRv5_server_rec \
--epochs 20 \
--batch_size 64
Option 2: Step-by-Step Process
- Prepare your dataset:
python prepare_dataset.py \
--input_type directory \
--input_path /path/to/your/images \
--output_dir ./dataset
- Fine-tune the model:
python fine_tune_text_recognition.py \
--model_name PP-OCRv5_server_rec \
--work_dir ./my_training \
--skip_demo_data \
--mode train
- Test your fine-tuned model:
python inference_example.py \
--model_dir ./my_training/PP-OCRv5_server_rec_infer \
--input /path/to/test/image.jpg \
--save_results \
--visualize
📊 Dataset Preparation
Supported Input Formats
1. Directory with Images and Text Files
your_dataset/
├── image1.jpg
├── image1.txt
├── image2.png
├── image2.txt
└── ...
python prepare_dataset.py \
--input_type directory \
--input_path ./your_dataset \
--output_dir ./dataset
2. CSV Format
CSV file with columns: image_path, text
python prepare_dataset.py \
--input_type csv \
--input_path data.csv \
--img_col image_path \
--text_col text \
--output_dir ./dataset
3. JSON Format
[
{"image_path": "img1.jpg", "text": "Hello World"},
{"image_path": "img2.jpg", "text": "Fine-tuning"},
...
]
python prepare_dataset.py \
--input_type json \
--input_path data.json \
--output_dir ./dataset
4. ICDAR Format
python prepare_dataset.py \
--input_type icdar \
--input_path ./images_directory \
--annotations_file annotations.txt \
--output_dir ./dataset
5. LMDB Format (Document Datasets)
For LMDB datasets (like the document dataset in ./input_dir/document):
# Extract LMDB data only
python extract_lmdb_data.py \
--input_dir ./input_dir/document \
--output_dir ./extracted_dataset
# Or use the integrated approach
python prepare_dataset.py \
--input_type lmdb \
--input_path ./input_dir/document \
--output_dir ./dataset
Expected Output Structure
dataset/
├── images/
│ ├── image1.jpg
│ ├── image2.png
│ └── ...
├── train_list.txt
└── val_list.txt
Format of train_list.txt and val_list.txt:
images/image1.jpg Hello World
images/image2.png Fine-tuning
...
🎯 Fine-tuning Process
Basic Fine-tuning
python fine_tune_text_recognition.py \
--model_name PP-OCRv5_server_rec \
--work_dir ./training_output \
--gpus 0
Advanced Configuration
The script supports various customization options:
python fine_tune_text_recognition.py \
--model_name PP-OCRv5_mobile_rec \
--work_dir ./training_output \
--gpus 0,1 \
--mode complete \
--skip_demo_data
Custom Training Parameters
Modify the custom_params dictionary in the script for advanced customization:
custom_params = {
"Global": {
"epoch_num": 50, # Number of training epochs
"save_epoch_step": 5, # Save model every N epochs
"eval_batch_step": [0, 1000] # Evaluation frequency
},
"Train": {
"loader": {
"batch_size_per_card": 64, # Batch size per GPU
"num_workers": 8 # Data loading workers
}
}
}
Memory Optimization
For systems with limited GPU memory, use these settings:
custom_params = {
"Train": {
"loader": {
"batch_size_per_card": 32, # Reduce batch size
"num_workers": 2 # Reduce workers
}
},
"Eval": {
"loader": {
"batch_size_per_card": 32
}
}
}
📈 Model Evaluation
Evaluate Trained Model
python fine_tune_text_recognition.py \
--mode eval \
--config path/to/config.yml \
--checkpoint path/to/best_accuracy.pdparams \
--gpus 0
Export Model for Inference
python fine_tune_text_recognition.py \
--mode export \
--config path/to/config.yml \
--checkpoint path/to/best_accuracy.pdparams
🔍 Inference
Single Image Inference
python inference_example.py \
--model_dir ./work_dir/PP-OCRv5_server_rec_infer \
--input single_image.jpg \
--save_results \
--visualize
Batch Processing
python inference_example.py \
--model_dir ./work_dir/PP-OCRv5_server_rec_infer \
--input ./test_images/ \
--batch_size 16 \
--save_results
Performance Benchmarking
python inference_example.py \
--model_dir ./work_dir/PP-OCRv5_server_rec_infer \
--input ./test_images/ \
--benchmark \
--visualize
Compare with Original Model
python inference_example.py \
--model_dir ./work_dir/PP-OCRv5_server_rec_infer \
--input ./test_images/ \
--compare_original PP-OCRv5_server_rec \
--visualize
🔧 Advanced Usage
Multi-GPU Training
python fine_tune_text_recognition.py \
--model_name PP-OCRv5_server_rec \
--gpus 0,1,2,3 \
--work_dir ./multi_gpu_training
Resume Training from Checkpoint
python fine_tune_text_recognition.py \
--mode train \
--config custom_config.yml \
--resume_from ./work_dir/output/iter_1000.pdparams
Custom Character Dictionary
- Create your character dictionary file:
a
b
c
...
中
文
字
符
- Update the configuration:
custom_params = {
"Global": {
"character_dict_path": "path/to/your/custom_dict.txt",
"character_type": "ch" # or "en" for English
}
}
Training with Different Image Sizes
custom_params = {
"Train": {
"dataset": {
"transforms": [
{"DecodeImage": {"img_mode": "BGR", "channel_first": False}},
{"RecResizeImg": {"image_shape": [3, 64, 256]}}, # H=64, W=256
# ... other transforms
]
}
}
}
❌ Troubleshooting
Common Issues
1. CUDA Out of Memory
Solution: Reduce batch size and enable gradient accumulation
custom_params = {
"Train": {
"loader": {
"batch_size_per_card": 16 # Reduce from default 256
}
}
}
2. Dataset Loading Errors
Solution: Check dataset format and file paths
# Validate your dataset
python prepare_dataset.py --input_type directory --input_path ./data --output_dir ./test_dataset
3. Model Export Fails
Solution: Ensure checkpoint exists and config path is correct
# Check if checkpoint exists
ls ./work_dir/output/
4. Low Recognition Accuracy
Solutions:
- Increase training epochs
- Use data augmentation
- Verify dataset quality
- Try different learning rates
Performance Tips
For faster training:
- Use SSD storage for datasets
- Increase
num_workersin data loader - Use mixed precision training (if supported)
For better accuracy:
- Increase image resolution
- Add more training data
- Use appropriate data augmentation
- Fine-tune learning rate schedule
For memory efficiency:
- Reduce batch size
- Use gradient accumulation
- Enable CPU offloading
📋 Supported Models
| Model | Accuracy | Speed | Model Size | Use Case |
|---|---|---|---|---|
| PP-OCRv5_server_rec | 86.38% | 8.46ms | 81MB | High accuracy server deployment |
| PP-OCRv5_mobile_rec | 81.29% | 5.43ms | 16MB | Mobile/edge devices |
| PP-OCRv4_server_rec | 85.19% | 8.75ms | 173MB | Legacy server deployment |
| PP-OCRv4_mobile_rec | 78.74% | 5.26ms | 10.5MB | Legacy mobile deployment |
Choosing the Right Model
- PP-OCRv5_server_rec: Best overall accuracy, suitable for server deployment
- PP-OCRv5_mobile_rec: Good balance of accuracy and speed, perfect for mobile apps
- PP-OCRv4_*: Use if you need compatibility with older PaddleOCR versions
📝 File Structure
.
├── fine_tune_text_recognition.py # Main fine-tuning script
├── prepare_dataset.py # Dataset preparation utility
├── inference_example.py # Inference and evaluation script
├── extract_lmdb_data.py # LMDB data extraction utility
├── extract_and_train.py # Complete LMDB pipeline
├── demo_document_extraction.py # Demo for document dataset
├── quick_start_example.py # Simple getting started script
├── requirements.txt # Python dependencies
├── README.md # This file
├── input_dir/ # Your input data (LMDB format)
│ └── document/ # Document dataset
│ ├── document_train/ # Training split (LMDB)
│ ├── document_val/ # Validation split (LMDB)
│ └── document_test/ # Test split (LMDB)
└── work_dir/ # Training outputs (created during training)
├── dataset/ # Prepared dataset
├── output/ # Training checkpoints
└── PP-OCRv5_server_rec_infer/ # Exported model
🎉 Ready for Professional Demonstrations!
Your enhanced Chinese Text Recognition Demo now provides a powerful comparison platform that clearly demonstrates the benefits of fine-tuning!
🚀 Quick Start:
# Launch the enhanced comparison demo
python3 demo.py
Access at: http://localhost:7860
🤝 Contributing
Feel free to submit issues, feature requests, and pull requests. For major changes, please open an issue first to discuss what you would like to change.
📄 License
This project is based on PaddleOCR and follows the same Apache 2.0 License.
🙏 Acknowledgments
- PaddleOCR team for the excellent OCR framework
- PaddlePaddle team for the deep learning platform
- Community contributors for testing and feedback
For more detailed information about PaddleOCR, visit the official documentation.