|
--- |
|
license: mit |
|
task: image-classification |
|
tags: |
|
- document-classification |
|
- computer-vision |
|
- onnx |
|
- deep-learning |
|
- document-analysis |
|
- jpqd |
|
- quantized |
|
library_name: onnxruntime |
|
datasets: |
|
- ds4sd/document-corpus |
|
pipeline_tag: image-classification |
|
--- |
|
|
|
# DocumentClassifier ONNX |
|
|
|
**Optimized ONNX implementation of DS4SD DocumentClassifier for high-performance document type classification.** |
|
|
|
[](https://opensource.org/licenses/MIT) |
|
[](https://onnx.ai/) |
|
[](https://www.python.org/) |
|
|
|
## π― Overview |
|
|
|
DocumentClassifier is a deep learning model designed for automatic document type classification. This ONNX version provides optimized inference for production environments with enhanced performance through JPQD (Joint Pruning, Quantization, and Distillation) optimization. |
|
|
|
### Key Features |
|
|
|
- **High Accuracy**: Reliable document type classification across multiple categories |
|
- **Fast Inference**: ~28ms per document on CPU (35+ FPS) |
|
- **Production Ready**: ONNX format for cross-platform deployment |
|
- **Memory Efficient**: Optimized model size with JPQD compression |
|
- **Easy Integration**: Simple Python API with comprehensive examples |
|
|
|
## π Quick Start |
|
|
|
### Installation |
|
|
|
```bash |
|
pip install onnxruntime opencv-python pillow numpy |
|
``` |
|
|
|
### Basic Usage |
|
|
|
```python |
|
from example import DocumentClassifierONNX |
|
import cv2 |
|
|
|
# Initialize model |
|
classifier = DocumentClassifierONNX("DocumentClassifier.onnx") |
|
|
|
# Classify document from image file |
|
result = classifier.classify("document.jpg") |
|
print(f"Document type: {result['predicted_category']}") |
|
print(f"Confidence: {result['confidence']:.3f}") |
|
|
|
# Get top predictions |
|
for pred in result['top_predictions']: |
|
print(f"{pred['category']}: {pred['confidence']:.3f}") |
|
``` |
|
|
|
### Command Line Interface |
|
|
|
```bash |
|
# Classify a document image |
|
python example.py --image document.jpg |
|
|
|
# Run performance benchmark |
|
python example.py --benchmark --iterations 100 |
|
|
|
# Demo with dummy data |
|
python example.py |
|
``` |
|
|
|
## π Model Specifications |
|
|
|
| Specification | Value | |
|
|---------------|-------| |
|
| **Input Shape** | `[1, 3, 224, 224]` | |
|
| **Input Type** | `float32` | |
|
| **Output Shape** | `[1, 1280, 7, 7]` | |
|
| **Output Type** | `float32` | |
|
| **Model Size** | ~8.2MB | |
|
| **Parameters** | ~2.1M | |
|
| **Framework** | ONNX Runtime | |
|
|
|
## π·οΈ Supported Document Categories |
|
|
|
The model can classify documents into the following categories: |
|
|
|
- **Article** - News articles, blog posts, web content |
|
- **Form** - Application forms, surveys, questionnaires |
|
- **Letter** - Business letters, correspondence |
|
- **Memo** - Internal memos, notices |
|
- **News** - Newspaper articles, press releases |
|
- **Presentation** - Slides, presentation materials |
|
- **Resume** - CVs, resumes, professional profiles |
|
- **Scientific** - Research papers, academic documents |
|
- **Specification** - Technical specs, manuals |
|
- **Table** - Data tables, spreadsheet content |
|
- **Other** - Miscellaneous document types |
|
|
|
## β‘ Performance Benchmarks |
|
|
|
### Inference Speed (CPU) |
|
- **Mean**: 28.1ms Β± 0.5ms |
|
- **Throughput**: ~35.6 FPS |
|
- **Hardware**: Modern CPU (single thread) |
|
- **Batch Size**: 1 |
|
|
|
### Memory Usage |
|
- **Model Loading**: ~50MB RAM |
|
- **Inference**: ~100MB RAM |
|
- **Peak Usage**: ~150MB RAM |
|
|
|
## π§ Advanced Usage |
|
|
|
### Batch Processing |
|
|
|
```python |
|
import numpy as np |
|
from example import DocumentClassifierONNX |
|
|
|
classifier = DocumentClassifierONNX() |
|
|
|
# Process multiple images |
|
image_paths = ["doc1.jpg", "doc2.pdf", "doc3.png"] |
|
results = [] |
|
|
|
for path in image_paths: |
|
result = classifier.classify(path) |
|
results.append({ |
|
'file': path, |
|
'category': result['predicted_category'], |
|
'confidence': result['confidence'] |
|
}) |
|
|
|
# Display results |
|
for r in results: |
|
print(f"{r['file']}: {r['category']} ({r['confidence']:.3f})") |
|
``` |
|
|
|
### Custom Preprocessing |
|
|
|
```python |
|
import cv2 |
|
import numpy as np |
|
|
|
# Load and preprocess image manually |
|
image = cv2.imread("document.jpg") |
|
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) |
|
|
|
# Resize to model input size |
|
resized = cv2.resize(image, (224, 224)) |
|
normalized = resized.astype(np.float32) / 255.0 |
|
|
|
# Convert to CHW format and add batch dimension |
|
chw = np.transpose(normalized, (2, 0, 1)) |
|
batched = np.expand_dims(chw, axis=0) |
|
|
|
# Run inference |
|
classifier = DocumentClassifierONNX() |
|
logits = classifier.predict(batched) |
|
result = classifier.decode_output(logits) |
|
``` |
|
|
|
## π οΈ Integration Examples |
|
|
|
### Flask Web Service |
|
|
|
```python |
|
from flask import Flask, request, jsonify |
|
from example import DocumentClassifierONNX |
|
|
|
app = Flask(__name__) |
|
classifier = DocumentClassifierONNX() |
|
|
|
@app.route('/classify', methods=['POST']) |
|
def classify_document(): |
|
file = request.files['document'] |
|
|
|
# Save and process file |
|
file.save('temp_document.jpg') |
|
result = classifier.classify('temp_document.jpg') |
|
|
|
return jsonify({ |
|
'category': result['predicted_category'], |
|
'confidence': float(result['confidence']), |
|
'top_predictions': result['top_predictions'] |
|
}) |
|
|
|
if __name__ == '__main__': |
|
app.run(host='0.0.0.0', port=5000) |
|
``` |
|
|
|
### Batch Processing Script |
|
|
|
```python |
|
import os |
|
import glob |
|
from example import DocumentClassifierONNX |
|
|
|
def classify_directory(input_dir, output_file): |
|
classifier = DocumentClassifierONNX() |
|
|
|
# Find all image files |
|
extensions = ['*.jpg', '*.jpeg', '*.png', '*.pdf'] |
|
files = [] |
|
for ext in extensions: |
|
files.extend(glob.glob(os.path.join(input_dir, ext))) |
|
|
|
results = [] |
|
for file_path in files: |
|
try: |
|
result = classifier.classify(file_path) |
|
results.append({ |
|
'file': os.path.basename(file_path), |
|
'category': result['predicted_category'], |
|
'confidence': result['confidence'] |
|
}) |
|
print(f"β {file_path}: {result['predicted_category']}") |
|
except Exception as e: |
|
print(f"β {file_path}: Error - {e}") |
|
|
|
# Save results |
|
import json |
|
with open(output_file, 'w') as f: |
|
json.dump(results, f, indent=2) |
|
|
|
# Usage |
|
classify_directory("./documents", "classification_results.json") |
|
``` |
|
|
|
## π Requirements |
|
|
|
### System Requirements |
|
- **Python**: 3.8 or higher |
|
- **RAM**: Minimum 2GB available |
|
- **CPU**: x86_64 architecture recommended |
|
- **OS**: Windows, Linux, macOS |
|
|
|
### Dependencies |
|
``` |
|
onnxruntime>=1.15.0 |
|
opencv-python>=4.5.0 |
|
numpy>=1.21.0 |
|
Pillow>=8.0.0 |
|
``` |
|
|
|
## π Troubleshooting |
|
|
|
### Common Issues |
|
|
|
**Model Loading Error** |
|
```python |
|
# Ensure model file exists |
|
import os |
|
if not os.path.exists("DocumentClassifier.onnx"): |
|
print("Model file not found!") |
|
``` |
|
|
|
**Memory Issues** |
|
```python |
|
# For low-memory systems, process images individually |
|
# and clear variables after use |
|
import gc |
|
result = classifier.classify(image) |
|
del image # Free memory |
|
gc.collect() |
|
``` |
|
|
|
**Image Format Issues** |
|
```python |
|
# Convert any image format to RGB |
|
from PIL import Image |
|
img = Image.open("document.pdf").convert("RGB") |
|
result = classifier.classify(np.array(img)) |
|
``` |
|
|
|
## π Technical Details |
|
|
|
### Architecture |
|
- **Base Model**: Deep Convolutional Neural Network |
|
- **Input Processing**: Standard ImageNet preprocessing |
|
- **Feature Extraction**: CNN backbone with global pooling |
|
- **Classification Head**: Dense layers with softmax activation |
|
- **Optimization**: JPQD quantization for size and speed |
|
|
|
### Preprocessing Pipeline |
|
1. **Image Loading**: PIL/OpenCV image loading |
|
2. **Resizing**: Bilinear interpolation to 224Γ224 |
|
3. **Normalization**: [0, 255] β [0, 1] range |
|
4. **Format Conversion**: HWC β CHW (channels first) |
|
5. **Batch Addition**: Single image β batch dimension |
|
|
|
### Output Processing |
|
1. **Feature Extraction**: CNN backbone outputs [1, 1280, 7, 7] |
|
2. **Global Pooling**: Spatial averaging to [1, 1280] |
|
3. **Classification**: Map features to category probabilities |
|
4. **Top-K Selection**: Return most likely categories |
|
|
|
## π Citation |
|
|
|
If you use this model in your research, please cite: |
|
|
|
```bibtex |
|
@article{docling2024, |
|
title={Docling Technical Report}, |
|
author={DS4SD Team}, |
|
journal={arXiv preprint arXiv:2408.09869}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
## π License |
|
|
|
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. |
|
|
|
## π€ Contributing |
|
|
|
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change. |
|
|
|
## π Support |
|
|
|
- **Issues**: [GitHub Issues](https://github.com/asmud/ds4sd-DocumentClassifier-onnx/issues) |
|
- **Documentation**: This README and inline code comments |
|
- **Examples**: See `example.py` for comprehensive usage examples |
|
|
|
## π Changelog |
|
|
|
### v1.0.0 |
|
- Initial ONNX model release |
|
- JPQD optimization applied |
|
- Complete Python API |
|
- CLI interface |
|
- Comprehensive documentation |
|
- Performance benchmarks |
|
|
|
--- |
|
|
|
**Made with β€οΈ by the DS4SD Community** |