asmud's picture
Initial release: Docling DocumentClassifier ONNX models with JPQD quantization
c5958d3
---
license: mit
task: image-classification
tags:
- document-classification
- computer-vision
- onnx
- deep-learning
- document-analysis
- jpqd
- quantized
library_name: onnxruntime
datasets:
- ds4sd/document-corpus
pipeline_tag: image-classification
---
# DocumentClassifier ONNX
**Optimized ONNX implementation of DS4SD DocumentClassifier for high-performance document type classification.**
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![ONNX](https://img.shields.io/badge/ONNX-1.15+-blue.svg)](https://onnx.ai/)
[![Python 3.8+](https://img.shields.io/badge/Python-3.8+-green.svg)](https://www.python.org/)
## 🎯 Overview
DocumentClassifier is a deep learning model designed for automatic document type classification. This ONNX version provides optimized inference for production environments with enhanced performance through JPQD (Joint Pruning, Quantization, and Distillation) optimization.
### Key Features
- **High Accuracy**: Reliable document type classification across multiple categories
- **Fast Inference**: ~28ms per document on CPU (35+ FPS)
- **Production Ready**: ONNX format for cross-platform deployment
- **Memory Efficient**: Optimized model size with JPQD compression
- **Easy Integration**: Simple Python API with comprehensive examples
## πŸš€ Quick Start
### Installation
```bash
pip install onnxruntime opencv-python pillow numpy
```
### Basic Usage
```python
from example import DocumentClassifierONNX
import cv2
# Initialize model
classifier = DocumentClassifierONNX("DocumentClassifier.onnx")
# Classify document from image file
result = classifier.classify("document.jpg")
print(f"Document type: {result['predicted_category']}")
print(f"Confidence: {result['confidence']:.3f}")
# Get top predictions
for pred in result['top_predictions']:
print(f"{pred['category']}: {pred['confidence']:.3f}")
```
### Command Line Interface
```bash
# Classify a document image
python example.py --image document.jpg
# Run performance benchmark
python example.py --benchmark --iterations 100
# Demo with dummy data
python example.py
```
## πŸ“Š Model Specifications
| Specification | Value |
|---------------|-------|
| **Input Shape** | `[1, 3, 224, 224]` |
| **Input Type** | `float32` |
| **Output Shape** | `[1, 1280, 7, 7]` |
| **Output Type** | `float32` |
| **Model Size** | ~8.2MB |
| **Parameters** | ~2.1M |
| **Framework** | ONNX Runtime |
## 🏷️ Supported Document Categories
The model can classify documents into the following categories:
- **Article** - News articles, blog posts, web content
- **Form** - Application forms, surveys, questionnaires
- **Letter** - Business letters, correspondence
- **Memo** - Internal memos, notices
- **News** - Newspaper articles, press releases
- **Presentation** - Slides, presentation materials
- **Resume** - CVs, resumes, professional profiles
- **Scientific** - Research papers, academic documents
- **Specification** - Technical specs, manuals
- **Table** - Data tables, spreadsheet content
- **Other** - Miscellaneous document types
## ⚑ Performance Benchmarks
### Inference Speed (CPU)
- **Mean**: 28.1ms Β± 0.5ms
- **Throughput**: ~35.6 FPS
- **Hardware**: Modern CPU (single thread)
- **Batch Size**: 1
### Memory Usage
- **Model Loading**: ~50MB RAM
- **Inference**: ~100MB RAM
- **Peak Usage**: ~150MB RAM
## πŸ”§ Advanced Usage
### Batch Processing
```python
import numpy as np
from example import DocumentClassifierONNX
classifier = DocumentClassifierONNX()
# Process multiple images
image_paths = ["doc1.jpg", "doc2.pdf", "doc3.png"]
results = []
for path in image_paths:
result = classifier.classify(path)
results.append({
'file': path,
'category': result['predicted_category'],
'confidence': result['confidence']
})
# Display results
for r in results:
print(f"{r['file']}: {r['category']} ({r['confidence']:.3f})")
```
### Custom Preprocessing
```python
import cv2
import numpy as np
# Load and preprocess image manually
image = cv2.imread("document.jpg")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# Resize to model input size
resized = cv2.resize(image, (224, 224))
normalized = resized.astype(np.float32) / 255.0
# Convert to CHW format and add batch dimension
chw = np.transpose(normalized, (2, 0, 1))
batched = np.expand_dims(chw, axis=0)
# Run inference
classifier = DocumentClassifierONNX()
logits = classifier.predict(batched)
result = classifier.decode_output(logits)
```
## πŸ› οΈ Integration Examples
### Flask Web Service
```python
from flask import Flask, request, jsonify
from example import DocumentClassifierONNX
app = Flask(__name__)
classifier = DocumentClassifierONNX()
@app.route('/classify', methods=['POST'])
def classify_document():
file = request.files['document']
# Save and process file
file.save('temp_document.jpg')
result = classifier.classify('temp_document.jpg')
return jsonify({
'category': result['predicted_category'],
'confidence': float(result['confidence']),
'top_predictions': result['top_predictions']
})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
```
### Batch Processing Script
```python
import os
import glob
from example import DocumentClassifierONNX
def classify_directory(input_dir, output_file):
classifier = DocumentClassifierONNX()
# Find all image files
extensions = ['*.jpg', '*.jpeg', '*.png', '*.pdf']
files = []
for ext in extensions:
files.extend(glob.glob(os.path.join(input_dir, ext)))
results = []
for file_path in files:
try:
result = classifier.classify(file_path)
results.append({
'file': os.path.basename(file_path),
'category': result['predicted_category'],
'confidence': result['confidence']
})
print(f"βœ“ {file_path}: {result['predicted_category']}")
except Exception as e:
print(f"βœ— {file_path}: Error - {e}")
# Save results
import json
with open(output_file, 'w') as f:
json.dump(results, f, indent=2)
# Usage
classify_directory("./documents", "classification_results.json")
```
## πŸ“‹ Requirements
### System Requirements
- **Python**: 3.8 or higher
- **RAM**: Minimum 2GB available
- **CPU**: x86_64 architecture recommended
- **OS**: Windows, Linux, macOS
### Dependencies
```
onnxruntime>=1.15.0
opencv-python>=4.5.0
numpy>=1.21.0
Pillow>=8.0.0
```
## πŸ” Troubleshooting
### Common Issues
**Model Loading Error**
```python
# Ensure model file exists
import os
if not os.path.exists("DocumentClassifier.onnx"):
print("Model file not found!")
```
**Memory Issues**
```python
# For low-memory systems, process images individually
# and clear variables after use
import gc
result = classifier.classify(image)
del image # Free memory
gc.collect()
```
**Image Format Issues**
```python
# Convert any image format to RGB
from PIL import Image
img = Image.open("document.pdf").convert("RGB")
result = classifier.classify(np.array(img))
```
## πŸ“– Technical Details
### Architecture
- **Base Model**: Deep Convolutional Neural Network
- **Input Processing**: Standard ImageNet preprocessing
- **Feature Extraction**: CNN backbone with global pooling
- **Classification Head**: Dense layers with softmax activation
- **Optimization**: JPQD quantization for size and speed
### Preprocessing Pipeline
1. **Image Loading**: PIL/OpenCV image loading
2. **Resizing**: Bilinear interpolation to 224Γ—224
3. **Normalization**: [0, 255] β†’ [0, 1] range
4. **Format Conversion**: HWC β†’ CHW (channels first)
5. **Batch Addition**: Single image β†’ batch dimension
### Output Processing
1. **Feature Extraction**: CNN backbone outputs [1, 1280, 7, 7]
2. **Global Pooling**: Spatial averaging to [1, 1280]
3. **Classification**: Map features to category probabilities
4. **Top-K Selection**: Return most likely categories
## πŸ“š Citation
If you use this model in your research, please cite:
```bibtex
@article{docling2024,
title={Docling Technical Report},
author={DS4SD Team},
journal={arXiv preprint arXiv:2408.09869},
year={2024}
}
```
## πŸ“„ License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 🀝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
## πŸ†˜ Support
- **Issues**: [GitHub Issues](https://github.com/asmud/ds4sd-DocumentClassifier-onnx/issues)
- **Documentation**: This README and inline code comments
- **Examples**: See `example.py` for comprehensive usage examples
## πŸ“ˆ Changelog
### v1.0.0
- Initial ONNX model release
- JPQD optimization applied
- Complete Python API
- CLI interface
- Comprehensive documentation
- Performance benchmarks
---
**Made with ❀️ by the DS4SD Community**