ds4sd-DocumentClassifier-onnx / README.md

Initial release: Docling DocumentClassifier ONNX models with JPQD quantization

c5958d3 6 days ago

9.02 kB

	---
	license: mit
	task: image-classification
	tags:
	- document-classification
	- computer-vision
	- onnx
	- deep-learning
	- document-analysis
	- jpqd
	- quantized
	library_name: onnxruntime
	datasets:
	- ds4sd/document-corpus
	pipeline_tag: image-classification
	---

	# DocumentClassifier ONNX

	Optimized ONNX implementation of DS4SD DocumentClassifier for high-performance document type classification.

	[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
	[![ONNX](https://img.shields.io/badge/ONNX-1.15+-blue.svg)](https://onnx.ai/)
	[![Python 3.8+](https://img.shields.io/badge/Python-3.8+-green.svg)](https://www.python.org/)

	## 🎯 Overview

	DocumentClassifier is a deep learning model designed for automatic document type classification. This ONNX version provides optimized inference for production environments with enhanced performance through JPQD (Joint Pruning, Quantization, and Distillation) optimization.

	### Key Features

	- High Accuracy: Reliable document type classification across multiple categories
	- Fast Inference: ~28ms per document on CPU (35+ FPS)
	- Production Ready: ONNX format for cross-platform deployment
	- Memory Efficient: Optimized model size with JPQD compression
	- Easy Integration: Simple Python API with comprehensive examples

	## 🚀 Quick Start

	### Installation

	```bash
	pip install onnxruntime opencv-python pillow numpy
	```

	### Basic Usage

	```python
	from example import DocumentClassifierONNX
	import cv2

	# Initialize model
	classifier = DocumentClassifierONNX("DocumentClassifier.onnx")

	# Classify document from image file
	result = classifier.classify("document.jpg")
	print(f"Document type: {result['predicted_category']}")
	print(f"Confidence: {result['confidence']:.3f}")

	# Get top predictions
	for pred in result['top_predictions']:
	print(f"{pred['category']}: {pred['confidence']:.3f}")
	```

	### Command Line Interface

	```bash
	# Classify a document image
	python example.py --image document.jpg

	# Run performance benchmark
	python example.py --benchmark --iterations 100

	# Demo with dummy data
	python example.py
	```

	## 📊 Model Specifications

	\| Specification \| Value \|
	\|---------------\|-------\|
	\| Input Shape \| `[1, 3, 224, 224]` \|
	\| Input Type \| `float32` \|
	\| Output Shape \| `[1, 1280, 7, 7]` \|
	\| Output Type \| `float32` \|
	\| Model Size \| ~8.2MB \|
	\| Parameters \| ~2.1M \|
	\| Framework \| ONNX Runtime \|

	## 🏷️ Supported Document Categories

	The model can classify documents into the following categories:

	- Article - News articles, blog posts, web content
	- Form - Application forms, surveys, questionnaires
	- Letter - Business letters, correspondence
	- Memo - Internal memos, notices
	- News - Newspaper articles, press releases
	- Presentation - Slides, presentation materials
	- Resume - CVs, resumes, professional profiles
	- Scientific - Research papers, academic documents
	- Specification - Technical specs, manuals
	- Table - Data tables, spreadsheet content
	- Other - Miscellaneous document types

	## ⚡ Performance Benchmarks

	### Inference Speed (CPU)
	- Mean: 28.1ms ± 0.5ms
	- Throughput: ~35.6 FPS
	- Hardware: Modern CPU (single thread)
	- Batch Size: 1

	### Memory Usage
	- Model Loading: ~50MB RAM
	- Inference: ~100MB RAM
	- Peak Usage: ~150MB RAM

	## 🔧 Advanced Usage

	### Batch Processing

	```python
	import numpy as np
	from example import DocumentClassifierONNX

	classifier = DocumentClassifierONNX()

	# Process multiple images
	image_paths = ["doc1.jpg", "doc2.pdf", "doc3.png"]
	results = []

	for path in image_paths:
	result = classifier.classify(path)
	results.append({
	'file': path,
	'category': result['predicted_category'],
	'confidence': result['confidence']
	})

	# Display results
	for r in results:
	print(f"{r['file']}: {r['category']} ({r['confidence']:.3f})")
	```

	### Custom Preprocessing

	```python
	import cv2
	import numpy as np

	# Load and preprocess image manually
	image = cv2.imread("document.jpg")
	image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

	# Resize to model input size
	resized = cv2.resize(image, (224, 224))
	normalized = resized.astype(np.float32) / 255.0

	# Convert to CHW format and add batch dimension
	chw = np.transpose(normalized, (2, 0, 1))
	batched = np.expand_dims(chw, axis=0)

	# Run inference
	classifier = DocumentClassifierONNX()
	logits = classifier.predict(batched)
	result = classifier.decode_output(logits)
	```

	## 🛠️ Integration Examples

	### Flask Web Service

	```python
	from flask import Flask, request, jsonify
	from example import DocumentClassifierONNX

	app = Flask(__name__)
	classifier = DocumentClassifierONNX()

	@app.route('/classify', methods=['POST'])
	def classify_document():
	file = request.files['document']

	# Save and process file
	file.save('temp_document.jpg')
	result = classifier.classify('temp_document.jpg')

	return jsonify({
	'category': result['predicted_category'],
	'confidence': float(result['confidence']),
	'top_predictions': result['top_predictions']
	})

	if __name__ == '__main__':
	app.run(host='0.0.0.0', port=5000)
	```

	### Batch Processing Script

	```python
	import os
	import glob
	from example import DocumentClassifierONNX

	def classify_directory(input_dir, output_file):
	classifier = DocumentClassifierONNX()

	# Find all image files
	extensions = ['.jpg', '.jpeg', '.png', '.pdf']
	files = []
	for ext in extensions:
	files.extend(glob.glob(os.path.join(input_dir, ext)))

	results = []
	for file_path in files:
	try:
	result = classifier.classify(file_path)
	results.append({
	'file': os.path.basename(file_path),
	'category': result['predicted_category'],
	'confidence': result['confidence']
	})
	print(f"✓ {file_path}: {result['predicted_category']}")
	except Exception as e:
	print(f"✗ {file_path}: Error - {e}")

	# Save results
	import json
	with open(output_file, 'w') as f:
	json.dump(results, f, indent=2)

	# Usage
	classify_directory("./documents", "classification_results.json")
	```

	## 📋 Requirements

	### System Requirements
	- Python: 3.8 or higher
	- RAM: Minimum 2GB available
	- CPU: x86_64 architecture recommended
	- OS: Windows, Linux, macOS

	### Dependencies
	```
	onnxruntime>=1.15.0
	opencv-python>=4.5.0
	numpy>=1.21.0
	Pillow>=8.0.0
	```

	## 🔍 Troubleshooting

	### Common Issues

	Model Loading Error
	```python
	# Ensure model file exists
	import os
	if not os.path.exists("DocumentClassifier.onnx"):
	print("Model file not found!")
	```

	Memory Issues
	```python
	# For low-memory systems, process images individually
	# and clear variables after use
	import gc
	result = classifier.classify(image)
	del image # Free memory
	gc.collect()
	```

	Image Format Issues
	```python
	# Convert any image format to RGB
	from PIL import Image
	img = Image.open("document.pdf").convert("RGB")
	result = classifier.classify(np.array(img))
	```

	## 📖 Technical Details

	### Architecture
	- Base Model: Deep Convolutional Neural Network
	- Input Processing: Standard ImageNet preprocessing
	- Feature Extraction: CNN backbone with global pooling
	- Classification Head: Dense layers with softmax activation
	- Optimization: JPQD quantization for size and speed

	### Preprocessing Pipeline
	1. Image Loading: PIL/OpenCV image loading
	2. Resizing: Bilinear interpolation to 224×224
	3. Normalization: [0, 255] → [0, 1] range
	4. Format Conversion: HWC → CHW (channels first)
	5. Batch Addition: Single image → batch dimension

	### Output Processing
	1. Feature Extraction: CNN backbone outputs [1, 1280, 7, 7]
	2. Global Pooling: Spatial averaging to [1, 1280]
	3. Classification: Map features to category probabilities
	4. Top-K Selection: Return most likely categories

	## 📚 Citation

	If you use this model in your research, please cite:

	```bibtex
	@article{docling2024,
	title={Docling Technical Report},
	author={DS4SD Team},
	journal={arXiv preprint arXiv:2408.09869},
	year={2024}
	}
	```

	## 📄 License

	This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

	## 🤝 Contributing

	Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

	## 🆘 Support

	- Issues: [GitHub Issues](https://github.com/asmud/ds4sd-DocumentClassifier-onnx/issues)
	- Documentation: This README and inline code comments
	- Examples: See `example.py` for comprehensive usage examples

	## 📈 Changelog

	### v1.0.0
	- Initial ONNX model release
	- JPQD optimization applied
	- Complete Python API
	- CLI interface
	- Comprehensive documentation
	- Performance benchmarks

	---

	Made with ❤️ by the DS4SD Community