# granite-docling ONNX Conversion Guide ## Technical Reproduction Instructions This document provides complete instructions for reproducing the granite-docling ONNX conversion. ### Prerequisites - Python 3.10+ - ~4GB available RAM - ~2GB disk space for conversion environment ### Step 1: Environment Setup ```bash # Create isolated environment python3 -m venv onnx_converter source onnx_converter/bin/activate # Linux/Mac # or onnx_converter\Scripts\activate # Windows # Install dependencies pip install torch torchvision transformers optimum[onnxruntime] safetensors ``` ### Step 2: Download Original Model ```bash # Download granite-docling SafeTensors model mkdir granite-docling-258m cd granite-docling-258m curl -L "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/model.safetensors" -o model.safetensors curl -L "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/config.json" -o config.json curl -L "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/tokenizer.json" -o tokenizer.json curl -L "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/preprocessor_config.json" -o preprocessor_config.json ``` ### Step 3: Install IBM Experimental Fork ```bash # Clone IBM experimental optimum-onnx fork git clone https://github.com/gabe-l-hart/optimum-onnx.git cd optimum-onnx git checkout Idefics3Support # Install experimental fork pip install -e . --force-reinstall ``` ### Step 4: Convert to ONNX ```python import os import torch os.environ['CUDA_VISIBLE_DEVICES'] = '' # Force CPU from pathlib import Path from transformers import Idefics3ForConditionalGeneration from optimum.exporters.onnx import export from optimum.exporters.onnx.model_configs import Idefics3OnnxConfig # Load model model = Idefics3ForConditionalGeneration.from_pretrained( './granite-docling-258m', trust_remote_code=True, torch_dtype=torch.float32 ).to('cpu') # Create ONNX config onnx_config = Idefics3OnnxConfig(model.config, task='image-to-text') # Export to ONNX output_path = Path('./granite_docling.onnx') export(model, onnx_config, output_path, 17) print(f"ONNX conversion complete: {output_path}") ``` ### Expected Output ``` Initializing Idefics3ModelPatcher Entering Idefics3ModelPatcher context Patching Idefics3 model Using patched position embedding forward Exiting Idefics3ModelPatcher context ONNX conversion complete: granite_docling.onnx (1.2GB) ``` ### Validation ```python import onnxruntime as ort # Test ONNX model loading session = ort.InferenceSession('granite_docling.onnx') print("✅ ONNX model loads successfully") # Check input/output specifications for inp in session.get_inputs(): print(f"Input: {inp.name} - {inp.shape}") for out in session.get_outputs(): print(f"Output: {out.name} - {out.shape}") ``` ## Troubleshooting ### Common Issues 1. **"Custom architecture" error**: Ensure using IBM experimental fork 2. **Memory errors**: Use CPU-only conversion (`CUDA_VISIBLE_DEVICES=''`) 3. **Import errors**: Verify experimental fork installed with `-e .` ### Technical Notes - **Conversion time**: 5-10 minutes on typical CPU - **Memory usage**: ~4GB RAM during conversion - **Warnings**: TracerWarnings are expected for complex VLM - **File size**: ONNX (~1.2GB) vs SafeTensors (~492MB) due to graph inclusion ## Attribution Original model: IBM Research granite-docling-258M Conversion method: IBM experimental Idefics3Support optimum-onnx fork Documentation: lamco-development