How to use transformers for PaddleOCR-VL inferencing?
Excellent work! It would be more convenient if PaddleOCR-VL support transformers-backed inferencing.
Hello, we currently support inference using the PaddleOCR-VL-0.9B model with the transformers
library, which can recognize texts, formulas, tables, and chart elements. In the future, we plan to support full document parsing inference with transformers
. Below is a simple script we provide to support inference using the PaddleOCR-VL-0.9B model with transformers
. We currently recommend using the official method for inference, which is faster and can support page-level document parsing.
If you need any further assistance, feel free to ask!
# -*- coding: utf-8 -*-
"""
This script includes four task prompts (prompts) and allows switching by modifying the CHOSEN_TASK line without any command line parameters.
Available tasks (CHOSEN_TASK):
- 'ocr' -> 'OCR:'
- 'table' -> 'Table Recognition:'
- 'chart' -> 'Chart Recognition:'
- 'formula' -> 'Formula Recognition:'
To add/modify prompts, change the PROMPTS dictionary as needed.
"""
from PIL import Image
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
CHOSEN_TASK = "ocr" # Options: 'ocr' | 'table' | 'chart' | 'formula'
PROMPTS = {
"ocr": "OCR:",
"table": "Table Recognition:",
"chart": "Chart Recognition:",
"formula": "Formula Recognition:",
}
model_path = "PaddleOCR-VL-0.9B"
image_path = "test.png"
image = Image.open(image_path).convert("RGB")
model = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True, torch_dtype=torch.bfloat16
).to(DEVICE).eval()
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
messages = [{"role": "user", "content": PROMPTS[CHOSEN_TASK]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt")
inputs = {k: (v.to(DEVICE) if isinstance(v, torch.Tensor) else v) for k, v in inputs.items()}
with torch.inference_mode():
generated = model.generate(**inputs, max_new_tokens=1024, do_sample=False, use_cache=True)
resp = processor.batch_decode(generated, skip_special_tokens=True)[0]
answer = resp.split(text)[-1].strip()
print(answer)
model_path = "PaddleOCR-VL-0.9B" is it correct? I changed it to "PaddlePaddle/PaddleOCR-VL" still its not working. Error says model_type is missing from config.
model_path = "PaddleOCR-VL-0.9B"
is an example, please replace it with your local model path and try again.
Yes. It's working. Thanks for the quick response. I have two more queries
1.Is it possible to parse complete page to markdown or JSON using transformers?
2. I tried using PaddleOCRVL() pipeline, but its not working in CPU only system. How can I set it for CPU only system.
Thank you for your interest.
- As I mentioned in my previous reply, we do not currently support end-to-end Transformers inference, but we plan to add this support in the future. We recommend that you use the official deployment method for higher inference efficiency.
- We do not support CPU inference at this time, as it would lead to a poor user experience.
Using official deployment, can we output the confidence interval or probability of each word?
I encountered an error:
"""
from transformers.modeling_layers import GradientCheckpointingLayer
ModuleNotFoundError: No module named 'transformers.modeling_layers'
"""
I asked GPT and they told me that the version of Transformers is incorrect. May I know which version I should use
Hello, we’re currently using Transformers version 4.55.0. You may try installing this version if needed.
I am really excited