How to use transformers for PaddleOCR-VL inferencing?

#1
by stzhao - opened

Excellent work! It would be more convenient if PaddleOCR-VL support transformers-backed inferencing.

PaddlePaddle org

Hello, we currently support inference using the PaddleOCR-VL-0.9B model with the transformers library, which can recognize texts, formulas, tables, and chart elements. In the future, we plan to support full document parsing inference with transformers. Below is a simple script we provide to support inference using the PaddleOCR-VL-0.9B model with transformers. We currently recommend using the official method for inference, which is faster and can support page-level document parsing.

If you need any further assistance, feel free to ask!

# -*- coding: utf-8 -*-
"""
This script includes four task prompts (prompts) and allows switching by modifying the CHOSEN_TASK line without any command line parameters.

Available tasks (CHOSEN_TASK):

- 'ocr' -> 'OCR:'
- 'table' -> 'Table Recognition:'
- 'chart' -> 'Chart Recognition:'
- 'formula' -> 'Formula Recognition:'
To add/modify prompts, change the PROMPTS dictionary as needed.
"""

from PIL import Image
import torch
from transformers import AutoModelForCausalLM, AutoProcessor

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

CHOSEN_TASK = "ocr"  # Options: 'ocr' | 'table' | 'chart' | 'formula'
PROMPTS = {
    "ocr": "OCR:",
    "table": "Table Recognition:",
    "chart": "Chart Recognition:",
    "formula": "Formula Recognition:",
}

model_path = "PaddleOCR-VL-0.9B"
image_path = "test.png"
image = Image.open(image_path).convert("RGB")

model = AutoModelForCausalLM.from_pretrained(
    model_path, trust_remote_code=True, torch_dtype=torch.bfloat16
).to(DEVICE).eval()
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

messages = [{"role": "user", "content": PROMPTS[CHOSEN_TASK]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = processor(text=[text], images=[image], return_tensors="pt")
inputs = {k: (v.to(DEVICE) if isinstance(v, torch.Tensor) else v) for k, v in inputs.items()}

with torch.inference_mode():
    generated = model.generate(**inputs, max_new_tokens=1024, do_sample=False, use_cache=True)

resp = processor.batch_decode(generated, skip_special_tokens=True)[0]
answer = resp.split(text)[-1].strip()
print(answer)

model_path = "PaddleOCR-VL-0.9B" is it correct? I changed it to "PaddlePaddle/PaddleOCR-VL" still its not working. Error says model_type is missing from config.

lsyzz changed discussion status to closed
lsyzz changed discussion status to open
PaddlePaddle org

model_path = "PaddleOCR-VL-0.9B" is an example, please replace it with your local model path and try again.

Yes. It's working. Thanks for the quick response. I have two more queries
1.Is it possible to parse complete page to markdown or JSON using transformers?
2. I tried using PaddleOCRVL() pipeline, but its not working in CPU only system. How can I set it for CPU only system.

PaddlePaddle org

Thank you for your interest.

  1. As I mentioned in my previous reply, we do not currently support end-to-end Transformers inference, but we plan to add this support in the future. We recommend that you use the official deployment method for higher inference efficiency.
  2. We do not support CPU inference at this time, as it would lead to a poor user experience.

Using official deployment, can we output the confidence interval or probability of each word?

I encountered an error:
"""
from transformers.modeling_layers import GradientCheckpointingLayer
ModuleNotFoundError: No module named 'transformers.modeling_layers'
"""
I asked GPT and they told me that the version of Transformers is incorrect. May I know which version I should use

PaddlePaddle org

Hello, we’re currently using Transformers version 4.55.0. You may try installing this version if needed.

I am really excited

Sign up or log in to comment