File size: 9,256 Bytes
dfc7fce
 
 
 
6a07280
eafad19
dfc7fce
 
6a07280
dfc7fce
 
eafad19
dfc7fce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eafad19
dfc7fce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eafad19
dfc7fce
 
 
 
eafad19
dfc7fce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eafad19
dfc7fce
 
 
eafad19
dfc7fce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
eafad19
dfc7fce
 
 
 
 
 
 
eafad19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dfc7fce
 
 
 
 
 
 
 
 
eafad19
dfc7fce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6a07280
dfc7fce
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6a07280
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
---
license: apache-2.0
---


# syntheticbot/ocr-qwen



## Introduction

syntheticbot/ocr-qwen is a fine-tuned model for Optical Character Recognition (OCR) tasks, derived from the base model [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct). This model is engineered for high accuracy in extracting text from images, including documents and scenes containing text.

#### Key Enhancements for OCR:

* **Enhanced Text Recognition Accuracy**: Superior accuracy across diverse text fonts, styles, sizes, and orientations.
* **Robustness to Document Variations**: Specifically trained to manage document complexities like varied layouts, noise, and distortions.
* **Structured Output Generation**: Enables structured output formats (JSON, CSV) for recognized text and layout in document images such as invoices and tables.
* **Text Localization**: Provides accurate localization of text regions and bounding boxes for text elements within images.
* **Improved Handling of Text in Visuals**:  Maintains proficiency in analyzing charts and graphics, with enhanced recognition of embedded text.


#### Model Architecture Updates:

* **Dynamic Resolution and Frame Rate Training for Video Understanding**:
<p align="center">
    <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-VL/qwen2.5vl_arc.jpeg" width="80%"/>
<p>

* **Streamlined and Efficient Vision Encoder**

This repository provides the instruction-tuned and OCR-optimized 7B Qwen-VL-7B-ocr model.  For comprehensive details about the foundational model architecture, please refer to the [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) repository, as well as the [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL) pages for Qwen2.5-VL.


## Requirements
For optimal performance and access to OCR-specific features, it is recommended to build from source:
```
pip install git+https://github.com/huggingface/transformers accelerate
```


## Quickstart

The following examples illustrate the use of syntheticbot/ocr-qwen with 🤗 Transformers and `qwen_vl_utils` for OCR applications.

```
pip install git+https://github.com/huggingface/transformers accelerate
```

Install the toolkit for streamlined visual input processing:

```bash
pip install qwen-vl-utils[decord]==0.0.8
```

### Using 🤗  Transformers for OCR

```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "syntheticbot/ocr-qwen",
    torch_dtype="auto",
    device_map="auto"
)

processor = AutoProcessor.from_pretrained("syntheticbot/ocr-qwen")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/your/document_image.jpg",
            },
            {"type": "text", "text": "Extract the text from this image."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda" if torch.cuda.is_available() else "cpu")

generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Extracted Text:", output_text[0])
```

<details>
<summary>Example for Structured Output (JSON for Table Extraction)</summary>

```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
import json

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "syntheticbot/ocr-qwen",
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("syntheticbot/ocr-qwen")


messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/your/table_image.jpg",
            },
            {"type": "text", "text": "Extract the table from this image and output it as JSON."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda" if torch.cuda.is_available() else "cpu")

generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Extracted Table (JSON):\n", output_text[0])

try:
    json_output = json.loads(output_text[0])
    print("\nParsed JSON Output:\n", json.dumps(json_output, indent=2))
except json.JSONDecodeError:
    print("\nCould not parse output as JSON. Output is plain text.")
```
</details>

<details>
<summary>Batch inference for OCR</summary>

```python
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/image1.jpg"},
            {"type": "text", "text": "Extract text from this image."},
        ],
    }
]
messages2 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/image2.jpg"},
            {"type": "text", "text": "Read the text in this document."},
        ],
    }
]
messages = [messages1, messages2]

texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda" if torch.cuda.is_available() else "cpu")

generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Extracted Texts (Batch):\n", output_texts)
```
</details>


### 🤖 ModelScope
For users in mainland China, ModelScope is recommended. Use `snapshot_download` for checkpoint management.  Adapt model names to `syntheticbot/ocr-qwen` in ModelScope implementations.


### More Usage Tips for OCR

Input images support local files, URLs, and base64 encoding.

```python
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "http://path/to/your/document_image.jpg"
            },
            {
                "type": "text",
                "text": "Extract the text from this image URL."
            },
        ],
    }
]
```
#### Image Resolution for OCR Accuracy

Higher resolution images typically improve OCR accuracy, especially for small text. Adjust resolution using `min_pixels`, `max_pixels`, `resized_height`, and `resized_width` parameters with the processor.

```python
min_pixels = 512 * 28 * 28
max_pixels = 2048 * 28 * 28
processor = AutoProcessor.from_pretrained(
    "syntheticbot/ocr-qwen",
    min_pixels=min_pixels, max_pixels=max_pixels
)
```

Control resizing dimensions directly:

```python
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/document_image.jpg",
                "resized_height": 600,
                "resized_width": 800,
            },
            {"type": "text", "text": "Extract the text."},
        ],
    }
]
```


## Citation from base model

```
@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}
```