File size: 9,256 Bytes
dfc7fce 6a07280 eafad19 dfc7fce 6a07280 dfc7fce eafad19 dfc7fce eafad19 dfc7fce eafad19 dfc7fce eafad19 dfc7fce eafad19 dfc7fce eafad19 dfc7fce eafad19 dfc7fce eafad19 dfc7fce eafad19 dfc7fce 6a07280 dfc7fce 6a07280 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 |
---
license: apache-2.0
---
# syntheticbot/ocr-qwen
## Introduction
syntheticbot/ocr-qwen is a fine-tuned model for Optical Character Recognition (OCR) tasks, derived from the base model [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct). This model is engineered for high accuracy in extracting text from images, including documents and scenes containing text.
#### Key Enhancements for OCR:
* **Enhanced Text Recognition Accuracy**: Superior accuracy across diverse text fonts, styles, sizes, and orientations.
* **Robustness to Document Variations**: Specifically trained to manage document complexities like varied layouts, noise, and distortions.
* **Structured Output Generation**: Enables structured output formats (JSON, CSV) for recognized text and layout in document images such as invoices and tables.
* **Text Localization**: Provides accurate localization of text regions and bounding boxes for text elements within images.
* **Improved Handling of Text in Visuals**: Maintains proficiency in analyzing charts and graphics, with enhanced recognition of embedded text.
#### Model Architecture Updates:
* **Dynamic Resolution and Frame Rate Training for Video Understanding**:
<p align="center">
<img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-VL/qwen2.5vl_arc.jpeg" width="80%"/>
<p>
* **Streamlined and Efficient Vision Encoder**
This repository provides the instruction-tuned and OCR-optimized 7B Qwen-VL-7B-ocr model. For comprehensive details about the foundational model architecture, please refer to the [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) repository, as well as the [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL) pages for Qwen2.5-VL.
## Requirements
For optimal performance and access to OCR-specific features, it is recommended to build from source:
```
pip install git+https://github.com/huggingface/transformers accelerate
```
## Quickstart
The following examples illustrate the use of syntheticbot/ocr-qwen with 🤗 Transformers and `qwen_vl_utils` for OCR applications.
```
pip install git+https://github.com/huggingface/transformers accelerate
```
Install the toolkit for streamlined visual input processing:
```bash
pip install qwen-vl-utils[decord]==0.0.8
```
### Using 🤗 Transformers for OCR
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"syntheticbot/ocr-qwen",
torch_dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained("syntheticbot/ocr-qwen")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "path/to/your/document_image.jpg",
},
{"type": "text", "text": "Extract the text from this image."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda" if torch.cuda.is_available() else "cpu")
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Extracted Text:", output_text[0])
```
<details>
<summary>Example for Structured Output (JSON for Table Extraction)</summary>
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
import json
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"syntheticbot/ocr-qwen",
torch_dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained("syntheticbot/ocr-qwen")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "path/to/your/table_image.jpg",
},
{"type": "text", "text": "Extract the table from this image and output it as JSON."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda" if torch.cuda.is_available() else "cpu")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Extracted Table (JSON):\n", output_text[0])
try:
json_output = json.loads(output_text[0])
print("\nParsed JSON Output:\n", json.dumps(json_output, indent=2))
except json.JSONDecodeError:
print("\nCould not parse output as JSON. Output is plain text.")
```
</details>
<details>
<summary>Batch inference for OCR</summary>
```python
messages1 = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/image1.jpg"},
{"type": "text", "text": "Extract text from this image."},
],
}
]
messages2 = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/image2.jpg"},
{"type": "text", "text": "Read the text in this document."},
],
}
]
messages = [messages1, messages2]
texts = [
processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=texts,
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda" if torch.cuda.is_available() else "cpu")
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Extracted Texts (Batch):\n", output_texts)
```
</details>
### 🤖 ModelScope
For users in mainland China, ModelScope is recommended. Use `snapshot_download` for checkpoint management. Adapt model names to `syntheticbot/ocr-qwen` in ModelScope implementations.
### More Usage Tips for OCR
Input images support local files, URLs, and base64 encoding.
```python
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "http://path/to/your/document_image.jpg"
},
{
"type": "text",
"text": "Extract the text from this image URL."
},
],
}
]
```
#### Image Resolution for OCR Accuracy
Higher resolution images typically improve OCR accuracy, especially for small text. Adjust resolution using `min_pixels`, `max_pixels`, `resized_height`, and `resized_width` parameters with the processor.
```python
min_pixels = 512 * 28 * 28
max_pixels = 2048 * 28 * 28
processor = AutoProcessor.from_pretrained(
"syntheticbot/ocr-qwen",
min_pixels=min_pixels, max_pixels=max_pixels
)
```
Control resizing dimensions directly:
```python
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "file:///path/to/your/document_image.jpg",
"resized_height": 600,
"resized_width": 800,
},
{"type": "text", "text": "Extract the text."},
],
}
]
```
## Citation from base model
```
@misc{qwen2.5-VL,
title = {Qwen2.5-VL},
url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
author = {Qwen Team},
month = {January},
year = {2025}
}
@article{Qwen2VL,
title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
journal={arXiv preprint arXiv:2409.12191},
year={2024}
}
@article{Qwen-VL,
title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
journal={arXiv preprint arXiv:2308.12966},
year={2023}
}
``` |