yuhangzang commited on
Commit
765c016
Β·
verified Β·
1 Parent(s): 81d010b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +125 -1
README.md CHANGED
@@ -7,4 +7,128 @@ library_name: transformers
7
  tags:
8
  - multimodal
9
  - image caption
10
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  tags:
8
  - multimodal
9
  - image caption
10
+ ---
11
+
12
+
13
+
14
+ # CapRL-3B
15
+
16
+ πŸ“–<a href="https://arxiv.org/abs/2509.22647">Paper</a> | 🏠<a href="https://github.com/InternLM/CapRL">Github</a> |πŸ€—<a href="https://huggingface.co/internlm/CapRL-3B">CapRL-3B Model</a> |πŸ€—<a href="https://huggingface.co/yuhangzang/CapRL-InternVL3.5-8B">CapRL-InternVL3.5-8B Model</a> |
17
+ πŸ€—<a href="https://huggingface.co/datasets/internlm/CapRL-2M">CapRL-2M Dataset</a>
18
+
19
+ πŸ€—<a href="https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189">CapRL Collection</a> | πŸ€—<a href="https://huggingface.co/papers/2509.22647">Daily Paper</a>
20
+
21
+ Based on the same recipe as CapRL-3B, we used InternVL3.5-8B as the policy model and obtained CapRL-InternVL3.5-8B through CapRL. **Its performance significantly surpasses that of Qwen2.5-VL-72B**.
22
+
23
+ We are working on even stronger base models and upgrading our training recipe β€” stay tuned!
24
+
25
+
26
+ ## Introduction
27
+ We are excited to introduce CapRL-3B, a lightweight 3B image captioner that achieves perception capabilities comparable to Qwen2.5-VL-72B.
28
+
29
+ This is the first study of applying Reinforcement Learning with Verifiable Rewards for the
30
+ open-ended and subjective image captioning task. Unlike traditional Supervised Fine-Tuning, which
31
+ can lead to models memorizing a limited set of annotated captions, our method allows the model to
32
+ explore and generate a broader range of creative and general descriptions.
33
+ CapRL is a new training paradigm featuring a decoupled two-stage pipeline. The initial
34
+ stage uses LVLMs to generate rich and accurate captions. Subsequently, the second stage evaluates
35
+ caption quality by using a vision-only LLM to perform the QA task. We also created a specific QA
36
+ curation pipeline to ensure the quality of the questions and answers used for the second stage.
37
+
38
+ By employing CapRL training framework, initializing with the Qwen2.5-VL-3B model, and using a carefully
39
+ filtered 75K QA dataset as the training set, we obtained a highly capable captioner, CapRL-3B.
40
+
41
+ <p align="center">
42
+ <img src="./assets/teaser.png" width="750"/>
43
+ </p>
44
+ <p align="center">
45
+ <img src="./assets/performance_update.png" width="750"/>
46
+ </p>
47
+
48
+ ## Key Features
49
+ * **Remarkable visual understanding for Chart, Infographics and Document**: CapRL-3B achieves perception accuracy and visual information coverage comparable to Qwen2.5-VL-72B.
50
+ * **Well-organized output**: The outputs of CapRL-3B are relatively well-structured, making them clear and easy to understand.
51
+ * **Detailed description for natural images**: The outputs of CapRL-3B can perfectly cover all valid visual information while containing fewer hallucinations.
52
+
53
+ ## Usage
54
+ If you want to use **CapRL-3B** for captioning, you can directly follow the exact same inference approach as in [Qwen2.5-VL-series](https://github.com/QwenLM/Qwen3-VL/tree/d2240f11656bfe404b9ba56db4e51cd09f522ff1).
55
+
56
+ We recommend using **vLLM** to speed up inference.
57
+
58
+
59
+
60
+ ### Start an OpenAI API Service
61
+
62
+ Run the command below to start an OpenAI-compatible API service:
63
+
64
+ ```bash
65
+ vllm serve "/PATH/CapRL-3B" \
66
+ --trust-remote-code \
67
+ --tensor-parallel-size=1 \
68
+ --pipeline-parallel-size=1 \
69
+ --gpu_memory_utilization=0.95 \
70
+ --served-model-name=caprl \
71
+ --port 8000 \
72
+ --host 0.0.0.0
73
+ ```
74
+
75
+ Then you can use the chat API as below: (see [OpenAI API protocol document](https://platform.openai.com/docs/guides/vision/uploading-base-64-encoded-images) for more details):
76
+ ```python
77
+ import base64
78
+ from openai import OpenAI
79
+ # Set OpenAI's API key and API base to use vLLM's API server.
80
+ openai_api_key = "EMPTY"
81
+ openai_api_base = "http://localhost:8000/v1"
82
+ client = OpenAI(
83
+ api_key=openai_api_key,
84
+ base_url=openai_api_base,
85
+ )
86
+ image_path = "/path/to/local/image.png"
87
+ with open(image_path, "rb") as f:
88
+ encoded_image = base64.b64encode(f.read())
89
+ encoded_image_text = encoded_image.decode("utf-8")
90
+ base64_qwen = f"data:image;base64,{encoded_image_text}"
91
+ chat_response = client.chat.completions.create(
92
+ model="caprl",
93
+ messages=[
94
+ {"role": "system", "content": "You are a helpful assistant."},
95
+ {
96
+ "role": "user",
97
+ "content": [
98
+ {
99
+ "type": "image_url",
100
+ "image_url": {
101
+ "url": base64_qwen
102
+ },
103
+ },
104
+ {"type": "text", "text": "What is the text in the illustrate?"},
105
+ ],
106
+ },
107
+ ],
108
+ temperature=1.0,
109
+ max_tokens=max_tokens,
110
+ top_p=1.0,
111
+ extra_body={
112
+ "repetition_penalty": 1.0,
113
+ },
114
+ )
115
+ print("Chat response:", chat_response)
116
+ ```
117
+
118
+
119
+
120
+ ## Cases
121
+ <p align="center">
122
+ <img src="./assets/comparison.png" alt="Main Results on GPT2" width="750"/>
123
+ </p>
124
+
125
+ <p align="center">
126
+ <img src="./assets/info_caprl.png" alt="Main Results on GPT2" width="750"/>
127
+ </p>
128
+
129
+ <p align="center">
130
+ <img src="./assets/info_caprl2.png" alt="Main Results on GPT2" width="750"/>
131
+ </p>
132
+ <p align="center">
133
+ <img src="./assets/natural_caprl.png" alt="Main Results on GPT2" width="750"/>
134
+ </p>