File size: 7,044 Bytes
81d010b
 
 
 
 
 
 
 
 
2e39016
 
 
 
 
765c016
 
 
 
78769ad
765c016
 
 
 
3a0ec16
765c016
b169a08
 
 
 
 
244e182
b169a08
309fd1b
 
 
3a0ec16
a5db8ff
2e39016
 
b169a08
2e39016
 
c1fdc60
765c016
 
2e39016
765c016
2e39016
765c016
 
 
 
 
 
 
 
2e39016
 
765c016
 
 
 
 
 
 
 
 
 
 
 
 
 
2e39016
765c016
 
 
 
 
 
 
 
 
628ec6c
765c016
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f9b4f0a
765c016
 
 
f9b4f0a
765c016
 
 
f9b4f0a
765c016
 
f9b4f0a
765c016
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---
license: apache-2.0
language:
- en
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- multimodal
- image caption
- captioning
datasets:
- internlm/CapRL-2M
base_model:
- OpenGVLab/InternVL3_5-8B
---



# CapRL-InternVL3.5-8B

πŸ“–<a href="https://arxiv.org/abs/2509.22647">Paper</a> | 🏠<a href="https://github.com/InternLM/CapRL">Github</a> |πŸ€—<a href="https://huggingface.co/internlm/CapRL-3B">CapRL-3B Model</a> |πŸ€—<a href="https://huggingface.co/yuhangzang/CapRL-InternVL3.5-8B">CapRL-InternVL3.5-8B Model</a> |
  πŸ€—<a href="https://huggingface.co/datasets/internlm/CapRL-2M">CapRL-2M Dataset</a> 
  
  πŸ€—<a href="https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189">CapRL Collection</a> | πŸ€—<a href="https://huggingface.co/papers/2509.22647">Daily Paper</a> ο½œπŸ€—<a href="https://huggingface.co/mradermacher/CapRL-3B-GGUF">CapRL-3B-GGUF</a> ο½œπŸ€—<a href="https://huggingface.co/mradermacher/CapRL-3B-i1-GGUF">CapRL-3B-i1-GGUF</a>

When selecting between the available CapRL models, it's essential to consider the trade-off between performance and computational cost.
This guide will help you choose the most suitable model for your specific needs:
|Model|Parameters|Strength|
|-|-|-|
|πŸ€—[CapRL-3B](https://huggingface.co/internlm/CapRL-3B)|3B|Speed, Efficiency|
|πŸ€—[CapRL-InternVL3.5-8B](https://huggingface.co/yuhangzang/CapRL-InternVL3.5-8B)|8B|High Performance, Advanced Captioning Ability|

Now you can try out CapRL-3B with your own images🎨!&nbsp;&nbsp;&nbsp;&nbsp;➑️&nbsp;&nbsp;&nbsp;&nbsp;[🌈CapRL Space](https://huggingface.co/spaces/yuhangzang/caprl)


## πŸ“’ News
We are working on even stronger base models and upgrading our training recipe β€” stay tuned!
- πŸ”₯ [10/15/2025] The total downloads of the CapRL-related [models and dataset](https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189) reached 6,000 within just 20 days!
- πŸš€ [10/15/2025] We are excited to announce the release of **[CapRL-InternVL3.5-8B](https://huggingface.co/internlm/CapRL-InternVL3.5-8B)**, whose image captioning capability outperforms Qwen2.5-VL-72B!
- πŸš€ [10/15/2025] Thanks [mradermacher](https://huggingface.co/mradermacher) for the valuable contribution! [CapRL-3B-GGUF](https://huggingface.co/mradermacher/CapRL-3B-GGUF) is the static quants version, and [CapRL-3B-i1-GGUF](https://huggingface.co/mradermacher/CapRL-3B-i1-GGUF) is weighted/imatrix quants version.
- πŸš€ [10/15/2025] We release [QA curation code](https://github.com/InternLM/CapRL).
- πŸš€ [09/25/2025] We release **CapRL** repository, [CapRL-3B model](https://huggingface.co/internlm/CapRL-3B), [evaluation code](https://github.com/InternLM/CapRL) and [dataset](https://huggingface.co/datasets/internlm/CapRL-2M).


## Introduction
Based on the same recipe as [CapRL-3B](https://huggingface.co/internlm/CapRL-3B), we used [InternVL3.5-8B](https://huggingface.co/OpenGVLab/InternVL3_5-8B) as the policy model and obtained **[CapRL-InternVL3.5-8B](https://huggingface.co/yuhangzang/CapRL-InternVL3.5-8B)** through CapRL.

CapRL is the first study of applying Reinforcement Learning with Verifiable Rewards for the
open-ended and subjective image captioning task. Unlike traditional Supervised Fine-Tuning, which
can lead to models memorizing a limited set of annotated captions, our method allows the model to
explore and generate a broader range of creative and general descriptions.
CapRL is a new training paradigm featuring a decoupled two-stage pipeline. The initial
stage uses LVLMs to generate rich and accurate captions. Subsequently, the second stage evaluates
caption quality by using a vision-only LLM to perform the QA task. We also created a specific QA
curation pipeline to ensure the quality of the questions and answers used for the second stage.

By employing the CapRL training framework, initializing with the [InternVL3.5-8B](https://huggingface.co/OpenGVLab/InternVL3_5-8B) model, and using a carefully 
filtered 75K QA dataset as the training set, we obtained a highly capable captioner, CapRL-InternVL3.5-8B.

<p align="center">
  <img src="./assets/teaser.png"  width="750"/>
</p>
<p align="center">
  <img src="./assets/performance_update.png" width="750"/>
</p>

## Key Features
* **Remarkable visual understanding for Chart, Infographics and Document**: CapRL-3B achieves perception accuracy and visual information coverage comparable to Qwen2.5-VL-72B.
* **Well-organized output**: The outputs of CapRL-3B are relatively well-structured, making them clear and easy to understand.
* **Detailed description for natural images**: The outputs of CapRL-3B can perfectly cover all valid visual information while containing fewer hallucinations.

## Usage
If you want to use **CapRL-InternVL3.5-8B** for captioning, you can directly follow the exact same inference approach as in [InternVL-3.5-series](https://huggingface.co/collections/internlm/internvl35-68ab285d4a1f0871ddcb75b2).

We recommend using **vLLM** to speed up inference.


### Start an OpenAI API Service

Run the command below to start an OpenAI-compatible API service:

```bash
vllm serve "/PATH/CapRL-InternVL3.5-8B" \
    --trust-remote-code \
    --tensor-parallel-size=1 \
    --pipeline-parallel-size=1 \
    --gpu_memory_utilization=0.95 \
    --served-model-name=caprl \
    --port 8000 \
    --host 0.0.0.0
```

Then you can use the chat API as below: (see [OpenAI API protocol document](https://platform.openai.com/docs/guides/vision/uploading-base-64-encoded-images) for more details):
```python
import base64
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
image_path = "/path/to/local/image.png"
with open(image_path, "rb") as f:
    encoded_image = base64.b64encode(f.read())
encoded_image_text = encoded_image.decode("utf-8")
base64_qwen = f"data:image;base64,{encoded_image_text}"
chat_response = client.chat.completions.create(
    model="caprl",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": base64_qwen
                    },
                },
                {"type": "text", "text": "What is the text in the illustrate?"},
            ],
        },
    ],
    temperature=1.0,
    max_tokens=max_tokens,
    top_p=1.0,
    extra_body={
        "repetition_penalty": 1.0,
        },
)
print("Chat response:", chat_response)
```



## Cases
<p align="center">
  <img src="./assets/comparison.png"  width="750"/>
</p>

<p align="center">
  <img src="./assets/info_caprl.png"  width="750"/>
</p>

<p align="center">
  <img src="./assets/info_caprl2.png"  width="750"/>
</p>
<p align="center">
  <img src="./assets/natural_caprl.png"  width="750"/>
</p>