File size: 7,044 Bytes
81d010b 2e39016 765c016 78769ad 765c016 3a0ec16 765c016 b169a08 244e182 b169a08 309fd1b 3a0ec16 a5db8ff 2e39016 b169a08 2e39016 c1fdc60 765c016 2e39016 765c016 2e39016 765c016 2e39016 765c016 2e39016 765c016 628ec6c 765c016 f9b4f0a 765c016 f9b4f0a 765c016 f9b4f0a 765c016 f9b4f0a 765c016 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
---
license: apache-2.0
language:
- en
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- multimodal
- image caption
- captioning
datasets:
- internlm/CapRL-2M
base_model:
- OpenGVLab/InternVL3_5-8B
---
# CapRL-InternVL3.5-8B
π<a href="https://arxiv.org/abs/2509.22647">Paper</a> | π <a href="https://github.com/InternLM/CapRL">Github</a> |π€<a href="https://huggingface.co/internlm/CapRL-3B">CapRL-3B Model</a> |π€<a href="https://huggingface.co/yuhangzang/CapRL-InternVL3.5-8B">CapRL-InternVL3.5-8B Model</a> |
π€<a href="https://huggingface.co/datasets/internlm/CapRL-2M">CapRL-2M Dataset</a>
π€<a href="https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189">CapRL Collection</a> | π€<a href="https://huggingface.co/papers/2509.22647">Daily Paper</a> ο½π€<a href="https://huggingface.co/mradermacher/CapRL-3B-GGUF">CapRL-3B-GGUF</a> ο½π€<a href="https://huggingface.co/mradermacher/CapRL-3B-i1-GGUF">CapRL-3B-i1-GGUF</a>
When selecting between the available CapRL models, it's essential to consider the trade-off between performance and computational cost.
This guide will help you choose the most suitable model for your specific needs:
|Model|Parameters|Strength|
|-|-|-|
|π€[CapRL-3B](https://huggingface.co/internlm/CapRL-3B)|3B|Speed, Efficiency|
|π€[CapRL-InternVL3.5-8B](https://huggingface.co/yuhangzang/CapRL-InternVL3.5-8B)|8B|High Performance, Advanced Captioning Ability|
Now you can try out CapRL-3B with your own imagesπ¨! β‘οΈ [πCapRL Space](https://huggingface.co/spaces/yuhangzang/caprl)
## π’ News
We are working on even stronger base models and upgrading our training recipe β stay tuned!
- π₯ [10/15/2025] The total downloads of the CapRL-related [models and dataset](https://huggingface.co/collections/long-xing1/caprl-68d64ac32ded31596c36e189) reached 6,000 within just 20 days!
- π [10/15/2025] We are excited to announce the release of **[CapRL-InternVL3.5-8B](https://huggingface.co/internlm/CapRL-InternVL3.5-8B)**, whose image captioning capability outperforms Qwen2.5-VL-72B!
- π [10/15/2025] Thanks [mradermacher](https://huggingface.co/mradermacher) for the valuable contribution! [CapRL-3B-GGUF](https://huggingface.co/mradermacher/CapRL-3B-GGUF) is the static quants version, and [CapRL-3B-i1-GGUF](https://huggingface.co/mradermacher/CapRL-3B-i1-GGUF) is weighted/imatrix quants version.
- π [10/15/2025] We release [QA curation code](https://github.com/InternLM/CapRL).
- π [09/25/2025] We release **CapRL** repository, [CapRL-3B model](https://huggingface.co/internlm/CapRL-3B), [evaluation code](https://github.com/InternLM/CapRL) and [dataset](https://huggingface.co/datasets/internlm/CapRL-2M).
## Introduction
Based on the same recipe as [CapRL-3B](https://huggingface.co/internlm/CapRL-3B), we used [InternVL3.5-8B](https://huggingface.co/OpenGVLab/InternVL3_5-8B) as the policy model and obtained **[CapRL-InternVL3.5-8B](https://huggingface.co/yuhangzang/CapRL-InternVL3.5-8B)** through CapRL.
CapRL is the first study of applying Reinforcement Learning with Verifiable Rewards for the
open-ended and subjective image captioning task. Unlike traditional Supervised Fine-Tuning, which
can lead to models memorizing a limited set of annotated captions, our method allows the model to
explore and generate a broader range of creative and general descriptions.
CapRL is a new training paradigm featuring a decoupled two-stage pipeline. The initial
stage uses LVLMs to generate rich and accurate captions. Subsequently, the second stage evaluates
caption quality by using a vision-only LLM to perform the QA task. We also created a specific QA
curation pipeline to ensure the quality of the questions and answers used for the second stage.
By employing the CapRL training framework, initializing with the [InternVL3.5-8B](https://huggingface.co/OpenGVLab/InternVL3_5-8B) model, and using a carefully
filtered 75K QA dataset as the training set, we obtained a highly capable captioner, CapRL-InternVL3.5-8B.
<p align="center">
<img src="./assets/teaser.png" width="750"/>
</p>
<p align="center">
<img src="./assets/performance_update.png" width="750"/>
</p>
## Key Features
* **Remarkable visual understanding for Chart, Infographics and Document**: CapRL-3B achieves perception accuracy and visual information coverage comparable to Qwen2.5-VL-72B.
* **Well-organized output**: The outputs of CapRL-3B are relatively well-structured, making them clear and easy to understand.
* **Detailed description for natural images**: The outputs of CapRL-3B can perfectly cover all valid visual information while containing fewer hallucinations.
## Usage
If you want to use **CapRL-InternVL3.5-8B** for captioning, you can directly follow the exact same inference approach as in [InternVL-3.5-series](https://huggingface.co/collections/internlm/internvl35-68ab285d4a1f0871ddcb75b2).
We recommend using **vLLM** to speed up inference.
### Start an OpenAI API Service
Run the command below to start an OpenAI-compatible API service:
```bash
vllm serve "/PATH/CapRL-InternVL3.5-8B" \
--trust-remote-code \
--tensor-parallel-size=1 \
--pipeline-parallel-size=1 \
--gpu_memory_utilization=0.95 \
--served-model-name=caprl \
--port 8000 \
--host 0.0.0.0
```
Then you can use the chat API as below: (see [OpenAI API protocol document](https://platform.openai.com/docs/guides/vision/uploading-base-64-encoded-images) for more details):
```python
import base64
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
image_path = "/path/to/local/image.png"
with open(image_path, "rb") as f:
encoded_image = base64.b64encode(f.read())
encoded_image_text = encoded_image.decode("utf-8")
base64_qwen = f"data:image;base64,{encoded_image_text}"
chat_response = client.chat.completions.create(
model="caprl",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": base64_qwen
},
},
{"type": "text", "text": "What is the text in the illustrate?"},
],
},
],
temperature=1.0,
max_tokens=max_tokens,
top_p=1.0,
extra_body={
"repetition_penalty": 1.0,
},
)
print("Chat response:", chat_response)
```
## Cases
<p align="center">
<img src="./assets/comparison.png" width="750"/>
</p>
<p align="center">
<img src="./assets/info_caprl.png" width="750"/>
</p>
<p align="center">
<img src="./assets/info_caprl2.png" width="750"/>
</p>
<p align="center">
<img src="./assets/natural_caprl.png" width="750"/>
</p> |