ReCAP-32B
ReCAP-32B is a vision-language model fine-tuned from
Qwen/Qwen3-VL-32B-Thinking, designed to enable robust CAPTCHA solving within native GUI agents while preserving general GUI interaction capabilities.
This model is introduced in βCAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Trainingβ.
π Overview
ReCAP-32B extends a general-purpose GUI agent with CAPTCHA-solving ability by learning from structured reasoning-action trajectories.
It operates end-to-end:
- Input: raw screenshots
- Output: reasoning + executable GUI actions (click, type, drag)
β¨ Key Features
- Unified agent: Handles both CAPTCHA and general GUI tasks
- Reasoning-action modeling: Learns both decisions and execution
- Self-correction: Improves robustness by learning from failures
- Efficient interaction: Generates multiple actions per step
π§ Capabilities
Supports diverse CAPTCHA types:
- Text / OCR
- Icon selection & matching
- Image grid reasoning
- Slider / drag tasks
- Multi-step interaction challenges
Core skills:
- Visual understanding
- Spatial reasoning
- Continuous control
- Multi-step planning
π Performance
- ~81.0% success rate on synthetic CAPTCHA benchmark
- Strong improvements on interaction-heavy tasks (e.g., slider, image grid)
- Maintains strong performance on general GUI benchmarks
π Ethical Considerations
This model is released for research purposes only.
It is intended to study and improve the robustness of human-verification systems, not to bypass them.
- Downloads last month
- 8
Model tree for ReCAP-Agent/ReCAP-32B
Base model
Qwen/Qwen3-VL-32B-ThinkingCollection including ReCAP-Agent/ReCAP-32B
Collection
ReCAP is a framework for training and evaluating CAPTCHA-capable GUI agents using dynamic tasks, benchmarks, and unified evaluation. β’ 3 items β’ Updated