GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization Paper • 2601.05242 • Published 6 days ago • 176
SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence Paper • 2512.22334 • Published 19 days ago • 34
Agentic Rubrics as Contextual Verifiers for SWE Agents Paper • 2601.04171 • Published 7 days ago • 10
NitroGen: An Open Foundation Model for Generalist Gaming Agents Paper • 2601.02427 • Published 10 days ago • 38
Masking Teacher and Reinforcing Student for Distilling Vision-Language Models Paper • 2512.22238 • Published 22 days ago • 19
UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture Paper • 2512.21675 • Published 20 days ago • 24
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios Paper • 2512.18470 • Published 25 days ago • 10
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows Paper • 2512.16969 • Published 28 days ago • 112
MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments Paper • 2512.19432 • Published 23 days ago • 12
QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models Paper • 2512.19526 • Published 23 days ago • 11
Reinforcement Learning for Self-Improving Agent with Skill Library Paper • 2512.17102 • Published 27 days ago • 32
Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision Paper • 2512.15489 • Published 28 days ago • 6
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows Paper • 2512.13168 • Published about 1 month ago • 49
The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality Paper • 2512.10791 • Published Dec 11, 2025 • 7