CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents
Abstract
CostBench evaluates Large Language Model agents' cost-aware planning and adaptability in response to dynamic changes, revealing significant gaps in current models' performance.
Current evaluations of Large Language Model (LLM) agents primarily emphasize task completion, often overlooking resource efficiency and adaptability. This neglects a crucial capability: agents' ability to devise and adjust cost-optimal plans in response to changing environments. To bridge this gap, we introduce CostBench, a scalable, cost-centric benchmark designed to evaluate agents' economic reasoning and replanning abilities. Situated in the travel-planning domain, CostBench comprises tasks solvable via multiple sequences of atomic and composite tools with diverse, customizable costs. It also supports four types of dynamic blocking events, such as tool failures and cost changes, to simulate real-world unpredictability and necessitate agents to adapt in real time. Evaluating leading open-sourced and proprietary models on CostBench reveals a substantial gap in cost-aware planning: agents frequently fail to identify cost-optimal solutions in static settings, with even GPT-5 achieving less than 75% exact match rate on the hardest tasks, and performance further dropping by around 40% under dynamic conditions. By diagnosing these weaknesses, CostBench lays the groundwork for developing future agents that are both economically rational and robust.
Community
This paper reveals SOTA LLMs' significant deficiencies in cost-optimal planning, stemming from high sensitivity to cost variations and insufficient coverage of potential paths in their strategies. Performance deteriorates further under dynamic disruptions like tool failures or cost changes, emphasizing limitations in adaptive capabilities.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- TRAJECT-Bench:A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use (2025)
- COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning&Preference Optimization (2025)
- TPS-Bench: Evaluating AI Agents'Tool Planning &Scheduling Abilities in Compounding Tasks (2025)
- AgentChangeBench: A Multi-Dimensional Evaluation Framework for Goal-Shift Robustness in Conversational AI (2025)
- In-the-Flow Agentic System Optimization for Effective Planning and Tool Use (2025)
- MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools (2025)
- VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper