Papers
arxiv:2601.06112

ReliabilityBench: Evaluating LLM Agent Reliability Under Production-Like Stress Conditions

Published on Jan 3
Authors:

Abstract

ReliabilityBench evaluates agent reliability across consistency, robustness, and fault tolerance dimensions using unified metrics and fault injection methods across multiple domains and models.

AI-generated summary

Existing benchmarks for tool-using LLM agents primarily report single-run success rates and miss reliability properties required in production. We introduce ReliabilityBench, a benchmark for evaluating agent reliability across three dimensions: (i) consistency under repeated execution using pass^k, (ii) robustness to semantically equivalent task perturbations at intensity ε, and (iii) fault tolerance under controlled tool/API failures at intensity λ. ReliabilityBench contributes a unified reliability surface R(k,ε,λ), action metamorphic relations that define correctness via end-state equivalence rather than text similarity, and a chaos-engineering-style fault injection framework (timeouts, rate limits, partial responses, schema drift). We evaluate two models (Gemini 2.0 Flash, GPT-4o) and two agent architectures (ReAct, Reflexion) across four domains (scheduling, travel, customer support, e-commerce) over 1,280 episodes. Perturbations alone reduce success from 96.9% at ε=0 to 88.1% at ε=0.2. Rate limiting is the most damaging fault in ablations. ReAct is more robust than Reflexion under combined stress, and Gemini 2.0 Flash achieves comparable reliability to GPT-4o at much lower cost. ReliabilityBench provides a systematic framework for assessing production readiness of LLM agents.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.06112 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.06112 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.