arxiv:2601.06112

ReliabilityBench: Evaluating LLM Agent Reliability Under Production-Like Stress Conditions

Published on Jan 3

Authors:

Abstract

ReliabilityBench evaluates agent reliability across consistency, robustness, and fault tolerance dimensions using unified metrics and fault injection methods across multiple domains and models.

AI-generated summary

Existing benchmarks for tool-using LLM agents primarily report single-run success rates and miss reliability properties required in production. We introduce ReliabilityBench, a benchmark for evaluating agent reliability across three dimensions: (i) consistency under repeated execution using pass^k, (ii) robustness to semantically equivalent task perturbations at intensity ε, and (iii) fault tolerance under controlled tool/API failures at intensity λ. ReliabilityBench contributes a unified reliability surface R(k,ε,λ), action metamorphic relations that define correctness via end-state equivalence rather than text similarity, and a chaos-engineering-style fault injection framework (timeouts, rate limits, partial responses, schema drift). We evaluate two models (Gemini 2.0 Flash, GPT-4o) and two agent architectures (ReAct, Reflexion) across four domains (scheduling, travel, customer support, e-commerce) over 1,280 episodes. Perturbations alone reduce success from 96.9% at ε=0 to 88.1% at ε=0.2. Rate limiting is the most damaging fault in ablations. ReAct is more robust than Reflexion under combined stress, and Gemini 2.0 Flash achieves comparable reliability to GPT-4o at much lower cost. ReliabilityBench provides a systematic framework for assessing production readiness of LLM agents.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.06112 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.06112 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.