Papers
arxiv:2603.26728

SEAR: Schema-Based Evaluation and Routing for LLM Gateways

Published on Mar 20
· Submitted by
Zecheng Zhang
on Mar 31
Authors:
,

Abstract

SEAR is a schema-based system for evaluating and routing LLM responses that uses structured signals derived from LLM reasoning to enable accurate, interpretable routing decisions across multiple providers.

AI-generated summary

Evaluating production LLM responses and routing requests across providers in LLM gateways requires fine-grained quality signals and operationally grounded decisions. To address this gap, we present SEAR, a schema-based evaluation and routing system for multi-model, multi-provider LLM gateways. SEAR defines an extensible relational schema covering both LLM evaluation signals (context, intent, response characteristics, issue attribution, and quality scores) and gateway operational metrics (latency, cost, throughput), with cross-table consistency links across around one hundred typed, SQL-queryable columns. To populate the evaluation signals reliably, SEAR proposes self-contained signal instructions, in-schema reasoning, and multi-stage generation that produces database-ready structured outputs. Because signals are derived through LLM reasoning rather than shallow classifiers, SEAR captures complex request semantics, enables human-interpretable routing explanations, and unifies evaluation and routing in a single query layer. Across thousands of production sessions, SEAR achieves strong signal accuracy on human-labeled data and supports practical routing decisions, including large cost reductions with comparable quality.

Community

Paper author Paper submitter
edited about 11 hours ago

LLM gateways route requests across multiple models and providers, but evaluating response quality and making routing decisions still relies on shallow heuristics or black-box logic. SEAR changes that.

SEAR defines an extensible relational schema with ~100 typed, SQL-queryable columns covering evaluation signals (intent, context, response characteristics, issue attribution, quality scores) and operational metrics (latency, cost, throughput). It uses LLM-based reasoning to populate these signals, capturing complex request semantics and producing human-interpretable, structured outputs. Because all signals live in a unified SQL layer, evaluation insights can be aggregated, joined, and queried at scale, bringing a big-data approach to LLM quality analysis and routing.

Evaluated on thousands of production sessions from https://infron.ai/, SEAR achieves strong signal accuracy against human labels and supports practical routing decisions, including large cost reductions with comparable quality.

SEAR turns model evaluation and routing from a static, opaque process into a data-driven loop that continuously improves cost, quality, and reliability.

Paper author Paper submitter

By using SEAR, we reduce all LLM production evaluation and routing to SQL queries which based on reasoning LLMs instead of shallow classifiers and also fully human interpretable for routing decision makings for companies.

Evaluation Example

Surface LLM-caused issues across models and domains for coding tasks over the last 30 days

SELECT gw.model_id,
       ctx.context_domain_category  AS domain,
       COUNT(*)                     AS n,
       AVG(CASE WHEN ia.issue_caused_by_code_task
                 IN ('llm','both')
            THEN 1 ELSE 0 END)      AS llm_issue_rate,
       AVG(CASE eval.severity_of_code_task
             WHEN 'major' THEN 2 WHEN 'minor' THEN 1
             ELSE 0 END)            AS avg_severity
FROM context_info      ctx
JOIN issue_attribution ia   ON ia.context_id  = ctx.id
JOIN evaluation        eval ON eval.context_id = ctx.id
JOIN llm_response_info llm  ON llm.context_id  = ctx.id
JOIN gateway_metrics   gw   ON gw.id = llm.gateway_metrics_id
WHERE ctx.request_task_type = 'coding'
  AND gw.created_at >= NOW() - INTERVAL '30 days'
GROUP BY 1, 2
ORDER BY llm_issue_rate, avg_severity;

Routing Example

Route the cheapest model within 90% of the best quality

WITH model_perf AS (
  SELECT gw.model_id,
         AVG(CASE eval.overall_task_type_quality
               WHEN 'high' THEN 3 WHEN 'medium' THEN 2
               WHEN 'low'  THEN 1 ELSE 0 END) AS avg_quality,
         AVG(gw.prompt_tokens  * mp.input_cost_per_million_token
           + gw.completion_tokens
             * mp.output_cost_per_million_token) AS avg_cost
  FROM gateway_metrics   gw
  JOIN llm_response_info llm  ON llm.gateway_metrics_id = gw.id
  JOIN evaluation        eval ON eval.context_id = llm.context_id
  JOIN model_provider    mp   ON mp.model_id = gw.model_id
                              AND mp.provider_id = gw.provider_id
  WHERE gw.is_failed = FALSE
  GROUP BY 1
  HAVING COUNT(*) >= 30
)
SELECT model_id, avg_quality, avg_cost
FROM model_perf
WHERE avg_quality >= 0.9 * (SELECT MAX(avg_quality) FROM model_perf)
ORDER BY avg_cost;

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2603.26728
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.26728 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.26728 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.26728 in a Space README.md to link it from this page.

Collections including this paper 1