Papers
arxiv:2509.23061

Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification

Published on Sep 27
· Submitted by XinLi on Sep 30
Authors:
,
Xin Li ,
,
,

Abstract

DafnyCOMP evaluates large language models on generating compositional specifications in Dafny, highlighting their weaknesses in cross-functional reasoning.

AI-generated summary

We introduce DafnyCOMP, a benchmark for evaluating large language models (LLMs) on compositional specification generation in Dafny. Unlike prior benchmarks that focus on single-function tasks, DafnyCOMP targets programs composed of multiple interacting functions with data dependencies, requiring reasoning across component boundaries. The benchmark consists of 300 automatically synthesized multi-function programs. We evaluate several state-of-the-art LLM families and find that, while they perform well on single-function verification, their performance drops sharply on compositional tasks. Analysis reveals systematic failures in cross-functional reasoning, including fragile specifications, misalignment between implementations and proofs, and unstable reasoning. DafnyCOMP thus provides a diagnostic tool for measuring progress toward reliable, verifiable, and compositional code generation with LLMs.

Community

Paper author Paper submitter

Project homepage: https://DafnyComp.github.io/

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.23061 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.23061 in a Space README.md to link it from this page.

Collections including this paper 1