Local Success Does Not Compose: Benchmarking Large Language Models for Compositional Formal Verification
Abstract
DafnyCOMP evaluates large language models on generating compositional specifications in Dafny, highlighting their weaknesses in cross-functional reasoning.
We introduce DafnyCOMP, a benchmark for evaluating large language models (LLMs) on compositional specification generation in Dafny. Unlike prior benchmarks that focus on single-function tasks, DafnyCOMP targets programs composed of multiple interacting functions with data dependencies, requiring reasoning across component boundaries. The benchmark consists of 300 automatically synthesized multi-function programs. We evaluate several state-of-the-art LLM families and find that, while they perform well on single-function verification, their performance drops sharply on compositional tasks. Analysis reveals systematic failures in cross-functional reasoning, including fragile specifications, misalignment between implementations and proofs, and unstable reasoning. DafnyCOMP thus provides a diagnostic tool for measuring progress toward reliable, verifiable, and compositional code generation with LLMs.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- From Evaluation to Enhancement: Large Language Models for Zero-Knowledge Proof Code Generation (2025)
- Proof2Silicon: Prompt Repair for Verified Code and Hardware Generation via Reinforcement Learning (2025)
- SLD-Spec: Enhancement LLM-assisted Specification Generation for Complex Loop Functions via Program Slicing and Logical Deletion (2025)
- CASP: An evaluation dataset for formal verification of C code (2025)
- Strengthening Programming Comprehension in Large Language Models through Code Generation (2025)
- STEPWISE-CODEX-Bench: Evaluating Complex Multi-Function Comprehension and Fine-Grained Execution Reasoning (2025)
- Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper