Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny

This repository contains models from the Re:Form framework, as presented in the paper Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny.

Re:Form introduces a novel approach to formal software verification by integrating Reinforcement Learning (RL) with Large Language Models (LLMs), focusing on formal languages like Dafny. This work aims to reduce the dependency on extensive human-annotated priors, a significant challenge in informal language-based LLMs. By grounding LLMs in rigorous formal systems, Re:Form enables automatic and mathematically provable verification of reasoning processes and outcomes, addressing scalability and reliability issues in generating verifiable programs.

The pipeline for Re:Form involves an automatic and scalable data curation process and carefully designed RL strategies that incorporate feedback from a formal language verifier. The paper also introduces DafnyComp, a benchmark of compositional formal programs. Even small models (e.g., 0.5B) fine-tuned with Supervised Fine-Tuning (SFT) can generate syntactically valid and verifiable Dafny code, surpassing proprietary models. Further improvements in performance and generalization to out-of-domain tasks are achieved through RL with regularization.

Overall Pipeline

Usage

This model is compatible with the Hugging Face transformers library. You can load and use it for text generation tasks, particularly for generating Dafny code.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Replace with the specific model checkpoint you want to use, e.g., "Veri-Code/sft_0.5B" or "Veri-Code/reform-qwen2-7b-sft"
model_name = "Veri-Code/sft_0.5B" 

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16, # Use bfloat16 for better performance if supported, otherwise torch.float16
    device_map="auto" # Automatically loads the model across available devices (e.g., multiple GPUs)
)

model.eval() # Set model to evaluation mode

# Example: Generate Dafny code for a simple function
prompt = """
method Add(x: int, y: int) returns (sum: int)
    ensures sum == x + y
{
"""

print(f"Input Prompt:
{prompt}")

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate continuation of the code
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100, # Generate up to 100 new tokens
        do_sample=True,
        temperature=0.7,    # Adjust for more/less creative outputs
        top_p=0.9           # Adjust for nucleus sampling
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated Dafny Code:
{generated_text}")

Citation

If you use this work in your research, please cite:

@misc{yan2025reformreducinghuman,
      title={Re:Form -- Reducing Human Priors in Scalable Formal Software Verification with RL in LLMs: A Preliminary Study on Dafny}, 
      author={Chuanhao Yan and Fengdi Che and Xuhan Huang and Xu Xu and Xin Li and Yizhi Li and Xingwei Qu and Jingzhe Shi and Zhuangzhuang He and Chenghua Lin and Yaodong Yang and Binhang Yuan and Hang Zhao and Yu Qiao and Bowen Zhou and Jie Fu},
      year={2025},
      eprint={2507.16331},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.16331}, 
}
Downloads last month
10
Safetensors
Model size
7.62B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Veri-Code/ReForm-SFT-7B

Quantizations
3 models

Collection including Veri-Code/ReForm-SFT-7B