daVinci-Env: Open SWE Environment Synthesis at Scale
Abstract
OpenSWE presents the largest transparent framework for software engineering agent training, featuring 45,320 executable environments and achieving state-of-the-art performance on SWE-bench Verified.
Training capable software engineering (SWE) agents demands large-scale, executable, and verifiable environments that provide dynamic feedback loops for iterative code editing, test execution, and solution refinement. However, existing open-source datasets remain limited in scale and repository diversity, while industrial solutions are opaque with unreleased infrastructure, creating a prohibitive barrier for most academic research groups. We present OpenSWE, the largest fully transparent framework for SWE agent training in Python, comprising 45,320 executable Docker environments spanning over 12.8k repositories, with all Dockerfiles, evaluation scripts, and infrastructure fully open-sourced for reproducibility. OpenSWE is built through a multi-agent synthesis pipeline deployed across a 64-node distributed cluster, automating repository exploration, Dockerfile construction, evaluation script generation, and iterative test analysis. Beyond scale, we propose a quality-centric filtering pipeline that characterizes the inherent difficulty of each environment, filtering out instances that are either unsolvable or insufficiently challenging and retaining only those that maximize learning efficiency. With 891K spent on environment construction and an additional 576K on trajectory sampling and difficulty-aware curation, the entire project represents a total investment of approximately $1.47 million, yielding about 13,000 curated trajectories from roughly 9,000 quality guaranteed environments. Extensive experiments validate OpenSWE's effectiveness: OpenSWE-32B and OpenSWE-72B achieve 62.4% and 66.0% on SWE-bench Verified, establishing SOTA among Qwen2.5 series. Moreover, SWE-focused training yields substantial out-of-domain improvements, including up to 12 points on mathematical reasoning and 5 points on science benchmarks, without degrading factual recall.
Community
lowkey the four-stage pr-based filtering pipeline is the heart of openswe, turning a wild repo soup into a reliable learning signal. i buy the idea of filtering for solvability and maintenance realism, but i worry this could bias the curriculum toward easier tasks or those with brittle tests. an ablation removing the 'unsolvable' gate or replacing it with a continuous difficulty score would tell us how sensitive the gains are to the exact filter thresholds. btw the arxivlens breakdown helped me parse the method details, especially how the filter criteria map to learning signal quality, nice summary at https://arxivlens.com/PaperView/Details/davinci-env-open-swe-environment-synthesis-at-scale-5258-f8a29a3b. how do you think this filtering strategy would fare if you relaxed it to include more borderline environments, would the out-of-domain gains hold?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Immersion in the GitHub Universe: Scaling Coding Agents to Mastery (2026)
- DockSmith: Scaling Reliable Coding Environments via an Agentic Docker Builder (2026)
- SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale (2026)
- SWE-Hub: A Unified Production System for Scalable, Executable Software Engineering Tasks (2026)
- SWE-Universe: Scale Real-World Verifiable Environments to Millions (2026)
- MEnvAgent: Scalable Polyglot Environment Construction for Verifiable Software Engineering (2026)
- SWE-Master: Unleashing the Potential of Software Engineering Agents via Post-Training (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 2
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper