Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories
Abstract
A dataset of 331 terminal-agent environments with 3,632 reward-hacking trajectories and 2,352 legitimate baselines across four AI models is released to study adversarial exploits in system administration, ML, software engineering, and security tasks.
We release Terminal Wrench, a subset of 331 terminal-agent benchmark environments, copied from the popular open benchmarks that are demonstrably reward-hackable. The data set includes 3,632 hack trajectories and 2,352 legitimate baseline trajectories across three frontier models (Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.4). Each entry preserves the original task definition alongside full attack trajectories that show how the verifier was bypassed. It also includes cases where the task was not solved as intended. The tasks span system administration, machine learning, software engineering, and security challenges; the exploits range from simple output spoofing to stack-frame introspection, standard-library patching, and rootkit-style binary hijacking. Crucially, these exploits are specific to each task, rather than the evaluation harness, making them harder to patch. We also present a monitorability study in which hack trajectories are sanitized or stripped of reasoning traces and then scored by an LLM judge, showing that detection degrades meaningfully when chain-of-thought is removed (AUC drops from 0.97 to 0.92). The data set is publicly available at https://github.com/few-sh/terminal-wrench.
Community
Terminal Wrench is a dataset of terminal-bench style environments that have shown evidence of being reward-hackable, paired with agent trajectories that lead to hacking and non-hacking rewards. Each entry preserves the original task definition alongside full attack trajectories showing how the verifier was passed, including cases where the task was not solved as intended. The tasks span system administration, machine learning, software engineering, and security challenges, and the hacks range from simple output spoofing to stack-frame introspection, stdlib backdoors, and rootkit-style binary hijacking. Notably, these hacking trajectories are task-specific rather than benchmark/evaluation-harness-specific, which makes them harder to fix. The hacks were elicited by appending a hack elicitation excerpt to the prompt. We believe this is a subset of hackable tasks across the 1,860 we analyzed. Our methodology was to first elicit and judge hacking using a variety of models and prompts, producing over 40k trials. From these trials we narrowed the hackable-task subset to 395, and then performed a more robust hacker loop on those, resulting in this dataset.
We also share results of a simple monitorability experiment. We first sanitize the hacked trajectories with a sanitization prompt to remove obvious mentions of hacking behavior from the prompt and replace common keywords, and we ask another model to rewrite the agent blocks to be less suspicious. These sanitized trajectories are stored under sanitized_trajectories/
Some benchmarks continue to make improvements, for example we are aware Terminal Bench 2 has continued to fix existing tasks as issues have been found. We pulled our tasks from the primary sources between 2026-01-01 and 2026-03-30, so some of these tasks might have changed since.
For presentation, ◆ rewarded serious exploits and ◇ rewarded non-serious hacks are grouped together as hacks; ~ attacker legitimate solves and ° no-reward attempts remain separate non-hack outcomes.
331 unique tasks
957 task/model entries
6,289 v5 hacker trajectories
3,632 hack trajectories (◆ + ◇)
1,216 attacker legitimate solves (~)
1,441 no-reward attempts (°)
2,352 baseline trajectories from successful prechecks
3 models: claude-opus-4.6, gemini-3.1-pro, gpt-5.4
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Detecting Safety Violations Across Many Agent Traces (2026)
- SkillAttack: Automated Red Teaming of Agent Skills through Attack Path Refinement (2026)
- Re-Evaluating EVMBench: Are AI Agents Ready for Smart Contract Security? (2026)
- Countdown-Code: A Testbed for Studying The Emergence and Generalization of Reward Hacking in RLVR (2026)
- How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition (2026)
- Constitutional Black-Box Monitoring for Scheming in LLM Agents (2026)
- LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.17596 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper