arxiv:2604.17596

Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories

Published on Apr 19

· Submitted by

Ivan Bercovich on Apr 21

Fewshot Corp

Upvote

Authors:

Abstract

A dataset of 331 terminal-agent environments with 3,632 reward-hacking trajectories and 2,352 legitimate baselines across four AI models is released to study adversarial exploits in system administration, ML, software engineering, and security tasks.

AI-generated summary

We release Terminal Wrench, a subset of 331 terminal-agent benchmark environments, copied from the popular open benchmarks that are demonstrably reward-hackable. The data set includes 3,632 hack trajectories and 2,352 legitimate baseline trajectories across three frontier models (Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.4). Each entry preserves the original task definition alongside full attack trajectories that show how the verifier was bypassed. It also includes cases where the task was not solved as intended. The tasks span system administration, machine learning, software engineering, and security challenges; the exploits range from simple output spoofing to stack-frame introspection, standard-library patching, and rootkit-style binary hijacking. Crucially, these exploits are specific to each task, rather than the evaluation harness, making them harder to patch. We also present a monitorability study in which hack trajectories are sanitized or stripped of reasoning traces and then scored by an LLM judge, showing that detection degrades meaningfully when chain-of-thought is removed (AUC drops from 0.97 to 0.92). The data set is publicly available at https://github.com/few-sh/terminal-wrench.

View arXiv page View PDF Project page GitHub 20 Add to collection

Community

ibercovich

Paper submitter 1 day ago

Terminal Wrench is a dataset of terminal-bench style environments that have shown evidence of being reward-hackable, paired with agent trajectories that lead to hacking and non-hacking rewards. Each entry preserves the original task definition alongside full attack trajectories showing how the verifier was passed, including cases where the task was not solved as intended. The tasks span system administration, machine learning, software engineering, and security challenges, and the hacks range from simple output spoofing to stack-frame introspection, stdlib backdoors, and rootkit-style binary hijacking. Notably, these hacking trajectories are task-specific rather than benchmark/evaluation-harness-specific, which makes them harder to fix. The hacks were elicited by appending a hack elicitation excerpt to the prompt. We believe this is a subset of hackable tasks across the 1,860 we analyzed. Our methodology was to first elicit and judge hacking using a variety of models and prompts, producing over 40k trials. From these trials we narrowed the hackable-task subset to 395, and then performed a more robust hacker loop on those, resulting in this dataset.

We also share results of a simple monitorability experiment. We first sanitize the hacked trajectories with a sanitization prompt to remove obvious mentions of hacking behavior from the prompt and replace common keywords, and we ask another model to rewrite the agent blocks to be less suspicious. These sanitized trajectories are stored under sanitized_trajectories//trial/agent/trajectory.json, with sanitizer_model.txt recording which model produced the rewrite. Tool calls are unchanged. We also perform more aggressive stripping, where we remove all agent commentary and only keep the tool calls. We store these under stripped_trajectories//trial/agent/trajectory.json. We then use a judge prompt to rate the original hack as well as the two variants to determine whether detectability drops.

Some benchmarks continue to make improvements, for example we are aware Terminal Bench 2 has continued to fix existing tasks as issues have been found. We pulled our tasks from the primary sources between 2026-01-01 and 2026-03-30, so some of these tasks might have changed since.

For presentation, ◆ rewarded serious exploits and ◇ rewarded non-serious hacks are grouped together as hacks; ~ attacker legitimate solves and ° no-reward attempts remain separate non-hack outcomes.

331 unique tasks
957 task/model entries
6,289 v5 hacker trajectories
3,632 hack trajectories (◆ + ◇)
1,216 attacker legitimate solves (~)
1,441 no-reward attempts (°)
2,352 baseline trajectories from successful prechecks
3 models: claude-opus-4.6, gemini-3.1-pro, gpt-5.4

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.17596

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.17596 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.17596 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.17596 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.