Squeez: Task-Conditioned Tool-Output Pruning for Coding Agents
pytest run returns 45 lines; the next reasoning step needs 6. The same happens with grep, git log, kubectl, pip install, file reads, stack traces, and other observations that are structurally noisy but locally informative.
In this post we'll explain how we've built Squeez for that setting: an open benchmark for task-conditioned tool-output pruning, a LoRA-tuned Qwen 3.5 2B model trained on it, and a CLI that can be dropped into existing coding-agent workflows. The task is simple: given a focused query and one raw tool observation, return the smallest verbatim evidence block worth keeping.
On the held-out benchmark, the model reaches 0.86 recall at 92% compression, outperforming zero-shot Qwen 3.5 35B A3B by 11 recall points at essentially the same compression level. The model, dataset, and code are open on Hugging Face, the dataset hub, and GitHub.
TL;DR
- We release Squeez, a benchmark, model, and CLI for pruning tool output in coding agents.
- The benchmark has 11,477 examples across 27 tool types, built from real SWE-bench-derived observations and synthetic multi-ecosystem tool outputs.
- A LoRA-tuned Qwen 3.5 2B reaches 0.86 recall at 92% compression, beating zero-shot Qwen 3.5 35B A3B by 11 recall points.
- Squeez is a preprocessing step, not a new agent framework: it keeps the smallest verbatim evidence block from
pytest,grep,git log,kubectl, file reads, and similar outputs. - Everything is under Apache 2.0 license.
- Try the CLI with
pip install squeez. - Code and CLI: github.com/KRLabsOrg/squeez
Quick example:
pytest -q 2>&1 | squeez "find the failure block"
git log --oneline -50 | squeez "find the commit that changed CSRF handling"
Table of Contents
Introduction
Coding agents such as Claude Code and Codex spend much of their time reading tool output. When an agent runs pytest, grep, git log, kubectl, or pip install, the result is often dozens or hundreds of lines long. Only a small fraction of those lines matters for the next step. The rest is headers, passing tests, repeated metadata, timestamps, unchanged context, or structurally similar but irrelevant matches. A lot of the context budget ends up going not to reasoning, but to re-reading noise.
Test output is the easiest example. One failure block matters; everything else is scaffolding around it.
Raw tool output (45 lines):
$ python -m pytest tests/ -v
======================== test session starts ===========
platform linux -- Python 3.12.1, pytest-8.1.1
collected 23 items
tests/test_auth.py::test_login_valid PASSED
tests/test_auth.py::test_login_invalid PASSED
tests/test_auth.py::test_token_refresh FAILED
tests/test_auth.py::test_logout PASSED
tests/test_users.py::test_create_user PASSED
... 6 more PASSED ...
tests/test_middleware.py::test_cors_headers PASSED
======================= FAILURES =======================
_____ test_token_refresh _______________________________
def test_token_refresh(self):
token = self.client.get_token(expired=True)
> refreshed = self.client.refresh(token)
E AuthenticationError: Token refresh window expired
E Expected: new token within 30m window
E Got: rejection after 15m (timeout changed?)
tests/test_auth.py:47: AuthenticationError
================ short test summary info ===============
FAILED tests/test_auth.py::test_token_refresh
================== 1 failed, 9 passed =================
After Squeez (6 lines):
tests/test_auth.py::test_token_refresh FAILED
def test_token_refresh(self):
token = self.client.get_token(expired=True)
> refreshed = self.client.refresh(token)
E AuthenticationError: Token refresh window expired
E Expected: new token within 30m window
E Got: rejection after 15m (timeout changed?)
That is 87% compression while preserving the only part of the observation that matters for the next debugging step:
python -m pytest tests/ -v 2>&1 | squeez "find the test failure related to authentication"
The same structure shows up in many other tools. grep returns a long list of nearby lexical matches even though only one file matters. git log shows a history where one commit matters. kubectl describe contains hundreds of lines of pod state even though the evidence is two lines saying OOMKilled and Exit Code: 137. read_file returns an entire module even though the agent only needs one code block.
Existing pruning systems work in similar direction, but usually operate on different units. LLMLingua and LongLLMLingua compress prompts at the token or prompt-block level (Jiang et al., 2023; Jiang et al., 2024). EXIT and Provence perform extractive compression over retrieved text for downstream question answering or retrieval-augmented generation (Hwang et al., 2025; Chirkova et al., 2025). Zilliz Semantic Highlight adapts this line to semantic highlighting over retrieved passages (model card). SWE-Pruner is the closest coding baseline, but it focuses on pruning repository code context rather than a single mixed-format tool observation (Wang et al., 2026).
Tool output is a different object. It is not well-formed prose, and it is not always source code. A single observation may mix code, logs, shell traces, stack frames, JSON payloads, and Git metadata. The relevant unit may be a failure block, a short function body, a commit entry, a package conflict, or nothing at all. That is the gap Squeez targets.
The Task and the Benchmark
We formulate the problem as task-conditioned tool-output pruning: given a focused query and one raw tool observation, return the smallest verbatim evidence block that the agent should inspect next. The model is not asked to solve the full bug from one observation. It is asked to preserve the relevant evidence and remove the rest.
Two properties of the task matter. First, the output is verbatim. We do not want paraphrased summaries of stack traces, imports, versions, exit codes, or code blocks. Tool output often contains details that should remain exact. Second, the query is task-conditioned but narrower than the full issue description. It expresses the local information need the agent has at that moment: find the failure block, the relevant code region, or the commit that likely introduced the behavior.
The overall pipeline is shown below:
The benchmark is built from two sources. The first is SWE-bench, which provides real GitHub issue-resolution tasks over real repositories. We clone repository snapshots and execute 14 tool types against them — file reads, grep, Git log and blame, test runners, linters, type checkers, package installation, curl, and others — collecting 10,713 raw observations that reflect the kind of output a coding agent encounters during issue resolution.
The second source is synthetic multi-ecosystem tool output, which extends coverage beyond SWE-bench's Python-heavy distribution. We use openai/gpt-oss-120b to generate 2,039 realistic tool outputs for representative tasks in TypeScript, Go, Rust, Java, Docker, Terraform, and Kubernetes workflows. We also construct explicit negatives by pairing mismatched queries and tool outputs, where the correct pruning decision is to return nothing.
The executed SWE-derived subset covers 14 tool types; the full released benchmark reaches 27 tool families once the synthetic multi-ecosystem portion is added.
Each released positive example is labeled with the same two-stage teacher pipeline, again using openai/gpt-oss-120b, regardless of whether it comes from SWE-bench or from the synthetic portion. First, the teacher writes a focused extraction query for one observation. Second, it selects the smallest contiguous span, or small set of spans, that answers that query. The teacher sees a numbered rendering of the output for stable span selection, but the released labels are always mapped back onto the original raw text. Positive examples whose query cannot be supported by the observation are dropped rather than retained as accidental empty outputs. Explicit negatives are created separately in the synthetic portion, where the correct target is an empty extraction.
The held-out set was also manually curated. Starting from 729 candidate test examples, we removed 111 cases (15.2%) that were near-duplicates, trivial 1--2 line outputs, overly broad spans, or incorrect annotations. The final test set contains 618 manually reviewed examples.
The released benchmark contains 11,477 examples in total: 9,205 SWE-derived examples, 1,697 synthetic positives, and 575 synthetic negatives. SWE-derived examples are split by repository and synthetic examples by tool family.
| Source | Raw inputs | Released rows |
|---|---|---|
| SWE-derived | 10,713 | 9,205 |
| Synthetic positives | 2,039 | 1,697 |
| Synthetic negatives | — | 575 |
| Total | 12,752 | 11,477 |
The benchmark covers 27 tool types. The largest families are shown below.
| Tool family | Rows | Avg. input | Avg. gold |
|---|---|---|---|
read_file |
3768 | 1677 | 84 |
grep |
1330 | 779 | 19 |
git_log |
720 | 161 | 11 |
python |
698 | 60 | 28 |
test_output |
546 | 56 | 23 |
curl |
493 | 723 | 68 |
pip_install |
441 | 438 | 79 |
type_check |
317 | 3418 | 39 |
git_blame |
291 | 4210 | 139 |
| remaining tools | 2873 | 688 | 47 |
The distribution is intentionally heterogeneous. python and test_output rows are short; read_file, type_check, and git_blame can be extremely long. This matters because the useful evidence does not follow one structural pattern, and it may occur at the beginning, middle, or end of the observation. That is also why simple truncation and lexical retrieval remain weak baselines here.
Training
We use Qwen 3.5 2B as the base model (Qwen3.5 blog post). The goal is to learn a narrow supervised extraction task that can run cheaply inside an agent loop. A 2B model is large enough to benefit from supervision, but still small enough to be practical for local serving and repeated tool use.
We fine-tune with LoRA (Hu et al., 2022; Dettmers et al., 2023) using the Unsloth stack. The model receives a focused extraction query and the raw tool observation, and is trained to emit the extracted evidence wrapped in <relevant_lines> tags. The supervision target is the exact evidence block the model should keep.
Training uses max sequence length 20,000, effective batch size 32, learning rate 2e-4, 3 epochs, warmup 0.05, weight decay 0.01. After training, we merge the LoRA adapter into the base model and serve the merged checkpoint through vLLM.
Results
We compare Squeez-2B against three zero-shot generative baselines and four heuristic baselines. The heuristic baselines keep roughly 10% of the input lines to operate at a compression level similar to the gold extractions. The main metrics are recall, F1, and compression. Recall matters most because dropping relevant evidence is usually more harmful than keeping a slightly larger block.
| Model | Recall | F1 | Compression |
|---|---|---|---|
| Squeez-2B | 0.86 | 0.80 | 0.92 |
| Qwen 3.5 35B A3B | 0.75 | 0.73 | 0.92 |
| Kimi K2 | 0.53 | 0.68 | 0.94 |
| Qwen 3.5 2B (base) | 0.53 | 0.55 | 0.82 |
| BM25 (10%) | 0.22 | 0.23 | 0.90 |
| First-N (10%) | 0.14 | 0.16 | 0.91 |
| Random (10%) | 0.10 | 0.20 | 0.91 |
| Last-N (10%) | 0.05 | 0.14 | 0.91 |
Squeez-2B beats zero-shot Qwen 3.5 35B A3B by 11 recall points at essentially the same compression level. For this task, narrow supervision matters more than scale. BM25 reaches only 0.22 recall, showing that lexical overlap is weak supervision for logs, stack traces, and other mixed-format observations. Kimi K2 compresses more aggressively, but loses too much evidence in the process.
The recall-compression trade-off is shown below. Squeez-2B occupies the upper-left region: high recall with strong compression.
Qualitative Analysis
We show a few recurring patterns that are consistent across the test set.
Precise selection in structured output. In grep and git_log, the fine-tuned model learns to return the single relevant hit. Here is a 21-line git_log where the task is to find the commit that changed the dimension order of xr.polyval output:
fc282d59 re-add timedelta support for polyval (#6599)
cad4474a Fix polyval overloads (#6593)
6fbeb131 polyval: Use Horner's algorithm + support chunked inputs (#6548) ← gold
07de257c Simplify transpose in xr.dot (#5849)
... 17 more lines ...
| Model | Prediction | Correct? |
|---|---|---|
| Squeez-2B | 6fbeb131 polyval: Use Horner's algorithm... |
Yes |
| Qwen 3.5 35B A3B | 07de257c Simplify transpose in xr.dot |
No (wrong commit) |
| Qwen 3.5 2B (base) | 3 polyval commits (over-selects) | Partial |
Squeez picks the exact commit. Qwen 35B picks a plausible but wrong commit about transpose — right neighborhood, wrong entry.
Failure-block extraction in logs. This 176-line service log contains two separate TLS handshake failures at different timestamps. The query asks for the health-check failure:
... 40 lines of startup logs ...
10:00:00.240 [ERROR] TLS handshake failed: certificate verify failed ← gold
10:00:00.241 [ERROR] node-fetch: request to .../status failed ← gold
10:00:00.260 [WARN] Health check #1 failed (TLS error) ← gold
... 80 lines of normal operation ...
10:00:21.165 [ERROR] TLS handshake failed: certificate verify failed ← wrong block
10:00:21.166 [ERROR] node-fetch: request to .../pay failed ← wrong block
... 50 more lines ...
| Model | Prediction | Correct? |
|---|---|---|
| Squeez-2B | Health-check TLS block (10:00:00) | Yes |
| Qwen 3.5 35B A3B | Payment TLS block (10:00:21) | No (wrong timestamp) |
| Kimi K2 | Health-check TLS block (10:00:00) | Partial (3 of 5 lines) |
Qwen 35B selects a semantically similar but wrong block from a later request. This "right pattern, wrong instance" failure is common among zero-shot models on repetitive log output.
Correct empty predictions. On negative examples where the tool output does not contain the requested evidence, Squeez correctly returns nothing. In a 316-line docker_logs output, the query asks about a numpy version conflict between torch and tensorflow — but no such conflict exists. Squeez returns empty output; Qwen 35B generates "No relevant lines found..." (not verbatim tool output); the 2B base returns unrelated database errors. On the 59 negative examples in the test set, Squeez-2B correctly returns empty 80% of the time. Kimi K2 matches this (81%), likely because its aggressive compression tends toward empty output. Qwen 35B returns empty only 7% of the time.
Compact local evidence. The kubectl example below shows the intended use case in its simplest form. The full observation contains 250 lines of pod description; the relevant evidence is a two-line block reporting OOMKilled and the exit code.
Across the held-out set, these four patterns account for most correct selections and most remaining errors. The main remaining failure mode is semantically adjacent over-selection. In a build log containing both a Dockerfile syntax error and a Python SyntaxError, Squeez usually finds the Dockerfile block but may keep the nearby Python error as well. Qwen 35B often misses the Dockerfile error entirely and keeps only the Python block.
Using Squeez
Squeez is a preprocessing step, not a new agent architecture. It does not require changes to the planner, tool API, or interaction loop. You can pipe tool output through the CLI:
pytest -q 2>&1 | squeez "find the failure block"
git log --oneline -50 | squeez "find the commit that changed CSRF handling"
cat src/auth/middleware.py | squeez "find the referer validation logic"
Or you can serve the model with vLLM for higher-throughput settings:
vllm serve KRLabsOrg/squeez-2b --dtype bfloat16 --max-model-len 16384
export SQUEEZ_SERVER_URL=http://localhost:8000/v1
pytest -q 2>&1 | squeez "find the failure block"
For systems such as Claude Code, a minimal CLAUDE.md instruction is enough:
When you invoke a shell command, pipe it through `squeez` and describe what you need.
Examples:
- `bun test 2>&1 | squeez "did the tests pass?"`
- `git log --oneline -50 | squeez "find the commit that broke CSRF"`
The same pattern works with Codex and other agent setups that accept system-level instructions or shell wrappers.
Conclusion
Tool-output pruning in coding agents was missing a benchmark of its own. Existing pruning systems target prompts, retrieved documents, or repository/code context; they do not directly address single mixed-format tool observations such as logs, traces, CLI output, or file reads. Squeez was built for that gap.
We release a benchmark for task-conditioned tool-output pruning, constructed from real SWE-bench-derived observations and synthetic multi-ecosystem tool outputs, together with a fine-tuned Qwen 3.5 2B model and a CLI for practical use. On this task, a small supervised model performs well under strong compression and outperforms much larger zero-shot baselines. That is the main result.
Resources
- Paper: arXiv:2604.04979
- Model: KRLabsOrg/squeez-2b (Apache 2.0)
- Dataset: KRLabsOrg/tool-output-extraction-swebench (Apache 2.0)
- Code & CLI: github.com/KRLabsOrg/squeez (Apache 2.0)
References
- Hu, E. J., et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR
- Dettmers, T., et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. NeurIPS
- Jiang, H., et al. (2023). LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. EMNLP
- Jiang, H., et al. (2024). LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression. ACL
- Hwang, T., et al. (2025). EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation. Findings of ACL
- Chirkova, N., et al. (2025). Provence: Efficient and Robust Context Pruning for Retrieval-Augmented Generation. ICLR
- Zilliz. (2025). Semantic Highlight Bilingual v1. Model card
- Kerboua, I., et al. (2025). FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents. arXiv
- Wang, Y., et al. (2026). SWE-Pruner: Self-Adaptive Context Pruning for Coding Agents. arXiv
- Kovacs, A., Schmitt, P., Recski, G. (2025). KR Labs at ArchEHR-QA 2025: A Verbatim Approach for Evidence-Based Question Answering. BioNLP Workshop
- Jimenez, C. E., et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? ICLR
- Qwen Team. (2026). Qwen3.5: Towards Native Multimodal Agents. Blog post