very-small-prompt-compression

Interactive demo: Very Small Prompt Compression (Space)

This model is a fine-tuned version of Falconsai/text_summarization on the gravitee-io/dolly-15k-prompt-compression dataset.
It achieves the following results on the evaluation set:

Loss: 2.1583
Rouge1: 0.8190
Rouge2: 0.6452
Rougel: 0.7792
Rougelsum: 0.7792
Comp Ratio Mean: 0.7395
Comp Ratio P90: 0.9091
Pct Violations: 0.0004

Model description

The gravitee-io/very-small-prompt-compression checkpoint distills a compact sequence-to-sequence model that trims short user prompts (≤64 tokens) before they are forwarded to a larger assistant. Instead of full summarization, the decoder focuses on deleting optional tokens, rewriting phrases more succinctly, and stripping trailing punctuation while keeping intent, modality, polarity, entities, and numeric constraints intact. The model targets sub‑100 ms latency on modern GPUs and is intended to run as a lightweight preprocessing stage in front of higher-capacity LLMs.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import re

model_id = "gravitee-io/very-small-prompt-compression"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

def compress_prompt(text: str) -> str:
    # Split longer prompts on sentence punctuation so each segment stays within the sweet spot.
    parts = re.split(r"([.?!])", text)
    segments = ["".join(parts[i:i+2]).strip() for i in range(0, len(parts), 2)]
    outputs = []
    for segment in segments:
        if not segment:
            continue
        encoded = tokenizer(segment, return_tensors="pt", truncation=True)
        generated = model.generate(**encoded, max_new_tokens=96, num_beams=4)
        compressed = tokenizer.decode(generated[0], skip_special_tokens=True).rstrip("?.!,;:")
        outputs.append(compressed)
    return " ".join(outputs).strip()

Intended uses & limitations

Intended uses

Prompt compression layer for human → LLM chat, API gateways, or routing pipelines where reducing prompt tokens improves throughput and latency.
Automatic rewrite step that preserves explicit instructions while trimming pleasantries and redundant phrasing on short English prompts.
Building blocks for prompt-budgeting systems; the model can be paired with token-price heuristics to decide whether to forward the compressed or original prompt.

Use with caution

For prompts longer than 64 tokens, split on punctuation (as in the example above), compress each clause independently, then restore punctuation in order.
Downstream systems should keep the original prompt available; if compression confidence falls below a threshold (e.g., cosine similarity or minimum token savings), fall back to the original.

Limitations

Training data is English-only and derived from Dolly-style instructions; performance on other languages or domain-specific jargon may degrade.
The model inherits stylistic biases from the gpt-5-nano generations used to bootstrap the dataset.
Average compression on the held-out set is ~26 % token savings (comp ratio 0.7395). Do not expect aggressive summarization or reasoning compression on much longer documents.

Training and evaluation data

Source corpus: databricks/databricks-dolly-15k instructions served as the base prompts.
Synthetic targets: Each instruction was rewritten by gpt-5-nano under a policy that emphasized constraint preservation while removing optional words, helper verbs, and trailing punctuation. Responses that failed validation were regenerated.
Normalization: Compressed outputs were post-processed to drop trailing punctuation marks and optional English articles (a, an, the) when those edits did not change intent.
Splits: The resulting dataset (gravitee-io/dolly-15k-prompt-compression) was filtered to prompts ≤64 tokens, then split 90/10 into train and evaluation subsets. Metadata includes original and compressed token counts, ROUGE scores, compression ratios, and cosine similarity diagnostics.

Training procedure

The model was fine-tuned for five epochs using Hugging Face Seq2SeqTrainer, starting from Falconsai/text_summarization. Length-aware generation settings (beam search with num_beams=4 and mild length penalty) and a “no-worse-than-original” post-filter ensured outputs stayed shorter than their sources. Training monitored ROUGE-1/2/L, average compression ratio, and violation rate (percentage of generations longer than the input). The best checkpoint—selected by lowest validation loss—was pushed to the Hugging Face Hub as gravitee-io/very-small-prompt-compression.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 3e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.1
num_epochs: 5
mixed_precision_training: Native AMP
label_smoothing_factor: 0.1

Training results

Training Loss	Epoch	Step	Comp Ratio Mean	Comp Ratio P90	Validation Loss	Pct Violations	Rouge1	Rouge2	Rougel	Rougelsum
2.8959	1.0	4044	0.7438	0.9167	2.3356	0.0002	0.7994	0.6082	0.7545	0.7543
2.4157	2.0	8088	0.7393	0.9119	2.2255	0.0005	0.8121	0.6299	0.7696	0.7696
2.3277	3.0	12132	0.7391	0.9091	2.1815	0.0004	0.8168	0.6386	0.7757	0.7756
2.2933	4.0	16176	0.7395	0.9091	2.1669	0.0004	0.8182	0.6416	0.7774	0.7773
2.2815	5.0	20220	0.7395	0.9091	2.1583	0.0004	0.8190	0.6452	0.7792	0.7792

Framework versions

Transformers 4.57.1
Pytorch 2.8.0+cu126
Datasets 4.0.0
Tokenizers 0.22.1

Evaluation

Held-out compression: On the ≤64 token evaluation split the model reaches a mean compression ratio of 0.7395 (≈26 % token reduction) with only 0.04 % of generations exceeding the original length.
Semantic fidelity: Cosine similarity between original and compressed embeddings (text-embedding-3-small) averages above 0.90, indicating that key semantics are preserved.
Instruction alignment: ROUGE-L of 0.7792 against synthetic targets shows the model closely matches the policy-compliant outputs produced during data generation.

License

This model is released under the Apache 2.0 License.

Acknowledgments

Training data sourced from databricks/databricks-dolly-15k and the compressed derivative gravitee-io/dolly-15k-prompt-compression
Base model: Falconsai/text_summarization

Citation

If you use this model in your research, please cite:

@misc{very_small_prompt_compression_2025,
  title={Very Small Prompt Compression Model},
  author={Derek Thompson - Gravitee.io},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/gravitee-io/very-small-prompt-compression}}
}

Contact

For questions, issues, or contributions, please open an issue on the model repository.

Generated by dotslashderek on 2025-10-31

Downloads last month: 123

Safetensors

Model size

60.5M params

Tensor type

F32

Model tree for gravitee-io/very-small-prompt-compression

Base model

Falconsai/text_summarization

Finetuned

(24)

this model

Quantizations

1 model

Dataset used to train gravitee-io/very-small-prompt-compression

Space using gravitee-io/very-small-prompt-compression 1

Evaluation results

Metadata error: specify a dataset to view leaderboard