very-small-prompt-compression

Interactive demo: Very Small Prompt Compression (Space)

This model is a fine-tuned version of Falconsai/text_summarization on the gravitee-io/dolly-15k-prompt-compression dataset.
It achieves the following results on the evaluation set:

  • Loss: 2.1583
  • Rouge1: 0.8190
  • Rouge2: 0.6452
  • Rougel: 0.7792
  • Rougelsum: 0.7792
  • Comp Ratio Mean: 0.7395
  • Comp Ratio P90: 0.9091
  • Pct Violations: 0.0004

Model description

The gravitee-io/very-small-prompt-compression checkpoint distills a compact sequence-to-sequence model that trims short user prompts (≤64 tokens) before they are forwarded to a larger assistant. Instead of full summarization, the decoder focuses on deleting optional tokens, rewriting phrases more succinctly, and stripping trailing punctuation while keeping intent, modality, polarity, entities, and numeric constraints intact. The model targets sub‑100 ms latency on modern GPUs and is intended to run as a lightweight preprocessing stage in front of higher-capacity LLMs.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import re

model_id = "gravitee-io/very-small-prompt-compression"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

def compress_prompt(text: str) -> str:
    # Split longer prompts on sentence punctuation so each segment stays within the sweet spot.
    parts = re.split(r"([.?!])", text)
    segments = ["".join(parts[i:i+2]).strip() for i in range(0, len(parts), 2)]
    outputs = []
    for segment in segments:
        if not segment:
            continue
        encoded = tokenizer(segment, return_tensors="pt", truncation=True)
        generated = model.generate(**encoded, max_new_tokens=96, num_beams=4)
        compressed = tokenizer.decode(generated[0], skip_special_tokens=True).rstrip("?.!,;:")
        outputs.append(compressed)
    return " ".join(outputs).strip()

Intended uses & limitations

Intended uses

  • Prompt compression layer for human → LLM chat, API gateways, or routing pipelines where reducing prompt tokens improves throughput and latency.
  • Automatic rewrite step that preserves explicit instructions while trimming pleasantries and redundant phrasing on short English prompts.
  • Building blocks for prompt-budgeting systems; the model can be paired with token-price heuristics to decide whether to forward the compressed or original prompt.

Use with caution

  • For prompts longer than 64 tokens, split on punctuation (as in the example above), compress each clause independently, then restore punctuation in order.
  • Downstream systems should keep the original prompt available; if compression confidence falls below a threshold (e.g., cosine similarity or minimum token savings), fall back to the original.

Limitations

  • Training data is English-only and derived from Dolly-style instructions; performance on other languages or domain-specific jargon may degrade.
  • The model inherits stylistic biases from the gpt-5-nano generations used to bootstrap the dataset.
  • Average compression on the held-out set is ~26 % token savings (comp ratio 0.7395). Do not expect aggressive summarization or reasoning compression on much longer documents.

Training and evaluation data

  • Source corpus: databricks/databricks-dolly-15k instructions served as the base prompts.
  • Synthetic targets: Each instruction was rewritten by gpt-5-nano under a policy that emphasized constraint preservation while removing optional words, helper verbs, and trailing punctuation. Responses that failed validation were regenerated.
  • Normalization: Compressed outputs were post-processed to drop trailing punctuation marks and optional English articles (a, an, the) when those edits did not change intent.
  • Splits: The resulting dataset (gravitee-io/dolly-15k-prompt-compression) was filtered to prompts ≤64 tokens, then split 90/10 into train and evaluation subsets. Metadata includes original and compressed token counts, ROUGE scores, compression ratios, and cosine similarity diagnostics.

Training procedure

The model was fine-tuned for five epochs using Hugging Face Seq2SeqTrainer, starting from Falconsai/text_summarization. Length-aware generation settings (beam search with num_beams=4 and mild length penalty) and a “no-worse-than-original” post-filter ensured outputs stayed shorter than their sources. Training monitored ROUGE-1/2/L, average compression ratio, and violation rate (percentage of generations longer than the input). The best checkpoint—selected by lowest validation loss—was pushed to the Hugging Face Hub as gravitee-io/very-small-prompt-compression.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 3e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 5
  • mixed_precision_training: Native AMP
  • label_smoothing_factor: 0.1

Training results

Training Loss Epoch Step Comp Ratio Mean Comp Ratio P90 Validation Loss Pct Violations Rouge1 Rouge2 Rougel Rougelsum
2.8959 1.0 4044 0.7438 0.9167 2.3356 0.0002 0.7994 0.6082 0.7545 0.7543
2.4157 2.0 8088 0.7393 0.9119 2.2255 0.0005 0.8121 0.6299 0.7696 0.7696
2.3277 3.0 12132 0.7391 0.9091 2.1815 0.0004 0.8168 0.6386 0.7757 0.7756
2.2933 4.0 16176 0.7395 0.9091 2.1669 0.0004 0.8182 0.6416 0.7774 0.7773
2.2815 5.0 20220 0.7395 0.9091 2.1583 0.0004 0.8190 0.6452 0.7792 0.7792

Framework versions

  • Transformers 4.57.1
  • Pytorch 2.8.0+cu126
  • Datasets 4.0.0
  • Tokenizers 0.22.1

Evaluation

  • Held-out compression: On the ≤64 token evaluation split the model reaches a mean compression ratio of 0.7395 (≈26 % token reduction) with only 0.04 % of generations exceeding the original length.
  • Semantic fidelity: Cosine similarity between original and compressed embeddings (text-embedding-3-small) averages above 0.90, indicating that key semantics are preserved.
  • Instruction alignment: ROUGE-L of 0.7792 against synthetic targets shows the model closely matches the policy-compliant outputs produced during data generation.

License

This model is released under the Apache 2.0 License.

Acknowledgments

Citation

If you use this model in your research, please cite:

@misc{very_small_prompt_compression_2025,
  title={Very Small Prompt Compression Model},
  author={Derek Thompson - Gravitee.io},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/gravitee-io/very-small-prompt-compression}}
}

Contact

For questions, issues, or contributions, please open an issue on the model repository.


Generated by dotslashderek on 2025-10-31

Downloads last month
123
Safetensors
Model size
60.5M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for gravitee-io/very-small-prompt-compression

Finetuned
(24)
this model
Quantizations
1 model

Dataset used to train gravitee-io/very-small-prompt-compression

Space using gravitee-io/very-small-prompt-compression 1