Papers
arxiv:2510.18279

Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs

Published on Oct 21
· Submitted by Yanhong Li on Oct 23
Authors:
,

Abstract

Rendering text as images reduces token usage for decoder LLMs without compromising performance on tasks like long-context retrieval and document summarization.

AI-generated summary

Large language models (LLMs) and their multimodal variants can now process visual inputs, including images of text. This raises an intriguing question: can we compress textual inputs by feeding them as images to reduce token usage while preserving performance? In this paper, we show that visual text representations are a practical and surprisingly effective form of input compression for decoder LLMs. We exploit the idea of rendering long text inputs as a single image and provide it directly to the model. This leads to dramatically reduced number of decoder tokens required, offering a new form of input compression. Through experiments on two distinct benchmarks RULER (long-context retrieval) and CNN/DailyMail (document summarization) we demonstrate that this text-as-image method yields substantial token savings (often nearly half) without degrading task performance.

Community

Paper author Paper submitter

We ask a simple question: If we render long text as a single image and feed it to an off-the-shelf multimodal LLM, can we cut decoder tokens while keeping performance?

Answer: Yes! Even with untuned, general-purpose VLMs (e.g., GPT-4.1-mini, Qwen2.5-VL-72B-Instruct)—not models specialized for OCR—text-as-image consistently reduces decoder tokens by ~½ with no accuracy loss, acting as an implicit compression layer!

Results:

  • RULER S-NIAH (single needle-in-a-haystack): 97–99% accuracy with up to ≈58% fewer decoder tokens.
  • CNN/DailyMail summarization: at matched compression, text-as-image matches or beats token-pruning baselines (Select-Context, LLMLingua-2).

Takeaway: Rendering context as an image is a drop-in, modality-shifted compression strategy. No finetuning, just fewer decoder tokens!

Check out our paper to see more results / details!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.18279 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.18279 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.18279 in a Space README.md to link it from this page.

Collections including this paper 1