Clarifying Prior Research on Visual Compression of Textual Contexts

#18

by Awiny - opened 1 day ago

1 day ago

We truly appreciate the efforts behind DeepSeek-OCR and its exploration of optical 2D mapping for long-context compression. However, this Technical Report currently overlooks a substantial body of prior research that has already investigated “visual processing or compression of textual contexts.”

Prior works on visualizing or rendering text for language modeling:

LANGUAGE MODELING WITH PIXELS (Phillip Rust et al., ICLR 2023, Princeton)
CLIPPO: Image-and-Language Understanding from Pixels Only (Google Research, CVPR 2023)
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding (Lee et al., ICML 2023)
Improving Language Understanding from Screenshots (Gao et al., Arxiv 2024)

Prior works on visual compression of long textual contexts:

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning (Wang et al., NeurIPS 2024) — proposed a vision-centric token compression method using a visual encoder to handle long text sequences more efficiently.

Vision-Centric Token Compression in Large Language Model (Xin et al., NeurIPS 2025) — further established the optical/vision-based compression principle inspired by how humans visually scan unimportant information.
These studies systematically explored using vision-based representations to compress long textual contexts well before DeepSeek-OCR.

For academic rigor and fair attribution, it would be appropriate for DeepSeek-OCR to acknowledge and cite these foundational works, as the concept of “processing or compressing text via visual” has been previously introduced and validated.

(Both of our NeurIPS papers were publicly available on arXiv in June 2024 and February 2025, respectively; easily searchable with keywords such as “Compress Text with Visual Tokens”.)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment