arxiv:2509.23736

HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation

Published on Sep 28

Authors:

Cheng Zou ,

Abstract

HieraTok, a multi-scale Vision Transformer-based tokenizer, enhances image reconstruction and generation by using multi-scale downsampling and scale-causal attention, achieving superior performance compared to single-scale tokenizers.

AI-generated summary

In this work, we present HieraTok, a novel multi-scale Vision Transformer (ViT)-based tokenizer that overcomes the inherent limitation of modeling single-scale representations. This is realized through two key designs: (1) multi-scale downsampling applied to the token map generated by the tokenizer encoder, producing a sequence of multi-scale tokens, and (2) a scale-causal attention mechanism that enables the progressive flow of information from low-resolution global semantic features to high-resolution structural details. Coupling these designs, HieraTok achieves significant improvements in both image reconstruction and generation tasks. Under identical settings, the multi-scale visual tokenizer outperforms its single-scale counterpart by a 27.2\% improvement in rFID (1.47 rightarrow 1.07). When integrated into downstream generation frameworks, it achieves a 1.38times faster convergence rate and an 18.9\% boost in gFID (16.4 rightarrow 13.3), which may be attributed to the smoother and more uniformly distributed latent space. Furthermore, by scaling up the tokenizer's training, we demonstrate its potential by a sota rFID of 0.45 and a gFID of 1.82 among ViT tokenizers. To the best of our knowledge, we are the first to introduce multi-scale ViT-based tokenizer in image reconstruction and image generation. We hope our findings and designs advance the ViT-based tokenizers in visual generation tasks.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.23736 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.23736 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.23736 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.