PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

📢 Official Announcement

PyraTok has been officially accepted to CVPR 2026! 🎉
This repository contains the pretrained weights and model implementation for the Language-aligned Pyramidal Tokenizer.

🚀 Overview

PyraTok is a state-of-the-art video tokenizer that bridges the gap between video understanding and generation. Unlike traditional VAEs that operate at a single visual scale, PyraTok introduces a Language-aligned Pyramidal Quantization (LaPQ) module.

Key Innovations:

Pyramidal Structure: Learns semantically structured discrete latents across multiple spatiotemporal resolutions.
Language Alignment: Tightly couples visual tokens with language using a shared, large binary codebook (up to 48K tokens).
Scalability: Robustly scales from standard resolutions to 4K/8K video processing.
Unified Backbone: A single model that excels in Video QA, Zero-Shot Segmentation, and high-fidelity Text-to-Video generation.

@inproceedings{susladkar2026pyratok,
  title={PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation},
  author={Susladkar, Onkar and Prakash, Tushar and Juvekar, Adheesh and Nguyen, Kiet A. and Jang, Dong-Hwan and Dhillon, Inderjit S. and Lourentzou, Ismini},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for onkarsus13/PyraTok

PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation

Paper • 2601.16210 • Published Jan 22