YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation


πŸ“’ Official Announcement

PyraTok has been officially accepted to CVPR 2026! πŸŽ‰
This repository contains the pretrained weights and model implementation for the Language-aligned Pyramidal Tokenizer.


πŸš€ Overview

PyraTok is a state-of-the-art video tokenizer that bridges the gap between video understanding and generation. Unlike traditional VAEs that operate at a single visual scale, PyraTok introduces a Language-aligned Pyramidal Quantization (LaPQ) module.

Key Innovations:

  • Pyramidal Structure: Learns semantically structured discrete latents across multiple spatiotemporal resolutions.
  • Language Alignment: Tightly couples visual tokens with language using a shared, large binary codebook (up to 48K tokens).
  • Scalability: Robustly scales from standard resolutions to 4K/8K video processing.
  • Unified Backbone: A single model that excels in Video QA, Zero-Shot Segmentation, and high-fidelity Text-to-Video generation.
@inproceedings{susladkar2026pyratok,
  title={PyraTok: Language-Aligned Pyramidal Tokenizer for Video Understanding and Generation},
  author={Susladkar, Onkar and Prakash, Tushar and Juvekar, Adheesh and Nguyen, Kiet A. and Jang, Dong-Hwan and Dhillon, Inderjit S. and Lourentzou, Ismini},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for onkarsus13/PyraTok