DeltaTok
Collection
DeltaTok tokenizer, DeltaWorld predictor, and evaluation heads.
https://github.com/amazon-far/deltatok • 7 items • Updated • 4
DeltaTok is a video tokenizer that encodes the vision foundation model (VFM) feature differences between consecutive frames into a single continuous "delta" token, as introduced in A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens (CVPR 2026 Highlight). This approach significantly reduces the token count in video sequences (e.g., 1,024x reduction) and enables efficient generative world modeling.
Project Page | GitHub | Paper
This repository contains the ViT-B encoder and decoder trained on Kinetics-700 at 512x512 resolution.
Requires a frozen DINOv3 ViT-B backbone. Full training and evaluation code is available in the DeltaTok GitHub repository. To evaluate:
python main.py validate -c configs/deltatok_vitb_dinov3_vitb_kinetics.yaml \
--model.ckpt_path=path/to/deltatok-kinetics/pytorch_model.bin
@inproceedings{kerssies2026deltatok,
title = {A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens},
author = {Kerssies, Tommie and Berton, Gabriele and He, Ju and Yu, Qihang and Ma, Wufei and de Geus, Daan and Dubbelman, Gijs and Chen, Liang-Chieh},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2026}
}