Cambrian-S Models
Collection
10 items
•
Updated
•
5
Website | Paper | GitHub | Cambrian-S Family
Authors: Shusheng Yang*, Jihan Yang*, Pinzhi Huang†, Ellis Brown†, et al.
Cambrian-S-1.5B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks.
from cambrian.model.builder import load_pretrained_model
from cambrian.mm_utils import process_images, tokenizer_image_token
from cambrian.conversation import conv_templates
model_path = "nyu-visionx/Cambrian-S-1.5B"
tokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, "cambrian-s-1.5b", device_map="cuda")
# Process image/video
conv = conv_templates["qwen_2"].copy()
conv.append_message(conv.roles[0], "<image>\nWhat objects are in this scene?")
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
# Generate
output_ids = model.generate(input_ids, images=image_tensor, image_sizes=image_sizes)
@article{yang2025cambrian,
title={Cambrian-S: Towards Spatial Supersensing in Video},
author={Yang, Shusheng and Yang, Jihan and Huang, Pinzhi and Brown, Ellis and Yang, Zihao and Yu, Yue and Tong, Shengbang and Zheng, Zihan and Xu, Yifan and Wang, Muhan and Lu, Danhao and Fergus, Rob and LeCun, Yann and Fei-Fei, Li and Xie, Saining},
journal={arXiv preprint arXiv:2511.04670},
year={2025}
}