Cambrian-S-1.5B

Website | Paper | GitHub | Cambrian-S Family

Authors: Shusheng Yang*, Jihan Yang*, Pinzhi Huang†, Ellis Brown†, et al.

Cambrian-S-1.5B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks.

Model Details

  • Architecture: Qwen2.5-1.5B-Instruct + SigLIP2-SO400M vision encoder + 2-layer MLP adapter
  • Parameters: 1.5B
  • Vision Encoder: SigLIP-384 (SiGLIP)
  • Training: 4-stage pipeline (image alignment → image IT → video IT → spatial IT)
  • Training Data: Trained on VSI-590K (spatial reasoning) + general video instruction data

Usage

from cambrian.model.builder import load_pretrained_model
from cambrian.mm_utils import process_images, tokenizer_image_token
from cambrian.conversation import conv_templates

model_path = "nyu-visionx/Cambrian-S-1.5B"
tokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, "cambrian-s-1.5b", device_map="cuda")

# Process image/video
conv = conv_templates["qwen_2"].copy()
conv.append_message(conv.roles[0], "<image>\nWhat objects are in this scene?")
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

# Generate
output_ids = model.generate(input_ids, images=image_tensor, image_sizes=image_sizes)

Citation

@article{yang2025cambrian,
  title={Cambrian-S: Towards Spatial Supersensing in Video},
  author={Yang, Shusheng and Yang, Jihan and Huang, Pinzhi and Brown, Ellis and Yang, Zihao and Yu, Yue and Tong, Shengbang and Zheng, Zihan and Xu, Yifan and Wang, Muhan and Lu, Danhao and Fergus, Rob and LeCun, Yann and Fei-Fei, Li and Xie, Saining},
  journal={arXiv preprint arXiv:2511.04670},
  year={2025}
}
Downloads last month
21
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nyu-visionx/Cambrian-S-1.5B

Base model

Qwen/Qwen2.5-1.5B
Finetuned
(1253)
this model

Dataset used to train nyu-visionx/Cambrian-S-1.5B

Collection including nyu-visionx/Cambrian-S-1.5B