ShotVL-7B / README.md

Update README.md

1724675 verified 2 months ago

10.5 kB

	---
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	license: apache-2.0
	tags:
	- vision-language
	- cinematography
	- shotbench
	pipeline_tag: image-text-to-text
	library_name: transformers
	---

	## Model description

	This model is a fine-tuned version of [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct), introduced in the paper [ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models](https://huggingface.co/papers/2506.21356). It is trained by supervised fine-tuning on the largest and high-quality dataset for cinematic language understanding to date. It currently achieves state-of-the-art performance on [ShotBench](https://vchitect.github.io/ShotBench-project/), a comprehensive benchmark for evaluating cinematography understanding in vision-language models.

	Project Page: [https://vchitect.github.io/ShotBench-project/](https://vchitect.github.io/ShotBench-project/)

	Code: [https://github.com/Vchitect/ShotBench](https://github.com/Vchitect/ShotBench)

	### Demo

	Image

	```python
	import cv2
	import torch
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
	from qwen_vl_utils import process_vision_info

	device = "cuda"
	device_map = "balanced"
	dtype = torch.bfloat16
	image_path = "/path/to/image.jpg"

	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	"Vchitect/ShotVL-7B",
	device_map=device_map,
	attn_implementation="flash_attention_2",
	torch_dtype=dtype,
	).eval()
	processor = AutoProcessor.from_pretrained(
	"Vchitect/ShotVL-7B", revision="refs/pr/24", use_fast=True, torch_dtype=dtype
	)

	msgs = [
	{"role": "system", "content": "You are a helpful assistant."},
	{
	"role": "user",
	"content": [
	{"type": "image", "image": image_path},
	{"type": "text", "text": "What's the shot size of this shot?"},
	],
	},
	]

	text = processor.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
	image_inputs, video_inputs = process_vision_info(msgs)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	).to(device)

	with torch.inference_mode():
	out_ids = model.generate(**inputs, max_new_tokens=640)

	trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, out_ids)]
	print(processor.batch_decode(trimmed, skip_special_tokens=True)[0])
	```

	Video

	```python
	import cv2
	import torch
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
	from qwen_vl_utils import process_vision_info

	device = "cuda"
	device_map = "balanced"
	dtype = torch.bfloat16
	video_path = "/path/to/video.mp4"

	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	"Vchitect/ShotVL-7B",
	device_map=device_map,
	attn_implementation="flash_attention_2",
	torch_dtype=dtype,
	).eval()
	processor = AutoProcessor.from_pretrained(
	"Vchitect/ShotVL-7B", revision="refs/pr/24", use_fast=True, torch_dtype=dtype
	)

	question = (
	"What's the camera movement in this movie shot?
	"
	"Options:
	A. Boom down
	B. Boom up
	C. Push in
	D. Pull out
	"
	"Please select the most likely answer from the options above.
	"
	)
	msgs = [
	{"role": "system", "content": "You are a helpful assistant."},
	{
	"role": "user",
	"content": [
	{"type": "video", "video": video_path, "max_pixels": 360*640, "fps": 12.0},
	{"type": "text", "text": question},
	],
	},
	]

	text = processor.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
	image_inputs, video_inputs = process_vision_info(msgs)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	).to(device)

	with torch.inference_mode():
	out_ids = model.generate(**inputs, max_new_tokens=640)

	trimmed = [o[len(i):] for i, o in zip(inputs.input_ids, out_ids)]
	print(processor.batch_decode(trimmed, skip_special_tokens=True)[0])
	```

	## Evaluation Results

	<div align="center">
	<table>
	<caption>
	<small>
	Abbreviations:
	SS = <em>Shot Size</em>,
	SF = <em>Shot Framing</em>,
	CA = <em>Camera Angle</em>,
	LS = <em>Lens Size</em>,
	LT = <em>Lighting Type</em>,
	LC = <em>Lighting Conditions</em>,
	SC = <em>Shot Composition</em>,
	CM = <em>Camera Movement</em>.
	<u>Underline</u> marks previous best in each group.<br>
	<strong>Our <em>ShotVL</em> models establish new SOTA.</strong>
	</small>
	</caption><thead>
	<tr>
	<th>Models</th><th>SS</th><th>SF</th><th>CA</th><th>LS</th><th>LT</th>
	<th>LC</th><th>SC</th><th>CM</th><th>Avg</th>
	</tr>
	</thead><tbody>
	<tr><th colspan="10"><em>Open-Sourced VLMs</em></th></tr>
	<tr><td>Qwen2.5-VL-3B-Instruct</td><td>54.6</td><td>56.6</td><td>43.1</td><td>36.6</td><td>59.3</td><td>45.1</td><td>41.5</td><td>31.9</td><td>46.1</td></tr>
	<tr><td>Qwen2.5-VL-7B-Instruct</td><td>69.1</td><td>73.5</td><td>53.2</td><td>47.0</td><td>60.5</td><td>47.4</td><td>49.9</td><td>30.2</td><td>53.8</td></tr>
	<tr><td>LLaVA-NeXT-Video-7B</td><td>35.9</td><td>37.1</td><td>32.5</td><td>27.8</td><td>50.9</td><td>31.7</td><td>28.0</td><td>31.3</td><td>34.4</td></tr>
	<tr><td>LLaVA-Video-7B-Qwen2</td><td>56.9</td><td>65.4</td><td>45.1</td><td>36.0</td><td>63.5</td><td>45.4</td><td>37.4</td><td>35.3</td><td>48.1</td></tr>
	<tr><td>LLaVA-Onevision-Qwen2-7B-Ov-Chat</td><td>58.4</td><td>71.0</td><td>52.3</td><td>38.7</td><td>59.5</td><td>44.9</td><td>50.9</td><td>39.7</td><td>51.9</td></tr>
	<tr><td>InternVL2.5-8B</td><td>56.3</td><td>70.3</td><td>50.8</td><td>41.1</td><td>60.2</td><td>45.1</td><td>50.1</td><td>33.6</td><td>50.9</td></tr>
	<tr><td>InternVL3-2B</td><td>56.3</td><td>56.0</td><td>44.4</td><td>34.6</td><td>56.8</td><td>44.6</td><td>43.0</td><td>38.1</td><td>46.7</td></tr>
	<tr><td>InternVL3-8B</td><td>62.1</td><td>65.8</td><td>46.8</td><td>42.9</td><td>58.0</td><td>44.3</td><td>46.8</td><td>44.2</td><td>51.4</td></tr>
	<tr><td>InternVL3-14B</td><td>59.6</td><td>82.2</td><td>55.4</td><td>40.7</td><td>61.7</td><td>44.6</td><td>51.1</td><td>38.2</td><td>54.2</td></tr>
	<tr><td>Internlm-xcomposer2d5-7B</td><td>51.1</td><td>71.0</td><td>39.8</td><td>32.7</td><td>59.3</td><td>35.7</td><td>35.7</td><td>38.8</td><td>45.5</td></tr>
	<tr><td>Ovis2-8B</td><td>35.9</td><td>37.1</td><td>32.5</td><td>27.8</td><td>50.9</td><td>31.7</td><td>28.0</td><td>35.3</td><td>34.9</td></tr>
	<tr><td>VILA1.5-3B</td><td>33.4</td><td>44.9</td><td>32.1</td><td>28.6</td><td>50.6</td><td>35.7</td><td>28.4</td><td>21.5</td><td>34.4</td></tr>
	<tr><td>VILA1.5-8B</td><td>40.6</td><td>44.5</td><td>39.1</td><td>29.7</td><td>48.9</td><td>32.9</td><td>34.4</td><td>36.9</td><td>38.4</td></tr>
	<tr><td>VILA1.5-13B</td><td>36.7</td><td>54.6</td><td>40.7</td><td>34.8</td><td>52.8</td><td>35.4</td><td>34.2</td><td>31.3</td><td>40.1</td></tr>
	<tr><td>Instructblip-vicuna-7B</td><td>27.0</td><td>27.9</td><td>34.5</td><td>29.4</td><td>44.4</td><td>29.7</td><td>27.1</td><td>25.0</td><td>30.6</td></tr>
	<tr><td>Instructblip-vicuna-13B</td><td>26.8</td><td>29.2</td><td>27.9</td><td>28.0</td><td>39.0</td><td>24.0</td><td>27.1</td><td>22.0</td><td>28.0</td></tr>
	<tr><td>InternVL2.5-38B</td><td>67.8</td><td><u>85.4</u></td><td>55.4</td><td>41.7</td><td>61.7</td><td>48.9</td><td>52.4</td><td>44.0</td><td>57.2</td></tr>
	<tr><td>InternVL3-38B</td><td>68.0</td><td>84.0</td><td>51.9</td><td>43.6</td><td>64.4</td><td>46.9</td><td>54.7</td><td>44.6</td><td>57.3</td></tr>
	<tr><td>Qwen2.5-VL-32B-Instruct</td><td>62.3</td><td>76.6</td><td>51.0</td><td>48.3</td><td>61.7</td><td>44.0</td><td>52.2</td><td>43.8</td><td>55.0</td></tr>
	<tr><td>Qwen2.5-VL-72B-Instruct</td><td><u>75.1</u></td><td>82.9</td><td>56.7</td><td>46.8</td><td>59.0</td><td><u>49.4</u></td><td>54.1</td><td><u>48.9</u></td><td>59.1</td></tr>
	<tr><td>InternVL3-78B</td><td>69.7</td><td>80.0</td><td>54.5</td><td>44.0</td><td><u>65.5</u></td><td>47.4</td><td>51.8</td><td>44.4</td><td>57.2</td></tr>
	<tr><th colspan="10"><em>Proprietary VLMs</em></th></tr>
	<tr><td>Gemini-2.0-flash</td><td>48.9</td><td>75.5</td><td>44.6</td><td>31.9</td><td>62.2</td><td>48.9</td><td>52.4</td><td>47.4</td><td>51.5</td></tr>
	<tr><td>Gemini-2.5-flash-preview-04-17</td><td>57.7</td><td>82.9</td><td>51.4</td><td>43.8</td><td>65.2</td><td>45.7</td><td>45.9</td><td>43.5</td><td>54.5</td></tr>
	<tr><td>GPT-4o</td><td>69.3</td><td>83.1</td><td><u>58.2</u></td><td><u>48.9</u></td><td>63.2</td><td>48.0</td><td><u>55.2</u></td><td>48.3</td><td><u>59.3</u></td></tr>
	<tr><th colspan="10"><em>Ours</em></th></tr>
	<tr>
	<td>ShotVL-3B
	<a href="https://huggingface.co/Vchitect/ShotVL-3B">
	<img src="https://img.shields.io/badge/Model-HF-yellow?logo=huggingface" alt="HF">
	</a>
	</td>
	<td>77.9</td><td>85.6</td><td>68.8</td><td>59.3</td><td>65.7</td>
	<td>53.1</td><td>57.4</td><td>51.7</td><td>65.1</td>
	</tr>
	<tr>
	<td>ShotVL-7B
	<a href="https://huggingface.co/Vchitect/ShotVL-7B">
	<img src="https://img.shields.io/badge/Model-HF-yellow?logo=huggingface" alt="HF">
	</a>
	</td>
	<td>81.2</td><td>90.1</td><td>78.0</td><td>68.5</td><td>70.1</td>
	<td>64.3</td><td>45.7</td><td>62.9</td><td>70.1</td>
	</tr> </tbody>
	</table></div>

	## BibTeX

	```
	@misc{
	liu2025shotbench,
	title={ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models},
	author={Hongbo Liu and Jingwen He and Yi Jin and Dian Zheng and Yuhao Dong and Fan Zhang and Ziqi Huang and Yinan He and Yangguang Li and Weichao Chen and Yu Qiao and Wanli Ouyang and Shengjie Zhao and Ziwei Liu},
	year={2025},
	eprint={2506.21356},
	achivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2506.21356},
	}
	```