Spaces:

LPX55
/

QwenStorytellerV2

Running

App Files Files Community

QwenStorytellerV2 / README.md

LPX55

Update README.md

dedfff1 verified about 2 months ago

preview code

raw

history blame contribute delete

3.34 kB

	---
	title: Qwen2.5-VL \| 📔 Storyteller v2
	emoji: 📚
	colorFrom: red
	colorTo: red
	sdk: gradio
	sdk_version: 5.30.0
	app_file: app.py
	pinned: true
	tags:
	- vision-language-model
	- visual-storytelling
	- chain-of-thought
	- grounded-text-generation
	- cross-frame-consistency
	- storytelling
	- image-to-text
	license: apache-2.0
	datasets:
	- daniel3303/StoryReasoning
	models:
	- daniel3303/QwenStoryteller2
	- daniel3303/QwenStoryteller
	pipeline_tag: image-to-text
	language: en, zh
	---


	# QwenStoryteller

	This HF Space is a simple implementation of [2505.10292](https://arxiv.org/abs/2505.10292) by Daniel A. P. Oliveira and David Martins de Matos. BibTeX citation provided below. The space was created as a POC, all other credits go to Daniel and David.

	QwenStoryteller is a fine-tuned version of Qwen2.5-VL 7B specialized for grounded visual storytelling with cross-frame consistency, capable of generating coherent narratives from multiple images while maintaining character and object identity throughout the story.

	## Model Description

	Base Model: Qwen2.5-VL 7B
	Training Method: LoRA fine-tuning (rank 2048, alpha 4096)
	Training Dataset: [StoryReasoning](https://huggingface.co/datasets/daniel3303/StoryReasoning)

	QwenStoryteller processes sequences of images to perform:
	- End-to-end object detection
	- Cross-frame object re-identification
	- Landmark detection
	- Chain-of-thought reasoning for scene understanding
	- Grounded story generation with explicit visual references

	The model was fine-tuned on the StoryReasoning dataset using LoRA with a rank of 2048 and alpha scaling factor of 4096, targeting self-attention layers of the language components. Training used a peak learning rate of 1×10⁻⁴ with batch size 32, warmup for the first 3% of steps for 4 epochs, AdamW optimizer with weight decay 0.01, and bfloat16 precision.

	## System Prompt
	The model was trained with the following system prompt, and we recommend using it as it is for inference.

	```
	You are an AI storyteller that can analyze sequences of images and create creative narratives.
	First think step-by-step to analyze characters, objects, settings, and narrative structure.
	Then create a grounded story that maintains consistent character identity and object references across frames.
	Use <think></think> tags to show your reasoning process before writing the final story.
	```

	## Key Features

	- Cross-Frame Consistency: Maintains consistent character and object identity across multiple frames through visual similarity and face recognition techniques
	- Structured Reasoning: Employs chain-of-thought reasoning to analyze scenes with explicit modeling of characters, objects, settings, and narrative structure
	- Grounded Storytelling: Uses specialized XML tags to link narrative elements directly to visual entities
	- Reduced Hallucinations: Achieves 12.3% fewer hallucinations compared to the non-fine-tuned base model

	```
	@misc{oliveira2025storyreasoningdatasetusingchainofthought,
	title={StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation},
	author={Daniel A. P. Oliveira and David Martins de Matos},
	year={2025},
	eprint={2505.10292},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2505.10292},
	}
	```