--- language: en license: apache-2.0 tags: - vision-language-model - visual-storytelling - chain-of-thought - grounded-text-generation - cross-frame-consistency - storytelling - image-to-text datasets: - daniel3303/StoryReasoning metrics: - precision - recall - bleu - meteor - rouge base_model: - Qwen/Qwen2.5-VL-7B-Instruct pipeline_tag: image-to-text model-index: - name: QwenStoryteller results: - task: type: visual-storytelling name: Visual Storytelling dataset: name: StoryReasoning type: daniel3303/StoryReasoning split: test metrics: - name: Character Precision type: precision value: 0.83 - name: Object Precision type: precision value: 0.46 - name: Total Precision type: precision value: 0.57 - name: mAP type: mean_average_precision value: 0.27 - name: Character Recall type: recall value: 0.62 - name: Object Recall type: recall value: 0.25 - name: Total Recall type: recall value: 0.40 - name: METEOR type: meteor value: 0.14 - name: ROUGE-L type: rouge-l value: 0.16 - name: BLEU-4 type: bleu-4 value: 0.054 - name: Description Accuracy type: accuracy value: 2.76 description: "Rating on a scale of 1-5" - name: Average Hallucinations type: error_rate value: 3.56 description: "Average number of hallucinations per story" library_name: transformers --- # QwenStoryteller QwenStoryteller is a fine-tuned version of Qwen2.5-VL 7B specialized for grounded visual storytelling with cross-frame consistency, capable of generating coherent narratives from multiple images while maintaining character and object identity throughout the story. ## Model Description **Base Model:** Qwen2.5-VL 7B **Training Method:** LoRA fine-tuning (rank 2048, alpha 4096) **Training Dataset:** [StoryReasoning](https://huggingface.co/datasets/daniel3303/StoryReasoning) QwenStoryteller processes sequences of images to perform: - End-to-end object detection - Cross-frame object re-identification - Landmark detection - Chain-of-thought reasoning for scene understanding - Grounded story generation with explicit visual references The model was fine-tuned on the StoryReasoning dataset using LoRA with a rank of 2048 and alpha scaling factor of 4096, targeting self-attention layers of the language components. Training used a peak learning rate of 1×10⁻⁴ with batch size 32, warmup for the first 3% of steps for 4 epochs, AdamW optimizer with weight decay 0.01, and bfloat16 precision. ## System Prompt The model was trained with the following system prompt, and we recommend using it as it is for inference. ``` You are an AI storyteller that can analyze sequences of images and create creative narratives. First think step-by-step to analyze characters, objects, settings, and narrative structure. Then create a grounded story that maintains consistent character identity and object references across frames. Use tags to show your reasoning process before writing the final story. ``` ## Key Features - **Cross-Frame Consistency:** Maintains consistent character and object identity across multiple frames through visual similarity and face recognition techniques - **Structured Reasoning:** Employs chain-of-thought reasoning to analyze scenes with explicit modeling of characters, objects, settings, and narrative structure - **Grounded Storytelling:** Uses specialized XML tags to link narrative elements directly to visual entities - **Reduced Hallucinations:** Achieves 12.3% fewer hallucinations compared to the non-fine-tuned base model ## Usage ```python from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info import torch from PIL import Image # Load the model model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "daniel3303/QwenStoryteller", torch_dtype="auto", device_map="auto" ) # Load processor processor = AutoProcessor.from_pretrained("daniel3303/QwenStoryteller") # Load images images = [ Image.open("image1.jpg"), Image.open("image2.jpg"), Image.open("image3.jpg"), Image.open("image4.jpg"), Image.open("image5.jpg") ] # Create image content list image_content = [] for img in images: image_content.append({ "type": "image", "image": img, }) # Add text prompt at the end image_content.append({"type": "text", "text": "Generate a story based on these images."}) # Create messages with system prompt messages = [ { "role": "system", "content": "You are an AI storyteller that can analyze sequences of images and create creative narratives. First think step-by-step to analyze characters, objects, settings, and narrative structure. Then create a grounded story that maintains consistent character identity and object references across frames. Use tags to show your reasoning process before writing the final story." }, { "role": "user", "content": image_content, } ] # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to(model.device) # Inference: Generation of the output generated_ids = model.generate( **inputs, max_new_tokens=4096, do_sample=True, temperature=0.7, top_p=0.9 ) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] story = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(story) ``` ### Using vLLM for faster inference For significantly faster inference, you can use vLLM to serve the model. Simply install vLLM and run: ```bash # Install vLLM pip install vllm # Serve the model with vLLM vllm serve daniel3303/QwenStoryteller ``` ## Output Format QwenStoryteller produces two main outputs: 1. **Chain-of-Thought Analysis (``):** A structured analysis containing: - Character tables with consistent identity references, emotions, actions, and spatial locations - Object tables with functions, interactions, and spatial coordinates - Setting tables categorizing environmental elements - Narrative structure tables modeling story progression 2. **Grounded Story:** A narrative with specialized XML tags linking text to visual elements: - ``: Image tags for specific frames - ``: Entity reference tags for character and object mentions - ``: Action tags for character actions - ``: Location/landmark tags for background elements ## Limitations - Re-identification relies primarily on object appearance rather than overall context, which can lead to confusion with similar-looking objects/persons - Movie-derived training data introduces biases from cinematic composition that may not generalize to candid visual sequences - Low grounding rates for first-person pronouns as they primarily appear in character dialogues - May still produce hallucinations, albeit at a reduced rate compared to the base model ## Citation ``` @misc{oliveira2025storyreasoningdatasetusingchainofthought, title={StoryReasoning Dataset: Using Chain-of-Thought for Scene Understanding and Grounded Story Generation}, author={Daniel A. P. Oliveira and David Martins de Matos}, year={2025}, eprint={2505.10292}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2505.10292}, } ``` ## Contact For questions or feedback regarding this model, please contact: - Daniel A. P. Oliveira (daniel.oliveira@inesc-id.pt)