Model Card for ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow

Llama-3.2-11B-Vision-Instruct-StarFlow is a vision-language model finetuned for structured workflow generation from sketch images. It translates hand-drawn or computer-generated workflow diagrams into structured JSON workflows, including triggers, flow logic, and actions.

Model Details

Model Description

Llama-3.2-11B-Vision-Instruct-StarFlow is part of the StarFlow framework for automating workflow creation. It extends Meta's Llama-3.2-11B-Vision-Instruct with domain-specific finetuning on workflow diagrams, enabling accurate sketch-to-workflow generation.

  • Developed by: ServiceNow Research
  • Model type: Transformer-based Vision-Language Model (VLM)
  • Language(s) (NLP): English
  • License: llama3.2
  • Finetuned from model : Llama-3.2-11B-Vision-Instruct

Model Sources


Uses

Direct Use

  • Translating sketches of workflows (hand-drawn, whiteboard, or digital diagrams) into JSON structured workflows.
  • Supporting workflow automation in enterprise platforms by removing the need for manual low-code configuration.

Downstream Use

  • Integration into enterprise low-code platforms for rapid prototyping of workflows by users.
  • Used in automation migration pipelines, e.g., converting legacy workflow screenshots into JSON representations.

Out-of-Scope Use

  • General-purpose vision-language tasks (e.g., image captioning, OCR).
  • Use on domains outside workflow automation (e.g., arbitrary diagram-to-code).
  • Real-time handwriting recognition (StarFlow focuses on structured workflow translation, not raw OCR).

Bias, Risks, and Limitations

  • Limited generalization: Finetuned models perform poorly on out-of-distribution diagrams from unfamiliar platforms.
  • Sensitivity to input style: Whiteboard/handwritten sketches degrade performance compared to digital or UI-rendered workflows.
  • Component naming mismatches: Model may mispredict action definitions (e.g., “create_user” vs. “create_a_user”), leading to execution errors.
  • Evaluation gap: Current metrics don’t always reflect execution correctness of generated workflows.

Recommendations

Users should:

  • Validate outputs before deployment.
  • Be cautious with handwritten/ambiguous sketches.
  • Consider supplementing with retrieval-augmented generation (RAG) or tool grounding for robustness.

How to Get Started with the Model

from transformers import AutoProcessor, MllamaForConditionalGeneration
from PIL import Image

processor = AutoProcessor.from_pretrained("ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow")
model = MllamaForConditionalGeneration.from_pretrained("ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow")

image = Image.open("workflow_sketch.png")
inputs = processor(images=image, text="Generate workflow JSON", return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=4096)
workflow_json = processor.decode(outputs[0], skip_special_tokens=True)

print(workflow_json)

Training Details

Training Data

The model was trained using the ServiceNow/BigDocs-Sketch2Flow dataset, which includes the following data distribution:

  • Synthetic (12,376 Graphviz-generated diagrams)
  • Manual (3,035 sketches hand-drawn by annotators)
  • Digital (2,613 diagrams drawn using software)
  • Whiteboard (484 sketches drawn on whiteboard / blackboard)
  • User Interface (373 screenshots from ServiceNow Flow Designer)

Training Procedure

Preprocessing

  • Synthetic workflows generated via heuristics (Scheduled Loop, IF/ELSE, FOREACH, etc.).
  • Annotators recreated flows in digital, manual, and whiteboard formats.

Training Hyperparameters

  • Optimizer: AdamW with β=(0.95,0.999), lr=2e-5, weight decay=1e-6.
  • Scheduler: cosine learning rate with 30 warmup steps.
  • Early stopping based on validation loss.
  • Precision: bf16 mixed-precision.
  • Sequence length: up to 32k tokens.

Speeds, Sizes, Times

  • Trained with 16× NVIDIA H100 80GB GPUs across two nodes.
  • Full Sharded Data Parallel (FSDP) training, no CPU offloading.

Evaluation

Testing Data

Same dataset distribution as training: synthetic, manual, digital, whiteboard, UI-rendered workflows.

Factors

  • Source of sample (synthetic, manual, UI, etc.)
  • Orientation (portrait vs. landscape diagrams)
  • Resolution (small <400k pixels, medium, large >1M pixels)

Metrics

All Evaluation metrics can be found in the official StarFlow repo.

  • Flow Similarity (FlowSim) – tree edit distance similarity.
  • TreeBLEU – structural recall of subtrees.
  • Trigger Match (TM) – accuracy of workflow triggers.
  • Component Match (CM) – overlap of predicted vs. gold components.

Results

  • Proprietary models (GPT-4o, Claude-3.7, Gemini 2.0) outperform open-weights without finetuning.

  • Finetuned Pixtral-12B achieves SOTA:

    • FlowSim w/ inputs: 0.919
    • TreeBLEU w/ inputs: 0.950
    • Trigger Match: 0.753
    • Component Match: 0.930

Summary

Finetuning yields large gains over base Pixtral-12B and GPT-4o, particularly in matching workflow components and triggers.

Model Examination

  • Finetuned models capture naming conventions and structured execution logic better.
  • Failure modes include missing ELSE branches or generic table names.

Technical Specifications

Model Architecture and Objective

  • Base: Llama-3.2-11B Vision Instruct, a multimodal LLM with 11 B parameters, optimized for image reasoning and instruction-following tasks.
  • Objective: Image-to-JSON structured workflow generation.

Compute Infrastructure

  • Hardware: 16× NVIDIA H100 80GB (2 nodes)
  • Software: FSDP, bf16 mixed precision, PyTorch/Transformers

Citation

BibTeX:

@article{bechard2025starflow,
  title={StarFlow: Generating Structured Workflow Outputs from Sketch Images},
  author={B{\'e}chard, Patrice and Wang, Chao and Abaskohi, Amirhossein and Rodriguez, Juan and Pal, Christopher and Vazquez, David and Gella, Spandana and Rajeswar, Sai and Taslakian, Perouz},
  journal={arXiv preprint arXiv:2503.21889},
  year={2025}
}

APA: Béchard, P., Wang, C., Abaskohi, A., Rodriguez, J., Pal, C., Vazquez, D., Gella, S., Rajeswar, S., & Taslakian, P. (2025). StarFlow: Generating Structured Workflow Outputs from Sketch Images. arXiv preprint arXiv:2503.21889.


Glossary

  • FlowSim: Metric based on tree edit distance for workflows.
  • TreeBLEU: BLEU-like score using tree structures.
  • Trigger Match: Correctness of predicted workflow trigger.
  • Component Match: Correctness of predicted components (order-agnostic).

More Information


The StarFlow Team

  • Patrice Béchard, Chao Wang, Amirhossein Abaskohi, Juan Rodriguez, Christopher Pal, David Vazquez, Spandana Gella, Sai Rajeswar, Perouz Taslakian

Model Card Contact

Downloads last month
26
Safetensors
Model size
10.7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow

Finetuned
(140)
this model

Dataset used to train ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow