|
--- |
|
language: |
|
- en |
|
library_name: transformers |
|
pipeline_tag: image-text-to-text |
|
license: llama3.2 |
|
datasets: |
|
- ServiceNow/BigDocs-Sketch2Flow |
|
base_model: |
|
- meta-llama/Llama-3.2-11B-Vision-Instruct |
|
--- |
|
# Model Card for ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow |
|
|
|
Llama-3.2-11B-Vision-Instruct-StarFlow is a vision-language model finetuned for **structured workflow generation from sketch images**. It translates hand-drawn or computer-generated workflow diagrams into structured JSON workflows, including triggers, flow logic, and actions. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
Llama-3.2-11B-Vision-Instruct-StarFlow is part of the **StarFlow** framework for automating workflow creation. It extends Meta's Llama-3.2-11B-Vision-Instruct with domain-specific finetuning on workflow diagrams, enabling accurate sketch-to-workflow generation. |
|
|
|
* **Developed by:** ServiceNow Research |
|
* **Model type:** Transformer-based Vision-Language Model (VLM) |
|
* **Language(s) (NLP):** English |
|
* **License:** [llama3.2](https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/LICENSE.txt) |
|
* **Finetuned from model :** [Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) |
|
|
|
### Model Sources |
|
|
|
* **Repository:** [ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow](https://huggingface.co/ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow) |
|
* **Paper:** [StarFlow: Generating Structured Workflow Outputs From Sketch Images](https://arxiv.org/abs/2503.21889); |
|
|
|
--- |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
* Translating **sketches of workflows** (hand-drawn, whiteboard, or digital diagrams) into **JSON structured workflows**. |
|
* Supporting **workflow automation** in enterprise platforms by removing the need for manual low-code configuration. |
|
|
|
### Downstream Use |
|
|
|
* Integration into **enterprise low-code platforms** for rapid prototyping of workflows by users. |
|
* Used in **automation migration pipelines**, e.g., converting legacy workflow screenshots into JSON representations. |
|
|
|
### Out-of-Scope Use |
|
|
|
* General-purpose vision-language tasks (e.g., image captioning, OCR). |
|
* Use on domains outside workflow automation (e.g., arbitrary diagram-to-code). |
|
* Real-time handwriting recognition (StarFlow focuses on structured workflow translation, not raw OCR). |
|
|
|
--- |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
* **Limited generalization**: Finetuned models perform poorly on out-of-distribution diagrams from unfamiliar platforms. |
|
* **Sensitivity to input style**: Whiteboard/handwritten sketches degrade performance compared to digital or UI-rendered workflows. |
|
* **Component naming mismatches**: Model may mispredict action definitions (e.g., “create\_user” vs. “create\_a\_user”), leading to execution errors. |
|
* **Evaluation gap**: Current metrics don’t always reflect execution correctness of generated workflows. |
|
|
|
### Recommendations |
|
|
|
Users should: |
|
|
|
* Validate outputs before deployment. |
|
* Be cautious with **handwritten/ambiguous sketches**. |
|
* Consider supplementing with **retrieval-augmented generation (RAG)** or **tool grounding** for robustness. |
|
|
|
--- |
|
|
|
## How to Get Started with the Model |
|
|
|
```python |
|
from transformers import AutoProcessor, MllamaForConditionalGeneration |
|
from PIL import Image |
|
|
|
processor = AutoProcessor.from_pretrained("ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow") |
|
model = MllamaForConditionalGeneration.from_pretrained("ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow") |
|
|
|
image = Image.open("workflow_sketch.png") |
|
inputs = processor(images=image, text="Generate workflow JSON", return_tensors="pt") |
|
|
|
outputs = model.generate(**inputs, max_new_tokens=4096) |
|
workflow_json = processor.decode(outputs[0], skip_special_tokens=True) |
|
|
|
print(workflow_json) |
|
``` |
|
|
|
--- |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The model was trained using the [ServiceNow/BigDocs-Sketch2Flow](https://huggingface.co/datasets/ServiceNow/BigDocs-Sketch2Flow) dataset, which includes the following data distribution: |
|
|
|
* **Synthetic** (12,376 Graphviz-generated diagrams) |
|
* **Manual** (3,035 sketches hand-drawn by annotators) |
|
* **Digital** (2,613 diagrams drawn using software) |
|
* **Whiteboard** (484 sketches drawn on whiteboard / blackboard) |
|
* **User Interface** (373 screenshots from ServiceNow Flow Designer) |
|
|
|
### Training Procedure |
|
|
|
#### Preprocessing |
|
|
|
* Synthetic workflows generated via **heuristics** (Scheduled Loop, IF/ELSE, FOREACH, etc.). |
|
* Annotators recreated flows in digital, manual, and whiteboard formats. |
|
|
|
#### Training Hyperparameters |
|
|
|
* Optimizer: **AdamW** with β=(0.95,0.999), lr=2e-5, weight decay=1e-6. |
|
* Scheduler: **cosine learning rate** with 30 warmup steps. |
|
* Early stopping based on validation loss. |
|
* Precision: **bf16 mixed-precision**. |
|
* Sequence length: up to **32k tokens**. |
|
|
|
#### Speeds, Sizes, Times |
|
|
|
* Trained with **16× NVIDIA H100 80GB GPUs** across two nodes. |
|
* Full Sharded Data Parallel (FSDP) training, no CPU offloading. |
|
|
|
--- |
|
|
|
## Evaluation |
|
|
|
### Testing Data |
|
|
|
Same dataset distribution as training: synthetic, manual, digital, whiteboard, UI-rendered workflows. |
|
|
|
### Factors |
|
|
|
* **Source of sample** (synthetic, manual, UI, etc.) |
|
* **Orientation** (portrait vs. landscape diagrams) |
|
* **Resolution** (small <400k pixels, medium, large >1M pixels) |
|
|
|
### Metrics |
|
|
|
All Evaluation metrics can be found in the official [StarFlow repo](https://github.com/ServiceNow/StarFlow). |
|
|
|
* **Flow Similarity (FlowSim)** – tree edit distance similarity. |
|
* **TreeBLEU** – structural recall of subtrees. |
|
* **Trigger Match (TM)** – accuracy of workflow triggers. |
|
* **Component Match (CM)** – overlap of predicted vs. gold components. |
|
|
|
### Results |
|
|
|
* Proprietary models (GPT-4o, Claude-3.7, Gemini 2.0) outperform open-weights **without finetuning**. |
|
* **Finetuned Pixtral-12B achieves SOTA**: |
|
|
|
* FlowSim w/ inputs: **0.919** |
|
* TreeBLEU w/ inputs: **0.950** |
|
* Trigger Match: **0.753** |
|
* Component Match: **0.930** |
|
|
|
#### Summary |
|
|
|
Finetuning yields **large gains over base Pixtral-12B and GPT-4o**, particularly in matching workflow components and triggers. |
|
|
|
## Model Examination |
|
|
|
* Finetuned models capture **naming conventions** and structured execution logic better. |
|
* Failure modes include **missing ELSE branches** or **generic table names**. |
|
|
|
--- |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
|
|
* Base: **Llama-3.2-11B Vision Instruct**, a multimodal LLM with 11 B parameters, optimized for image reasoning and instruction-following tasks. |
|
* Objective: **Image-to-JSON structured workflow generation**. |
|
|
|
### Compute Infrastructure |
|
|
|
* **Hardware:** 16× NVIDIA H100 80GB (2 nodes) |
|
* **Software:** FSDP, bf16 mixed precision, PyTorch/Transformers |
|
|
|
--- |
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
|
|
```bibtex |
|
@article{bechard2025starflow, |
|
title={StarFlow: Generating Structured Workflow Outputs from Sketch Images}, |
|
author={B{\'e}chard, Patrice and Wang, Chao and Abaskohi, Amirhossein and Rodriguez, Juan and Pal, Christopher and Vazquez, David and Gella, Spandana and Rajeswar, Sai and Taslakian, Perouz}, |
|
journal={arXiv preprint arXiv:2503.21889}, |
|
year={2025} |
|
} |
|
``` |
|
|
|
**APA:** |
|
Béchard, P., Wang, C., Abaskohi, A., Rodriguez, J., Pal, C., Vazquez, D., Gella, S., Rajeswar, S., & Taslakian, P. (2025). **StarFlow: Generating Structured Workflow Outputs from Sketch Images**. *arXiv preprint arXiv:2503.21889*. |
|
|
|
--- |
|
|
|
## Glossary |
|
|
|
* **FlowSim**: Metric based on tree edit distance for workflows. |
|
* **TreeBLEU**: BLEU-like score using tree structures. |
|
* **Trigger Match**: Correctness of predicted workflow trigger. |
|
* **Component Match**: Correctness of predicted components (order-agnostic). |
|
|
|
--- |
|
|
|
## More Information |
|
|
|
* [ServiceNow Flow Designer](https://www.servicenow.com/products/platform-flow-designer.html) |
|
* [StarFlow Blog](https://www.servicenow.com/blogs/2025/starflow-ai-turns-sketches-into-workflows) |
|
|
|
--- |
|
|
|
## The StarFlow Team |
|
|
|
* Patrice Béchard, Chao Wang, Amirhossein Abaskohi, Juan Rodriguez, Christopher Pal, David Vazquez, Spandana Gella, Sai Rajeswar, Perouz Taslakian |
|
|
|
--- |
|
|
|
## Model Card Contact |
|
|
|
* Patrice Bechard - [patrice.bechard@servicenow.com](mailto:patrice.bechard@servicenow.com) |
|
* ServiceNow Research – [research.servicenow.com](https://research.servicenow.com) |
|
|