Update README.md

d47e7d1 verified 5 days ago

8.16 kB

	---
	language:
	- en
	library_name: transformers
	pipeline_tag: image-text-to-text
	license: llama3.2
	datasets:
	- ServiceNow/BigDocs-Sketch2Flow
	base_model:
	- meta-llama/Llama-3.2-11B-Vision-Instruct
	---
	# Model Card for ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow

	Llama-3.2-11B-Vision-Instruct-StarFlow is a vision-language model finetuned for structured workflow generation from sketch images. It translates hand-drawn or computer-generated workflow diagrams into structured JSON workflows, including triggers, flow logic, and actions.

	## Model Details

	### Model Description

	Llama-3.2-11B-Vision-Instruct-StarFlow is part of the StarFlow framework for automating workflow creation. It extends Meta's Llama-3.2-11B-Vision-Instruct with domain-specific finetuning on workflow diagrams, enabling accurate sketch-to-workflow generation.

	* Developed by: ServiceNow Research
	* Model type: Transformer-based Vision-Language Model (VLM)
	* Language(s) (NLP): English
	* License: [llama3.2](https://huggingface.co/meta-llama/Llama-3.2-1B/blob/main/LICENSE.txt)
	* Finetuned from model : [Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct)

	### Model Sources

	* Repository: [ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow](https://huggingface.co/ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow)
	* Paper: [StarFlow: Generating Structured Workflow Outputs From Sketch Images](https://arxiv.org/abs/2503.21889);

	---

	## Uses

	### Direct Use

	* Translating sketches of workflows (hand-drawn, whiteboard, or digital diagrams) into JSON structured workflows.
	* Supporting workflow automation in enterprise platforms by removing the need for manual low-code configuration.

	### Downstream Use

	* Integration into enterprise low-code platforms for rapid prototyping of workflows by users.
	* Used in automation migration pipelines, e.g., converting legacy workflow screenshots into JSON representations.

	### Out-of-Scope Use

	* General-purpose vision-language tasks (e.g., image captioning, OCR).
	* Use on domains outside workflow automation (e.g., arbitrary diagram-to-code).
	* Real-time handwriting recognition (StarFlow focuses on structured workflow translation, not raw OCR).

	---

	## Bias, Risks, and Limitations

	* Limited generalization: Finetuned models perform poorly on out-of-distribution diagrams from unfamiliar platforms.
	* Sensitivity to input style: Whiteboard/handwritten sketches degrade performance compared to digital or UI-rendered workflows.
	* Component naming mismatches: Model may mispredict action definitions (e.g., “create\_user” vs. “create\_a\_user”), leading to execution errors.
	* Evaluation gap: Current metrics don’t always reflect execution correctness of generated workflows.

	### Recommendations

	Users should:

	* Validate outputs before deployment.
	* Be cautious with handwritten/ambiguous sketches.
	* Consider supplementing with retrieval-augmented generation (RAG) or tool grounding for robustness.

	---

	## How to Get Started with the Model

	```python
	from transformers import AutoProcessor, MllamaForConditionalGeneration
	from PIL import Image

	processor = AutoProcessor.from_pretrained("ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow")
	model = MllamaForConditionalGeneration.from_pretrained("ServiceNow/Llama-3.2-11B-Vision-Instruct-StarFlow")

	image = Image.open("workflow_sketch.png")
	inputs = processor(images=image, text="Generate workflow JSON", return_tensors="pt")

	outputs = model.generate(**inputs, max_new_tokens=4096)
	workflow_json = processor.decode(outputs[0], skip_special_tokens=True)

	print(workflow_json)
	```

	---

	## Training Details

	### Training Data

	The model was trained using the [ServiceNow/BigDocs-Sketch2Flow](https://huggingface.co/datasets/ServiceNow/BigDocs-Sketch2Flow) dataset, which includes the following data distribution:

	* Synthetic (12,376 Graphviz-generated diagrams)
	* Manual (3,035 sketches hand-drawn by annotators)
	* Digital (2,613 diagrams drawn using software)
	* Whiteboard (484 sketches drawn on whiteboard / blackboard)
	* User Interface (373 screenshots from ServiceNow Flow Designer)

	### Training Procedure

	#### Preprocessing

	* Synthetic workflows generated via heuristics (Scheduled Loop, IF/ELSE, FOREACH, etc.).
	* Annotators recreated flows in digital, manual, and whiteboard formats.

	#### Training Hyperparameters

	* Optimizer: AdamW with β=(0.95,0.999), lr=2e-5, weight decay=1e-6.
	* Scheduler: cosine learning rate with 30 warmup steps.
	* Early stopping based on validation loss.
	* Precision: bf16 mixed-precision.
	* Sequence length: up to 32k tokens.

	#### Speeds, Sizes, Times

	* Trained with 16× NVIDIA H100 80GB GPUs across two nodes.
	* Full Sharded Data Parallel (FSDP) training, no CPU offloading.

	---

	## Evaluation

	### Testing Data

	Same dataset distribution as training: synthetic, manual, digital, whiteboard, UI-rendered workflows.

	### Factors

	* Source of sample (synthetic, manual, UI, etc.)
	* Orientation (portrait vs. landscape diagrams)
	* Resolution (small <400k pixels, medium, large >1M pixels)

	### Metrics

	All Evaluation metrics can be found in the official [StarFlow repo](https://github.com/ServiceNow/StarFlow).

	* Flow Similarity (FlowSim) – tree edit distance similarity.
	* TreeBLEU – structural recall of subtrees.
	* Trigger Match (TM) – accuracy of workflow triggers.
	* Component Match (CM) – overlap of predicted vs. gold components.

	### Results

	* Proprietary models (GPT-4o, Claude-3.7, Gemini 2.0) outperform open-weights without finetuning.
	* Finetuned Pixtral-12B achieves SOTA:

	* FlowSim w/ inputs: 0.919
	* TreeBLEU w/ inputs: 0.950
	* Trigger Match: 0.753
	* Component Match: 0.930

	#### Summary

	Finetuning yields large gains over base Pixtral-12B and GPT-4o, particularly in matching workflow components and triggers.

	## Model Examination

	* Finetuned models capture naming conventions and structured execution logic better.
	* Failure modes include missing ELSE branches or generic table names.

	---

	## Technical Specifications

	### Model Architecture and Objective

	* Base: Llama-3.2-11B Vision Instruct, a multimodal LLM with 11 B parameters, optimized for image reasoning and instruction-following tasks.
	* Objective: Image-to-JSON structured workflow generation.

	### Compute Infrastructure

	* Hardware: 16× NVIDIA H100 80GB (2 nodes)
	* Software: FSDP, bf16 mixed precision, PyTorch/Transformers

	---

	## Citation

	BibTeX:

	```bibtex
	@article{bechard2025starflow,
	title={StarFlow: Generating Structured Workflow Outputs from Sketch Images},
	author={B{\'e}chard, Patrice and Wang, Chao and Abaskohi, Amirhossein and Rodriguez, Juan and Pal, Christopher and Vazquez, David and Gella, Spandana and Rajeswar, Sai and Taslakian, Perouz},
	journal={arXiv preprint arXiv:2503.21889},
	year={2025}
	}
	```

	APA:
	Béchard, P., Wang, C., Abaskohi, A., Rodriguez, J., Pal, C., Vazquez, D., Gella, S., Rajeswar, S., & Taslakian, P. (2025). StarFlow: Generating Structured Workflow Outputs from Sketch Images. arXiv preprint arXiv:2503.21889.

	---

	## Glossary

	* FlowSim: Metric based on tree edit distance for workflows.
	* TreeBLEU: BLEU-like score using tree structures.
	* Trigger Match: Correctness of predicted workflow trigger.
	* Component Match: Correctness of predicted components (order-agnostic).

	---

	## More Information

	* [ServiceNow Flow Designer](https://www.servicenow.com/products/platform-flow-designer.html)
	* [StarFlow Blog](https://www.servicenow.com/blogs/2025/starflow-ai-turns-sketches-into-workflows)

	---

	## The StarFlow Team

	* Patrice Béchard, Chao Wang, Amirhossein Abaskohi, Juan Rodriguez, Christopher Pal, David Vazquez, Spandana Gella, Sai Rajeswar, Perouz Taslakian

	---

	## Model Card Contact

	* Patrice Bechard - [patrice.bechard@servicenow.com](mailto:patrice.bechard@servicenow.com)
	* ServiceNow Research – [research.servicenow.com](https://research.servicenow.com)