README.md · stepfun-ai/step3-fp8 at refs/pr/2

metadata

library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text

📰 Step3 Model Blog | 📄 Paper

Introduction

Step3 is our cutting-edge multimodal reasoning model—built on a Mixture-of-Experts architecture with 321B total parameters and 38B active. It is designed end-to-end to minimize decoding costs while delivering top-tier performance in vision–language reasoning. Through the co-design of Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD), Step3 maintains exceptional efficiency across both flagship and low-end accelerators.

Step3 model card:

Config	Value
Number of Layers (Dense layer included)	61
Number of Dense Layers	5
Hidden Dimension	7168
Attention Mechanism	MFA
Low-rank Query Dimension	2048
Number of Query Heads	64
Head Dimension	256
Number of Experts	48
Selected Experts per Token	3
Number of Shared Experts	1
Max Context Length	65536
Tokenizer	Deepseek V3
Total Parameters (LLM)	316B
Activated Params per Token	38B
Total Parameters (VLM)	321B

Evaluation Results

	Model	Total Params.	MMMU	MathVision	ZeroBench(sub)	DYNAMATH	SimpleVQA	HallusionBench	AIME25	HMMT25	CNMO24	GPQA-Diamond	LiveCodeBench (24.8-25.5)
Open-Source VLM	Step3	321B	74.2	64.8	23.0	50.1	62.2	64.2	82.9	70.0	83.7	73.0	67.1
	ERINE4.5 - thinking	300B/424B	70.0	47.6	22.5	46.9	59.8	60.0	35.1	40.5*	75.5	76.8	38.8
	GLM-4.1V-thinking	9B	68.0	49.4	22.8	41.9	48.1	60.8	13.3	6.7	25.0	47.4	24.2
	MiMo-VL	7B	66.7	60.4	18.6	45.9	48.5	59.6	60.0	34.6	69.9	55.5	50.1
	QvQ-72B-Preview	72B	70.3	35.9	15.9	30.7	40.3	50.8	22.7	49.5	47.3	10.9	24.1
	LLaMA-Maverick	400B	73.4	47.2	22.8	47.1	45.4	57.1	19.2	8.91	41.6	69.8	33.9
Open-Source LLM	MiniMax-M1-80k	456B	-	-	-	-	-	-	76.9	-	-	70.0	65.0
	Qwen3-235B-A22B-Thinking	235B	-	-	-	-	-	-	81.5	62.5	-	71.1	65.9
	DeepSeek R1-0528	671B	-	-	-	-	-	-	87.5	79.4	86.9	81.0	73.3
	Qwen3-235B-A22B-Thinking-2507	235B	-	-	-	-	-	-	92.3	83.9	-	81.1	-
Proprietary VLM	O3	-	82.9	72.8	25.2	58.1	59.8	60.1	88.9	70.1	86.7	83.3	75.8
	Claude4 Sonnet (thinking)	-	76.9	64.6	26.1	48.1	43.7	57.0	70.5	-	-	75.4	55.9
	Claude4 opus (thinking)	-	79.8	66.1	25.2	49.3	47.2	59.9	75.5	-	-	79.6	56.6
	Gemini 2.5 Flash (thinking)	-	73.2	57.3	20.1	57.1	61.1	65.2	72.0	-	-	82.8	61.9
	Gemini 2.5 Pro	-	81.7	73.3	30.8	56.3	66.8	66.8	88.0	-	-	86.4	71.8
	Grok 4	-	80.9	70.3	22.5	40.7	55.9	64.8	98.8	93.9	85.5	87.5	79.3

Note: Parts of the evaluation results are reproduced using the same settings.
†: Evaluation results of Gemini 2.5 Flash (thinking) may be lower than real model performance, especially on MathVision, due to insufficient instruction following ability.

Deployment

Step3's API is accessible at https://platform.stepfun.com/, where we offer OpenAI-compatible API for you.

Inference with Hugging Face Transformers

We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.54.0 as the development environment.We currently only support bf16 inference, and multi-patch for image preprocessing is supported by default. This behavior is aligned with vllm and sglang.

from transformers import AutoProcessor, AutoModelForCausalLM

key_mapping = {
    "^vision_model": "model.vision_model",
    r"^model(?!\.(language_model|vision_model))": "model.language_model",
    "vit_downsampler": "model.vit_downsampler",
    "vit_downsampler2": "model.vit_downsampler2",
    "vit_large_projector": "model.vit_large_projector",
}

model_path = "stepfun-ai/step3"

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, 
                device_map="auto", torch_dtype="auto",trust_remote_code=True, 
                key_mapping=key_mapping)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "What's in this picture?"}
        ]
    },
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

generate_ids = model.generate(**inputs, max_new_tokens=32768, do_sample=False)
decoded = processor.decode(generate_ids[0, inputs["input_ids"].shape[-1] :], skip_special_tokens=True)

print(decoded)

Inference with vLLM and SGLang

Our model checkpoints are stored in bf16 and block-fp8 format, you can find it on Huggingface.

Currently, it is recommended to run Step3 on the following inference engines:

vLLM
SGLang

Deployment and Request examples for vLLM and SGLang can be found in the Model Deployment Guide.

Contact Us

If you have any questions, please reach out at contact@stepfun.com .

License

Both the code repository and the model weights are released under the Apache License (Version 2.0).

Citation

@misc{step3system,
      title={Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding}, 
      author={StepFun Team},
      year={2025},
      eprint={2507.19427},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.19427}, 
}

@misc{step3blog,
      title={Step3: Cost-Effective Multimodal Intelligence}, 
      author={StepFun Team},
      url={https://stepfun.ai/research/step3}, 
}