step3-fp8 / README.md
nielsr's picture
nielsr HF Staff
Improve model card: Add pipeline tag and update paper link
098d4ad verified
|
raw
history blame
11.4 kB
metadata
library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
StepFun: Cost-Effective Multimodal Intelligence

Chat Homepage
GitHub ModelScope Twitter Follow
Discord License
📰  Step3 Model Blog     |     📄  Paper

Introduction

Step3 is our cutting-edge multimodal reasoning model—built on a Mixture-of-Experts architecture with 321B total parameters and 38B active. It is designed end-to-end to minimize decoding costs while delivering top-tier performance in vision–language reasoning. Through the co-design of Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD), Step3 maintains exceptional efficiency across both flagship and low-end accelerators.

Step3 model card:

Config Value
Number of Layers (Dense layer included) 61
Number of Dense Layers 5
Hidden Dimension 7168
Attention Mechanism MFA
Low-rank Query Dimension 2048
Number of Query Heads 64
Head Dimension 256
Number of Experts 48
Selected Experts per Token 3
Number of Shared Experts 1
Max Context Length 65536
Tokenizer Deepseek V3
Total Parameters (LLM) 316B
Activated Params per Token 38B
Total Parameters (VLM) 321B

Evaluation Results

Model Total Params. MMMU MathVision ZeroBench(sub) DYNAMATH SimpleVQA HallusionBench AIME25 HMMT25 CNMO24 GPQA-Diamond LiveCodeBench
(24.8-25.5)
Open-Source VLM Step3 321B 74.2 64.8 23.0 50.1 62.2 64.2 82.9 70.0 83.7 73.0 67.1
ERINE4.5 - thinking 300B/424B 70.0 47.6 22.5 46.9 59.8 60.0 35.1 40.5* 75.5 76.8 38.8
GLM-4.1V-thinking 9B 68.0 49.4 22.8 41.9 48.1 60.8 13.3 6.7 25.0 47.4 24.2
MiMo-VL 7B 66.7 60.4 18.6 45.9 48.5 59.6 60.0 34.6 69.9 55.5 50.1
QvQ-72B-Preview 72B 70.3 35.9 15.9 30.7 40.3 50.8 22.7 49.5 47.3 10.9 24.1
LLaMA-Maverick 400B 73.4 47.2 22.8 47.1 45.4 57.1 19.2 8.91 41.6 69.8 33.9
Open-Source LLM MiniMax-M1-80k 456B - - - - - - 76.9 - - 70.0 65.0
Qwen3-235B-A22B-Thinking 235B - - - - - - 81.5 62.5 - 71.1 65.9
DeepSeek R1-0528 671B - - - - - - 87.5 79.4 86.9 81.0 73.3
Qwen3-235B-A22B-Thinking-2507 235B - - - - - - 92.3 83.9 - 81.1 -
Proprietary VLM O3 - 82.9 72.8 25.2 58.1 59.8 60.1 88.9 70.1 86.7 83.3 75.8
Claude4 Sonnet (thinking) - 76.9 64.6 26.1 48.1 43.7 57.0 70.5 - - 75.4 55.9
Claude4 opus (thinking) - 79.8 66.1 25.2 49.3 47.2 59.9 75.5 - - 79.6 56.6
Gemini 2.5 Flash (thinking) - 73.2 57.3 20.1 57.1 61.1 65.2 72.0 - - 82.8 61.9
Gemini 2.5 Pro - 81.7 73.3 30.8 56.3 66.8 66.8 88.0 - - 86.4 71.8
Grok 4 - 80.9 70.3 22.5 40.7 55.9 64.8 98.8 93.9 85.5 87.5 79.3

Note: Parts of the evaluation results are reproduced using the same settings.
†: Evaluation results of Gemini 2.5 Flash (thinking) may be lower than real model performance, especially on MathVision, due to insufficient instruction following ability.

Deployment

Step3's API is accessible at https://platform.stepfun.com/, where we offer OpenAI-compatible API for you.

Inference with Hugging Face Transformers

We introduce how to use our model at inference stage using transformers library. It is recommended to use python=3.10, torch>=2.1.0, and transformers=4.54.0 as the development environment.We currently only support bf16 inference, and multi-patch for image preprocessing is supported by default. This behavior is aligned with vllm and sglang.

from transformers import AutoProcessor, AutoModelForCausalLM

key_mapping = {
    "^vision_model": "model.vision_model",
    r"^model(?!\.(language_model|vision_model))": "model.language_model",
    "vit_downsampler": "model.vit_downsampler",
    "vit_downsampler2": "model.vit_downsampler2",
    "vit_large_projector": "model.vit_large_projector",
}

model_path = "stepfun-ai/step3"

processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, 
                device_map="auto", torch_dtype="auto",trust_remote_code=True, 
                key_mapping=key_mapping)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
            {"type": "text", "text": "What's in this picture?"}
        ]
    },
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

generate_ids = model.generate(**inputs, max_new_tokens=32768, do_sample=False)
decoded = processor.decode(generate_ids[0, inputs["input_ids"].shape[-1] :], skip_special_tokens=True)

print(decoded)

Inference with vLLM and SGLang

Our model checkpoints are stored in bf16 and block-fp8 format, you can find it on Huggingface.

Currently, it is recommended to run Step3 on the following inference engines:

  • vLLM
  • SGLang

Deployment and Request examples for vLLM and SGLang can be found in the Model Deployment Guide.

Contact Us

If you have any questions, please reach out at contact@stepfun.com .

License

Both the code repository and the model weights are released under the Apache License (Version 2.0).

Citation

@misc{step3system,
      title={Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding}, 
      author={StepFun Team},
      year={2025},
      eprint={2507.19427},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2507.19427}, 
}

@misc{step3blog,
      title={Step3: Cost-Effective Multimodal Intelligence}, 
      author={StepFun Team},
      url={https://stepfun.ai/research/step3}, 
}