how to run it?

#1
by SlavikF - opened

I tried to use VLLM nightly and getting this error:

(APIServer pid=1) Value error, The checkpoint you are trying to load has model type qwen3_omni_moe but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

my docker compose:

services:
  vllm:
    # https://hub.docker.com/r/vllm/vllm-openai/tags
    image: vllm/vllm-openai:nightly-0efd540dbc5405ada2f57f09d2a376aecad576dc  # Sep 28
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
              device_ids: ['0']    # RTX 4090D 48GB
    ports:
      - "80:80"
    environment:
      TORCH_CUDA_ARCH_LIST: "8.9"
    volumes:
      - /home/slavik/.cache/huggingface:/root/.cache/huggingface
    ipc: host
    # https://docs.vllm.ai/en/latest/cli/serve.html
    command:
      - "--model"
      - "cpatonn/Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit"
      - "--max-model-len"
      - "32768"
      - "--served-model-name"
      - "local-qwen3omni30b-q4"
      - "--gpu-memory-utilization"
      - "0.97"
      - "--max-num-seqs"
      - "1"
Owner

Thank you for your interest in the model.

In the original bf16 modelcard, it is required that both transformers and vllm are built from sources, with vllm from their forked repo. I'm not sure if they provide vllm as docker images, but they recommend the followings:

git clone -b qwen3_omni https://github.com/wangxiongts/vllm.git
cd vllm
pip install -r requirements/build.txt
pip install -r requirements/cuda.txt
export VLLM_PRECOMPILED_WHEEL_LOCATION=https://wheels.vllm.ai/a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f/vllm-0.9.2-cp38-abi3-manylinux1_x86_64.whl
VLLM_USE_PRECOMPILED=1 pip install -e . -v --no-build-isolation
# If you meet an "Undefined symbol" error while using VLLM_USE_PRECOMPILED=1, please use "pip install -e . -v" to build from source.
# Install the Transformers
pip install git+https://github.com/huggingface/transformers
pip install accelerate
pip install qwen-omni-utils -U
pip install -U flash-attn --no-build-isolation

I tried to use VLLM nightly and getting this error:

(APIServer pid=1) Value error, The checkpoint you are trying to load has model type qwen3_omni_moe but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

Just wanted to confirm that I get the same error - I have added the three pip install to my custom image:

FROM vllm/vllm-openai:nightly-0efd540dbc5405ada2f57f09d2a376aecad576dc

# This was done for Qwen3-Omni
RUN uv pip install --system git+https://github.com/huggingface/transformers.git@main 
RUN uv pip install --system accelerate
RUN uv pip install --system qwen-omni-utils

I got this working with a single RTX 5090, but had to compile from source to provide support for Blackwell architecture (sm_120) since this wasn't supported from the pre-compiled wheel provided at the time.

I used Pytorch docker image nvcr.io/nvidia/pytorch:25.03-py3 (Pytorch 2.80, CUDA 12.8) for the build environment with some tweaks just to get it working, not optimized or anything.

Cloned the omni fork for vllm:
git clone -b qwen3_omni https://github.com/wangxiongts/vllm.git vllm-omni
cd vllm-omni

I installed uv for creating a virtual environment since just using the global setup was hitting some package version constraints:
uv venv --python 3.12 --seed
source .venv/bin/activate

Environment variables:
export MAX_JOBS="8"
export TORCH_CUDA_ARCH_LIST="12.0" #this is for RTX 5090

Additional packages:
apt-get update
apt-get install kmod

I modified requirements/build.txt and requirements/cuda.txt to remove torch, torchvision, torchaudio, and xformers, then installed packages using uv:
uv pip install -r requirements/build.txt
uv pip install -r requirements/cuda.txt

Next installed torchvision and torchaudio (torch should be installed at this point):
uv pip install torchvision torchaudio

Then compiled vllm (adjust MAX_JOBS based on your system, if you don't specify a limit it can max out and crash during build.):
uv pip install -e . -v --no-build-isolation #took about 30 min for me to build

There are some errors and warnings, but for me nothing ended up being critically fatal.

Use uv to install remaining packages:
uv pip install git+https://github.com/huggingface/transformers
uv pip install accelerate
uv pip install qwen-omni-utils -U
uv pip install -U flash-attn --no-build-isolation
uv pip install flashinfer-python
uv pip uninstall pynvml #deprecated package installed when installing flashinfer, remove to avoid getting warning message when running vllm

At this point you should be able to run vllm serve ... with support for blackwell and omni without any issues.

Thank you for your help. It works well now.

Works out-of-the box with v0.11.1:

services:
  vllm:
    image: vllm/vllm-openai:v0.11.1
    container_name: qwen3-omni30b-4090D
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
              device_ids: ['0']
    ports:
      - "8000:8000"
    environment:
      TORCH_CUDA_ARCH_LIST: "8.9"
    volumes:
      - /home/slavik/.cache:/root/.cache
    ipc: host
    command:
      - "--model"
      - "cpatonn/Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit"
      - "--max-model-len"
      - "32768"
      - "--served-model-name"
      - "local-qwen3-omni30b-q4"
      - "--gpu-memory-utilization"
      - "0.97"
      - "--max-num-seqs"
      - "1"

On my RTX 4090D with 48GB VRAM, it using almost all VRAM and I'm getting:

# Avg prompt throughput: 1026.6 tokens/s
# Avg generation throughput: 64.2 tokens/s

Sign up or log in to comment