Qwen3-VL-30B-A3B-Thinking-AWQ-8bit fails to load on vLLM ROCm build (“Only support 4-bit quantize of AWQ”)

#1
by Chris0105 - opened

Hi, thanks for releasing Qwen3-VL! I’m running inference on an AMD Instinct MI60 cluster (ROCm 5.7, vLLM 0.11.0+gfx906). The 4-bit AWQ checkpoints work great, but every time I try the 8-bit AWQ variant the server aborts during warm-up with:

RuntimeError: Only support 4-bit quantize of AWQ
  File ".../vllm/_custom_ops.py", line 367, in gptq_gemm
  File ".../vllm/model_executor/layers/quantization/kernels/mixed_precision/exllama.py", line 131, in apply_weights

Repro:

vllm serve /model/Qwen3-VL-30B-A3B-Thinking-AWQ-8bit \
  --tensor-parallel-size 4 \
  --max-model-len 262144 \
  --dtype float16

The stack trace shows vLLM falling back to ExllamaLinearKernel, which calls the ROCm GPTQ GEMM kernel. That kernel currently hard-checks bit == 4 whenever the AWQ path (bias_one=False) is selected. Because the 8-bit checkpoint sets num_bits: 8, initialization always fails.

Is there an 8-bit friendly quantization format you recommend for ROCm, or is the model expected to run only on the CUDA build today? If there’s a workaround—e.g., a different quant config or updated kernel support—any pointers would be super helpful. Thanks!

Owner

Thank you for using the model. Could you try adding the flag --quantization compressed-tensors?

Hi, thanks for your reply. Even after adding the --quantization compressed-tensors flag, I still get the error message "Only supports 4-bit quantize of AWQ."

It seems your model currently doesn't run correctly on ROCm. Alternatively, I think the GPTQ-Int8 model should work fine on my platform.

Sign up or log in to comment