Qwen3-VL-30B-A3B-Thinking-AWQ-8bit fails to load on vLLM ROCm build (“Only support 4-bit quantize of AWQ”)
Hi, thanks for releasing Qwen3-VL! I’m running inference on an AMD Instinct MI60 cluster (ROCm 5.7, vLLM 0.11.0+gfx906). The 4-bit AWQ checkpoints work great, but every time I try the 8-bit AWQ variant the server aborts during warm-up with:
RuntimeError: Only support 4-bit quantize of AWQ
File ".../vllm/_custom_ops.py", line 367, in gptq_gemm
File ".../vllm/model_executor/layers/quantization/kernels/mixed_precision/exllama.py", line 131, in apply_weights
Repro:
vllm serve /model/Qwen3-VL-30B-A3B-Thinking-AWQ-8bit \
--tensor-parallel-size 4 \
--max-model-len 262144 \
--dtype float16
The stack trace shows vLLM falling back to ExllamaLinearKernel, which calls the ROCm GPTQ GEMM kernel. That kernel currently hard-checks bit == 4 whenever the AWQ path (bias_one=False) is selected. Because the 8-bit checkpoint sets num_bits: 8, initialization always fails.
Is there an 8-bit friendly quantization format you recommend for ROCm, or is the model expected to run only on the CUDA build today? If there’s a workaround—e.g., a different quant config or updated kernel support—any pointers would be super helpful. Thanks!
Thank you for using the model. Could you try adding the flag --quantization compressed-tensors?
Hi, thanks for your reply. Even after adding the --quantization compressed-tensors flag, I still get the error message "Only supports 4-bit quantize of AWQ."
It seems your model currently doesn't run correctly on ROCm. Alternatively, I think the GPTQ-Int8 model should work fine on my platform.