VLLM compatibility?
(Worker_PP2 pid=175) ERROR ... File ".../quantization/bitsandbytes.py", line 602, in _create_weights_8bit
(Worker_PP2 pid=175) ERROR ... raise NotImplementedError
(Worker_PP2 pid=175) ERROR ... NotImplementedError
Im currently building from main.
--name vllm-qwen3-next-80b-a3b-fp8 \
--gpus all \
--ipc=host \
--network shared-network \
-p 8080:8080 \
-e NVIDIA_VISIBLE_DEVICES=all \
-e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
-e CUDA_VISIBLE_DEVICES=0,1,2,3 \
-e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
-v /root/.cache/huggingface:/root/.cache/huggingface \
vllm-local:latest \
--model DevQuasar/Qwen.Qwen3-Next-80B-A3B-Instruct-FP8 \
--tensor-parallel-size 1 \
--pipeline-parallel-size 4 \
--quantization bitsandbytes \
--max-model-len 4096 \
--port 8080 \
--host 0.0.0.0
I'm uploading (upload is running now) an FP8 dynamic that I've made with llmcompressor that should work with vllm
@aidendle94
I'd appreciate if you can check it out:
DevQuasar/Qwen.Qwen3-Next-80B-A3B-Instruct-FP8-Dynamic
(2x5090 is not enough for this :D )
sweet i'll give it a try! appreciate it greatly. I'm on 4x 3090 so it should work. I can install 2 more to get 6 if needed lol
@aidendle94
I'd appreciate if you can check it out:
DevQuasar/Qwen.Qwen3-Next-80B-A3B-Instruct-FP8-Dynamic
(2x5090 is not enough for this :D )
did you ever get MTP to work? Everything works except MTP on 4x 3090.
update: nvm got it to work on vllm with exactly .93 mem util
Ohh great so the FP8-Dynamic works
Ohh great so the FP8-Dynamic works
yep works great, MTP ended up crashing but I suspect that might be a hardware limitation. I OOM. Thanks!
@aidendle94
Hi, could you try --tensor-parallel-size 4
? I got RuntimeError: size_n = 3088 is not divisible by tile_n_size = 64
.
@aidendle94 Hi, could you try
--tensor-parallel-size 4
? I gotRuntimeError: size_n = 3088 is not divisible by tile_n_size = 64
.
I got same error on 2 * A100. (using model: DevQuasar/Qwen.Qwen3-Next-80B-A3B-Instruct-FP8-Dynamic)
RuntimeError: size_n = 6176 is not divisible by tile_n_size = 64
Tp doesn't seem to work. I suspect its an limitation of Ampere compatibility with fp8. You'll have to rely on PP.