VLLM compatibility?

#1
by aidendle94 - opened

(Worker_PP2 pid=175) ERROR ... File ".../quantization/bitsandbytes.py", line 602, in _create_weights_8bit
(Worker_PP2 pid=175) ERROR ... raise NotImplementedError
(Worker_PP2 pid=175) ERROR ... NotImplementedError

Im currently building from main.

  --name vllm-qwen3-next-80b-a3b-fp8 \
  --gpus all \
  --ipc=host \
  --network shared-network \
  -p 8080:8080 \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3 \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -v /root/.cache/huggingface:/root/.cache/huggingface \
  vllm-local:latest \
  --model DevQuasar/Qwen.Qwen3-Next-80B-A3B-Instruct-FP8 \
  --tensor-parallel-size 1 \
  --pipeline-parallel-size 4 \
  --quantization bitsandbytes \
  --max-model-len 4096 \
  --port 8080 \
  --host 0.0.0.0
DevQuasar org

I'm uploading (upload is running now) an FP8 dynamic that I've made with llmcompressor that should work with vllm

DevQuasar org
β€’
edited Sep 13

@aidendle94
I'd appreciate if you can check it out:
DevQuasar/Qwen.Qwen3-Next-80B-A3B-Instruct-FP8-Dynamic
(2x5090 is not enough for this :D )

sweet i'll give it a try! appreciate it greatly. I'm on 4x 3090 so it should work. I can install 2 more to get 6 if needed lol

@aidendle94
I'd appreciate if you can check it out:
DevQuasar/Qwen.Qwen3-Next-80B-A3B-Instruct-FP8-Dynamic
(2x5090 is not enough for this :D )

did you ever get MTP to work? Everything works except MTP on 4x 3090.

update: nvm got it to work on vllm with exactly .93 mem util

DevQuasar org

Ohh great so the FP8-Dynamic works

Ohh great so the FP8-Dynamic works

yep works great, MTP ended up crashing but I suspect that might be a hardware limitation. I OOM. Thanks!

@aidendle94 Hi, could you try --tensor-parallel-size 4? I got RuntimeError: size_n = 3088 is not divisible by tile_n_size = 64.

@aidendle94 Hi, could you try --tensor-parallel-size 4? I got RuntimeError: size_n = 3088 is not divisible by tile_n_size = 64.

I got same error on 2 * A100. (using model: DevQuasar/Qwen.Qwen3-Next-80B-A3B-Instruct-FP8-Dynamic)

RuntimeError: size_n = 6176 is not divisible by tile_n_size = 64

Tp doesn't seem to work. I suspect its an limitation of Ampere compatibility with fp8. You'll have to rely on PP.

Sign up or log in to comment