VLLM compatibility?

by aidendle94 - opened Sep 12

Sep 12

•

(Worker_PP2 pid=175) ERROR ... File ".../quantization/bitsandbytes.py", line 602, in _create_weights_8bit
(Worker_PP2 pid=175) ERROR ... raise NotImplementedError
(Worker_PP2 pid=175) ERROR ... NotImplementedError

Im currently building from main.

  --name vllm-qwen3-next-80b-a3b-fp8 \
  --gpus all \
  --ipc=host \
  --network shared-network \
  -p 8080:8080 \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3 \
  -e VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
  -v /root/.cache/huggingface:/root/.cache/huggingface \
  vllm-local:latest \
  --model DevQuasar/Qwen.Qwen3-Next-80B-A3B-Instruct-FP8 \
  --tensor-parallel-size 1 \
  --pipeline-parallel-size 4 \
  --quantization bitsandbytes \
  --max-model-len 4096 \
  --port 8080 \
  --host 0.0.0.0

csabakecskemeti

DevQuasar org Sep 13

I'm uploading (upload is running now) an FP8 dynamic that I've made with llmcompressor that should work with vllm

csabakecskemeti

DevQuasar org Sep 13

•

edited Sep 13

@aidendle94
I'd appreciate if you can check it out:
DevQuasar/Qwen.Qwen3-Next-80B-A3B-Instruct-FP8-Dynamic
(2x5090 is not enough for this :D )

aidendle94

Sep 13

sweet i'll give it a try! appreciate it greatly. I'm on 4x 3090 so it should work. I can install 2 more to get 6 if needed lol

aidendle94

Sep 13

•

edited Sep 13

@aidendle94
I'd appreciate if you can check it out:
DevQuasar/Qwen.Qwen3-Next-80B-A3B-Instruct-FP8-Dynamic
(2x5090 is not enough for this :D )

did you ever get MTP to work? Everything works except MTP on 4x 3090.

update: nvm got it to work on vllm with exactly .93 mem util

csabakecskemeti

DevQuasar org Sep 13

Ohh great so the FP8-Dynamic works

aidendle94

Sep 13

Ohh great so the FP8-Dynamic works

yep works great, MTP ended up crashing but I suspect that might be a hardware limitation. I OOM. Thanks!

JaheimLee

Sep 15

@aidendle94 Hi, could you try --tensor-parallel-size 4? I got RuntimeError: size_n = 3088 is not divisible by tile_n_size = 64.

traphix

Sep 17

@aidendle94 Hi, could you try --tensor-parallel-size 4? I got RuntimeError: size_n = 3088 is not divisible by tile_n_size = 64.

I got same error on 2 * A100. (using model: DevQuasar/Qwen.Qwen3-Next-80B-A3B-Instruct-FP8-Dynamic)

RuntimeError: size_n = 6176 is not divisible by tile_n_size = 64

aidendle94

Sep 17

Tp doesn't seem to work. I suspect its an limitation of Ampere compatibility with fp8. You'll have to rely on PP.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment