sorta works on vllm now

Would you mind sharing steps to reproduce your environment? I managed to make it work on my 6000 pro, but it doesn't produce more than 30 t/s when i should be able to squeeze a lot more,

my env contain vllm nightly, flashinfer, causal-conv1d and flash-linear-attention

MrDragonFox

26 days ago

sglang with spec would be probably more interresting .. im getting about 90 t/s on 2 a6000's with vllm

Would you mind sharing steps to reproduce your environment? I managed to make it work on my 6000 pro, but it doesn't produce more than 30 t/s when i should be able to squeeze a lot more,

my env contain vllm nightly, flashinfer, causal-conv1d and flash-linear-attention

docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=YOURTOKEN" -p 2243:8000 --ipc=host --env "TRANSFORMERS_OFFLINE=1" --env "VLLM_CONFIGURE_LOGGING=1" -it public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:0eecb3166365a29db117c2aff6ca441b484b514d

vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --port 8000 --tensor-parallel-size 2
mind you my port is mapped to 2243

AlphaGaO

26 days ago

Many thanks good sir, i'll try to reproduce that tomorrow

leo-19910

25 days ago

•

edited 25 days ago

so the "important note" on the model card is out of date? Is it working with prefixing "\n" ?
it'd be weird given it's instruct model, not thinking model...

aidendle94

25 days ago

Is that whats causing the issue with the slow response latency?

AlphaGaO

24 days ago

Ok, i kinda managed to reproduce the docker env.
First and foremost, it is important to disable chunked prefill (was the reason it was so slow for me)

My generation speed in batch 1 on rtx pro 6000 maxq (based on context lenght):
0-64: 110.22 tokens/sec
64-128: 109.31 tokens/sec
128-256: 103.62 tokens/sec
256-512: 108.20 tokens/sec
512-1024: 109.25 tokens/sec
1024-2048: 91.77 tokens/sec

On batch 32 i can pour about 1800 t/s in generation

Creating the env:

conda create --name qnext python=3.12
conda activate qnext

uv pip install torch==2.8.0+cu128 torchvision==0.23.0+cu128 torchaudio==2.8.0+cu128 --index-url https://download.pytorch.org/whl/cu128
uv pip install accelerate bitsandbytes cmake ninja
uv pip install -U https://github.com/huggingface/transformers/archive/refs/heads/main.zip
uv pip install datasets peft
uv pip install flash-attn --no-build-isolation  # This might be required for vLLM
uv pip install -U --pre --no-deps flashinfer-python
uv pip install nvidia-nvshmem-cu12
uv pip install einops ninja datasets transformers numpy
uv pip install causal-conv1d
uv pip uninstall fla-core flash-linear-attention 
uv pip install -U git+https://github.com/fla-org/flash-linear-attention --no-deps
uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly

Launching the model:

export NVCC_THREADS=16
export MAX_JOBS=16
export OMP_NUM_THREADS=16
export MODEL_NAME="Qwen3-Next-80B-A3B-Instruct-AWQ-4bit"
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
vllm serve "$MODEL_NAME" \
--served-model-name gpt-4 \
--port 5000 \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--trust_remote_code

With this, the very first request have a huge latency (most likely JIT compile) but next ones are much faster.
So far i coudn't get multi token prediction to work tho :/

AlphaGaO

24 days ago

Note but on same device, the generation speed are roughly equivalent to the ones of mistral small,

In theory with further optimization, it should be possible to reach ~180 t/s in batch 1, and much more with MTP enabled, i hope vllm support improves in upcoming weeks.

In the meantime, thx cpatonn for the model, much appreciated, may i ask if you mind sharing your quantization script (when i try to perform quant myself, either llmcompressor wasn't willing to work with it, and gptqmodel would eat up my 128Gb ram + 80Gb swap to the point of crashing python)

AlphaGaO

24 days ago

Okay, i managed to enable mtp, but draft acceptance rate is very low (about 2/1000) maybe there is some issue with the eagle head quantization?

MrDragonFox

24 days ago

eagle on vllm ? what are you sglang params how you start it ?

AlphaGaO

24 days ago

•

edited 24 days ago

actually i'm using it on vllm directly (so far for some reason the flashinfer backend does not start on my device so i have to use flash attn backend)

the command to run with mtp / eagle is:

export NVCC_THREADS=16
export MAX_JOBS=16
export OMP_NUM_THREADS=16
export MODEL_NAME="Qwen3-Next-80B-A3B-Instruct-AWQ-4bit"
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
vllm serve "$MODEL_NAME" \
--served-model-name gpt-4 \
--port 5000 \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--trust_remote_code \
--tokenizer-mode auto \
--no-enable-chunked-prefill \
--speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 7}'

works for num_speculative_tokens = 3 or 7

i tried on the fp8 version of the model as well, on it acceptance rate is about 50% but its running much slower than the model without spec dec, i don't know why yet.

i now moved to running the vllm kernel optimizer script for moe, i'll see if it improves base speed on both awq and fp8 variants

MrDragonFox

23 days ago

agreed something is weird with the spec heads - fp8 is possible to run for me - but ampere has no optimisations for that sadly

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment