sorta works on vllm now
public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:0eecb3166365a29db117c2aff6ca441b484b514d
works
but chat-template omits the starting think tag
works with prefixing <think>\n
sglang with spec would be probably more interresting .. im getting about 90 t/s on 2 a6000's with vllm
I am happy that it works on on your machine, and thank you for your inputs :)
sglang with spec would be probably more interresting .. im getting about 90 t/s on 2 a6000's with vllm
Would you mind sharing steps to reproduce your environment? I managed to make it work on my 6000 pro, but it doesn't produce more than 30 t/s when i should be able to squeeze a lot more,
my env contain vllm nightly, flashinfer, causal-conv1d and flash-linear-attention
sglang with spec would be probably more interresting .. im getting about 90 t/s on 2 a6000's with vllm
Would you mind sharing steps to reproduce your environment? I managed to make it work on my 6000 pro, but it doesn't produce more than 30 t/s when i should be able to squeeze a lot more,
my env contain vllm nightly, flashinfer, causal-conv1d and flash-linear-attention
docker run --runtime nvidia --gpus all -v ~/.cache/huggingface:/root/.cache/huggingface --env "HUGGING_FACE_HUB_TOKEN=YOURTOKEN" -p 2243:8000 --ipc=host --env "TRANSFORMERS_OFFLINE=1" --env "VLLM_CONFIGURE_LOGGING=1" -it public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:0eecb3166365a29db117c2aff6ca441b484b514d
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct --port 8000 --tensor-parallel-size 2
mind you my port is mapped to 2243
Many thanks good sir, i'll try to reproduce that tomorrow
so the "important note" on the model card is out of date? Is it working with prefixing "\n" ?
it'd be weird given it's instruct model, not thinking model...
Is that whats causing the issue with the slow response latency?
Ok, i kinda managed to reproduce the docker env.
First and foremost, it is important to disable chunked prefill (was the reason it was so slow for me)
My generation speed in batch 1 on rtx pro 6000 maxq (based on context lenght):
0-64: 110.22 tokens/sec
64-128: 109.31 tokens/sec
128-256: 103.62 tokens/sec
256-512: 108.20 tokens/sec
512-1024: 109.25 tokens/sec
1024-2048: 91.77 tokens/sec
On batch 32 i can pour about 1800 t/s in generation
Creating the env:
conda create --name qnext python=3.12
conda activate qnext
uv pip install torch==2.8.0+cu128 torchvision==0.23.0+cu128 torchaudio==2.8.0+cu128 --index-url https://download.pytorch.org/whl/cu128
uv pip install accelerate bitsandbytes cmake ninja
uv pip install -U https://github.com/huggingface/transformers/archive/refs/heads/main.zip
uv pip install datasets peft
uv pip install flash-attn --no-build-isolation # This might be required for vLLM
uv pip install -U --pre --no-deps flashinfer-python
uv pip install nvidia-nvshmem-cu12
uv pip install einops ninja datasets transformers numpy
uv pip install causal-conv1d
uv pip uninstall fla-core flash-linear-attention
uv pip install -U git+https://github.com/fla-org/flash-linear-attention --no-deps
uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly
Launching the model:
export NVCC_THREADS=16
export MAX_JOBS=16
export OMP_NUM_THREADS=16
export MODEL_NAME="Qwen3-Next-80B-A3B-Instruct-AWQ-4bit"
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
vllm serve "$MODEL_NAME" \
--served-model-name gpt-4 \
--port 5000 \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--trust_remote_code
With this, the very first request have a huge latency (most likely JIT compile) but next ones are much faster.
So far i coudn't get multi token prediction to work tho :/
Note but on same device, the generation speed are roughly equivalent to the ones of mistral small,
In theory with further optimization, it should be possible to reach ~180 t/s in batch 1, and much more with MTP enabled, i hope vllm support improves in upcoming weeks.
In the meantime, thx cpatonn for the model, much appreciated, may i ask if you mind sharing your quantization script (when i try to perform quant myself, either llmcompressor wasn't willing to work with it, and gptqmodel would eat up my 128Gb ram + 80Gb swap to the point of crashing python)
Okay, i managed to enable mtp, but draft acceptance rate is very low (about 2/1000) maybe there is some issue with the eagle head quantization?
eagle on vllm ? what are you sglang params how you start it ?
actually i'm using it on vllm directly (so far for some reason the flashinfer backend does not start on my device so i have to use flash attn backend)
the command to run with mtp / eagle is:
export NVCC_THREADS=16
export MAX_JOBS=16
export OMP_NUM_THREADS=16
export MODEL_NAME="Qwen3-Next-80B-A3B-Instruct-AWQ-4bit"
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
vllm serve "$MODEL_NAME" \
--served-model-name gpt-4 \
--port 5000 \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--trust_remote_code \
--tokenizer-mode auto \
--no-enable-chunked-prefill \
--speculative-config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 7}'
works for num_speculative_tokens = 3 or 7
i tried on the fp8 version of the model as well, on it acceptance rate is about 50% but its running much slower than the model without spec dec, i don't know why yet.
i now moved to running the vllm kernel optimizer script for moe, i'll see if it improves base speed on both awq and fp8 variants
agreed something is weird with the spec heads - fp8 is possible to run for me - but ampere has no optimisations for that sadly