MTP Accepted throughput always at 0.00 tokens/s

#5
by bpozdena - opened

Thanks @cpatonn for your work. Your quant has been my only way to run this model on 4x3090.

The only major issue I have is that I have not been able to make MTP work. Without MTP I get around 75 tokens/s or 50 tokens/s with full context, which is pretty good.

Still, I wanted to try with speculative decoding, but for some reason it is much slower with MTP then without MTP. The number of accepted tokens is always 0 (Accepted throughput: 0.00 tokens/s) even when my prompt just tries to spellcheck a single paragraph of text with temperature=0.0, which is usually an ideal scenario for speculative decoding.

Any ideas on how to make it work?

VLLM command:

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit --tensor-parallel-size 4 --max-model-len 256K --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8001 --dtype float16 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

VLLM Stats:

INFO 09-15 21:00:41 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 47.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
 INFO 09-15 21:00:41 [metrics.py:96] SpecDecoding metrics: Mean acceptance length: 1.00, Accepted throughput: 0.00 tokens/s, Drafted throughput: 94.39 tokens/s, Accepted: 0 tokens, Drafted: 944 tokens, Per-position acceptance rate: 0.000, 0.000, Avg Draft acceptance rate: 0.0%
 INFO 09-15 21:00:51 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 1.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

i have same build - 4*3090RTX. And i have problems - model answers is hallucinating - i have very emotional answers with different themes mixed but i have standart assistent promt. Do you have same problem ? will try speculative too soon

FYI - I have both issues as well - spec decoding at 0 accepted every time and the model outputting very verbose and odd output to almost anything using recommended generation parameters. 2XA6000 @ TP=2. Odd output happens regardless of MTP.

Owner

Thank you for using my quant. My apologies for this late replies, but I think there will be an update soon in both vllm and this model in the few days to improve the model accuracy and MTP problems.

The Weight config do not contain mtp weights. If you look for original model.safetensors.index.json file you can see that there are model files and mtp files which related to the MTP model. That part is missing and vllm do not throw any exception regarding that

Sign up or log in to comment