Experiencing excessive response latency.
#4
by
JunHowie
- opened
Hi, I’m running into some problems.
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
vllm serve cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit \
--max-model-len 8192 \
--gpu-memory-utilization 0.8 \
I’m getting some weird warnings.Similar to this issue.
https://github.com/vllm-project/vllm/issues/24865
(APIServer pid=61613) INFO: 127.0.0.1:33498 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=61613) INFO 09-15 17:36:15 [chat_utils.py:538] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=61613) INFO: 127.0.0.1:33502 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(EngineCore_DP0 pid=62261) /root/data/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/fla/ops/utils.py:105: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (19) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=62261) return fn(*contiguous_args, **contiguous_kwargs)
Experiencing excessive response latency.
[prefill] 14625.2 ms