Experiencing excessive response latency.

by JunHowie - opened Sep 15

Sep 15

Hi, I’m running into some problems.

export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
vllm serve cpatonn/Qwen3-Next-80B-A3B-Instruct-AWQ-4bit \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.8 \

I’m getting some weird warnings.Similar to this issue.
https://github.com/vllm-project/vllm/issues/24865

(APIServer pid=61613) INFO:     127.0.0.1:33498 - "GET /v1/models HTTP/1.1" 200 OK

(APIServer pid=61613) INFO 09-15 17:36:15 [chat_utils.py:538] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=61613) INFO:     127.0.0.1:33502 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(EngineCore_DP0 pid=62261) /root/data/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/fla/ops/utils.py:105: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (19) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=62261)   return fn(*contiguous_args, **contiguous_kwargs)

Experiencing excessive response latency.
[prefill] 14625.2 ms

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment