ValueError: Detected some but not all shards of model.layers.0.linear_attn.in_proj are quantized. All shards of fused layers to have the same precision.

by kq - opened 29 days ago

29 days ago

Running on 8xRTX3090 thows error
(vllm) deaf@rtxserver:~$ export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 && vllm serve /home/deaf/Qwen3-Next-80B-A3B-Thinking-FP8 --port 12304 --gpu-memory-utilization 0.78 --dtype float16 --tensor-parallel-size 8 --max-model-len 131072 ...

(APIServer pid=120598) INFO 09-23 08:45:23 [init.py:1815] Using max model len 131072
(APIServer pid=120598) INFO 09-23 08:45:23 [config.py:310] Hybrid or mamba-based model detected: disabling prefix caching since it is not yet supported.
(APIServer pid=120598) INFO 09-23 08:45:23 [config.py:321] Hybrid or mamba-based model detected: setting cudagraph mode to FULL_AND_PIECEWISE in order to optimize performance.
...
(EngineCore_DP0 pid=120808) INFO 09-23 08:45:36 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='/home/deaf/Qwen3-Next-80B-A3B-Thinking-FP8', ..., quantization=fp8, ...
...
(Worker_TP0 pid=120886) ERROR 09-23 08:45:53 [multiproc_executor.py:585] WorkerProc failed to start.
(Worker_TP0 pid=120886) ERROR 09-23 08:45:53 [multiproc_executor.py:585] Traceback (most recent call last):
...
(Worker_TP0 pid=120886) ERROR 09-23 08:45:53 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 256, in init
(Worker_TP0 pid=120886) ERROR 09-23 08:45:53 [multiproc_executor.py:585] self.in_proj = MergedColumnParallelLinear(
...
(Worker_TP0 pid=120886) ERROR 09-23 08:45:53 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 256, in init
(Worker_TP0 pid=120886) ERROR 09-23 08:45:53 [multiproc_executor.py:585] self.quant_method = quant_config.get_quant_method(self,
...
(Worker_TP0 pid=120886) ERROR 09-23 08:45:53 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/fp8.py", line 171, in get_quant_method
(Worker_TP0 pid=120886) ERROR 09-23 08:45:53 [multiproc_executor.py:585] if is_layer_skipped(prefix=prefix,
...
(Worker_TP0 pid=120886) ERROR 09-23 08:45:53 [multiproc_executor.py:585] File "/home/deaf/miniconda3/envs/vllm/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/utils/quant_utils.py", line 282, in is_layer_skipped
(Worker_TP0 pid=120886) ERROR 09-23 08:45:53 [multiproc_executor.py:585] raise ValueError(
(Worker_TP0 pid=120886) ERROR 09-23 08:45:53 [multiproc_executor.py:585] ValueError: Detected some but not all shards of model.layers.0.linear_attn.in_proj are quantized. All shards of fused layers to have the same precision.

... (其他Worker_TP1-7均出现完全相同的错误)
...
(EngineCore_DP0 pid=120808) ERROR 09-23 08:45:56 [core.py:718] EngineCore failed to start.
(EngineCore_DP0 pid=120808) ERROR 09-23 08:45:56 [core.py:718] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
...
(APIServer pid=120598) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

29 days ago

https://www.reddit.com/r/LocalLLaMA/comments/1nnhlx5/official_fp8quantizion_of_qwen3next80ba3b/
others got same error

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment