AttributeError: 'FusedMoE' object has no attribute 'moe' with latest vllm

#1
by DrRos - opened

With latest nightly build of vllm (0.10.2rc3.dev44+g98229db24) i'm getting this error:

vllm serve /mnt/nfs-share/LLM/Qwen3-Next-80B-A3B-Instruct-int4-mixed-AutoRound/ --host 0.0.0.0 --port 30000 --reasoning-parser deepseek_r1 --dtype auto --tensor-parallel-size 2 --served-model-name Qwen3-30B --max-model-len 4096 --gpu-memory-utilization 0.95
[W913 08:01:43.609356969 OperatorEntry.cpp:218] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_addmm_activation(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, bool use_gelu=False) -> Tensor
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: AutocastCPU
  previous kernel: registered at /pytorch/aten/src/ATen/autocast_mode.cpp:327
       new kernel: registered at /opt/workspace/ipex-cpu-dev/csrc/cpu/autocast/autocast_mode.cpp:112 (function operator())
INFO 09-13 08:01:45 [__init__.py:216] Automatically detected platform cuda.
(APIServer pid=516906) INFO 09-13 08:01:47 [api_server.py:1896] vLLM API server version 0.10.2rc3.dev44+g98229db24
(APIServer pid=516906) INFO 09-13 08:01:47 [utils.py:328] non-default args: {'model_tag': '/mnt/nfs-share/LLM/Qwen3-Next-80B-A3B-Instruct-int4-mixed-AutoRound/', 'host': '0.0.0.0', 'port': 30000, 'model': '/mnt/nfs-share/LLM/Qwen3-Next-80B-A3B-Instruct-int4-mixed-AutoRound/', 'max_model_len': 4096, 'served_model_name': ['Qwen3-30B'], 'reasoning_parser': 'deepseek_r1', 'tensor_parallel_size': 2, 'gpu_memory_utilization': 0.95}
(APIServer pid=516906) INFO 09-13 08:01:55 [__init__.py:750] Resolved architecture: Qwen3NextForCausalLM
(APIServer pid=516906) `torch_dtype` is deprecated! Use `dtype` instead!
(APIServer pid=516906) INFO 09-13 08:01:55 [__init__.py:1831] Using max model len 4096
(APIServer pid=516906) WARNING 09-13 08:01:55 [_logger.py:72] auto-round quantization is not fully optimized yet. The speed can be slower than non-quantized models.
(APIServer pid=516906) INFO 09-13 08:01:55 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=516906) INFO 09-13 08:01:55 [config.py:310] Hybrid or mamba-based model detected: disabling prefix caching since it is not yet supported.
(APIServer pid=516906) INFO 09-13 08:01:55 [config.py:321] Hybrid or mamba-based model detected: setting cudagraph mode to FULL_AND_PIECEWISE in order to optimize performance.
(APIServer pid=516906) INFO 09-13 08:01:56 [config.py:390] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=516906) INFO 09-13 08:01:56 [config.py:411] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.
[W913 08:02:01.866612326 OperatorEntry.cpp:218] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_addmm_activation(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, bool use_gelu=False) -> Tensor
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: AutocastCPU
  previous kernel: registered at /pytorch/aten/src/ATen/autocast_mode.cpp:327
       new kernel: registered at /opt/workspace/ipex-cpu-dev/csrc/cpu/autocast/autocast_mode.cpp:112 (function operator())
INFO 09-13 08:02:03 [__init__.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=517012) INFO 09-13 08:02:05 [core.py:655] Waiting for init message from front-end.
(EngineCore_DP0 pid=517012) INFO 09-13 08:02:05 [core.py:76] Initializing a V1 LLM engine (v0.10.2rc3.dev44+g98229db24) with config: model='/mnt/nfs-share/LLM/Qwen3-Next-80B-A3B-Instruct-int4-mixed-AutoRound/', speculative_config=None, tokenizer='/mnt/nfs-share/LLM/Qwen3-Next-80B-A3B-Instruct-int4-mixed-AutoRound/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=auto-round, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend='deepseek_r1'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen3-30B, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,1],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_DP0 pid=517012) WARNING 09-13 08:02:05 [_logger.py:72] Reducing Torch parallelism from 20 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=517012) INFO 09-13 08:02:05 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 16777216, 10, 'psm_f70f5eb3'), local_subscribe_addr='ipc:///tmp/7c4e62e1-28b8-483d-ad32-9a004499c204', remote_subscribe_addr=None, remote_addr_ipv6=False)
[W913 08:02:09.257761578 OperatorEntry.cpp:218] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_addmm_activation(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, bool use_gelu=False) -> Tensor
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: AutocastCPU
  previous kernel: registered at /pytorch/aten/src/ATen/autocast_mode.cpp:327
       new kernel: registered at /opt/workspace/ipex-cpu-dev/csrc/cpu/autocast/autocast_mode.cpp:112 (function operator())
[W913 08:02:09.257761983 OperatorEntry.cpp:218] Warning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: aten::_addmm_activation(Tensor self, Tensor mat1, Tensor mat2, *, Scalar beta=1, Scalar alpha=1, bool use_gelu=False) -> Tensor
    registered at /pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
  dispatch key: AutocastCPU
  previous kernel: registered at /pytorch/aten/src/ATen/autocast_mode.cpp:327
       new kernel: registered at /opt/workspace/ipex-cpu-dev/csrc/cpu/autocast/autocast_mode.cpp:112 (function operator())
INFO 09-13 08:02:11 [__init__.py:216] Automatically detected platform cuda.
INFO 09-13 08:02:11 [__init__.py:216] Automatically detected platform cuda.
W0913 08:02:14.003000 517060 intel_extension_for_pytorch/utils/_logger.py:72] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0913 08:02:14.003000 517060 intel_extension_for_pytorch/utils/_logger.py:72] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
W0913 08:02:14.079000 517059 intel_extension_for_pytorch/utils/_logger.py:72] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. 
W0913 08:02:14.079000 517059 intel_extension_for_pytorch/utils/_logger.py:72] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures.
INFO 09-13 08:02:14 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_0a820e50'), local_subscribe_addr='ipc:///tmp/b53bed7b-cf9f-48c9-b508-aef3cff17aaa', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 09-13 08:02:14 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_ec1fea1c'), local_subscribe_addr='ipc:///tmp/31edd46f-aad4-4eec-a6ab-fb8442d8b7ef', remote_subscribe_addr=None, remote_addr_ipv6=False)
[W913 08:02:15.844199395 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W913 08:02:15.854394665 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
INFO 09-13 08:02:15 [__init__.py:1439] Found nccl from library libnccl.so.2
INFO 09-13 08:02:15 [__init__.py:1439] Found nccl from library libnccl.so.2
INFO 09-13 08:02:15 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-13 08:02:15 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 09-13 08:02:15 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 09-13 08:02:15 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
WARNING 09-13 08:02:15 [_logger.py:72] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 09-13 08:02:15 [_logger.py:72] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 09-13 08:02:15 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_fe5349b0'), local_subscribe_addr='ipc:///tmp/cc31fe3f-d43e-4164-92e6-ed951fb58045', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
INFO 09-13 08:02:15 [parallel_state.py:1165] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 09-13 08:02:15 [parallel_state.py:1165] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 09-13 08:02:15 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
INFO 09-13 08:02:15 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling.
(Worker_TP1 pid=517060) INFO 09-13 08:02:15 [gpu_model_runner.py:2340] Starting to load model /mnt/nfs-share/LLM/Qwen3-Next-80B-A3B-Instruct-int4-mixed-AutoRound/...
(Worker_TP0 pid=517059) INFO 09-13 08:02:15 [gpu_model_runner.py:2340] Starting to load model /mnt/nfs-share/LLM/Qwen3-Next-80B-A3B-Instruct-int4-mixed-AutoRound/...
(Worker_TP1 pid=517060) INFO 09-13 08:02:16 [gpu_model_runner.py:2372] Loading model from scratch...
(Worker_TP0 pid=517059) INFO 09-13 08:02:16 [gpu_model_runner.py:2372] Loading model from scratch...
(Worker_TP0 pid=517059) INFO 09-13 08:02:16 [gptq_marlin.py:269] Using BitBLASLinearKernel for GPTQMarlinLinearMethod
(Worker_TP0 pid=517059) `torch_dtype` is deprecated! Use `dtype` instead!
(Worker_TP1 pid=517060) INFO 09-13 08:02:16 [gptq_marlin.py:269] Using BitBLASLinearKernel for GPTQMarlinLinearMethod
(Worker_TP0 pid=517059) INFO 09-13 08:02:16 [gptq_marlin.py:269] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(Worker_TP1 pid=517060) `torch_dtype` is deprecated! Use `dtype` instead!
(Worker_TP1 pid=517060) INFO 09-13 08:02:16 [gptq_marlin.py:269] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600] WorkerProc failed to start.
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600] Traceback (most recent call last):
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 574, in worker_main
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]     worker = WorkerProc(*args, **kwargs)
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 440, in __init__
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]     self.worker.load_model()
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 213, in load_model
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2373, in load_model
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]     self.model = model_loader.load_model(
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/base_loader.py", line 45, in load_model
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]     model = initialize_model(vllm_config=vllm_config,
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 64, in initialize_model
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]     return model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1079, in __init__
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]     self.model = Qwen3NextModel(vllm_config=vllm_config,
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 199, in __init__
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]     old_init(self, vllm_config=vllm_config, prefix=prefix, **kwargs)
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 915, in __init__
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]     self.start_layer, self.end_layer, self.layers = make_layers(
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]                                                     ^^^^^^^^^^^^
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 643, in make_layers
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 904, in get_layer
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]     return Qwen3NextDecoderLayer(
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]            ^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 782, in __init__
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]     self.mlp = Qwen3NextSparseMoeBlock(
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]                ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 115, in __init__
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]     self.experts = FusedMoE(num_experts=self.n_routed_experts,
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/layer.py", line 909, in __init__
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]     else quant_config.get_quant_method(self, prefix))
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/auto_round.py", line 386, in get_quant_method
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]     return self.apply_gptq_quant_layer(layer, prefix)
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/auto_round.py", line 330, in apply_gptq_quant_layer
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]     return GPTQMarlinMoEMethod(quant_args_marlin, layer.moe)
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]                                                   ^^^^^^^^^
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1962, in __getattr__
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600]     raise AttributeError(
(Worker_TP0 pid=517059) ERROR 09-13 08:02:17 [multiproc_executor.py:600] AttributeError: 'FusedMoE' object has no attribute 'moe'
(Worker_TP0 pid=517059) INFO 09-13 08:02:17 [multiproc_executor.py:561] Parent process exited, terminating worker
(Worker_TP1 pid=517060) INFO 09-13 08:02:17 [multiproc_executor.py:561] Parent process exited, terminating worker
[rank0]:[W913 08:02:18.847539388 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(EngineCore_DP0 pid=517012) ERROR 09-13 08:02:19 [core.py:719] EngineCore failed to start.
(EngineCore_DP0 pid=517012) ERROR 09-13 08:02:19 [core.py:719] Traceback (most recent call last):
(EngineCore_DP0 pid=517012) ERROR 09-13 08:02:19 [core.py:719]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 710, in run_engine_core
(EngineCore_DP0 pid=517012) ERROR 09-13 08:02:19 [core.py:719]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=517012) ERROR 09-13 08:02:19 [core.py:719]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=517012) ERROR 09-13 08:02:19 [core.py:719]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 509, in __init__
(EngineCore_DP0 pid=517012) ERROR 09-13 08:02:19 [core.py:719]     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=517012) ERROR 09-13 08:02:19 [core.py:719]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in __init__
(EngineCore_DP0 pid=517012) ERROR 09-13 08:02:19 [core.py:719]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=517012) ERROR 09-13 08:02:19 [core.py:719]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=517012) ERROR 09-13 08:02:19 [core.py:719]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_DP0 pid=517012) ERROR 09-13 08:02:19 [core.py:719]     self._init_executor()
(EngineCore_DP0 pid=517012) ERROR 09-13 08:02:19 [core.py:719]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 107, in _init_executor
(EngineCore_DP0 pid=517012) ERROR 09-13 08:02:19 [core.py:719]     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=517012) ERROR 09-13 08:02:19 [core.py:719]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=517012) ERROR 09-13 08:02:19 [core.py:719]   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 512, in wait_for_ready
(EngineCore_DP0 pid=517012) ERROR 09-13 08:02:19 [core.py:719]     raise e from None
(EngineCore_DP0 pid=517012) ERROR 09-13 08:02:19 [core.py:719] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=517012) Process EngineCore_DP0:
(EngineCore_DP0 pid=517012) Traceback (most recent call last):
(EngineCore_DP0 pid=517012)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=517012)     self.run()
(EngineCore_DP0 pid=517012)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=517012)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=517012)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 723, in run_engine_core
(EngineCore_DP0 pid=517012)     raise e
(EngineCore_DP0 pid=517012)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 710, in run_engine_core
(EngineCore_DP0 pid=517012)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=517012)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=517012)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 509, in __init__
(EngineCore_DP0 pid=517012)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=517012)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 82, in __init__
(EngineCore_DP0 pid=517012)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=517012)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=517012)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_DP0 pid=517012)     self._init_executor()
(EngineCore_DP0 pid=517012)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 107, in _init_executor
(EngineCore_DP0 pid=517012)     self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=517012)                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=517012)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 512, in wait_for_ready
(EngineCore_DP0 pid=517012)     raise e from None
(EngineCore_DP0 pid=517012) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(APIServer pid=516906) Traceback (most recent call last):
(APIServer pid=516906)   File "/home/drros/vllm/.venv/bin/vllm", line 10, in <module>
(APIServer pid=516906)     sys.exit(main())
(APIServer pid=516906)              ^^^^^^
(APIServer pid=516906)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 54, in main
(APIServer pid=516906)     args.dispatch_function(args)
(APIServer pid=516906)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 50, in cmd
(APIServer pid=516906)     uvloop.run(run_server(args))
(APIServer pid=516906)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
(APIServer pid=516906)     return __asyncio.run(
(APIServer pid=516906)            ^^^^^^^^^^^^^^
(APIServer pid=516906)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=516906)     return runner.run(main)
(APIServer pid=516906)            ^^^^^^^^^^^^^^^^
(APIServer pid=516906)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=516906)     return self._loop.run_until_complete(task)
(APIServer pid=516906)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=516906)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=516906)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
(APIServer pid=516906)     return await main
(APIServer pid=516906)            ^^^^^^^^^^
(APIServer pid=516906)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1941, in run_server
(APIServer pid=516906)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=516906)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1961, in run_server_worker
(APIServer pid=516906)     async with build_async_engine_client(
(APIServer pid=516906)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=516906)     return await anext(self.gen)
(APIServer pid=516906)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=516906)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 179, in build_async_engine_client
(APIServer pid=516906)     async with build_async_engine_client_from_engine_args(
(APIServer pid=516906)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=516906)     return await anext(self.gen)
(APIServer pid=516906)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=516906)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 221, in build_async_engine_client_from_engine_args
(APIServer pid=516906)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=516906)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=516906)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/utils/__init__.py", line 1595, in inner
(APIServer pid=516906)     return fn(*args, **kwargs)
(APIServer pid=516906)            ^^^^^^^^^^^^^^^^^^^
(APIServer pid=516906)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 209, in from_vllm_config
(APIServer pid=516906)     return cls(
(APIServer pid=516906)            ^^^^
(APIServer pid=516906)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 136, in __init__
(APIServer pid=516906)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=516906)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=516906)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 102, in make_async_mp_client
(APIServer pid=516906)     return AsyncMPClient(*client_args)
(APIServer pid=516906)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=516906)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 769, in __init__
(APIServer pid=516906)     super().__init__(
(APIServer pid=516906)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 448, in __init__
(APIServer pid=516906)     with launch_core_engines(vllm_config, executor_class,
(APIServer pid=516906)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=516906)     next(self.gen)
(APIServer pid=516906)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 729, in launch_core_engines
(APIServer pid=516906)     wait_for_engine_startup(
(APIServer pid=516906)   File "/home/drros/vllm/.venv/lib/python3.12/site-packages/vllm/v1/engine/utils.py", line 782, in wait_for_engine_startup
(APIServer pid=516906)     raise RuntimeError("Engine core initialization failed. "
(APIServer pid=516906) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
/usr/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

that may be caused by https://github.com/vllm-project/vllm/pull/24217. If you're using Ampere or later GPU (>=sm80), try to edit site-packages/vllm/model_executor/layers/quantization/auto_round.py in your python environment and set use_marlin = false between line 328 and 329.
i guess vllm has not supported gptq marlin method for autoround, but gptq linear method works well for sm75 and lower.

that may be caused by https://github.com/vllm-project/vllm/pull/24217. If you're using Ampere or later GPU (>=sm80), try to edit site-packages/vllm/model_executor/layers/quantization/auto_round.py in your python environment and set use_marlin = false between line 328 and 329.
i guess vllm has not supported gptq marlin method for autoround, but gptq linear method works well for sm75 and lower.

This commit also breaks other autoround quants for me (Qwen3-Coder-30B-A3B-Instruct-int4-AutoRound for example), for now I just revert it locally and it works fine

Sign up or log in to comment