Error on 4 x L40s
#4
by
traphix
- opened
i use latest vllm, 4 x L40s
launch command
python3 -m vllm.entrypoints.openai.api_server \
--served-model-name qwen3-next-80b-a3b-instruct \
--model /data/model-cache/Qwen3-Next-80B-A3B-Instruct-FP8 \
--tensor-parallel-size 4 \
--enable-expert-parallel
error logs, "raise e from None" ....
(EngineCore_DP0 pid=279) EngineCore failed to start.
(EngineCore_DP0 pid=279) Traceback (most recent call last):
(EngineCore_DP0 pid=279) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=279) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=279) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=279) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=279) super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=279) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 83, in __init__
(EngineCore_DP0 pid=279) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=279) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=279) File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_DP0 pid=279) self._init_executor()
(EngineCore_DP0 pid=279) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 106, in _init_executor
(EngineCore_DP0 pid=279) self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=279) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=279) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 509, in wait_for_ready
(EngineCore_DP0 pid=279) raise e from None
(EngineCore_DP0 pid=279) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
(EngineCore_DP0 pid=279) Process EngineCore_DP0:
(EngineCore_DP0 pid=279) Traceback (most recent call last):
(EngineCore_DP0 pid=279) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=279) self.run()
(EngineCore_DP0 pid=279) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=279) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=279) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 712, in run_engine_core
(EngineCore_DP0 pid=279) raise e
(EngineCore_DP0 pid=279) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 699, in run_engine_core
(EngineCore_DP0 pid=279) engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=279) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=279) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 498, in __init__
(EngineCore_DP0 pid=279) super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=279) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 83, in __init__
(EngineCore_DP0 pid=279) self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=279) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=279) File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 54, in __init__
(EngineCore_DP0 pid=279) self._init_executor()
(EngineCore_DP0 pid=279) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 106, in _init_executor
(EngineCore_DP0 pid=279) self.workers = WorkerProc.wait_for_ready(unready_workers)
(EngineCore_DP0 pid=279) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=279) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 509, in wait_for_ready
(EngineCore_DP0 pid=279) raise e from None
(EngineCore_DP0 pid=279) Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
i check the source code
/vllm/v1/executor/multiproc_executor.py", line 509, in wait_for_ready
i got EOFError
except EOFError:
e.__suppress_context__ = True
raise e from None
what does it mean? Tensor file damaged?
Hi, the latest stable version of vLLM does not yet include the fix for this issue. You’ll need to use the nightly build for now. I encountered a similar problem previously, and switching to the nightly version resolved it.