-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Description
System Info
version: verl-8.3.rc1-910b-ubuntu22.04-py3.11-latest
npu-smi 25.2.0 Version: 25.2.0
Commit id b53f0f1
When I reset veRL to f31df34, before #4280, 、everything is back to normal again.
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
https://github.com/volcengine/verl/blob/main/tests/special_npu/run_qwen2_5_05b_grpo.sh
Then I got:
(vLLMHttpServer pid=2790382) [rank0]:[E126 03:28:29.988683190 compiler_depend.ts:444] operator():build/CMakeFiles/torch_npu.dir/compiler_depend.ts:88 NPU function error: call aclnnInplaceCopy failed, error code is 107000
(vLLMHttpServer pid=2790382) [ERROR] 2026-01-26-03:28:29 (PID:2804108, Device:0, RankID:-1) ERR00100 PTA call acl api failed
(vLLMHttpServer pid=2790382) [Error]: Parameter verification failed.
(vLLMHttpServer pid=2790382) Check whether the input parameters of the interface are correct.
(vLLMHttpServer pid=2790382) [PID: 2804108] 2026-01-26-03:28:29.625.369 Invalid_Argument(EE1001): The argument is invalid.Reason: Memory async failed, src loc type=0, dst loc type=1, kind=3 is invalid!
(vLLMHttpServer pid=2790382) Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
(vLLMHttpServer pid=2790382) TraceBack (most recent call last):
(vLLMHttpServer pid=2790382) Memory async check kind and loc failed, retCode=0x7110001, copyKind=3, srcLoc=0, dstLoc=1[FUNC:MemcpyAsyncCheckLocation][FILE:api_error.cc][LINE:1938]
(vLLMHttpServer pid=2790382) MemcpyAsync check src or dst location failed, stream_id=45, checkKind=1, copyKind=3[FUNC:MemcpyAsync][FILE:api_error.cc][LINE:1586]
(vLLMHttpServer pid=2790382) The argument is invalid.Reason: rtMemcpyAsync execute failed, reason=[invalid value]
(vLLMHttpServer pid=2790382) Call rtMemcpyAsync failed when do CopyNpuToNpuOp, ret code: 107000
(vLLMHttpServer pid=2790382) launch failed for CopyToNpu, errno:361001.
(vLLMHttpServer pid=2790382)
(vLLMHttpServer pid=2790382) Exception raised from operator() at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:88 (most recent call first):
(vLLMHttpServer pid=2790382) frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xd4 (0xffff9cf53ea4 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
(vLLMHttpServer pid=2790382) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xe4 (0xffff9cef3e44 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
(vLLMHttpServer pid=2790382) frame #2: + 0x85b240 (0xffff8699b240 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
(vLLMHttpServer pid=2790382) frame #3: + 0x26e6c10 (0xffff88826c10 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
(vLLMHttpServer pid=2790382) frame #4: + 0x961a94 (0xffff86aa1a94 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
(vLLMHttpServer pid=2790382) frame #5: + 0x9644c0 (0xffff86aa44c0 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
(vLLMHttpServer pid=2790382) frame #6: + 0x96072c (0xffff86aa072c in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
(vLLMHttpServer pid=2790382) frame #7: + 0xd29cc (0xffffaa3729cc in /lib/aarch64-linux-gnu/libstdc++.so.6)
(vLLMHttpServer pid=2790382) frame #8: + 0x80398 (0xffffac770398 in /lib/aarch64-linux-gnu/libc.so.6)
(vLLMHttpServer pid=2790382) frame #9: + 0xe9e9c (0xffffac7d9e9c in /lib/aarch64-linux-gnu/libc.so.6)
(vLLMHttpServer pid=2790382)
(vLLMHttpServer pid=2790382) [rank1]:[E126 03:28:29.003641590 compiler_depend.ts:444] operator():build/CMakeFiles/torch_npu.dir/compiler_depend.ts:88 NPU function error: call aclnnInplaceCopy failed, error code is 107000
(vLLMHttpServer pid=2790382) [ERROR] 2026-01-26-03:28:29 (PID:2804111, Device:1, RankID:-1) ERR00100 PTA call acl api failed
(vLLMHttpServer pid=2790382) [Error]: Parameter verification failed.
(vLLMHttpServer pid=2790382) Check whether the input parameters of the interface are correct.
(vLLMHttpServer pid=2790382) [PID: 2804111] 2026-01-26-03:28:29.627.219 Invalid_Argument(EE1001): The argument is invalid.Reason: Memory async failed, src loc type=0, dst loc type=1, kind=3 is invalid!
(vLLMHttpServer pid=2790382) Solution: 1.Check the input parameter range of the function. 2.Check the function invocation relationship.
(vLLMHttpServer pid=2790382) TraceBack (most recent call last):
(vLLMHttpServer pid=2790382) Memory async check kind and loc failed, retCode=0x7110001, copyKind=3, srcLoc=0, dstLoc=1[FUNC:MemcpyAsyncCheckLocation][FILE:api_error.cc][LINE:1938]
(vLLMHttpServer pid=2790382) MemcpyAsync check src or dst location failed, stream_id=45, checkKind=1, copyKind=3[FUNC:MemcpyAsync][FILE:api_error.cc][LINE:1586]
(vLLMHttpServer pid=2790382) The argument is invalid.Reason: rtMemcpyAsync execute failed, reason=[invalid value]
(vLLMHttpServer pid=2790382) Call rtMemcpyAsync failed when do CopyNpuToNpuOp, ret code: 107000
(vLLMHttpServer pid=2790382) launch failed for CopyToNpu, errno:361001.
(vLLMHttpServer pid=2790382)
(vLLMHttpServer pid=2790382) Exception raised from operator() at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:88 (most recent call first):
(vLLMHttpServer pid=2790382) frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0xd4 (0xffff86c93ea4 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
(vLLMHttpServer pid=2790382) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xe4 (0xffff86c33e44 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch/lib/libc10.so)
(vLLMHttpServer pid=2790382) frame #2: + 0x85b240 (0xffff706db240 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
(vLLMHttpServer pid=2790382) frame #3: + 0x26e6c10 (0xffff72566c10 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
(vLLMHttpServer pid=2790382) frame #4: + 0x961a94 (0xffff707e1a94 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
(vLLMHttpServer pid=2790382) frame #5: + 0x9644c0 (0xffff707e44c0 in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
(vLLMHttpServer pid=2790382) frame #6: + 0x96072c (0xffff707e072c in /usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
(vLLMHttpServer pid=2790382) frame #7: + 0xd29cc (0xffff940b29cc in /lib/aarch64-linux-gnu/libstdc++.so.6)
(vLLMHttpServer pid=2790382) frame #8: + 0x80398 (0xffff964b0398 in /lib/aarch64-linux-gnu/libc.so.6)
(vLLMHttpServer pid=2790382) frame #9: + 0xe9e9c (0xffff96519e9c in /lib/aarch64-linux-gnu/libc.so.6)
(vLLMHttpServer pid=2790382)
(vLLMHttpServer pid=2790435)
(vLLMHttpServer pid=2790435)
(vLLMHttpServer pid=2790435)
(vLLMHttpServer pid=2790435)
(vLLMHttpServer pid=2790495)
(vLLMHttpServer pid=2790495)
(vLLMHttpServer pid=2790495)
(vLLMHttpServer pid=2790495)
(vLLMHttpServer pid=2790540)
(vLLMHttpServer pid=2790540)
(vLLMHttpServer pid=2790540)
(vLLMHttpServer pid=2790540)
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] WorkerProc hit an exception.
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] Traceback (most recent call last):
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] File "/vllm/vllm/v1/executor/multiproc_executor.py", line 666, in worker_busy_loop
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] output = func(*args, **kwargs)
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] ^^^^^^^^^^^^^^^^^^^^^
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] File "/home/h00848544/verl-main/verl/workers/rollout/vllm_rollout/utils.py", line 188, in update_weights_from_ipc
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] tensor = buffer[offset : offset + size].view(dtype=dtype).view(shape).clone()
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnInplaceCopy.
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1.
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] [ERROR] 2026-01-26-03:28:29 (PID:2805207, Device:1, RankID:-1) ERR00100 PTA call acl api failed.
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671]
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] Traceback (most recent call last):
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] File "/vllm/vllm/v1/executor/multiproc_executor.py", line 666, in worker_busy_loop
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] output = func(*args, **kwargs)
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] ^^^^^^^^^^^^^^^^^^^^^
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] File "/home/h00848544/verl-main/verl/workers/rollout/vllm_rollout/utils.py", line 188, in update_weights_from_ipc
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] tensor = buffer[offset : offset + size].view(dtype=dtype).view(shape).clone()
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnInplaceCopy.
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1.
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671] [ERROR] 2026-01-26-03:28:29 (PID:2805207, Device:1, RankID:-1) ERR00100 PTA call acl api failed.
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671]
(vLLMHttpServer pid=2790495) (Worker_TP1 pid=2805207) ERROR 01-26 03:28:29 [multiproc_executor.py:671]
(vLLMHttpServer pid=2790435) (Worker_TP1 pid=2804705) ERROR 01-26 03:28:30 [multiproc_executor.py:671] get_torch_device().synchronize()
(vLLMHttpServer pid=2790435) (Worker_TP1 pid=2804705) ERROR 01-26 03:28:30 [multiproc_executor.py:671] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/npu/utils.py", line 72, in synchronize
(vLLMHttpServer pid=2790435) (Worker_TP1 pid=2804705) ERROR 01-26 03:28:30 [multiproc_executor.py:671] return torch_npu._C._npu_synchronize()
(vLLMHttpServer pid=2790435) (Worker_TP1 pid=2804705) ERROR 01-26 03:28:30 [multiproc_executor.py:671] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(vLLMHttpServer pid=2790435) (Worker_TP1 pid=2804705) ERROR 01-26 03:28:30 [multiproc_executor.py:671] get_torch_device().synchronize()
(vLLMHttpServer pid=2790435) (Worker_TP1 pid=2804705) ERROR 01-26 03:28:30 [multiproc_executor.py:671] File "/usr/local/python3.11.13/lib/python3.11/site-packages/torch_npu/npu/utils.py", line 72, in synchronize
(vLLMHttpServer pid=2790435) (Worker_TP1 pid=2804705) ERROR 01-26 03:28:30 [multiproc_executor.py:671] return torch_npu._C._npu_synchronize()
(vLLMHttpServer pid=2790435) (Worker_TP1 pid=2804705) ERROR 01-26 03:28:30 [multiproc_executor.py:671] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(vLLMHttpServer pid=2790382) (EngineCore_DP0 pid=2798285) ERROR 01-26 03:28:30 [core.py:780] Invocation of collective_rpc method failed
(vLLMHttpServer pid=2790382) (EngineCore_DP0 pid=2798285) ERROR 01-26 03:28:30 [core.py:780] File "/vllm/vllm/v1/engine/core.py", line 777, in _handle_client_request
(vLLMHttpServer pid=2790382) (EngineCore_DP0 pid=2798285) ERROR 01-26 03:28:30 [core.py:780] result = method(*self._convert_msgspec_args(method, args))
(vLLMHttpServer pid=2790382) (EngineCore_DP0 pid=2798285) ERROR 01-26 03:28:30 [core.py:780] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(vLLMHttpServer pid=2790382) (EngineCore_DP0 pid=2798285) ERROR 01-26 03:28:30 [core.py:780] File "/vllm/vllm/v1/engine/core.py", line 416, in collective_rpc
(vLLMHttpServer pid=2790382) (EngineCore_DP0 pid=2798285) ERROR 01-26 03:28:30 [core.py:780] return self.model_executor.collective_rpc(method, timeout, args,
(vLLMHttpServer pid=2790382) (EngineCore_DP0 pid=2798285) ERROR 01-26 03:28:30 [core.py:780] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(vLLMHttpServer pid=2790382) (EngineCore_DP0 pid=2798285) ERROR 01-26 03:28:30 [core.py:780] File "/vllm/vllm/v1/executor/multiproc_executor.py", line 264, in collective_rpc
(vLLMHttpServer pid=2790382) (EngineCore_DP0 pid=2798285) ERROR 01-26 03:28:30 [core.py:780] result = get_response(w, dequeue_timeout,
(vLLMHttpServer pid=2790382) (EngineCore_DP0 pid=2798285) ERROR 01-26 03:28:30 [core.py:780] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(vLLMHttpServer pid=2790382) (EngineCore_DP0 pid=2798285) ERROR 01-26 03:28:30 [core.py:780] File "/vllm/vllm/v1/executor/multiproc_executor.py", line 248, in get_response
(vLLMHttpServer pid=2790382) (EngineCore_DP0 pid=2798285) ERROR 01-26 03:28:30 [core.py:780] raise RuntimeError(
(vLLMHttpServer pid=2790382) (EngineCore_DP0 pid=2798285) ERROR 01-26 03:28:30 [core.py:780] RuntimeError: Worker failed with error 'The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnInplaceCopy.
(vLLMHttpServer pid=2790382) (EngineCore_DP0 pid=2798285) ERROR 01-26 03:28:30 [core.py:780] ', please check the stack trace above for the root cause
Expected behavior
.