Skip to content

Add automatic Yuanrong startup to interface.py#60

Merged
0oshowero0 merged 4 commits intoAscend:mainfrom
dpj135:yr_init
Apr 1, 2026
Merged

Add automatic Yuanrong startup to interface.py#60
0oshowero0 merged 4 commits intoAscend:mainfrom
dpj135:yr_init

Conversation

@dpj135
Copy link
Copy Markdown
Contributor

@dpj135 dpj135 commented Mar 26, 2026

Description

  • Add Yuanrong startup to init()
  • Add Yuanrong shutdown to close()

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

1 similar comment
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

dpj135 added 3 commits March 30, 2026 11:12
Signed-off-by: dpj135 <958208521@qq.com>
Signed-off-by: dpj135 <958208521@qq.com>
Signed-off-by: dpj135 <958208521@qq.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@dpj135 dpj135 marked this pull request as ready for review March 30, 2026 03:22
@0oshowero0 0oshowero0 requested a review from Copilot March 30, 2026 03:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds automatic lifecycle management for the Yuanrong backend by starting required local services during init() and stopping/cleaning them up in close(), aligning Yuanrong with the existing “auto_init” behavior used for MooncakeStore.

Changes:

  • Add Yuanrong auto_init support to start local etcd and a datasystem worker (dscli start) during initialization.
  • Add Yuanrong shutdown logic to terminate etcd, remove its temporary data dir, and stop the worker via dscli stop.
  • Extend default config.yaml with Yuanrong auto_init, etcd_address, and client_name.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
transfer_queue/interface.py Starts/stops Yuanrong dependencies (etcd + dscli) as part of TransferQueue init/close lifecycle.
transfer_queue/config.yaml Adds Yuanrong auto-init configuration knobs and adjusts default Yuanrong config fields.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# If enable npu transport
enable_yr_npu_transport: false
# Client name (optional, default is YuanrongStorageClient)
client_name: "YuanrongStorageClient"
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enable_yr_npu_transport was removed from the default Yuanrong config. In transfer_queue/storage/clients/yuanrong_client.py, the NPU adapter defaults enable_yr_npu_transport to True when the key is missing, so this change flips the default behavior from disabled (previously false) to enabled. Consider re-adding enable_yr_npu_transport with an explicit default (or change the code default) to avoid an unintended behavior change for existing users.

Suggested change
client_name: "YuanrongStorageClient"
client_name: "YuanrongStorageClient"
# Whether to enable Yuanrong NPU transport (default: False for backward compatibility)
enable_yr_npu_transport: False

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

try:
etcd_process.wait(timeout=5)
except subprocess.TimeoutExpired:
etcd_process.kill()
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If etcd_process.kill() is used after a timeout, the code should still call wait() to reap the child process; otherwise a zombie process can be left behind until interpreter exit. After kill(), call etcd_process.wait() (possibly in a nested try/except) to ensure the process is fully collected.

Suggested change
etcd_process.kill()
etcd_process.kill()
try:
# Ensure the killed process is fully reaped to avoid zombies
etcd_process.wait(timeout=5)
except subprocess.TimeoutExpired:
logger.warning("Timed out while waiting for etcd process to exit after kill.")
except Exception as e:
logger.warning(f"Error while waiting for etcd process to exit after kill: {e}")

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add etcd_process.wait()

Comment on lines +496 to +501
subprocess.run(
["dscli", "stop", "--worker_address", worker_address],
timeout=90,
capture_output=True,
)
logger.info(f"Stopped datasystem worker at {worker_address} via dscli stop")
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dscli stop is run but its result is ignored, and an info log is emitted unconditionally. This can report a successful shutdown even if dscli fails (non-zero return code) or prints an error. Capture the CompletedProcess, check returncode (or use check=True), and log/raise accordingly (and include stdout/stderr in the error path).

Suggested change
subprocess.run(
["dscli", "stop", "--worker_address", worker_address],
timeout=90,
capture_output=True,
)
logger.info(f"Stopped datasystem worker at {worker_address} via dscli stop")
result = subprocess.run(
["dscli", "stop", "--worker_address", worker_address],
timeout=90,
capture_output=True,
text=True,
)
if result.returncode == 0:
logger.info(f"Stopped datasystem worker at {worker_address} via dscli stop")
else:
logger.error(
"Failed to stop datasystem worker at %s via dscli stop "
"(return code %s). stdout: %r stderr: %r",
worker_address,
result.returncode,
result.stdout,
result.stderr,
)
except subprocess.TimeoutExpired as e:
logger.error(
"Timed out while stopping datasystem worker at %s via dscli stop: %s",
worker_address,
e,
)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

except Exception as e:
# Clean up on failure
if etcd_process is not None and etcd_process.poll() is None:
etcd_process.terminate()
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Yuanrong startup failure, the cleanup path terminates the etcd process but doesn't wait() (or kill() on timeout) to ensure the process is actually stopped/reaped. This can leave a stray/zombie etcd process around after init() raises. Consider mirroring the shutdown logic used in close() (terminate -> wait with timeout -> kill -> wait).

Suggested change
etcd_process.terminate()
try:
etcd_process.terminate()
try:
etcd_process.wait(timeout=5)
except subprocess.TimeoutExpired:
etcd_process.kill()
etcd_process.wait()
except Exception:
# Best-effort cleanup; ignore failures here
pass

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines +191 to +204
if conf.backend.storage_backend == "Yuanrong":
if conf.backend.Yuanrong.auto_init:
etcd_process = None
etcd_data_dir = None
worker_address = None
if not shutil.which("etcd"):
raise RuntimeError(
"etcd executable not found in PATH. Please install etcd and make sure it's in the PATH."
)
if not shutil.which("dscli"):
raise RuntimeError(
"dscli executable not found in PATH. Please run `pip install openyuanrong-datasystem`."
)
try:
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new Yuanrong auto-init path (etcd + dscli) introduces substantial side effects during init() but currently has no targeted test coverage. Consider adding a unit test that monkeypatches shutil.which/subprocess.Popen/subprocess.run to validate the happy path and failure cleanup behavior (without requiring real etcd/dscli binaries).

Copilot uses AI. Check for mistakes.
Comment on lines +59 to +60
# Client name (optional, default is YuanrongStorageClient)
client_name: "YuanrongStorageClient"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can eliminate these config since it must be YuanrongStorageClient

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pass
elif key == "Yuanrong":
# Stop etcd process and clean up data directory, stop datasystem worker via dscli
if isinstance(value, dict):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only close etcd when enable auto_init?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If users start etcd manually, they should close etcd manually.

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: dpj135 <958208521@qq.com>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

dpj135, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@0oshowero0 0oshowero0 merged commit 2f24c68 into Ascend:main Apr 1, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants