[feat] Support metastore mode for Yuanrong backend init by KaisennHu · Pull Request #74 · Ascend/TransferQueue

KaisennHu · 2026-04-07T02:14:51Z

Description

Simplify Yuanrong backend initialization to exclusively support metastore mode, removing the external etcd dependency. This refactor uses Ray's native cluster discovery and placement groups to manage distributed Yuanrong datasystem workers.

Changes

transfer_queue/config.yaml: Remove etcd and metastore mode configuration
- Removed: etcd_address, host, port, metastore_mode, metastore_address
- Kept: auto_init, worker_port, metastore_port
- Added: worker_args for additional dscli start arguments
- Host IPs are now auto-detected from ray.nodes() via NodeManagerAddress
transfer_queue: Add new file transfer_queue/utils/yuanrong_utils.py
- YuanrongWorkerActor (Ray actor class):
  - Determines its node via IP intersection with provided node_ips
  - Starts metastore service on head node (rank 0)
  - Provides start() and stop() methods for lifecycle management
- initialize_yuanrong_backend(): Complete initialization logic
  - Gets Ray cluster information via ray.nodes()
  - Creates placement group with STRICT_SPREAD strategy (0.1 CPU per bundle)
  - Creates YuanrongWorkerActor instances on each bundle
  - Starts head worker first, then parallel starts remaining workers
  - Returns dict with worker_actors, metastore_address, placement_group
  - Handles exceptions with proper cleanup
- cleanup_yuanrong_resources(): Complete cleanup logic
  - Stops all workers concurrently, collecting exceptions
  - Kills actors and removes placement group
- start_datasystem_worker() / stop_datasystem_worker(): dscli wrapper functions
- get_local_ip_addresses(): IP discovery for node self-determination
transfer_queue/interface.py: Simplified Yuanrong backend integration
- Replace ~100 lines of inline initialization with single function call to initialize_yuanrong_backend(conf)
- Simplify close() to single call to cleanup_yuanrong_resources(value)
- Remove unused imports: shutil, get_local_ip_addresses, etcd-related functions
tests/: Update test configurations to use worker_port instead of host/port

Related issues

Fixes #50

ascend-robot · 2026-04-07T02:15:01Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-04-07T03:35:28Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

transfer_queue/interface.py

Copilot

Pull request overview

Adds a new “metastore mode” startup path for the Yuanrong backend so transfer_queue.init() can bring up Yuanrong datasystem without managing an external etcd, targeting multi-node deployments where the metastore runs inside the datasystem worker.

Changes:

Added metastore connectivity polling and refactored Yuanrong worker startup into helper functions.
Refactored Yuanrong auto_init to support both metastore mode and the existing etcd-based mode.
Updated Yuanrong cleanup logic in close() and extended default config with metastore_mode / metastore_address.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
`transfer_queue/interface.py`	Implements metastore readiness waiting, metastore-mode worker startup (head/worker), refactors etcd/dscli startup & cleanup paths.
`transfer_queue/config.yaml`	Adds metastore-related config keys and removes the documented `host` field in the Yuanrong defaults.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

transfer_queue/interface.py

transfer_queue/config.yaml

transfer_queue/interface.py

ascend-robot · 2026-04-07T08:13:50Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-04-07T11:04:03Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-04-07T16:53:35Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-04-07T16:56:46Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-04-08T01:04:48Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-04-08T01:45:46Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-04-08T02:09:32Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-04-08T03:14:45Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

transfer_queue/utils/yuanrong_utils.py

transfer_queue/storage/clients/yuanrong_client.py

transfer_queue/storage/managers/yuanrong_manager.py

ascend-robot · 2026-04-08T06:31:14Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

tianyi-ge · 2026-04-08T07:33:00Z

transfer_queue/utils/yuanrong_utils.py

+    bundles = []
+    for node in ordered_nodes:
+        node_ip = node["NodeManagerAddress"]
+        bundles.append({"CPU": 0.1, f"node:{node_ip}": 0.001})


does it require users to start ray with --resources='{"node:xxx": 1}'? Verl users do not likely set this config

ascend-robot · 2026-04-08T09:00:59Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-04-08T09:19:44Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

tianyi-ge · 2026-04-08T11:39:44Z

tests/test_yuanrong_client_zero_copy.py

    @pytest.fixture
    def storage_client(self, mock_kv_client):
-        return GeneralKVClientAdapter({"host": "127.0.0.1", "port": 31501})
+        return GeneralKVClientAdapter({"host": "127.0.0.1", "worker_port": 31501})


many test cases still have the removed field "host". you may remove them to make tests cleaner

tianyi-ge · 2026-04-08T11:41:41Z

transfer_queue/config.yaml

  # For Yuanrong:
  Yuanrong:
-    # Whether to let TQ automatically start etcd and datasystem services
+    # Whether to let TQ automatically init yuanrong.


you may add datasystem worker args here, e.g.

worker_args: "--shared_memory_mb 16384 --enable_huge_tlb true"

it's simpler than add another json file

ascend-robot · 2026-04-09T02:49:00Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-04-09T02:53:26Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-04-09T03:01:52Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-04-09T03:20:42Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

tianyi-ge · 2026-04-09T03:33:03Z

transfer_queue/utils/yuanrong_utils.py

+        cmd.extend(worker_args.split())
+
+    node_type = "head node" if is_head else "worker node"
+    logger.info(f"Starting Yuanrong datasystem ({node_type}, metastore mode) at {worker_address}")


I would suggest printing worker_args here, so the user know their config is correct

dpj135 · 2026-04-09T03:39:54Z

Could we add Yuanrong to ci::e2e_test? @KaisennHu @tianyi-ge

0oshowero0 · 2026-04-09T03:44:55Z

transfer_queue/utils/yuanrong_utils.py

+            stop_exceptions = []
+            # Stop worker nodes (all except head node 0) first
+            if len(worker_actors) > 1:
+                stop_refs = [actor.stop.remote() for actor in worker_actors[1:]]


Why we need to split the stop for head worker and other workers? To make sure the metastore on the head worker will not raise any error or warnings due to the heartbeat loss of worker?

0oshowero0 · 2026-04-09T03:49:50Z

transfer_queue/utils/yuanrong_utils.py

+import ray
+from omegaconf import DictConfig
+
+from transfer_queue.storage.clients.yuanrong_client import get_local_ip_addresses


We can move get_local_ip_addresses into this file

ascend-robot · 2026-04-09T06:17:05Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-04-09T07:13:53Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot · 2026-04-10T01:33:05Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>

ascend-robot · 2026-04-10T08:43:19Z

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

ascend-robot added the ascend-cla/yes label Apr 7, 2026

KaisennHu force-pushed the feat/yr-init-metastore branch from 3bcdf8b to 852c27e Compare April 7, 2026 03:35

0oshowero0 requested a review from Copilot April 7, 2026 07:39

Copilot started reviewing on behalf of 0oshowero0 April 7, 2026 07:40 View session

0oshowero0 reviewed Apr 7, 2026

View reviewed changes

transfer_queue/interface.py Outdated Show resolved Hide resolved

Copilot AI reviewed Apr 7, 2026

View reviewed changes

KaisennHu force-pushed the feat/yr-init-metastore branch from 852c27e to 33bed8a Compare April 7, 2026 08:13

KaisennHu force-pushed the feat/yr-init-metastore branch from 33bed8a to 89332c6 Compare April 7, 2026 11:03

KaisennHu force-pushed the feat/yr-init-metastore branch from 89332c6 to bbae4c1 Compare April 7, 2026 16:53

KaisennHu force-pushed the feat/yr-init-metastore branch from bbae4c1 to be6fe9f Compare April 7, 2026 16:56

KaisennHu force-pushed the feat/yr-init-metastore branch from be6fe9f to c6b54ea Compare April 8, 2026 01:04

KaisennHu force-pushed the feat/yr-init-metastore branch from c6b54ea to fa6db11 Compare April 8, 2026 01:45

KaisennHu force-pushed the feat/yr-init-metastore branch from fa6db11 to 655a553 Compare April 8, 2026 02:09

KaisennHu force-pushed the feat/yr-init-metastore branch from 655a553 to 747ecb0 Compare April 8, 2026 03:14

0oshowero0 requested a review from Copilot April 8, 2026 03:44

Copilot started reviewing on behalf of 0oshowero0 April 8, 2026 03:44 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

KaisennHu force-pushed the feat/yr-init-metastore branch from 747ecb0 to 7073b4a Compare April 8, 2026 06:31

tianyi-ge reviewed Apr 8, 2026

View reviewed changes

KaisennHu force-pushed the feat/yr-init-metastore branch from 7073b4a to b1980f8 Compare April 8, 2026 09:00

KaisennHu force-pushed the feat/yr-init-metastore branch from b1980f8 to e31a6ba Compare April 8, 2026 09:19

tianyi-ge reviewed Apr 8, 2026

View reviewed changes

KaisennHu force-pushed the feat/yr-init-metastore branch from e31a6ba to 51c689d Compare April 9, 2026 02:48

KaisennHu force-pushed the feat/yr-init-metastore branch from 51c689d to 2f9bf95 Compare April 9, 2026 02:53

KaisennHu force-pushed the feat/yr-init-metastore branch from 2f9bf95 to 8a1c985 Compare April 9, 2026 03:01

KaisennHu force-pushed the feat/yr-init-metastore branch from 8a1c985 to 5679d7d Compare April 9, 2026 03:20

tianyi-ge reviewed Apr 9, 2026

View reviewed changes

0oshowero0 reviewed Apr 9, 2026

View reviewed changes

KaisennHu force-pushed the feat/yr-init-metastore branch from 5679d7d to c0155af Compare April 9, 2026 06:16

KaisennHu force-pushed the feat/yr-init-metastore branch from c0155af to 54c6b1f Compare April 9, 2026 07:13

KaisennHu force-pushed the feat/yr-init-metastore branch from 54c6b1f to 583171a Compare April 10, 2026 01:32

[feat] Support auto init YR backend based on metastore

c608bc6

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>

KaisennHu force-pushed the feat/yr-init-metastore branch from 583171a to c608bc6 Compare April 10, 2026 08:43

Conversation

KaisennHu commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Related issues

Uh oh!

ascend-robot commented Apr 7, 2026

CLA Signature Pass

Uh oh!

ascend-robot commented Apr 7, 2026

CLA Signature Pass

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ascend-robot commented Apr 7, 2026

CLA Signature Pass

Uh oh!

ascend-robot commented Apr 7, 2026

CLA Signature Pass

Uh oh!

ascend-robot commented Apr 7, 2026

CLA Signature Pass

Uh oh!

ascend-robot commented Apr 7, 2026

CLA Signature Pass

Uh oh!

ascend-robot commented Apr 8, 2026

CLA Signature Pass

Uh oh!

ascend-robot commented Apr 8, 2026

CLA Signature Pass

Uh oh!

ascend-robot commented Apr 8, 2026

CLA Signature Pass

Uh oh!

ascend-robot commented Apr 8, 2026

CLA Signature Pass

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ascend-robot commented Apr 8, 2026

CLA Signature Pass

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ascend-robot commented Apr 8, 2026

CLA Signature Pass

Uh oh!

ascend-robot commented Apr 8, 2026

CLA Signature Pass

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ascend-robot commented Apr 9, 2026

CLA Signature Pass

Uh oh!

ascend-robot commented Apr 9, 2026

CLA Signature Pass

KaisennHu commented Apr 7, 2026 •

edited

Loading