Skip to content

[feat] Support metastore mode for Yuanrong backend init#74

Open
KaisennHu wants to merge 1 commit intoAscend:mainfrom
KaisennHu:feat/yr-init-metastore
Open

[feat] Support metastore mode for Yuanrong backend init#74
KaisennHu wants to merge 1 commit intoAscend:mainfrom
KaisennHu:feat/yr-init-metastore

Conversation

@KaisennHu
Copy link
Copy Markdown
Collaborator

@KaisennHu KaisennHu commented Apr 7, 2026

Description

Simplify Yuanrong backend initialization to exclusively support metastore mode, removing the external etcd dependency. This refactor uses Ray's native cluster discovery and placement groups to manage distributed Yuanrong datasystem workers.

Changes

  • transfer_queue/config.yaml: Remove etcd and metastore mode configuration
    - Removed: etcd_address, host, port, metastore_mode, metastore_address
    - Kept: auto_init, worker_port, metastore_port
    - Added: worker_args for additional dscli start arguments
    - Host IPs are now auto-detected from ray.nodes() via NodeManagerAddress
  • transfer_queue: Add new file transfer_queue/utils/yuanrong_utils.py
    • YuanrongWorkerActor (Ray actor class):
      • Determines its node via IP intersection with provided node_ips
      • Starts metastore service on head node (rank 0)
      • Provides start() and stop() methods for lifecycle management
    • initialize_yuanrong_backend(): Complete initialization logic
      • Gets Ray cluster information via ray.nodes()
      • Creates placement group with STRICT_SPREAD strategy (0.1 CPU per bundle)
      • Creates YuanrongWorkerActor instances on each bundle
      • Starts head worker first, then parallel starts remaining workers
      • Returns dict with worker_actors, metastore_address, placement_group
      • Handles exceptions with proper cleanup
    • cleanup_yuanrong_resources(): Complete cleanup logic
      • Stops all workers concurrently, collecting exceptions
      • Kills actors and removes placement group
    • start_datasystem_worker() / stop_datasystem_worker(): dscli wrapper functions
    • get_local_ip_addresses(): IP discovery for node self-determination
  • transfer_queue/interface.py: Simplified Yuanrong backend integration
    - Replace ~100 lines of inline initialization with single function call to initialize_yuanrong_backend(conf)
    - Simplify close() to single call to cleanup_yuanrong_resources(value)
    - Remove unused imports: shutil, get_local_ip_addresses, etcd-related functions
  • tests/: Update test configurations to use worker_port instead of host/port

Related issues

Fixes #50

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@KaisennHu KaisennHu force-pushed the feat/yr-init-metastore branch from 3bcdf8b to 852c27e Compare April 7, 2026 03:35
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “metastore mode” startup path for the Yuanrong backend so transfer_queue.init() can bring up Yuanrong datasystem without managing an external etcd, targeting multi-node deployments where the metastore runs inside the datasystem worker.

Changes:

  • Added metastore connectivity polling and refactored Yuanrong worker startup into helper functions.
  • Refactored Yuanrong auto_init to support both metastore mode and the existing etcd-based mode.
  • Updated Yuanrong cleanup logic in close() and extended default config with metastore_mode / metastore_address.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
transfer_queue/interface.py Implements metastore readiness waiting, metastore-mode worker startup (head/worker), refactors etcd/dscli startup & cleanup paths.
transfer_queue/config.yaml Adds metastore-related config keys and removes the documented host field in the Yuanrong defaults.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@KaisennHu KaisennHu force-pushed the feat/yr-init-metastore branch from 852c27e to 33bed8a Compare April 7, 2026 08:13
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@KaisennHu KaisennHu force-pushed the feat/yr-init-metastore branch from 33bed8a to 89332c6 Compare April 7, 2026 11:03
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@KaisennHu KaisennHu force-pushed the feat/yr-init-metastore branch from 89332c6 to bbae4c1 Compare April 7, 2026 16:53
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@KaisennHu KaisennHu force-pushed the feat/yr-init-metastore branch from bbae4c1 to be6fe9f Compare April 7, 2026 16:56
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@KaisennHu KaisennHu force-pushed the feat/yr-init-metastore branch from be6fe9f to c6b54ea Compare April 8, 2026 01:04
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@KaisennHu KaisennHu force-pushed the feat/yr-init-metastore branch from c6b54ea to fa6db11 Compare April 8, 2026 01:45
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@KaisennHu KaisennHu force-pushed the feat/yr-init-metastore branch from fa6db11 to 655a553 Compare April 8, 2026 02:09
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@KaisennHu KaisennHu force-pushed the feat/yr-init-metastore branch from 655a553 to 747ecb0 Compare April 8, 2026 03:14
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@KaisennHu KaisennHu force-pushed the feat/yr-init-metastore branch from 747ecb0 to 7073b4a Compare April 8, 2026 06:31
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

bundles = []
for node in ordered_nodes:
node_ip = node["NodeManagerAddress"]
bundles.append({"CPU": 0.1, f"node:{node_ip}": 0.001})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it require users to start ray with --resources='{"node:xxx": 1}'? Verl users do not likely set this config

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@KaisennHu KaisennHu force-pushed the feat/yr-init-metastore branch from 7073b4a to b1980f8 Compare April 8, 2026 09:00
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@KaisennHu KaisennHu force-pushed the feat/yr-init-metastore branch from b1980f8 to e31a6ba Compare April 8, 2026 09:19
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@pytest.fixture
def storage_client(self, mock_kv_client):
return GeneralKVClientAdapter({"host": "127.0.0.1", "port": 31501})
return GeneralKVClientAdapter({"host": "127.0.0.1", "worker_port": 31501})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

many test cases still have the removed field "host". you may remove them to make tests cleaner

# For Yuanrong:
Yuanrong:
# Whether to let TQ automatically start etcd and datasystem services
# Whether to let TQ automatically init yuanrong.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may add datasystem worker args here, e.g.

worker_args: "--shared_memory_mb 16384 --enable_huge_tlb true"

it's simpler than add another json file

@KaisennHu KaisennHu force-pushed the feat/yr-init-metastore branch from e31a6ba to 51c689d Compare April 9, 2026 02:48
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@KaisennHu KaisennHu force-pushed the feat/yr-init-metastore branch from 51c689d to 2f9bf95 Compare April 9, 2026 02:53
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@KaisennHu KaisennHu force-pushed the feat/yr-init-metastore branch from 2f9bf95 to 8a1c985 Compare April 9, 2026 03:01
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@KaisennHu KaisennHu force-pushed the feat/yr-init-metastore branch from 8a1c985 to 5679d7d Compare April 9, 2026 03:20
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

cmd.extend(worker_args.split())

node_type = "head node" if is_head else "worker node"
logger.info(f"Starting Yuanrong datasystem ({node_type}, metastore mode) at {worker_address}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest printing worker_args here, so the user know their config is correct

@dpj135
Copy link
Copy Markdown
Contributor

dpj135 commented Apr 9, 2026

Could we add Yuanrong to ci::e2e_test? @KaisennHu @tianyi-ge

stop_exceptions = []
# Stop worker nodes (all except head node 0) first
if len(worker_actors) > 1:
stop_refs = [actor.stop.remote() for actor in worker_actors[1:]]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we need to split the stop for head worker and other workers? To make sure the metastore on the head worker will not raise any error or warnings due to the heartbeat loss of worker?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly

import ray
from omegaconf import DictConfig

from transfer_queue.storage.clients.yuanrong_client import get_local_ip_addresses
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move get_local_ip_addresses into this file

@KaisennHu KaisennHu force-pushed the feat/yr-init-metastore branch from 5679d7d to c0155af Compare April 9, 2026 06:16
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@KaisennHu KaisennHu force-pushed the feat/yr-init-metastore branch from c0155af to 54c6b1f Compare April 9, 2026 07:13
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@KaisennHu KaisennHu force-pushed the feat/yr-init-metastore branch from 54c6b1f to 583171a Compare April 10, 2026 01:32
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
@KaisennHu KaisennHu force-pushed the feat/yr-init-metastore branch from 583171a to c608bc6 Compare April 10, 2026 08:43
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

KaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feat] Support automatic startup of Yuanrong for transfer_queue.init()

6 participants