[feat] Support metastore mode for Yuanrong backend init#74
[feat] Support metastore mode for Yuanrong backend init#74KaisennHu wants to merge 1 commit intoAscend:mainfrom
Conversation
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
3bcdf8b to
852c27e
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
There was a problem hiding this comment.
Pull request overview
Adds a new “metastore mode” startup path for the Yuanrong backend so transfer_queue.init() can bring up Yuanrong datasystem without managing an external etcd, targeting multi-node deployments where the metastore runs inside the datasystem worker.
Changes:
- Added metastore connectivity polling and refactored Yuanrong worker startup into helper functions.
- Refactored Yuanrong
auto_initto support both metastore mode and the existing etcd-based mode. - Updated Yuanrong cleanup logic in
close()and extended default config withmetastore_mode/metastore_address.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
transfer_queue/interface.py |
Implements metastore readiness waiting, metastore-mode worker startup (head/worker), refactors etcd/dscli startup & cleanup paths. |
transfer_queue/config.yaml |
Adds metastore-related config keys and removes the documented host field in the Yuanrong defaults. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
852c27e to
33bed8a
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
33bed8a to
89332c6
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
89332c6 to
bbae4c1
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
bbae4c1 to
be6fe9f
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
be6fe9f to
c6b54ea
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
c6b54ea to
fa6db11
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
fa6db11 to
655a553
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
655a553 to
747ecb0
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
747ecb0 to
7073b4a
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
| bundles = [] | ||
| for node in ordered_nodes: | ||
| node_ip = node["NodeManagerAddress"] | ||
| bundles.append({"CPU": 0.1, f"node:{node_ip}": 0.001}) |
There was a problem hiding this comment.
does it require users to start ray with --resources='{"node:xxx": 1}'? Verl users do not likely set this config
7073b4a to
b1980f8
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
b1980f8 to
e31a6ba
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
| @pytest.fixture | ||
| def storage_client(self, mock_kv_client): | ||
| return GeneralKVClientAdapter({"host": "127.0.0.1", "port": 31501}) | ||
| return GeneralKVClientAdapter({"host": "127.0.0.1", "worker_port": 31501}) |
There was a problem hiding this comment.
many test cases still have the removed field "host". you may remove them to make tests cleaner
transfer_queue/config.yaml
Outdated
| # For Yuanrong: | ||
| Yuanrong: | ||
| # Whether to let TQ automatically start etcd and datasystem services | ||
| # Whether to let TQ automatically init yuanrong. |
There was a problem hiding this comment.
you may add datasystem worker args here, e.g.
worker_args: "--shared_memory_mb 16384 --enable_huge_tlb true"it's simpler than add another json file
e31a6ba to
51c689d
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
51c689d to
2f9bf95
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
2f9bf95 to
8a1c985
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
8a1c985 to
5679d7d
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
| cmd.extend(worker_args.split()) | ||
|
|
||
| node_type = "head node" if is_head else "worker node" | ||
| logger.info(f"Starting Yuanrong datasystem ({node_type}, metastore mode) at {worker_address}") |
There was a problem hiding this comment.
I would suggest printing worker_args here, so the user know their config is correct
|
Could we add |
| stop_exceptions = [] | ||
| # Stop worker nodes (all except head node 0) first | ||
| if len(worker_actors) > 1: | ||
| stop_refs = [actor.stop.remote() for actor in worker_actors[1:]] |
There was a problem hiding this comment.
Why we need to split the stop for head worker and other workers? To make sure the metastore on the head worker will not raise any error or warnings due to the heartbeat loss of worker?
| import ray | ||
| from omegaconf import DictConfig | ||
|
|
||
| from transfer_queue.storage.clients.yuanrong_client import get_local_ip_addresses |
There was a problem hiding this comment.
We can move get_local_ip_addresses into this file
5679d7d to
c0155af
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
c0155af to
54c6b1f
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
54c6b1f to
583171a
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Signed-off-by: Haichuan Hu <kaisennhu@gmail.com>
583171a to
c608bc6
Compare
CLA Signature PassKaisennHu, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Description
Simplify Yuanrong backend initialization to exclusively support metastore mode, removing the external etcd dependency. This refactor uses Ray's native cluster discovery and placement groups to manage distributed Yuanrong datasystem workers.
Changes
- Removed: etcd_address, host, port, metastore_mode, metastore_address
- Kept: auto_init, worker_port, metastore_port
- Added: worker_args for additional dscli start arguments
- Host IPs are now auto-detected from ray.nodes() via NodeManagerAddress
- Replace ~100 lines of inline initialization with single function call to initialize_yuanrong_backend(conf)
- Simplify close() to single call to cleanup_yuanrong_resources(value)
- Remove unused imports: shutil, get_local_ip_addresses, etcd-related functions
Related issues
Fixes #50