Remove tms while preserving offloading by Risc-lt · Pull Request #17 · JD-ETH/slime

Risc-lt · 2026-01-25T05:53:25Z

This PR removes torch memory saver while preserving offloading function. The overview of the offloading mechanism is

# Initialization
x = torch.empty(1000, 1000, device="cuda")

# Release memory
x.storage().resize_(0)
torch.cuda.empty_cache()

# Realloc when onloading
# Since we don't need the last value, we can skip the h2d and just do the allocation
x.storage().resize_(x.numel())

After testing on 1to1 config, the effect is

[RDMA] Before offloading model replica: {'gpu': '0', 'total_GB': 139.8, 'free_GB': 57.41, 'used_GB': 82.39}

[RDMA] After offloading model replica: {'gpu': '0', 'total_GB': 139.8, 'free_GB': 66.13, 'used_GB': 73.67}

Risc-lt · 2026-01-25T05:56:29Z

Current profiling shows that registering results in 900~1k ms overhead and unregister is ~300 ms. We need to find some way to pipeline the process. cc @JD-ETH @JensenFire

JD-ETH

let's discuss a bit first

JD-ETH · 2026-01-25T16:43:57Z

slime/backends/megatron_utils/update_weight/update_weight_from_rdma.py

    engine: TransferEngine
    weight_memory_registry: dict
    remote_weight_infos: list[RemoteWeightInfo]
+    _model_on_cpu: bool = False


either make it private and access via a property, or just make it public member

sry for this typo, I'll correct it

JD-ETH · 2026-01-25T16:50:04Z

slime/backends/megatron_utils/update_weight/update_weight_from_rdma.py

-                        logging.error(f"RDMA transfer failed with error code {ret} for session {task.session_id}")
+                    logger.info(f"[RDMA] Submitted transfer task for session {task.session_id}, batch_id={batch_id}")
+                    # Record batch_id with engine and source_ptrs for later sync and unregister
+                    with self._lock:


don't we have _active_tasks already?

we should rely on _queue.join() for the eventual finish check

JD-ETH · 2026-01-25T16:58:47Z

slime/backends/megatron_utils/update_weight/update_weight_from_rdma.py


            print_memory("[RDMA] After Local Engine Replicas and engine Creation")

+    def _unregister_replica_memory(self, model_replica, transfer_engine) -> None:


is there a way to guarantee the mapping is exact? calling memory_snapshot can be expensive, no?

I think its best if we store the engine_param -> [memory, offset] mapping at registration time.

Exactly, the best way here is to return the registered memory region address in function register_memory_region_v2 (sglang side). Then we can directly pass these addresses to TE.unregister. Current implementation is to mock register_memory_region_v2 on traning side.

JD-ETH · 2026-01-25T18:12:26Z

for now let's pre-commit and fix the private member issue --- we can merge first, and work on correctness as an initial step

Risc-lt · 2026-01-25T21:59:05Z

Thanks for review! I'll do the pre-commit and resolve the problems commented.

Risc-lt added 3 commits January 24, 2026 19:50

feat: replace tms with torch.to and add register/unregister each turn

54d5788

fix: find exact registered physical page

f9bf5da

fix: remove h2d and d2h overhead

dc6ae85

JD-ETH requested changes Jan 25, 2026

View reviewed changes

fix: pre-commit and private attribute

f3a25cd

JD-ETH approved these changes Jan 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove tms while preserving offloading#17

Remove tms while preserving offloading#17
Risc-lt wants to merge 4 commits intojd/rdma-integrationfrom
feat/offload

Risc-lt commented Jan 25, 2026 •

edited

Loading

Uh oh!

Risc-lt commented Jan 25, 2026 •

edited

Loading

Uh oh!

JD-ETH left a comment

Uh oh!

JD-ETH Jan 25, 2026

Uh oh!

Risc-lt Jan 25, 2026

Uh oh!

JD-ETH Jan 25, 2026

Uh oh!

JD-ETH Jan 25, 2026

Uh oh!

JD-ETH Jan 25, 2026

Uh oh!

Risc-lt Jan 25, 2026 •

edited

Loading

Uh oh!

JD-ETH commented Jan 25, 2026

Uh oh!

Risc-lt commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		print_memory("[RDMA] After Local Engine Replicas and engine Creation")

		def _unregister_replica_memory(self, model_replica, transfer_engine) -> None:

Conversation

Risc-lt commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Risc-lt commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JD-ETH left a comment

Choose a reason for hiding this comment

Uh oh!

JD-ETH Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Risc-lt Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

JD-ETH Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

JD-ETH Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

JD-ETH Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Risc-lt Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JD-ETH commented Jan 25, 2026

Uh oh!

Risc-lt commented Jan 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Risc-lt commented Jan 25, 2026 •

edited

Loading

Risc-lt commented Jan 25, 2026 •

edited

Loading

Risc-lt Jan 25, 2026 •

edited

Loading