Skip to content

Remove tms while preserving offloading#17

Open
Risc-lt wants to merge 4 commits intojd/rdma-integrationfrom
feat/offload
Open

Remove tms while preserving offloading#17
Risc-lt wants to merge 4 commits intojd/rdma-integrationfrom
feat/offload

Conversation

@Risc-lt
Copy link
Collaborator

@Risc-lt Risc-lt commented Jan 25, 2026

This PR removes torch memory saver while preserving offloading function. The overview of the offloading mechanism is

# Initialization
x = torch.empty(1000, 1000, device="cuda")

# Release memory
x.storage().resize_(0)
torch.cuda.empty_cache()

# Realloc when onloading
# Since we don't need the last value, we can skip the h2d and just do the allocation
x.storage().resize_(x.numel())

After testing on 1to1 config, the effect is

[RDMA] Before offloading model replica: {'gpu': '0', 'total_GB': 139.8, 'free_GB': 57.41, 'used_GB': 82.39}

[RDMA] After offloading model replica: {'gpu': '0', 'total_GB': 139.8, 'free_GB': 66.13, 'used_GB': 73.67}

@Risc-lt
Copy link
Collaborator Author

Risc-lt commented Jan 25, 2026

Current profiling shows that registering results in 900~1k ms overhead and unregister is ~300 ms. We need to find some way to pipeline the process. cc @JD-ETH @JensenFire

Clipboard_Screenshot_1769320691

Copy link
Owner

@JD-ETH JD-ETH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's discuss a bit first

engine: TransferEngine
weight_memory_registry: dict
remote_weight_infos: list[RemoteWeightInfo]
_model_on_cpu: bool = False
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either make it private and access via a property, or just make it public member

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sry for this typo, I'll correct it

logging.error(f"RDMA transfer failed with error code {ret} for session {task.session_id}")
logger.info(f"[RDMA] Submitted transfer task for session {task.session_id}, batch_id={batch_id}")
# Record batch_id with engine and source_ptrs for later sync and unregister
with self._lock:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we have _active_tasks already?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should rely on _queue.join() for the eventual finish check


print_memory("[RDMA] After Local Engine Replicas and engine Creation")

def _unregister_replica_memory(self, model_replica, transfer_engine) -> None:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to guarantee the mapping is exact? calling memory_snapshot can be expensive, no?

I think its best if we store the engine_param -> [memory, offset] mapping at registration time.

Copy link
Collaborator Author

@Risc-lt Risc-lt Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, the best way here is to return the registered memory region address in function register_memory_region_v2 (sglang side). Then we can directly pass these addresses to TE.unregister. Current implementation is to mock register_memory_region_v2 on traning side.

@JD-ETH
Copy link
Owner

JD-ETH commented Jan 25, 2026

for now let's pre-commit and fix the private member issue --- we can merge first, and work on correctness as an initial step

@Risc-lt
Copy link
Collaborator Author

Risc-lt commented Jan 25, 2026

Thanks for review! I'll do the pre-commit and resolve the problems commented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants