Remove tms while preserving offloading#17
Remove tms while preserving offloading#17Risc-lt wants to merge 4 commits intojd/rdma-integrationfrom
Conversation
|
Current profiling shows that registering results in 900~1k ms overhead and unregister is ~300 ms. We need to find some way to pipeline the process. cc @JD-ETH @JensenFire
|
| engine: TransferEngine | ||
| weight_memory_registry: dict | ||
| remote_weight_infos: list[RemoteWeightInfo] | ||
| _model_on_cpu: bool = False |
There was a problem hiding this comment.
either make it private and access via a property, or just make it public member
There was a problem hiding this comment.
sry for this typo, I'll correct it
| logging.error(f"RDMA transfer failed with error code {ret} for session {task.session_id}") | ||
| logger.info(f"[RDMA] Submitted transfer task for session {task.session_id}, batch_id={batch_id}") | ||
| # Record batch_id with engine and source_ptrs for later sync and unregister | ||
| with self._lock: |
There was a problem hiding this comment.
don't we have _active_tasks already?
There was a problem hiding this comment.
we should rely on _queue.join() for the eventual finish check
|
|
||
| print_memory("[RDMA] After Local Engine Replicas and engine Creation") | ||
|
|
||
| def _unregister_replica_memory(self, model_replica, transfer_engine) -> None: |
There was a problem hiding this comment.
is there a way to guarantee the mapping is exact? calling memory_snapshot can be expensive, no?
I think its best if we store the engine_param -> [memory, offset] mapping at registration time.
There was a problem hiding this comment.
Exactly, the best way here is to return the registered memory region address in function register_memory_region_v2 (sglang side). Then we can directly pass these addresses to TE.unregister. Current implementation is to mock register_memory_region_v2 on traning side.
|
for now let's pre-commit and fix the private member issue --- we can merge first, and work on correctness as an initial step |
|
Thanks for review! I'll do the pre-commit and resolve the problems commented. |

This PR removes torch memory saver while preserving offloading function. The overview of the offloading mechanism is
After testing on 1to1 config, the effect is
[RDMA] Before offloading model replica: {'gpu': '0', 'total_GB': 139.8, 'free_GB': 57.41, 'used_GB': 82.39} [RDMA] After offloading model replica: {'gpu': '0', 'total_GB': 139.8, 'free_GB': 66.13, 'used_GB': 73.67}