[New Record] Overlap H2D transfers with GPU compute in PrefetchLoader#231
[New Record] Overlap H2D transfers with GPU compute in PrefetchLoader#231chloechiaw wants to merge 10 commits intoKellerJordan:masterfrom
Conversation
|
Nice! How's it going timing this? This seems like one of those small-but-real bits of overhead that might be difficult to register on the clock. Would be interesting to see a trace file if you're feeling generous. 😊 (e.g., https://blog.underfit.ai/profiling-101-nanogpt) |
|
@chrisjmccormick It's looking like 92069ms or 1.5344833 minutes. Let me see if I can -0.2 more. Running this on 8x H100. Just added in the log too. And yes I can make a trace file thanks for sending the article! |
|
@chrisjmccormick added the trace it looks like the cudaMemCpyAsync is happening in the right areas? it's being called at N batch and prefetching before N+1 batch occurs. Attached perfetto, it happens at the end of the profiling step after the backward pass during the NCCL allGather calls, etc.
any feedback appreciated!! |
|
I don't have a good read on the timing improvement from this. If its around 40ms, I don't think its worth merging in right now, given the complexity of PrefetchLoader. Thinking that if we want to hit 60 seconds, having this merged in right now is going to slow down that progression more than it will help due to the added complexity of following the code. Will keep in mind though and may revisit if improvements dry up. |



This is a small PR that improves the current data loader by prefetching data on the CPU, addressing @varunneal's comment in PR #216 . Heavy micro-optimization, but optimization nonetheless. @varunneal comment suggests to add a hook in the optimizer step, but it led to a bunch of NCCL errors with threading and was also kind of messy. The first 6 training runs averaged ~1.493 minutes for training
Instead, I made PrefetchLoader use a dedicated CUDA copy stream to run H2D transfers in parallel with GPU compute, instead of running them on the default stream where they block (again heavy micro-optimization). Each send() returns the current batch's GPU tensors and immediately starts the next batch's H2D on the copy stream, so by the next call the transfer is already done. record_stream is used to prevent PyTorch's memory allocator from reusing the transferred tensors before compute finishes with them.
Included a perfetto profile below in comments.