Skip to content

[New Record] Overlap H2D transfers with GPU compute in PrefetchLoader#231

Open
chloechiaw wants to merge 10 commits intoKellerJordan:masterfrom
chloechiaw:h2d-prefetch
Open

[New Record] Overlap H2D transfers with GPU compute in PrefetchLoader#231
chloechiaw wants to merge 10 commits intoKellerJordan:masterfrom
chloechiaw:h2d-prefetch

Conversation

@chloechiaw
Copy link

@chloechiaw chloechiaw commented Feb 12, 2026

This is a small PR that improves the current data loader by prefetching data on the CPU, addressing @varunneal's comment in PR #216 . Heavy micro-optimization, but optimization nonetheless. @varunneal comment suggests to add a hook in the optimizer step, but it led to a bunch of NCCL errors with threading and was also kind of messy. The first 6 training runs averaged ~1.493 minutes for training

Instead, I made PrefetchLoader use a dedicated CUDA copy stream to run H2D transfers in parallel with GPU compute, instead of running them on the default stream where they block (again heavy micro-optimization). Each send() returns the current batch's GPU tensors and immediately starts the next batch's H2D on the copy stream, so by the next call the transfer is already done. record_stream is used to prevent PyTorch's memory allocator from reusing the transferred tensors before compute finishes with them.

Included a perfetto profile below in comments.

@chloechiaw chloechiaw marked this pull request as draft February 12, 2026 02:14
@chrisjmccormick
Copy link
Contributor

Nice! How's it going timing this? This seems like one of those small-but-real bits of overhead that might be difficult to register on the clock.

Would be interesting to see a trace file if you're feeling generous. 😊 (e.g., https://blog.underfit.ai/profiling-101-nanogpt)

@chloechiaw
Copy link
Author

@chrisjmccormick It's looking like 92069ms or 1.5344833 minutes. Let me see if I can -0.2 more. Running this on 8x H100. Just added in the log too. And yes I can make a trace file thanks for sending the article!

@chloechiaw
Copy link
Author

@chrisjmccormick added the trace

it looks like the cudaMemCpyAsync is happening in the right areas? it's being called at N batch and prefetching before N+1 batch occurs. Attached perfetto, it happens at the end of the profiling step after the backward pass during the NCCL allGather calls, etc.

ScreenShot Tool -20260212130507

any feedback appreciated!!

@chloechiaw chloechiaw changed the title [New Record] Add prefetch class and set right GPU device [WIP] Add prefetch class and set right GPU device Feb 12, 2026
@chrisjmccormick
Copy link
Contributor

Nice, thanks for sharing the trace! I believe the gain this achieves is over at the end of the optimizer / beginning of the next forward pass. You've eliminated this (from a past baseline):

image

And then from your trace file:

image

These traces are from the beginning of training, too, so the benefit is probably even more significant in the later stages where the batch sizes are bigger.

Awesome!

@chloechiaw chloechiaw marked this pull request as ready for review February 22, 2026 18:59
@chloechiaw chloechiaw changed the title [WIP] Add prefetch class and set right GPU device [New Record] Overlap H2D transfers with GPU compute in PrefetchLoader Feb 22, 2026
@ClassicLarry
Copy link
Collaborator

I don't have a good read on the timing improvement from this. If its around 40ms, I don't think its worth merging in right now, given the complexity of PrefetchLoader.

Thinking that if we want to hit 60 seconds, having this merged in right now is going to slow down that progression more than it will help due to the added complexity of following the code. Will keep in mind though and may revisit if improvements dry up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants