[New Record] Overlap H2D transfers with GPU compute in PrefetchLoader by chloechiaw · Pull Request #231 · KellerJordan/modded-nanogpt

chloechiaw · 2026-02-12T02:14:06Z

This is a small PR that improves the current data loader by prefetching data on the CPU, addressing @varunneal's comment in PR #216 . Heavy micro-optimization, but optimization nonetheless. @varunneal comment suggests to add a hook in the optimizer step, but it led to a bunch of NCCL errors with threading and was also kind of messy. The first 6 training runs averaged ~1.493 minutes for training

Instead, I made PrefetchLoader use a dedicated CUDA copy stream to run H2D transfers in parallel with GPU compute, instead of running them on the default stream where they block (again heavy micro-optimization). Each send() returns the current batch's GPU tensors and immediately starts the next batch's H2D on the copy stream, so by the next call the transfer is already done. record_stream is used to prevent PyTorch's memory allocator from reusing the transferred tensors before compute finishes with them.

Included a perfetto profile below in comments.

chrisjmccormick · 2026-02-12T04:18:34Z

Nice! How's it going timing this? This seems like one of those small-but-real bits of overhead that might be difficult to register on the clock.

Would be interesting to see a trace file if you're feeling generous. 😊 (e.g., https://blog.underfit.ai/profiling-101-nanogpt)

chloechiaw · 2026-02-12T20:19:33Z

@chrisjmccormick It's looking like 92069ms or 1.5344833 minutes. Let me see if I can -0.2 more. Running this on 8x H100. Just added in the log too. And yes I can make a trace file thanks for sending the article!

chloechiaw · 2026-02-12T21:12:32Z

@chrisjmccormick added the trace

it looks like the cudaMemCpyAsync is happening in the right areas? it's being called at N batch and prefetching before N+1 batch occurs. Attached perfetto, it happens at the end of the profiling step after the backward pass during the NCCL allGather calls, etc.

any feedback appreciated!!

chrisjmccormick · 2026-02-13T18:25:47Z

Nice, thanks for sharing the trace! I believe the gain this achieves is over at the end of the optimizer / beginning of the next forward pass. You've eliminated this (from a past baseline):

And then from your trace file:

These traces are from the beginning of training, too, so the benefit is probably even more significant in the later stages where the batch sizes are bigger.

Awesome!

ClassicLarry · 2026-02-26T07:44:53Z

I don't have a good read on the timing improvement from this. If its around 40ms, I don't think its worth merging in right now, given the complexity of PrefetchLoader.

Thinking that if we want to hit 60 seconds, having this merged in right now is going to slow down that progression more than it will help due to the added complexity of following the code. Will keep in mind though and may revisit if improvements dry up.

Add prefetch class and set right GPU device

d3de45a

chloechiaw marked this pull request as draft February 12, 2026 02:14

“Chloe added 2 commits February 12, 2026 19:52

switching to one background thread

0bbe5c6

add log

58c1638

add trace

9dc1411

chloechiaw changed the title ~~[New Record] Add prefetch class and set right GPU device~~ [WIP] Add prefetch class and set right GPU device Feb 12, 2026

“Chloe added 5 commits February 16, 2026 20:46

commit before merging

02e5f2b

added w merge updates

e6ae8ca

revert to better loss

1f75b64

final changes

e70cdac

Add times

5d65eb3

chloechiaw marked this pull request as ready for review February 22, 2026 18:59

chloechiaw changed the title ~~[WIP] Add prefetch class and set right GPU device~~ [New Record] Overlap H2D transfers with GPU compute in PrefetchLoader Feb 22, 2026

Delete train_gpt_1.py

34df44b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Record] Overlap H2D transfers with GPU compute in PrefetchLoader#231

[New Record] Overlap H2D transfers with GPU compute in PrefetchLoader#231
chloechiaw wants to merge 10 commits intoKellerJordan:masterfrom
chloechiaw:h2d-prefetch

chloechiaw commented Feb 12, 2026 •

edited

Loading

Uh oh!

chrisjmccormick commented Feb 12, 2026

Uh oh!

chloechiaw commented Feb 12, 2026

Uh oh!

chloechiaw commented Feb 12, 2026

Uh oh!

chrisjmccormick commented Feb 13, 2026

Uh oh!

ClassicLarry commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chloechiaw commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chrisjmccormick commented Feb 12, 2026

Uh oh!

chloechiaw commented Feb 12, 2026

Uh oh!

chloechiaw commented Feb 12, 2026

Uh oh!

chrisjmccormick commented Feb 13, 2026

Uh oh!

ClassicLarry commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chloechiaw commented Feb 12, 2026 •

edited

Loading