Skip to content

Warn about checkpoint disk space only on the first checkpoint#1608

Open
mnoukhov wants to merge 3 commits intomainfrom
fix/reduce-disk-space-logging
Open

Warn about checkpoint disk space only on the first checkpoint#1608
mnoukhov wants to merge 3 commits intomainfrom
fix/reduce-disk-space-logging

Conversation

@mnoukhov
Copy link
Copy Markdown
Contributor

@mnoukhov mnoukhov commented Apr 13, 2026

Summary

  • warn about low disk space only on the first checkpoint attempt of a GRPO run
  • keep stateless and scope the suppression to the checkpoint loop
  • keep unit coverage for the disk-space warning helper itself

GPU_TESTS=bypass

gemini-code-assist[bot]

This comment was marked as outdated.

@mnoukhov mnoukhov changed the title Reduce repeated low disk space send-alert warnings Warn about checkpoint disk space only on the first checkpoint Apr 13, 2026
@mnoukhov mnoukhov enabled auto-merge April 13, 2026 23:25
@finbarrtimbers
Copy link
Copy Markdown
Collaborator

My original goal for this was to catch the case where the disk fills up on a long run.

@mnoukhov
Copy link
Copy Markdown
Contributor Author

I've been getting 100+ messages a day. I can just turn it off in my runs, but it just gets annoying and there's nothing you can do to turn it off or stop it once the run is going (especially if you're not the one who's using up the memory)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants