Skip to content

fix: A fix in megatron YARN module for memory leak#1163

Merged
terrykong merged 3 commits intoNVIDIA-NeMo:mainfrom
guyueh1:deepseek_fix_memory_leak
Sep 23, 2025
Merged

fix: A fix in megatron YARN module for memory leak#1163
terrykong merged 3 commits intoNVIDIA-NeMo:mainfrom
guyueh1:deepseek_fix_memory_leak

Conversation

@guyueh1
Copy link
Copy Markdown
Contributor

@guyueh1 guyueh1 commented Sep 18, 2025

What does this PR do ?

A bugfix for not being able to fully offload for deepseek

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Summary by CodeRabbit

  • Chores
    • Updated a pinned third‑party submodule reference to a different upstream branch/revision; repository metadata refreshed.
  • No User‑Facing Changes
    • No changes to features, behavior, public APIs, or exported interfaces.

Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@guyueh1 guyueh1 requested review from a team as code owners September 18, 2025 21:32
@guyueh1 guyueh1 self-assigned this Sep 18, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Sep 18, 2025

Walkthrough

Updated the Megatron-LM submodule: changed the configured branch in .gitmodules and advanced the submodule commit pointer. No source code, public APIs, or other submodules were modified.

Changes

Cohort / File(s) Summary of Changes
Submodule config
\.gitmodules
Updated Megatron-LM submodule branch from yuya/nemo-rl-use-2 to guyueh/yuya/nemo-rl-use-2; remote URL, path, and shallow setting unchanged.
Submodule pointer
3rdparty/Megatron-LM-workspace/Megatron-LM
Advanced submodule commit pointer from 383d1144c3b3f77096c63b7308402a0ea6ba47dd to af73aa2cebf94a0bee5ea6dda2614ad989faffae (metadata-only update).

Sequence Diagram(s)

(omitted — changes are metadata-only and do not alter runtime control flow)

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

A rabbit nudges submodule trails,
A branch renamed, a tiny sail.
Commit hops forward, quiet and neat,
No code disturbed, just pointer feet.
Small changes, tidy paths—hooray! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The PR title states a memory-leak fix in the Megatron YARN module, but the provided diff is limited to updating a Megatron-LM submodule pointer/branch and metadata-only changes with no code or API edits that resolve a memory leak; thus the title is misleading and does not accurately summarize the primary change. Ask the author to either update the PR title and description to reflect that this is a submodule/metadata update (e.g., "chore: update Megatron-LM submodule") or include the missing bugfix commits, tests, and validation details that actually implement the YARN memory-leak fix before the PR is reviewed or merged.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 82658f8 and 88edb0d.

📒 Files selected for processing (1)
  • .gitmodules (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • .gitmodules
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Lint check
  • GitHub Check: Post automodel integration comment / Comment on PR

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Comment @coderabbitai help to get the list of available commands and usage tips.

@guyueh1 guyueh1 changed the title bugfix: A fix in megatron YARN module for memory leak fix: A fix in megatron YARN module for memory leak Sep 18, 2025
@github-actions
Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 82658f8 (PR #1163 from deepseek_fix_memory_leak)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c6be532 and 82658f8.

📒 Files selected for processing (2)
  • .gitmodules (1 hunks)
  • 3rdparty/Megatron-LM-workspace/Megatron-LM (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Coverage (e2e)
  • GitHub Check: Coverage (doc-test)
  • GitHub Check: Coverage (unit-test)
  • GitHub Check: Post automodel integration comment / Comment on PR

Comment thread .gitmodules Outdated
Comment thread 3rdparty/Megatron-LM-workspace/Megatron-LM
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
@github-actions
Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 88edb0d (PR #1163 from deepseek_fix_memory_leak)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@guyueh1
Copy link
Copy Markdown
Contributor Author

guyueh1 commented Sep 22, 2025

@terrykong can you do a final review on this?

@terrykong terrykong added the CI:L1 Run doctests, unit tests, and functional tests label Sep 22, 2025
@terrykong terrykong enabled auto-merge (squash) September 22, 2025 18:18
@guyueh1 guyueh1 linked an issue Sep 22, 2025 that may be closed by this pull request
@github-actions
Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 4b0b546 (PR #1163 from deepseek_fix_memory_leak)

✅ Submodules that are properly updated:

Megatron-LM: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@terrykong terrykong merged commit 66099f5 into NVIDIA-NeMo:main Sep 23, 2025
25 checks passed
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
Signed-off-by: Guyue Huang <guyueh@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests r0.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refit regression for MoE models using Megatron-Bridge

4 participants