Skip to content

Increase timeout#1148

Merged
Oseltamivir merged 1 commit intomainfrom
mi355x_retrigger
Apr 25, 2026
Merged

Increase timeout#1148
Oseltamivir merged 1 commit intomainfrom
mi355x_retrigger

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

@Oseltamivir Oseltamivir commented Apr 25, 2026

Increase timeout for mi355x dsv4 runs

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@Oseltamivir Oseltamivir merged commit 694f206 into main Apr 25, 2026
5 checks passed
@Oseltamivir Oseltamivir deleted the mi355x_retrigger branch April 25, 2026 03:13
Comment thread perf-changelog.yaml
Comment on lines +1798 to +1805
- config-keys:
- dsv4-fp8-mi355x-sglang
description:
- "Bump MI355X SLURM allocation from --time=180 to --time=300 in runners/launch_mi355x-amds.sh"
- "DSv4-Pro on MI355X exceeded the 3h cap (STEP CANCELLED DUE TO TIME LIMIT) due to ~30min MoE JIT compile plus slow torch-fallback kernels (SGLANG_HACK_FLASHMLA_BACKEND=torch et al.) from sgl-project/sglang#23608"
- "300 minutes matches the GH Actions outer timeout-minutes cap in benchmark-tmpl.yml"
- "Retriggering dsv4-fp8-mi355x-sglang"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1148
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 This new entry duplicates the block added by PR #1147 (immediately above at lines 1790-1796) — all four description bullets are verbatim copies. Since PR #1148's diff only modifies perf-changelog.yaml and does not touch runners/launch_mi355x-amds.sh, the bullet "Bump MI355X SLURM allocation from --time=180 to --time=300" misrepresents what this PR actually does. Consider replacing the four bullets with a single Retriggering dsv4-fp8-mi355x-sglang bullet (matching the convention at line 1727 for PR #1120), and/or backfilling pr-link: #1147 on the existing block instead.

Extended reasoning...

What the bug is. The new entry appended at lines 1798-1805 reuses, word-for-word, the four description bullets from the entry already in the file at lines 1790-1796. Both blocks have the same config-keys: [dsv4-fp8-mi355x-sglang] and the same description. The only differences are that the new block adds pr-link: #1148 and the existing block has no pr-link at all.

Why this is misleading. The first bullet — "Bump MI355X SLURM allocation from --time=180 to --time=300 in runners/launch_mi355x-amds.sh" — describes a concrete code change to a runner script. But git log shows that change was already made by commit d40be91 (PR #1147, titled "Bump MI355X SLURM time-limit to 300m; retrigger dsv4-fp8-mi355x-sglang"), and the PR metadata for #1148 reports changed-files count=1 with only perf-changelog.yaml touched. So as written, the new entry claims credit for a script change this PR does not contain.

Step-by-step proof.

  1. git log --oneline includes d40be91 Bump MI355X SLURM time-limit to 300m; retrigger dsv4-fp8-mi355x-sglang (#1147) — confirming PR Bump MI355X SLURM time-limit to 300m; retrigger dsv4-fp8-mi355x-sglang #1147 already landed the script bump.
  2. The PR diff for Increase timeout #1148 contains exactly one hunk, all in perf-changelog.yaml; nothing in runners/ is modified.
  3. Reading lines 1790-1796 vs 1798-1805 of the post-merge file shows the four description: bullets are byte-identical between the two blocks.
  4. Therefore after merging, the changelog will contain two consecutive entries for the same config-key describing the same change — one (from Bump MI355X SLURM time-limit to 300m; retrigger dsv4-fp8-mi355x-sglang #1147) without a pr-link, and one (from Increase timeout #1148) with pr-link: #1148 that falsely attributes the SLURM bump to Increase timeout #1148.

Why existing convention doesn't catch this. The file has a precedent for retrigger-only entries: line 1727's PR #1120 entry uses a single bullet "Trigger H100 multinode evals after NVSHEMM fixes" rather than copying the underlying-fix entry's bullets. Following that convention here would have avoided the duplication entirely.

Impact. No runtime impact — perf-changelog.yaml is documentation. But the changelog is the canonical record of what each PR changed, and a duplicate block plus a mis-attributed bullet clutters that record and will confuse anyone trying to reconstruct when the SLURM time-limit was actually bumped.

Suggested fix. Either (a) replace the four-bullet description in this PR with a single Retriggering dsv4-fp8-mi355x-sglang bullet, mirroring the #1120 convention; or (b) drop the new block entirely from this PR and instead add pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1147 to the existing #1147 block at lines 1790-1796 (which is currently the only pr-link-less entry near the bottom of the file).

Oseltamivir added a commit that referenced this pull request Apr 25, 2026
The merge of main into this branch (c0aec93) accidentally overwrote
the two dsv4-fp8-mi355x-sglang retry entries (PR #1148 retry-pair tail
and PR #1159 retry-pair) with duplicated copies of our own
dsv4-fp4-gb200-dynamo-sglang entry. The process_changelog.py gate
rejects deletions, so the workflow blocked.

Restore the two mi355x entries verbatim from origin/main and keep a
single copy of our dsv4 entry, appended after the restored mi355x
block. perf-changelog.yaml diff vs origin/main is now additions-only.
@claude claude Bot mentioned this pull request Apr 26, 2026
Oseltamivir added a commit that referenced this pull request Apr 26, 2026
dsv4-fp4-mi355x-atom (ROCm/ATOM#650 PR1, single-sequence at TP=8 with
torch-fallback hc_pre because aiter mhc_pre crashes on this image)
runs at ~5 min per request in steady state. With 1k1k at 12 prompts
plus 8k1k at the same shape, the full sweep can exceed the 300-min
cap that #1148 set for the SGLang-DSv4 path.

Bump both the SLURM allocation in runners/launch_mi355x-amds.sh and
the GitHub Actions timeout-minutes in benchmark-tmpl.yml together —
either expiring first kills the job, so they need to stay aligned.

Note: this is a global bump that affects every MI355X benchmark and
every job that uses the shared workflow template, not just the dsv4
ATOM one. Drop back to 300 once the slow paths are gone (PR4
CUDAGraph + a working aiter MHC).
@claude claude Bot mentioned this pull request Apr 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant