Increase timeout by Oseltamivir · Pull Request #1148 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-04-25T03:13:39Z

Increase timeout for mi355x dsv4 runs

github-actions · 2026-04-25T03:13:48Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-04-25T03:18:30Z

+- config-keys:
+    - dsv4-fp8-mi355x-sglang
+  description:
+    - "Bump MI355X SLURM allocation from --time=180 to --time=300 in runners/launch_mi355x-amds.sh"
+    - "DSv4-Pro on MI355X exceeded the 3h cap (STEP CANCELLED DUE TO TIME LIMIT) due to ~30min MoE JIT compile plus slow torch-fallback kernels (SGLANG_HACK_FLASHMLA_BACKEND=torch et al.) from sgl-project/sglang#23608"
+    - "300 minutes matches the GH Actions outer timeout-minutes cap in benchmark-tmpl.yml"
+    - "Retriggering dsv4-fp8-mi355x-sglang"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1148


🟡 This new entry duplicates the block added by PR #1147 (immediately above at lines 1790-1796) — all four description bullets are verbatim copies. Since PR #1148's diff only modifies perf-changelog.yaml and does not touch runners/launch_mi355x-amds.sh, the bullet "Bump MI355X SLURM allocation from --time=180 to --time=300" misrepresents what this PR actually does. Consider replacing the four bullets with a single Retriggering dsv4-fp8-mi355x-sglang bullet (matching the convention at line 1727 for PR #1120), and/or backfilling pr-link: #1147 on the existing block instead.

Extended reasoning...

What the bug is. The new entry appended at lines 1798-1805 reuses, word-for-word, the four description bullets from the entry already in the file at lines 1790-1796. Both blocks have the same config-keys: [dsv4-fp8-mi355x-sglang] and the same description. The only differences are that the new block adds pr-link: #1148 and the existing block has no pr-link at all.

Why this is misleading. The first bullet — "Bump MI355X SLURM allocation from --time=180 to --time=300 in runners/launch_mi355x-amds.sh" — describes a concrete code change to a runner script. But git log shows that change was already made by commit d40be91 (PR #1147, titled "Bump MI355X SLURM time-limit to 300m; retrigger dsv4-fp8-mi355x-sglang"), and the PR metadata for #1148 reports changed-files count=1 with only perf-changelog.yaml touched. So as written, the new entry claims credit for a script change this PR does not contain.

Step-by-step proof.

git log --oneline includes d40be91 Bump MI355X SLURM time-limit to 300m; retrigger dsv4-fp8-mi355x-sglang (#1147) — confirming PR Bump MI355X SLURM time-limit to 300m; retrigger dsv4-fp8-mi355x-sglang #1147 already landed the script bump.

The PR diff for Increase timeout #1148 contains exactly one hunk, all in perf-changelog.yaml; nothing in runners/ is modified.

Reading lines 1790-1796 vs 1798-1805 of the post-merge file shows the four description: bullets are byte-identical between the two blocks.

Therefore after merging, the changelog will contain two consecutive entries for the same config-key describing the same change — one (from Bump MI355X SLURM time-limit to 300m; retrigger dsv4-fp8-mi355x-sglang #1147) without a pr-link, and one (from Increase timeout #1148) with pr-link: #1148 that falsely attributes the SLURM bump to Increase timeout #1148.

Why existing convention doesn't catch this. The file has a precedent for retrigger-only entries: line 1727's PR #1120 entry uses a single bullet "Trigger H100 multinode evals after NVSHEMM fixes" rather than copying the underlying-fix entry's bullets. Following that convention here would have avoided the duplication entirely.

Impact. No runtime impact — perf-changelog.yaml is documentation. But the changelog is the canonical record of what each PR changed, and a duplicate block plus a mis-attributed bullet clutters that record and will confuse anyone trying to reconstruct when the SLURM time-limit was actually bumped.

Suggested fix. Either (a) replace the four-bullet description in this PR with a single Retriggering dsv4-fp8-mi355x-sglang bullet, mirroring the #1120 convention; or (b) drop the new block entirely from this PR and instead add pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1147 to the existing #1147 block at lines 1790-1796 (which is currently the only pr-link-less entry near the bottom of the file).

The merge of main into this branch (c0aec93) accidentally overwrote the two dsv4-fp8-mi355x-sglang retry entries (PR #1148 retry-pair tail and PR #1159 retry-pair) with duplicated copies of our own dsv4-fp4-gb200-dynamo-sglang entry. The process_changelog.py gate rejects deletions, so the workflow blocked. Restore the two mi355x entries verbatim from origin/main and keep a single copy of our dsv4 entry, appended after the restored mi355x block. perf-changelog.yaml diff vs origin/main is now additions-only.

dsv4-fp4-mi355x-atom (ROCm/ATOM#650 PR1, single-sequence at TP=8 with torch-fallback hc_pre because aiter mhc_pre crashes on this image) runs at ~5 min per request in steady state. With 1k1k at 12 prompts plus 8k1k at the same shape, the full sweep can exceed the 300-min cap that #1148 set for the SGLang-DSv4 path. Bump both the SLURM allocation in runners/launch_mi355x-amds.sh and the GitHub Actions timeout-minutes in benchmark-tmpl.yml together — either expiring first kills the job, so they need to stay aligned. Note: this is a global bump that affects every MI355X benchmark and every job that uses the shared workflow template, not just the dsv4 ATOM one. Drop back to 300 once the slow paths are gone (PR4 CUDAGraph + a working aiter MHC).

init

8eab576

Oseltamivir requested a review from a team April 25, 2026 03:13

github-project-automation Bot added this to InferenceMAX Board Apr 25, 2026

Oseltamivir merged commit 694f206 into main Apr 25, 2026
5 checks passed

Oseltamivir deleted the mi355x_retrigger branch April 25, 2026 03:13

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 25, 2026

claude Bot reviewed Apr 25, 2026

View reviewed changes

This was referenced Apr 25, 2026

vLLM GB300 Day 0 DSV4 FP4 disagg #1150

Open

retrigger mi355x dsv4 #1159

Merged

mi355x dsv4 retry #1160

Merged

claude Bot mentioned this pull request Apr 26, 2026

DSV4 B300 DPA rerun #1162

Closed

claude Bot mentioned this pull request Apr 26, 2026

retry sglang b300 #1171

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase timeout#1148

Increase timeout#1148
Oseltamivir merged 1 commit intomainfrom
mi355x_retrigger

Oseltamivir commented Apr 25, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

Uh oh!

claude Bot Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oseltamivir commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Apr 25, 2026

Uh oh!

Uh oh!

claude Bot Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Oseltamivir commented Apr 25, 2026 •

edited

Loading