Increase timeout#1148
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| - config-keys: | ||
| - dsv4-fp8-mi355x-sglang | ||
| description: | ||
| - "Bump MI355X SLURM allocation from --time=180 to --time=300 in runners/launch_mi355x-amds.sh" | ||
| - "DSv4-Pro on MI355X exceeded the 3h cap (STEP CANCELLED DUE TO TIME LIMIT) due to ~30min MoE JIT compile plus slow torch-fallback kernels (SGLANG_HACK_FLASHMLA_BACKEND=torch et al.) from sgl-project/sglang#23608" | ||
| - "300 minutes matches the GH Actions outer timeout-minutes cap in benchmark-tmpl.yml" | ||
| - "Retriggering dsv4-fp8-mi355x-sglang" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1148 |
There was a problem hiding this comment.
🟡 This new entry duplicates the block added by PR #1147 (immediately above at lines 1790-1796) — all four description bullets are verbatim copies. Since PR #1148's diff only modifies perf-changelog.yaml and does not touch runners/launch_mi355x-amds.sh, the bullet "Bump MI355X SLURM allocation from --time=180 to --time=300" misrepresents what this PR actually does. Consider replacing the four bullets with a single Retriggering dsv4-fp8-mi355x-sglang bullet (matching the convention at line 1727 for PR #1120), and/or backfilling pr-link: #1147 on the existing block instead.
Extended reasoning...
What the bug is. The new entry appended at lines 1798-1805 reuses, word-for-word, the four description bullets from the entry already in the file at lines 1790-1796. Both blocks have the same config-keys: [dsv4-fp8-mi355x-sglang] and the same description. The only differences are that the new block adds pr-link: #1148 and the existing block has no pr-link at all.
Why this is misleading. The first bullet — "Bump MI355X SLURM allocation from --time=180 to --time=300 in runners/launch_mi355x-amds.sh" — describes a concrete code change to a runner script. But git log shows that change was already made by commit d40be91 (PR #1147, titled "Bump MI355X SLURM time-limit to 300m; retrigger dsv4-fp8-mi355x-sglang"), and the PR metadata for #1148 reports changed-files count=1 with only perf-changelog.yaml touched. So as written, the new entry claims credit for a script change this PR does not contain.
Step-by-step proof.
git log --onelineincludesd40be91 Bump MI355X SLURM time-limit to 300m; retrigger dsv4-fp8-mi355x-sglang (#1147)— confirming PR Bump MI355X SLURM time-limit to 300m; retrigger dsv4-fp8-mi355x-sglang #1147 already landed the script bump.- The PR diff for Increase timeout #1148 contains exactly one hunk, all in
perf-changelog.yaml; nothing inrunners/is modified. - Reading lines 1790-1796 vs 1798-1805 of the post-merge file shows the four
description:bullets are byte-identical between the two blocks. - Therefore after merging, the changelog will contain two consecutive entries for the same
config-keydescribing the same change — one (from Bump MI355X SLURM time-limit to 300m; retrigger dsv4-fp8-mi355x-sglang #1147) without apr-link, and one (from Increase timeout #1148) withpr-link: #1148that falsely attributes the SLURM bump to Increase timeout #1148.
Why existing convention doesn't catch this. The file has a precedent for retrigger-only entries: line 1727's PR #1120 entry uses a single bullet "Trigger H100 multinode evals after NVSHEMM fixes" rather than copying the underlying-fix entry's bullets. Following that convention here would have avoided the duplication entirely.
Impact. No runtime impact — perf-changelog.yaml is documentation. But the changelog is the canonical record of what each PR changed, and a duplicate block plus a mis-attributed bullet clutters that record and will confuse anyone trying to reconstruct when the SLURM time-limit was actually bumped.
Suggested fix. Either (a) replace the four-bullet description in this PR with a single Retriggering dsv4-fp8-mi355x-sglang bullet, mirroring the #1120 convention; or (b) drop the new block entirely from this PR and instead add pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1147 to the existing #1147 block at lines 1790-1796 (which is currently the only pr-link-less entry near the bottom of the file).
The merge of main into this branch (c0aec93) accidentally overwrote the two dsv4-fp8-mi355x-sglang retry entries (PR #1148 retry-pair tail and PR #1159 retry-pair) with duplicated copies of our own dsv4-fp4-gb200-dynamo-sglang entry. The process_changelog.py gate rejects deletions, so the workflow blocked. Restore the two mi355x entries verbatim from origin/main and keep a single copy of our dsv4 entry, appended after the restored mi355x block. perf-changelog.yaml diff vs origin/main is now additions-only.
dsv4-fp4-mi355x-atom (ROCm/ATOM#650 PR1, single-sequence at TP=8 with torch-fallback hc_pre because aiter mhc_pre crashes on this image) runs at ~5 min per request in steady state. With 1k1k at 12 prompts plus 8k1k at the same shape, the full sweep can exceed the 300-min cap that #1148 set for the SGLang-DSv4 path. Bump both the SLURM allocation in runners/launch_mi355x-amds.sh and the GitHub Actions timeout-minutes in benchmark-tmpl.yml together — either expiring first kills the job, so they need to stay aligned. Note: this is a global bump that affects every MI355X benchmark and every job that uses the shared workflow template, not just the dsv4 ATOM one. Drop back to 300 once the slow paths are gone (PR4 CUDAGraph + a working aiter MHC).
Increase timeout for mi355x dsv4 runs