ci: add LGTM Bot for external contributor merge visibility#6313
ci: add LGTM Bot for external contributor merge visibility#6313dagil-nvidia wants to merge 8 commits intomainfrom
Conversation
…nostics Add automated merge readiness tracking via the `lgtm` label and a short merge checklist comment that updates on CI completion and review changes. External contributors get a Required Reviewers comment showing which CODEOWNERS teams need to review their PR. The `/diagnose` slash command provides LLM-powered CI failure analysis using NVIDIA inference, giving root cause and fix suggestions. A reusable workflow lets existing CI pipelines auto-classify errors on failure. Architecture: 2 workflows + 1 reusable, 2 Python scripts, 1 shared LLM client package. All GitHub API calls use `gh` CLI. 1,307 total lines. Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
WalkthroughAdds LGTM Bot documentation, GitHub Actions workflows and scripts for CI diagnostics and error classification, and a new error_classification Python package with an LLM client and configuration utilities. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
🤖 Fix all issues with AI agents
In @.github/scripts/classify_workflow_errors.py:
- Around line 118-122: The loop currently sends empty logs from get_job_log to
client.classify_error; update the loop inside the jobs iteration to skip or mark
jobs with no log data by checking the return of get_job_log (e.g., if not log:
continue or set classification to a sentinel like "no_log") before calling
client.classify_error, so only non-empty logs are passed to classify_error and
results.append receives either a real classification or a clear "no_log"
classification for that job.
- Around line 178-206: The subprocess.run calls inside the try block that post
or patch the GH comment (the branches that run when existing_id is truthy or
falsy) do not check return codes or report failures; update those calls to
either use subprocess.run(..., check=True) or capture the CompletedProcess and
inspect returncode, logging stderr/stdout on failure and raising or exiting
non‑zero so the script does not falsely report success; keep the os.unlink(tmp)
in the finally block to ensure cleanup, and add process/log context mentioning
the PR number and existing_id when logging the error.
In @.github/scripts/lgtm_bot_diagnose.py:
- Around line 77-99: get_check_status currently only requests the first page of
check-runs (per_page=100) so runs beyond page 1 are dropped; update
get_check_status to paginate through the GitHub API (e.g., loop incrementing a
page parameter and merging data.get("check_runs", []) until no more results or
headers indicate end) and aggregate passed/failed/pending/failures across all
pages; apply the same pagination fix or add a TODO for the other affected
functions mentioned (get_review_summary and post_or_update_comment) so they also
fetch all pages instead of only the first.
- Around line 180-185: The generated markdown builds a link using f['url'] which
can be empty (html_url default ""), producing a broken link; update the
failed-checks loop in the function that builds `lines` (the block iterating over
checks["failures"]) to check the URL truthiness (e.g., f.get('url') or f['url']
!= "") and append "- check-name" as plain text when the URL is empty, otherwise
append the markdown link "- [check-name](url)"; ensure you reference the same
keys f['name'] and f['url'] so the output falls back to plain text for empty
URLs.
- Around line 245-273: The subprocess.run calls that POST/PATCH GitHub comments
(the two subprocess.run invocations guarded by existing_id and the tmp variable)
silently ignore failures; change them to capture output and enforce failure
detection (e.g., use subprocess.run(..., check=True, capture_output=True) or
assign the result and check result.returncode), log or surface stderr/exception
details, and propagate the error (raise) so the caller (main) doesn't treat a
failed comment as success; ensure tmp is still unlinked in the finally block
after handling/logging the error.
In @.github/workflows/lgtm-bot.yml:
- Line 109: The self-exclusion filter currently using c.name !== 'LGTM Bot' is
incorrect and is a no-op; update the filter logic (the allChecks.filter call
that assigns ciChecks) to exclude the actual check run names produced by the
workflow (e.g., 'evaluate-lgtm' and 'request-reviews') instead of 'LGTM Bot' so
the bot does not evaluate its own runs — locate the allChecks.filter(...)
expression and replace or extend the predicate to skip c.name matching
'evaluate-lgtm' and 'request-reviews' (or use a whitelist/blacklist rule that
reliably identifies bot-generated check names).
- Around line 109-117: The current logic sets allPassed = ciChecks.every(c =>
c.conclusion === 'success'), which treats 'skipped' and 'neutral' as failures
and prevents the LGTM label; update the allPassed check (referencing ciChecks
and allPassed) to treat conclusions 'success', 'skipped', and 'neutral' as
passing (e.g., check c.conclusion is one of those values), and keep anyFailed
and anyPending semantics (anyFailed should still only consider explicit
'failure' conclusions) so conditional/matrix jobs that are skipped/neutral don't
block lgtm.
- Around line 282-303: The reopen handler currently always calls
github.rest.issues.createComment with the constructed body (using marker '<!--
lgtm-bot-reviewers -->', rows from teamSlugs and variable body), which creates
duplicate "Required Reviewers" comments; change the logic to first list existing
issue comments (github.rest.issues.listComments), search for a comment whose
body includes the marker string '<!-- lgtm-bot-reviewers -->', and if found call
github.rest.issues.updateComment with that comment's id to replace the body,
otherwise call github.rest.issues.createComment as before; keep the same marker,
body construction, and owner/repo/issue_number values (and use prNumber) so the
find-and-update pattern mirrors the checklist implementation.
- Line 298: Summary: The PR comment uses a relative link
'[CODEOWNERS](../blob/main/CODEOWNERS)' which won't resolve in PR comments;
replace it with an absolute GitHub URL. Locate the string literal '*Based on
[CODEOWNERS](../blob/main/CODEOWNERS) and changed files.*' in
.github/workflows/lgtm-bot.yml and change the relative markdown link to an
absolute one that uses the repository context (for example using
https://github.com/${{ github.repository }}/blob/main/CODEOWNERS) so the link
resolves correctly in PR comments.
In `@error_classification/requirements.txt`:
- Line 1: The requirements.txt is missing the standard SPDX copyright header;
add the same SPDX header lines used across the repo (the SPDX-FileCopyrightText
and SPDX-License-Identifier lines) to the very top of requirements.txt, above
the existing "requests>=2.28.0" entry, ensuring the header text and formatting
exactly match other files in the project.
🧹 Nitpick comments (9)
.github/scripts/lgtm_bot_diagnose.py (1)
46-60:gh_apisilently swallows errors — consider logging stderr for debuggability.The sibling
gh_apiinclassify_workflow_errors.py(line 55) logsr.stderr[:200]on failure, but this version returnsNonesilently. In a CI diagnostic tool, this makes it harder to troubleshoot when API calls fail.Suggested improvement
r = subprocess.run(cmd, capture_output=True, text=True, timeout=30) if r.returncode != 0: + print(f"gh api {path}: {r.stderr[:200]}", file=sys.stderr) return None.github/LGTM_BOT.md (1)
63-71: Add a language identifier to the fenced code block.The linter flags this block (MD040). Since it's a plain-text diagram, use
```text..github/workflows/lgtm-bot-diagnose.yml (1)
24-27:contains()is a loose match — any comment containing/diagnosewill trigger.A comment like "I already ran /diagnose earlier" would re-trigger the workflow. Consider using
startsWithor a stricter pattern if this becomes noisy. Low risk since the operation is read-only and idempotent.error_classification/config.py (1)
24-48: LGTM — clean config loader with sensible defaults.Minor nit: lines 37 and 42-46 mix
os.getenvandos.environ.get, which are functionally identical but stylistically inconsistent. Picking one would be slightly cleaner.error_classification/llm_client.py (3)
71-77: Sleeping inside the lock serializes all waiters unnecessarily.
time.sleep()insideself._lockblocks other threads from even checking the interval until the sleeping thread wakes and releases. A common pattern is to compute the needed delay inside the lock, update_last_calloptimistically, then sleep outside the lock.For a CI bot with a handful of sequential LLM calls this is unlikely to matter in practice, but it's worth noting if the client is ever used with concurrent classification.
♻️ Optional: sleep outside the lock
def wait(self) -> None: + delay = 0.0 with self._lock: now = time.time() elapsed = now - self._last_call if elapsed < self._min_interval: - time.sleep(self._min_interval - elapsed) - self._last_call = time.time() + delay = self._min_interval - elapsed + self._last_call = now + delay + if delay > 0: + time.sleep(delay)
111-148: Retry loop conflates retriable HTTP errors with non-retriable parse failures.Currently, both branches (lines 119-126 for HTTP 429/5xx and lines 137-148 for exceptions) sleep and retry. If the LLM returns HTTP 200 with unparseable output, the code retries the identical request 3 times with backoff — unlikely to yield different JSON. Consider distinguishing transient errors (network/429/5xx) from deterministic parse failures to avoid wasted retries and latency.
Also, the rate limiter
wait()on line 95 is called only once before the retry loop, so retries bypass rate limiting. This is fine when retries follow a long backoff, but on a fast non-429 failure caught by the except branch the next attempt fires immediately (attempt 0 → catch → sleep 4s is fine, but logically it should re-consult the limiter).Minor: if all 3 attempts hit the
continuepath (lines 119-126), the exception handler is never entered yet the method still correctly returnsNoneat line 150 — that's good.
132-133: Markdown-fence stripping can misfire on edge cases.
content.split("\n", 1)[1]raisesIndexErrorwhen the content is exactly```with no newline (caught by the except handler, triggering a needless retry). More subtly,rsplit("```", 1)will clip valid JSON if the model output itself contains a triple-backtick string literal inside a value.A slightly more robust approach:
♻️ Suggested improvement
- if content.startswith("```"): - content = content.split("\n", 1)[1].rsplit("```", 1)[0].strip() + if content.startswith("```"): + lines = content.split("\n") + # Drop first (```json) and last (```) fence lines + if lines[-1].strip() == "```": + content = "\n".join(lines[1:-1]).strip() + else: + content = "\n".join(lines[1:]).strip().github/scripts/classify_workflow_errors.py (2)
164-172: Comment listing fetches only the first page (100 comments) — duplicate comment possible on busy PRs.
gh_apimakes a single request withper_page=100. If a PR has more than 100 comments, the classifier marker may not be found, resulting in a duplicate comment being created. This is a low-probability edge case for most PRs.
106-111:sys.pathmutation is fragile; consider makingerror_classificationinstallable.
sys.path.insert(0, ...)at line 109 works but couples the script to a specific directory layout. If theerror_classificationpackage has arequirements.txt(per the summary), adding a minimalpyproject.toml/setup.pyand installing it withpip install ./error_classificationin the workflow would be cleaner and would avoid import path gymnastics.
🤖 LGTM BotMerge Checklist
Type |
- Add issues: write permission to evaluate-lgtm and request-reviews jobs - Fix self-exclusion filter to use job names instead of workflow name - Accept skipped/neutral check conclusions as passing - Skip LLM classification when log is empty - Add SPDX header to requirements.txt - Log subprocess failures in comment posting - Handle empty URLs in markdown links - Find-and-update for reviewers comment to avoid duplicates on reopen - Fix CODEOWNERS link to use absolute GitHub URL Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
|
/diagnose |
…list format - Add error_classification/** to ignore filter in .github/filters.yaml to fix changed-files coverage check failure - Add workflow_dispatch trigger to lgtm-bot-diagnose.yml for manual testing (issue_comment only works from default branch) - Rewrite merge checklist with CI counts, CODEOWNERS review status, and unresolved conversation detection via GraphQL Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @.github/scripts/classify_workflow_errors.py:
- Around line 143-145: The confidence formatting uses f"{conf:.0%}" which treats
conf as a 0–1 fraction and will produce incorrect results for integer percent
values (e.g., 85 → "8500%"); update the pct calculation around the conf variable
in classify_workflow_errors.py to normalize and safely format values: parse
string confidences to a number, if conf > 1 and conf <= 100 divide by 100, clamp
values below 0 to 0 and above 1 to 1, then format as a percent (or fall back to
the original string if parsing fails). Implement this normalization logic where
pct is computed (the conf → pct assignment) or extract into a small helper like
format_confidence(conf) and use that.
- Around line 168-176: The comment search in _post_or_update_comment currently
calls gh_api("/repos/{repo}/issues/{pr_number}/comments?per_page=100") without
pagination, so it can miss the classifier marker and create duplicates; update
this logic to paginate through all comment pages (e.g., loop pages or use the
repo API's paginate helper) when fetching comments from gh_api, scanning each
page for COMMENT_MARKER and only setting existing_id if the marker is found,
then proceed to update or create as before.
In @.github/scripts/lgtm_bot_diagnose.py:
- Around line 130-151: The run_classifier function currently only catches
subprocess.TimeoutExpired and json.JSONDecodeError, so if subprocess.run raises
an OSError (e.g., FileNotFoundError when "python3" or the script is missing) the
program will crash; update run_classifier to also catch OSError (or
FileNotFoundError) around the subprocess.run call, log the exception to stderr
similarly to the existing except block (e.g., print f"Classifier failed:
{exc}"), and return an empty list on that error so main() can continue;
reference the run_classifier function and the subprocess.run invocation when
making this change.
In @.github/workflows/lgtm-bot.yml:
- Around line 193-196: The ternary assigning prefix is redundant because both
branches produce the same string; replace the ternary expression with a simple
assignment (e.g., const prefix = ' · ';), or remove prefix entirely and append
the separator directly when updating revLine in the block that handles
requestedTeams, referencing the variables requestedTeams, approved and revLine
to locate the code.
🧹 Nitpick comments (5)
.github/LGTM_BOT.md (1)
63-71: Add a language identifier to the fenced code block.The architecture diagram block lacks a language specifier. Use
```textto satisfy markdown linters.Proposed fix
-``` +```text .github/workflows/lgtm-bot.yml → label + checklist + CODEOWNERSerror_classification/llm_client.py (1)
63-78: Rate limiter holds the lock during sleep — acceptable here but worth a note.
time.sleep()inside the lock (line 76) blocks all other threads from even checking the limiter until the sleep completes. This is fine for this low-concurrency CI tooling context, but if_RateLimiteris ever reused in a higher-concurrency setting, consider computing the sleep duration under the lock and sleeping after releasing it..github/scripts/lgtm_bot_diagnose.py (1)
46-60:gh_apisilently swallows errors — consider logging to stderr like the sibling script.When
returncode != 0, this function returnsNonewith no logging. The equivalent function inclassify_workflow_errors.py(line 55) logsr.stderr[:200]to stderr, which helps with debugging in CI. Consider adding similar logging here for consistency and observability.Proposed fix
r = subprocess.run(cmd, capture_output=True, text=True, timeout=30) if r.returncode != 0: + print(f"gh api {path}: {r.stderr[:200]}", file=sys.stderr) return None.github/workflows/lgtm-bot.yml (2)
305-330: Hardcoded CODEOWNERS mapping will silently drift from the actualCODEOWNERSfile.The team-to-path rules here are a manual mirror of the repo's
CODEOWNERS. IfCODEOWNERSis updated (new paths, renamed teams), this block won't reflect that until someone remembers to edit it. Consider adding an inline comment pointing maintainers to keep these in sync, or parsing theCODEOWNERSfile at runtime.
70-71: Pinactions/github-script@v7to a commit SHA for supply-chain consistency.Both usages (lines 70 and 291) rely on a mutable version tag. Other workflows in this repository (
lgtm-bot-diagnose.yml,classify-workflow-errors-reusable.yml) pinactions/checkoutandactions/setup-pythonby commit SHA to prevent tag hijacking. Apply the same pinning practice to github-script for consistency.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @.github/scripts/classify_workflow_errors.py:
- Around line 47-58: The gh_api function currently calls json.loads(r.stdout)
without guarding against non-JSON output; wrap the json.loads call in a
try/except catching json.JSONDecodeError (and/or ValueError for broader
compatibility), and on decode failure print a short diagnostic (e.g., include
r.stdout[:200] or r.stderr[:200]) to sys.stderr similar to the existing error
path and return None, preserving the original "return parsed JSON or None"
contract; update the gh_api function and use the existing r variable for
diagnostics.
- Around line 259-265: The success message is printed unconditionally after
calling _post_or_update_comment in main(); change _post_or_update_comment to
either raise an exception on failure or return a boolean status, then update
main() to capture that result and only print the "Posted classification" message
when the call succeeded (use _post_or_update_comment(...) -> bool or allow
exceptions and wrap the call so you only call print(...) on success). Reference
_post_or_update_comment, _build_comment, body and pr_number when making the
conditional check or exception handling.
In @.github/scripts/lgtm_bot_diagnose.py:
- Around line 219-224: The AI analysis section can be omitted when an API key
exists and there are failed checks but the classifier produced no diagnoses;
update the logic that builds the "lines" list to handle an empty diagnoses
result by adding a fallback branch after the existing conditionals (e.g., after
the has_api_key / checks["failed"] branches) that checks if diagnoses is empty
and appends a clear message like "AI analysis attempted but returned no results
— please retry or check classifier logs." Refer to the variables names
has_api_key, checks["failed"], diagnoses and the lines list when implementing
this fallback so users always see feedback when classification was attempted.
In @.github/workflows/lgtm-bot.yml:
- Around line 346-347: The code computing isExternal uses pr.head.repo.full_name
and can throw if pr.head.repo is null; update the check to safely handle deleted
forks by guarding pr.head.repo (or using optional chaining) before accessing
full_name and treat missing head.repo as external only when appropriate (e.g.,
consider fallback to pr.head.repo?.full_name ?? '' or explicitly check
pr.head.repo !== null) so that isExternal is computed without raising a
TypeError.
🧹 Nitpick comments (2)
.github/workflows/lgtm-bot.yml (1)
305-330: Hardcoded CODEOWNERS mapping will silently drift from the actualCODEOWNERSfile.This duplicates (and can diverge from) the real
CODEOWNERSfile. Consider parsingCODEOWNERSat runtime or at least adding a comment referencing the source of truth so future maintainers know to update both places..github/scripts/lgtm_bot_diagnose.py (1)
46-60:gh_apiswallows non-zero exit silently and doesn't guard against malformed JSON.Two concerns:
- Unlike the sibling
gh_apiinclassify_workflow_errors.py, this version doesn't logr.stderron failure, making debugging harder.json.loads(r.stdout)on line 60 can raisejson.JSONDecodeErrorifghreturns non-empty but malformed output, which would propagate as an unhandled exception.Proposed fix
def gh_api( path: str, method: str = "GET", fields: dict[str, str] | None = None, ) -> Any: """Call GitHub API via ``gh`` CLI.""" cmd = ["gh", "api", path] if method != "GET": cmd.extend(["--method", method]) for k, v in (fields or {}).items(): cmd.extend(["-f", f"{k}={v}"]) r = subprocess.run(cmd, capture_output=True, text=True, timeout=30) if r.returncode != 0: + print(f"gh api {path}: {r.stderr[:200]}", file=sys.stderr) return None - return json.loads(r.stdout) if r.stdout.strip() else {} + try: + return json.loads(r.stdout) if r.stdout.strip() else {} + except json.JSONDecodeError: + print(f"gh api {path}: invalid JSON response", file=sys.stderr) + return None
- Add SKIP_CLASSIFICATION list to skip LLM on lint, copyright, DCO, etc. where logs already explain exactly what to fix - Fix confidence rendering: normalize 0-100 to 0-1 range - Wrap gh_api JSON parsing in try/except for malformed responses - Return success/failure from _post_or_update_comment - Add OSError to classifier subprocess except tuple - Add feedback when classifier returns empty results - Fix dead ternary in review status line - Guard against null pr.head.repo for deleted forks Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
…diagnose Consolidate classifier into diagnose script (-85 net lines): - Merge classify_workflow_errors.py into lgtm_bot_diagnose.py - Delete unused classify-workflow-errors-reusable.yml Add GitLab CI integration: - Fetch failed job details and logs via GitLab API (PRIVATE-TOKEN) - Auto-classify timeout jobs without LLM - Add contextual note explaining GitLab CI for external contributors Group failures by merge impact: - Blocking: 5 required status checks with sub-job diagnoses nested under their parent rollup check (WORKFLOW_TO_REQUIRED mapping) - Non-blocking: GitLab and other informational failures Improve output format: - Compact one-liners for passing/pending, collapsible details for failures - Reply-link posted to /diagnose comment with URL to diagnostic comment - Include commit statuses (GitLab, CodeRabbit) in merge checklist counts - Title case consistency across all bot output Total: 1,381 lines across all bot files (down from 1,467). Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
issue_comment and workflow_dispatch triggers only activate when the workflow file exists on the default branch. Since this is a new file, workflow_dispatch is dead weight until merge. Remove it to simplify the workflow and avoid misleading test instructions. Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: dagil-nvidia <dagil@nvidia.com>
dillon-cullinan
left a comment
There was a problem hiding this comment.
I'm very much against this code. It is attempting to put things already visible in the Github UI into a single comment.
Along with that, in its current state it has multiple problems; not considering edge cases, redundant code, and even hallucinated code.
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| def gh_api( |
There was a problem hiding this comment.
This function is defined twice across two files. Why?
| fields: dict[str, str] | None = None, | ||
| ) -> Any: | ||
| """Call GitHub API via ``gh`` CLI. Returns parsed JSON or *None*.""" | ||
| cmd = ["gh", "api", path] |
There was a problem hiding this comment.
This isn't using ghcli the way I meant. We should be using subcommands like gh pr checks #### to get check statuses.
| return None | ||
|
|
||
|
|
||
| def add_reaction(repo: str, comment_id: str, reaction: str) -> None: |
There was a problem hiding this comment.
Why is this a new function? All it does is call gh_api. Why not just use gh_api instead of defining this?
| own_jobs = {"evaluate-lgtm", "request-reviews", "diagnose"} | ||
|
|
||
| # 1. Check runs (GitHub Actions) | ||
| data = gh_api(f"/repos/{repo}/commits/{sha}/check-runs?per_page=100") |
There was a problem hiding this comment.
Why is it paginated? What if we have more than 100 jobs?
| ) | ||
|
|
||
| # 2. Commit statuses (GitLab, CodeRabbit, etc.) | ||
| status_data = gh_api(f"/repos/{repo}/commits/{sha}/status") |
There was a problem hiding this comment.
What is the status API offering that check-runs is not?
| // Simple CODEOWNERS matching | ||
| const teams = new Set(); | ||
| for (const f of files) { | ||
| const p = f.filename; | ||
| if (p.endsWith('.rs') || p.includes('Cargo.toml')) { | ||
| teams.add('dynamo-rust-codeowners'); | ||
| } | ||
| if (p.endsWith('.py')) { | ||
| teams.add('python-codeowners'); | ||
| teams.add('Devops'); | ||
| } | ||
| if (p.startsWith('.github/')) teams.add('Devops'); | ||
| if (p.startsWith('deploy/')) { | ||
| teams.add('dynamo-deploy-codeowners'); | ||
| } | ||
| if ( | ||
| p.startsWith('container/') || | ||
| p.startsWith('examples/') || | ||
| p.startsWith('recipes/') | ||
| ) { | ||
| teams.add('Devops'); | ||
| teams.add('dynamo-rust-codeowners'); | ||
| teams.add('python-codeowners'); | ||
| teams.add('dynamo-deploy-codeowners'); | ||
| } | ||
| } |
| '|------|--------|', | ||
| rows, | ||
| '', | ||
| '*Based on [CODEOWNERS](https://github.com/ai-dynamo/dynamo/blob/main/CODEOWNERS) and changed files.*', |
There was a problem hiding this comment.
This is not true. This comment being built is based off hallucinated codeowners in this workflow. See this comment: https://github.com/ai-dynamo/dynamo/pull/6313/changes#r2819268809
|
|
||
|
|
||
| @dataclass | ||
| class Config: |
There was a problem hiding this comment.
What is the point of this class? Are we publishing a library or re-using this in some meaningful way?
| import sys | ||
| import time | ||
| from dataclasses import dataclass, field | ||
| from threading import Lock |
There was a problem hiding this comment.
Why are we importing threading?
| } | ||
|
|
||
| for attempt in range(3): | ||
| try: |
There was a problem hiding this comment.
The sleep call should be within this for loop, not outside it.
|
/diagnose |
|
For anyone who doesn't have context, @dagil-nvidia, @sozercan, and I were discussing the confusion on what a PR being approved is. The GitHub UI is confusing since anyone can approve a PR, but we can merge only when it's approved by a codeowner of the certain part of the project that this PR touches. The GitHub UI also shows approval with write access, not approval from the owner of the codebase area. Kubernetes OSS repos have a system where you can type an An example of this is in a PR for the Cluster API project: |

Problem
Dynamo has 180+ external contributors and 465+ external PRs submitted. When an external contributor opens a PR, key information is invisible to them:
Solution
lgtmlabel/diagnose/diagnoseOutputThe
/diagnosecommand posts a structured diagnostic grouped by merge impact:Key design decisions:
/diagnosecomment so the contributor gets notified with a direct URL to the diagnostic message that will be refreshed to prevent comment bloatMerge Checklist
Architecture
flowchart TD subgraph triggers [Triggers] WR[workflow_run completed] PRR[pull_request_review] PRT[pull_request_target] IC["issue_comment /diagnose"] end subgraph workflows [Workflows] LB[lgtm-bot.yml] LD[lgtm-bot-diagnose.yml] end subgraph scripts [Scripts] DPY[lgtm_bot_diagnose.py] end subgraph pkg [error_classification/] LLC[llm_client.py] CFG[config.py] end subgraph external [External APIs] GH[GitHub API via gh CLI] GL[GitLab API via requests] NV[NVIDIA Inference API] end WR --> LB PRR --> LB PRT --> LB IC --> LD LB -->|"label + checklist + reviewers"| GH LD -->|run script| DPY DPY -->|check status| GH DPY -->|fetch logs| GL DPY --> LLC LLC --> CFG LLC -->|LLM call| NV DPY -->|post comment| GHFiles Changed
.github/workflows/lgtm-bot.yml.github/workflows/lgtm-bot-diagnose.yml/diagnosetrigger.github/scripts/lgtm_bot_diagnose.pyerror_classification/llm_client.pyerror_classification/config.pyerror_classification/__init__.pyerror_classification/requirements.txtrequestsdependency.github/LGTM_BOT.md.github/filters.yamlCONTRIBUTING.md/diagnosedocsHow to Test
Merge checklist + LGTM label (works now):
Triggers via
pull_request_reviewand runs from the PR branch. The checklist comment is already posting on this PR.Required Reviewers comment (works on fork PRs):
Open a PR from a fork — the bot requests CODEOWNERS reviews and posts a visibility comment.
/diagnosecommand (works after merge):The
/diagnoseworkflow uses theissue_commenttrigger. GitHub only activatesissue_comment-triggered workflows when the workflow file exists on the default branch (main). Since this PR introduces the workflow as a new file, the/diagnosecommand will activate once this PR is merged. This is a standard GitHub Actions constraint — not a bug. The example output above shows exactly what contributors will see. GitLab API integration was verified locally against real pipelines.GitLab log fetching requires
GITLAB_TOKENfrom theGITLABenvironment (same token used bytrigger_ci.ymlfor mirroring). Degrades gracefully without it.Security
pull_request_targetonly for label removal and reviewer requests — no PR code executed/diagnosegated toauthor_association != NONElgtm-bot.ymlusesgithub-script@v7tagAcknowledgments
Thanks to @natevm for PR #6258 (workflow error classification) which inspired the LLM diagnosis feature, and to @dillon-cullinan for the thorough review of #6066 that shaped this rewrite.