Skip to content

ci: add LGTM Bot for external contributor merge visibility#6313

Draft
dagil-nvidia wants to merge 8 commits intomainfrom
dagil-nvidia/ci-lgtm-bot-v2
Draft

ci: add LGTM Bot for external contributor merge visibility#6313
dagil-nvidia wants to merge 8 commits intomainfrom
dagil-nvidia/ci-lgtm-bot-v2

Conversation

@dagil-nvidia
Copy link
Copy Markdown
Collaborator

@dagil-nvidia dagil-nvidia commented Feb 15, 2026

Problem

Dynamo has 180+ external contributors and 465+ external PRs submitted. When an external contributor opens a PR, key information is invisible to them:

  1. They can't see who needs to review. CODEOWNERS teams are hidden from fork PRs — contributors have no idea which teams they're waiting on.
  2. They can't tell why CI failed. CI spans GitHub Actions and internal GitLab GPU tests. GitHub shows raw logs; GitLab is behind a login wall. Contributors can't distinguish infra flakes from real code errors.
  3. They don't know when they're ready to merge. No single "all gates pass" signal exists.

Not addressed in this PR: Automatic individual reviewer assignment. CODEOWNERS team requests are included, but round-robin assignment within a team is a potential follow-up.

Solution

Feature What It Does Why GitHub Doesn't
lgtm label Applied when ALL gates pass. Removed on new pushes. Could be used for future auto-merge. No native "all gates pass" signal.
/diagnose LLM-powered CI failure analysis grouped by blocking vs non-blocking. Fetches internal GitLab CI logs. Self-explanatory checks (lint, DCO, copyright) are never sent to the LLM. GitHub shows raw logs. GitLab is inaccessible.
Required Reviewers Posts which CODEOWNERS teams need to review. Sidebar with required reviewers not visible to external contributors.
Merge Checklist Auto-updating 3-line status (CI, reviews, mergeable). Status spread across tabs.

/diagnose Output

The /diagnose command posts a structured diagnostic grouped by merge impact:

## CI Diagnostics

**1 Blocking**, 3 non-blocking · 28 passed, 1 pending · 2 approvals

### Blocking

- ✅ copyright-checks
- ✅ DCO
- ❌ dynamo-status-check
  <details><summary>Build and Test - dynamo: Infrastructure Error (92%)</summary>
  etcd image not found. Infrastructure issue -- not caused by your PR. Wait for re-run.
  </details>
- ✅ backend-status-check
- ⏳ pre-merge-status-check

### Non-Blocking

*GitLab CI runs E2E integration tests that require GPU compute
on internal infrastructure. Logs are not publicly accessible — fetched here via API.*

<details><summary>sglang-x86-2GPU: Code Error (85%)</summary>
test_sglang_deployment[disaggregated-2] returned 404. Check SGLang config.
</details>

<details><summary>vllm-arm-1GPU: Timeout (95%)</summary>
Job exceeded time limit. Infrastructure issue -- no action needed.
</details>

Key design decisions:

  • Blocking lists all 5 required status checks; sub-job diagnoses nest under their parent rollup check
  • Non-blocking shows only failures; GitLab jobs include a note that logs were fetched from internal CI
  • Timeouts and other CI checks that do not need further explanation (e.g., copyright-checks, dco) are never sent to LLM for classification
  • A reply-link is posted to the /diagnose comment so the contributor gets notified with a direct URL to the diagnostic message that will be refreshed to prevent comment bloat

Merge Checklist

## 🤖 LGTM Bot
### Merge Checklist
- [ ] **CI Checks:** 2 failing, 10 passed, 3 pending, 8 skipped
- [ ] **Review Status:** Awaiting Approval · CODEOWNERS: @ai-dynamo/Devops, @ai-dynamo/python-codeowners
- [ ] **Mergeable:** Merge Conflicts, 3 unresolved conversations

Architecture

flowchart TD
    subgraph triggers [Triggers]
        WR[workflow_run completed]
        PRR[pull_request_review]
        PRT[pull_request_target]
        IC["issue_comment /diagnose"]
    end

    subgraph workflows [Workflows]
        LB[lgtm-bot.yml]
        LD[lgtm-bot-diagnose.yml]
    end

    subgraph scripts [Scripts]
        DPY[lgtm_bot_diagnose.py]
    end

    subgraph pkg [error_classification/]
        LLC[llm_client.py]
        CFG[config.py]
    end

    subgraph external [External APIs]
        GH[GitHub API via gh CLI]
        GL[GitLab API via requests]
        NV[NVIDIA Inference API]
    end

    WR --> LB
    PRR --> LB
    PRT --> LB
    IC --> LD

    LB -->|"label + checklist + reviewers"| GH
    LD -->|run script| DPY
    DPY -->|check status| GH
    DPY -->|fetch logs| GL
    DPY --> LLC
    LLC --> CFG
    LLC -->|LLM call| NV
    DPY -->|post comment| GH
Loading

Files Changed

File Lines Purpose
.github/workflows/lgtm-bot.yml 391 Label + checklist + reviewer request
.github/workflows/lgtm-bot-diagnose.yml 50 /diagnose trigger
.github/scripts/lgtm_bot_diagnose.py 658 Consolidated diagnostics: GitHub + GitLab + LLM + blocking grouping
error_classification/llm_client.py 150 OpenAI-compatible LLM client
error_classification/config.py 48 Env config
error_classification/__init__.py 14 Package init
error_classification/requirements.txt 16 requests dependency
.github/LGTM_BOT.md 72 Maintainer docs
.github/filters.yaml +1 CI filter coverage
CONTRIBUTING.md +25 CI trigger guide + /diagnose docs
Total ~1,425

How to Test

Merge checklist + LGTM label (works now):
Triggers via pull_request_review and runs from the PR branch. The checklist comment is already posting on this PR.

Required Reviewers comment (works on fork PRs):
Open a PR from a fork — the bot requests CODEOWNERS reviews and posts a visibility comment.

/diagnose command (works after merge):
The /diagnose workflow uses the issue_comment trigger. GitHub only activates issue_comment-triggered workflows when the workflow file exists on the default branch (main). Since this PR introduces the workflow as a new file, the /diagnose command will activate once this PR is merged. This is a standard GitHub Actions constraint — not a bug. The example output above shows exactly what contributors will see. GitLab API integration was verified locally against real pipelines.

GitLab log fetching requires GITLAB_TOKEN from the GITLAB environment (same token used by trigger_ci.yml for mirroring). Degrades gracefully without it.

Security

  • pull_request_target only for label removal and reviewer requests — no PR code executed
  • /diagnose gated to author_association != NONE
  • LLM API key and GitLab token optional — degrades gracefully
  • Actions SHA-pinned in diagnose workflow (checkout, setup-python); lgtm-bot.yml uses github-script@v7 tag

Acknowledgments

Thanks to @natevm for PR #6258 (workflow error classification) which inspired the LLM diagnosis feature, and to @dillon-cullinan for the thorough review of #6066 that shaped this rewrite.

…nostics

Add automated merge readiness tracking via the `lgtm` label and a short
merge checklist comment that updates on CI completion and review changes.
External contributors get a Required Reviewers comment showing which
CODEOWNERS teams need to review their PR.

The `/diagnose` slash command provides LLM-powered CI failure analysis
using NVIDIA inference, giving root cause and fix suggestions. A reusable
workflow lets existing CI pipelines auto-classify errors on failure.

Architecture: 2 workflows + 1 reusable, 2 Python scripts, 1 shared
LLM client package. All GitHub API calls use `gh` CLI. 1,307 total lines.

Signed-off-by: Dan Gil <dagil@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@dagil-nvidia dagil-nvidia requested a review from a team February 15, 2026 02:38
@dagil-nvidia dagil-nvidia requested a review from a team as a code owner February 15, 2026 02:38
@github-actions github-actions bot added ci Issues/PRs that reference CI build/test documentation Improvements or additions to documentation actions labels Feb 15, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 15, 2026

Walkthrough

Adds LGTM Bot documentation, GitHub Actions workflows and scripts for CI diagnostics and error classification, and a new error_classification Python package with an LLM client and configuration utilities.

Changes

Cohort / File(s) Summary
Documentation
/.github/LGTM_BOT.md, CONTRIBUTING.md
New LGTM Bot documentation and CONTRIBUTING updates describing CI/merge process, LGTM behavior, diagnostics, and pre-commit guidance.
Workflows
/.github/workflows/lgtm-bot.yml, /.github/workflows/lgtm-bot-diagnose.yml, /.github/workflows/classify-workflow-errors-reusable.yml
Added three workflows: LGTM bot for merge readiness and CODEOWNER routing, diagnose workflow triggered by comments/dispatch, and a reusable classify-workflow-errors workflow.
Automation scripts
/.github/scripts/classify_workflow_errors.py, /.github/scripts/lgtm_bot_diagnose.py
New Python scripts to collect CI failures, fetch logs, run LLM-based classification/diagnosis, and post or update annotated Markdown comments on PRs.
Error classification package
error_classification/__init__.py, error_classification/config.py, error_classification/llm_client.py, error_classification/requirements.txt
New package with Apache-2.0 header, Config dataclass (from_env), an LLMClient with rate limiting/retries and classify_error API, plus requirements file.
Repo filters
/.github/filters.yaml
Added error_classification/** to ignore patterns.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I nibble on logs in the pale CI light,

I hop through workflows, making failures bright,
I whisper diagnoses in tidy Markdown rhyme,
Labeling PRs so they pass in good time —
Hooray for bots, and carrots for every green check! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (25 files):

⚔️ .github/filters.yaml (content)
⚔️ CONTRIBUTING.md (content)
⚔️ Cargo.lock (content)
⚔️ Cargo.toml (content)
⚔️ benchmarks/router/run_engines.sh (content)
⚔️ components/src/dynamo/mocker/args.py (content)
⚔️ deploy/helm/charts/platform/components/operator/templates/prometheus.yaml (content)
⚔️ deploy/operator/docs/footer.md (content)
⚔️ deploy/operator/internal/consts/consts.go (content)
⚔️ deploy/operator/internal/controller/dynamocomponentdeployment_controller_test.go (content)
⚔️ deploy/operator/internal/dynamo/component_worker.go (content)
⚔️ deploy/operator/internal/dynamo/graph_test.go (content)
⚔️ docs/pages/components/router/router-guide.md (content)
⚔️ docs/pages/kubernetes/api-reference.md (content)
⚔️ docs/pages/kubernetes/observability/metrics.md (content)
⚔️ lib/kv-router/Cargo.toml (content)
⚔️ lib/kv-router/benches/kv_indexer_bench.rs (content)
⚔️ lib/kv-router/benches/radix_tree_microbench.rs (content)
⚔️ lib/kv-router/src/bench_utils.rs (content)
⚔️ lib/llm/Cargo.toml (content)
⚔️ lib/llm/benches/kv_router_bench.rs (content)
⚔️ lib/llm/src/mocker.rs (content)
⚔️ lib/llm/src/preprocessor.rs (content)
⚔️ lib/llm/src/protocols/openai/nvext.rs (content)
⚔️ lib/mocker/src/protocols.rs (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately summarizes the main change: introducing an LGTM Bot for external contributor merge visibility.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering problem statement, solution overview with feature table, detailed /diagnose output example, merge checklist format, system architecture with flowchart, files changed table, testing instructions, and security considerations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🤖 Fix all issues with AI agents
In @.github/scripts/classify_workflow_errors.py:
- Around line 118-122: The loop currently sends empty logs from get_job_log to
client.classify_error; update the loop inside the jobs iteration to skip or mark
jobs with no log data by checking the return of get_job_log (e.g., if not log:
continue or set classification to a sentinel like "no_log") before calling
client.classify_error, so only non-empty logs are passed to classify_error and
results.append receives either a real classification or a clear "no_log"
classification for that job.
- Around line 178-206: The subprocess.run calls inside the try block that post
or patch the GH comment (the branches that run when existing_id is truthy or
falsy) do not check return codes or report failures; update those calls to
either use subprocess.run(..., check=True) or capture the CompletedProcess and
inspect returncode, logging stderr/stdout on failure and raising or exiting
non‑zero so the script does not falsely report success; keep the os.unlink(tmp)
in the finally block to ensure cleanup, and add process/log context mentioning
the PR number and existing_id when logging the error.

In @.github/scripts/lgtm_bot_diagnose.py:
- Around line 77-99: get_check_status currently only requests the first page of
check-runs (per_page=100) so runs beyond page 1 are dropped; update
get_check_status to paginate through the GitHub API (e.g., loop incrementing a
page parameter and merging data.get("check_runs", []) until no more results or
headers indicate end) and aggregate passed/failed/pending/failures across all
pages; apply the same pagination fix or add a TODO for the other affected
functions mentioned (get_review_summary and post_or_update_comment) so they also
fetch all pages instead of only the first.
- Around line 180-185: The generated markdown builds a link using f['url'] which
can be empty (html_url default ""), producing a broken link; update the
failed-checks loop in the function that builds `lines` (the block iterating over
checks["failures"]) to check the URL truthiness (e.g., f.get('url') or f['url']
!= "") and append "- check-name" as plain text when the URL is empty, otherwise
append the markdown link "- [check-name](url)"; ensure you reference the same
keys f['name'] and f['url'] so the output falls back to plain text for empty
URLs.
- Around line 245-273: The subprocess.run calls that POST/PATCH GitHub comments
(the two subprocess.run invocations guarded by existing_id and the tmp variable)
silently ignore failures; change them to capture output and enforce failure
detection (e.g., use subprocess.run(..., check=True, capture_output=True) or
assign the result and check result.returncode), log or surface stderr/exception
details, and propagate the error (raise) so the caller (main) doesn't treat a
failed comment as success; ensure tmp is still unlinked in the finally block
after handling/logging the error.

In @.github/workflows/lgtm-bot.yml:
- Line 109: The self-exclusion filter currently using c.name !== 'LGTM Bot' is
incorrect and is a no-op; update the filter logic (the allChecks.filter call
that assigns ciChecks) to exclude the actual check run names produced by the
workflow (e.g., 'evaluate-lgtm' and 'request-reviews') instead of 'LGTM Bot' so
the bot does not evaluate its own runs — locate the allChecks.filter(...)
expression and replace or extend the predicate to skip c.name matching
'evaluate-lgtm' and 'request-reviews' (or use a whitelist/blacklist rule that
reliably identifies bot-generated check names).
- Around line 109-117: The current logic sets allPassed = ciChecks.every(c =>
c.conclusion === 'success'), which treats 'skipped' and 'neutral' as failures
and prevents the LGTM label; update the allPassed check (referencing ciChecks
and allPassed) to treat conclusions 'success', 'skipped', and 'neutral' as
passing (e.g., check c.conclusion is one of those values), and keep anyFailed
and anyPending semantics (anyFailed should still only consider explicit
'failure' conclusions) so conditional/matrix jobs that are skipped/neutral don't
block lgtm.
- Around line 282-303: The reopen handler currently always calls
github.rest.issues.createComment with the constructed body (using marker '<!--
lgtm-bot-reviewers -->', rows from teamSlugs and variable body), which creates
duplicate "Required Reviewers" comments; change the logic to first list existing
issue comments (github.rest.issues.listComments), search for a comment whose
body includes the marker string '<!-- lgtm-bot-reviewers -->', and if found call
github.rest.issues.updateComment with that comment's id to replace the body,
otherwise call github.rest.issues.createComment as before; keep the same marker,
body construction, and owner/repo/issue_number values (and use prNumber) so the
find-and-update pattern mirrors the checklist implementation.
- Line 298: Summary: The PR comment uses a relative link
'[CODEOWNERS](../blob/main/CODEOWNERS)' which won't resolve in PR comments;
replace it with an absolute GitHub URL. Locate the string literal '*Based on
[CODEOWNERS](../blob/main/CODEOWNERS) and changed files.*' in
.github/workflows/lgtm-bot.yml and change the relative markdown link to an
absolute one that uses the repository context (for example using
https://github.com/${{ github.repository }}/blob/main/CODEOWNERS) so the link
resolves correctly in PR comments.

In `@error_classification/requirements.txt`:
- Line 1: The requirements.txt is missing the standard SPDX copyright header;
add the same SPDX header lines used across the repo (the SPDX-FileCopyrightText
and SPDX-License-Identifier lines) to the very top of requirements.txt, above
the existing "requests>=2.28.0" entry, ensuring the header text and formatting
exactly match other files in the project.
🧹 Nitpick comments (9)
.github/scripts/lgtm_bot_diagnose.py (1)

46-60: gh_api silently swallows errors — consider logging stderr for debuggability.

The sibling gh_api in classify_workflow_errors.py (line 55) logs r.stderr[:200] on failure, but this version returns None silently. In a CI diagnostic tool, this makes it harder to troubleshoot when API calls fail.

Suggested improvement
     r = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
     if r.returncode != 0:
+        print(f"gh api {path}: {r.stderr[:200]}", file=sys.stderr)
         return None
.github/LGTM_BOT.md (1)

63-71: Add a language identifier to the fenced code block.

The linter flags this block (MD040). Since it's a plain-text diagram, use ```text.

.github/workflows/lgtm-bot-diagnose.yml (1)

24-27: contains() is a loose match — any comment containing /diagnose will trigger.

A comment like "I already ran /diagnose earlier" would re-trigger the workflow. Consider using startsWith or a stricter pattern if this becomes noisy. Low risk since the operation is read-only and idempotent.

error_classification/config.py (1)

24-48: LGTM — clean config loader with sensible defaults.

Minor nit: lines 37 and 42-46 mix os.getenv and os.environ.get, which are functionally identical but stylistically inconsistent. Picking one would be slightly cleaner.

error_classification/llm_client.py (3)

71-77: Sleeping inside the lock serializes all waiters unnecessarily.

time.sleep() inside self._lock blocks other threads from even checking the interval until the sleeping thread wakes and releases. A common pattern is to compute the needed delay inside the lock, update _last_call optimistically, then sleep outside the lock.

For a CI bot with a handful of sequential LLM calls this is unlikely to matter in practice, but it's worth noting if the client is ever used with concurrent classification.

♻️ Optional: sleep outside the lock
     def wait(self) -> None:
+        delay = 0.0
         with self._lock:
             now = time.time()
             elapsed = now - self._last_call
             if elapsed < self._min_interval:
-                time.sleep(self._min_interval - elapsed)
-            self._last_call = time.time()
+                delay = self._min_interval - elapsed
+            self._last_call = now + delay
+        if delay > 0:
+            time.sleep(delay)

111-148: Retry loop conflates retriable HTTP errors with non-retriable parse failures.

Currently, both branches (lines 119-126 for HTTP 429/5xx and lines 137-148 for exceptions) sleep and retry. If the LLM returns HTTP 200 with unparseable output, the code retries the identical request 3 times with backoff — unlikely to yield different JSON. Consider distinguishing transient errors (network/429/5xx) from deterministic parse failures to avoid wasted retries and latency.

Also, the rate limiter wait() on line 95 is called only once before the retry loop, so retries bypass rate limiting. This is fine when retries follow a long backoff, but on a fast non-429 failure caught by the except branch the next attempt fires immediately (attempt 0 → catch → sleep 4s is fine, but logically it should re-consult the limiter).

Minor: if all 3 attempts hit the continue path (lines 119-126), the exception handler is never entered yet the method still correctly returns None at line 150 — that's good.


132-133: Markdown-fence stripping can misfire on edge cases.

content.split("\n", 1)[1] raises IndexError when the content is exactly ``` with no newline (caught by the except handler, triggering a needless retry). More subtly, rsplit("```", 1) will clip valid JSON if the model output itself contains a triple-backtick string literal inside a value.

A slightly more robust approach:

♻️ Suggested improvement
-                if content.startswith("```"):
-                    content = content.split("\n", 1)[1].rsplit("```", 1)[0].strip()
+                if content.startswith("```"):
+                    lines = content.split("\n")
+                    # Drop first (```json) and last (```) fence lines
+                    if lines[-1].strip() == "```":
+                        content = "\n".join(lines[1:-1]).strip()
+                    else:
+                        content = "\n".join(lines[1:]).strip()
.github/scripts/classify_workflow_errors.py (2)

164-172: Comment listing fetches only the first page (100 comments) — duplicate comment possible on busy PRs.

gh_api makes a single request with per_page=100. If a PR has more than 100 comments, the classifier marker may not be found, resulting in a duplicate comment being created. This is a low-probability edge case for most PRs.


106-111: sys.path mutation is fragile; consider making error_classification installable.

sys.path.insert(0, ...) at line 109 works but couples the script to a specific directory layout. If the error_classification package has a requirements.txt (per the summary), adding a minimal pyproject.toml/setup.py and installing it with pip install ./error_classification in the workflow would be cleaner and would avoid import path gymnastics.

Comment thread .github/scripts/classify_workflow_errors.py Outdated
Comment thread .github/scripts/classify_workflow_errors.py Outdated
Comment thread .github/scripts/lgtm_bot_diagnose.py Outdated
Comment thread .github/scripts/lgtm_bot_diagnose.py Outdated
Comment thread .github/scripts/lgtm_bot_diagnose.py
Comment thread .github/workflows/lgtm-bot.yml Outdated
Comment thread .github/workflows/lgtm-bot.yml Outdated
Comment thread .github/workflows/lgtm-bot.yml Outdated
Comment thread .github/workflows/lgtm-bot.yml Outdated
Comment thread error_classification/requirements.txt
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 15, 2026

🤖 LGTM Bot

Merge Checklist

  • CI Checks: 2 failing, 46 passed, 9 skipped
  • Review Status: Awaiting Approval
  • Mergeable: Blocked (Pending Review), 40 unresolved conversations, Changes Requested

Type /diagnose for AI-powered CI failure analysis.
Updates automatically on CI completion and review changes.

- Add issues: write permission to evaluate-lgtm and request-reviews jobs
- Fix self-exclusion filter to use job names instead of workflow name
- Accept skipped/neutral check conclusions as passing
- Skip LLM classification when log is empty
- Add SPDX header to requirements.txt
- Log subprocess failures in comment posting
- Handle empty URLs in markdown links
- Find-and-update for reviewers comment to avoid duplicates on reopen
- Fix CODEOWNERS link to use absolute GitHub URL

Signed-off-by: Dan Gil <dagil@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@dagil-nvidia
Copy link
Copy Markdown
Collaborator Author

/diagnose

…list format

- Add error_classification/** to ignore filter in .github/filters.yaml
  to fix changed-files coverage check failure
- Add workflow_dispatch trigger to lgtm-bot-diagnose.yml for manual
  testing (issue_comment only works from default branch)
- Rewrite merge checklist with CI counts, CODEOWNERS review status,
  and unresolved conversation detection via GraphQL

Signed-off-by: Dan Gil <dagil@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@dagil-nvidia
Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 15, 2026

✅ Actions performed

Full review triggered.

@ai-dynamo ai-dynamo deleted a comment from coderabbitai bot Feb 15, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In @.github/scripts/classify_workflow_errors.py:
- Around line 143-145: The confidence formatting uses f"{conf:.0%}" which treats
conf as a 0–1 fraction and will produce incorrect results for integer percent
values (e.g., 85 → "8500%"); update the pct calculation around the conf variable
in classify_workflow_errors.py to normalize and safely format values: parse
string confidences to a number, if conf > 1 and conf <= 100 divide by 100, clamp
values below 0 to 0 and above 1 to 1, then format as a percent (or fall back to
the original string if parsing fails). Implement this normalization logic where
pct is computed (the conf → pct assignment) or extract into a small helper like
format_confidence(conf) and use that.
- Around line 168-176: The comment search in _post_or_update_comment currently
calls gh_api("/repos/{repo}/issues/{pr_number}/comments?per_page=100") without
pagination, so it can miss the classifier marker and create duplicates; update
this logic to paginate through all comment pages (e.g., loop pages or use the
repo API's paginate helper) when fetching comments from gh_api, scanning each
page for COMMENT_MARKER and only setting existing_id if the marker is found,
then proceed to update or create as before.

In @.github/scripts/lgtm_bot_diagnose.py:
- Around line 130-151: The run_classifier function currently only catches
subprocess.TimeoutExpired and json.JSONDecodeError, so if subprocess.run raises
an OSError (e.g., FileNotFoundError when "python3" or the script is missing) the
program will crash; update run_classifier to also catch OSError (or
FileNotFoundError) around the subprocess.run call, log the exception to stderr
similarly to the existing except block (e.g., print f"Classifier failed:
{exc}"), and return an empty list on that error so main() can continue;
reference the run_classifier function and the subprocess.run invocation when
making this change.

In @.github/workflows/lgtm-bot.yml:
- Around line 193-196: The ternary assigning prefix is redundant because both
branches produce the same string; replace the ternary expression with a simple
assignment (e.g., const prefix = ' · ';), or remove prefix entirely and append
the separator directly when updating revLine in the block that handles
requestedTeams, referencing the variables requestedTeams, approved and revLine
to locate the code.
🧹 Nitpick comments (5)
.github/LGTM_BOT.md (1)

63-71: Add a language identifier to the fenced code block.

The architecture diagram block lacks a language specifier. Use ```text to satisfy markdown linters.

Proposed fix
-```
+```text
 .github/workflows/lgtm-bot.yml            → label + checklist + CODEOWNERS
error_classification/llm_client.py (1)

63-78: Rate limiter holds the lock during sleep — acceptable here but worth a note.

time.sleep() inside the lock (line 76) blocks all other threads from even checking the limiter until the sleep completes. This is fine for this low-concurrency CI tooling context, but if _RateLimiter is ever reused in a higher-concurrency setting, consider computing the sleep duration under the lock and sleeping after releasing it.

.github/scripts/lgtm_bot_diagnose.py (1)

46-60: gh_api silently swallows errors — consider logging to stderr like the sibling script.

When returncode != 0, this function returns None with no logging. The equivalent function in classify_workflow_errors.py (line 55) logs r.stderr[:200] to stderr, which helps with debugging in CI. Consider adding similar logging here for consistency and observability.

Proposed fix
     r = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
     if r.returncode != 0:
+        print(f"gh api {path}: {r.stderr[:200]}", file=sys.stderr)
         return None
.github/workflows/lgtm-bot.yml (2)

305-330: Hardcoded CODEOWNERS mapping will silently drift from the actual CODEOWNERS file.

The team-to-path rules here are a manual mirror of the repo's CODEOWNERS. If CODEOWNERS is updated (new paths, renamed teams), this block won't reflect that until someone remembers to edit it. Consider adding an inline comment pointing maintainers to keep these in sync, or parsing the CODEOWNERS file at runtime.


70-71: Pin actions/github-script@v7 to a commit SHA for supply-chain consistency.

Both usages (lines 70 and 291) rely on a mutable version tag. Other workflows in this repository (lgtm-bot-diagnose.yml, classify-workflow-errors-reusable.yml) pin actions/checkout and actions/setup-python by commit SHA to prevent tag hijacking. Apply the same pinning practice to github-script for consistency.

Comment thread .github/scripts/classify_workflow_errors.py Outdated
Comment thread .github/scripts/classify_workflow_errors.py Outdated
Comment thread .github/scripts/lgtm_bot_diagnose.py Outdated
Comment thread .github/workflows/lgtm-bot.yml
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In @.github/scripts/classify_workflow_errors.py:
- Around line 47-58: The gh_api function currently calls json.loads(r.stdout)
without guarding against non-JSON output; wrap the json.loads call in a
try/except catching json.JSONDecodeError (and/or ValueError for broader
compatibility), and on decode failure print a short diagnostic (e.g., include
r.stdout[:200] or r.stderr[:200]) to sys.stderr similar to the existing error
path and return None, preserving the original "return parsed JSON or None"
contract; update the gh_api function and use the existing r variable for
diagnostics.
- Around line 259-265: The success message is printed unconditionally after
calling _post_or_update_comment in main(); change _post_or_update_comment to
either raise an exception on failure or return a boolean status, then update
main() to capture that result and only print the "Posted classification" message
when the call succeeded (use _post_or_update_comment(...) -> bool or allow
exceptions and wrap the call so you only call print(...) on success). Reference
_post_or_update_comment, _build_comment, body and pr_number when making the
conditional check or exception handling.

In @.github/scripts/lgtm_bot_diagnose.py:
- Around line 219-224: The AI analysis section can be omitted when an API key
exists and there are failed checks but the classifier produced no diagnoses;
update the logic that builds the "lines" list to handle an empty diagnoses
result by adding a fallback branch after the existing conditionals (e.g., after
the has_api_key / checks["failed"] branches) that checks if diagnoses is empty
and appends a clear message like "AI analysis attempted but returned no results
— please retry or check classifier logs." Refer to the variables names
has_api_key, checks["failed"], diagnoses and the lines list when implementing
this fallback so users always see feedback when classification was attempted.

In @.github/workflows/lgtm-bot.yml:
- Around line 346-347: The code computing isExternal uses pr.head.repo.full_name
and can throw if pr.head.repo is null; update the check to safely handle deleted
forks by guarding pr.head.repo (or using optional chaining) before accessing
full_name and treat missing head.repo as external only when appropriate (e.g.,
consider fallback to pr.head.repo?.full_name ?? '' or explicitly check
pr.head.repo !== null) so that isExternal is computed without raising a
TypeError.
🧹 Nitpick comments (2)
.github/workflows/lgtm-bot.yml (1)

305-330: Hardcoded CODEOWNERS mapping will silently drift from the actual CODEOWNERS file.

This duplicates (and can diverge from) the real CODEOWNERS file. Consider parsing CODEOWNERS at runtime or at least adding a comment referencing the source of truth so future maintainers know to update both places.

.github/scripts/lgtm_bot_diagnose.py (1)

46-60: gh_api swallows non-zero exit silently and doesn't guard against malformed JSON.

Two concerns:

  1. Unlike the sibling gh_api in classify_workflow_errors.py, this version doesn't log r.stderr on failure, making debugging harder.
  2. json.loads(r.stdout) on line 60 can raise json.JSONDecodeError if gh returns non-empty but malformed output, which would propagate as an unhandled exception.
Proposed fix
 def gh_api(
     path: str,
     method: str = "GET",
     fields: dict[str, str] | None = None,
 ) -> Any:
     """Call GitHub API via ``gh`` CLI."""
     cmd = ["gh", "api", path]
     if method != "GET":
         cmd.extend(["--method", method])
     for k, v in (fields or {}).items():
         cmd.extend(["-f", f"{k}={v}"])
     r = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
     if r.returncode != 0:
+        print(f"gh api {path}: {r.stderr[:200]}", file=sys.stderr)
         return None
-    return json.loads(r.stdout) if r.stdout.strip() else {}
+    try:
+        return json.loads(r.stdout) if r.stdout.strip() else {}
+    except json.JSONDecodeError:
+        print(f"gh api {path}: invalid JSON response", file=sys.stderr)
+        return None

Comment thread .github/scripts/classify_workflow_errors.py Outdated
Comment thread .github/scripts/classify_workflow_errors.py Outdated
Comment thread .github/scripts/lgtm_bot_diagnose.py Outdated
Comment thread .github/workflows/lgtm-bot.yml Outdated
- Add SKIP_CLASSIFICATION list to skip LLM on lint, copyright, DCO,
  etc. where logs already explain exactly what to fix
- Fix confidence rendering: normalize 0-100 to 0-1 range
- Wrap gh_api JSON parsing in try/except for malformed responses
- Return success/failure from _post_or_update_comment
- Add OSError to classifier subprocess except tuple
- Add feedback when classifier returns empty results
- Fix dead ternary in review status line
- Guard against null pr.head.repo for deleted forks

Signed-off-by: Dan Gil <dagil@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…diagnose

Consolidate classifier into diagnose script (-85 net lines):
- Merge classify_workflow_errors.py into lgtm_bot_diagnose.py
- Delete unused classify-workflow-errors-reusable.yml

Add GitLab CI integration:
- Fetch failed job details and logs via GitLab API (PRIVATE-TOKEN)
- Auto-classify timeout jobs without LLM
- Add contextual note explaining GitLab CI for external contributors

Group failures by merge impact:
- Blocking: 5 required status checks with sub-job diagnoses nested
  under their parent rollup check (WORKFLOW_TO_REQUIRED mapping)
- Non-blocking: GitLab and other informational failures

Improve output format:
- Compact one-liners for passing/pending, collapsible details for failures
- Reply-link posted to /diagnose comment with URL to diagnostic comment
- Include commit statuses (GitLab, CodeRabbit) in merge checklist counts
- Title case consistency across all bot output

Total: 1,381 lines across all bot files (down from 1,467).
Signed-off-by: Dan Gil <dagil@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
@dagil-nvidia dagil-nvidia changed the title ci: add LGTM Bot with merge readiness tracking and AI-powered CI diagnostics ci: add LGTM Bot for external contributor merge visibility Feb 16, 2026
issue_comment and workflow_dispatch triggers only activate when the
workflow file exists on the default branch.  Since this is a new file,
workflow_dispatch is dead weight until merge.  Remove it to simplify
the workflow and avoid misleading test instructions.

Signed-off-by: Dan Gil <dagil@nvidia.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: dagil-nvidia <dagil@nvidia.com>
Copy link
Copy Markdown
Contributor

@dillon-cullinan dillon-cullinan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very much against this code. It is attempting to put things already visible in the Github UI into a single comment.

Along with that, in its current state it has multiple problems; not considering edge cases, redundant code, and even hallucinated code.

# ---------------------------------------------------------------------------


def gh_api(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is defined twice across two files. Why?

fields: dict[str, str] | None = None,
) -> Any:
"""Call GitHub API via ``gh`` CLI. Returns parsed JSON or *None*."""
cmd = ["gh", "api", path]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't using ghcli the way I meant. We should be using subcommands like gh pr checks #### to get check statuses.

return None


def add_reaction(repo: str, comment_id: str, reaction: str) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a new function? All it does is call gh_api. Why not just use gh_api instead of defining this?

own_jobs = {"evaluate-lgtm", "request-reviews", "diagnose"}

# 1. Check runs (GitHub Actions)
data = gh_api(f"/repos/{repo}/commits/{sha}/check-runs?per_page=100")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it paginated? What if we have more than 100 jobs?

)

# 2. Commit statuses (GitLab, CodeRabbit, etc.)
status_data = gh_api(f"/repos/{repo}/commits/{sha}/status")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the status API offering that check-runs is not?

Comment on lines +315 to +340
// Simple CODEOWNERS matching
const teams = new Set();
for (const f of files) {
const p = f.filename;
if (p.endsWith('.rs') || p.includes('Cargo.toml')) {
teams.add('dynamo-rust-codeowners');
}
if (p.endsWith('.py')) {
teams.add('python-codeowners');
teams.add('Devops');
}
if (p.startsWith('.github/')) teams.add('Devops');
if (p.startsWith('deploy/')) {
teams.add('dynamo-deploy-codeowners');
}
if (
p.startsWith('container/') ||
p.startsWith('examples/') ||
p.startsWith('recipes/')
) {
teams.add('Devops');
teams.add('dynamo-rust-codeowners');
teams.add('python-codeowners');
teams.add('dynamo-deploy-codeowners');
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this?

'|------|--------|',
rows,
'',
'*Based on [CODEOWNERS](https://github.com/ai-dynamo/dynamo/blob/main/CODEOWNERS) and changed files.*',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true. This comment being built is based off hallucinated codeowners in this workflow. See this comment: https://github.com/ai-dynamo/dynamo/pull/6313/changes#r2819268809



@dataclass
class Config:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of this class? Are we publishing a library or re-using this in some meaningful way?

import sys
import time
from dataclasses import dataclass, field
from threading import Lock
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we importing threading?

}

for attempt in range(3):
try:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sleep call should be within this for loop, not outside it.

@dillon-cullinan
Copy link
Copy Markdown
Contributor

/diagnose

@Jont828
Copy link
Copy Markdown
Contributor

Jont828 commented Feb 19, 2026

For anyone who doesn't have context, @dagil-nvidia, @sozercan, and I were discussing the confusion on what a PR being approved is. The GitHub UI is confusing since anyone can approve a PR, but we can merge only when it's approved by a codeowner of the certain part of the project that this PR touches. The GitHub UI also shows approval with write access, not approval from the owner of the codebase area.

Kubernetes OSS repos have a system where you can type an /approve command, and it'll add the approve label if and only if you are part of the approvers for that part of the codebase. Our suggestion is to add a label indicating if a PR is approved by a person who can merge to this area of the project and use that as the source of truth. That way, there's no confusion on if a PR being "approved" actually means it's approved.

An example of this is in a PR for the Cluster API project:
image

@dagil-nvidia dagil-nvidia marked this pull request as draft March 12, 2026 21:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

actions ci Issues/PRs that reference CI build/test documentation Improvements or additions to documentation size/XXL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants