ci: add LGTM Bot for external contributor merge visibility by dagil-nvidia · Pull Request #6313 · ai-dynamo/dynamo

dagil-nvidia · 2026-02-15T02:38:11Z

Problem

Dynamo has 180+ external contributors and 465+ external PRs submitted. When an external contributor opens a PR, key information is invisible to them:

They can't see who needs to review. CODEOWNERS teams are hidden from fork PRs — contributors have no idea which teams they're waiting on.
They can't tell why CI failed. CI spans GitHub Actions and internal GitLab GPU tests. GitHub shows raw logs; GitLab is behind a login wall. Contributors can't distinguish infra flakes from real code errors.
They don't know when they're ready to merge. No single "all gates pass" signal exists.

Not addressed in this PR: Automatic individual reviewer assignment. CODEOWNERS team requests are included, but round-robin assignment within a team is a potential follow-up.

Solution

Feature	What It Does	Why GitHub Doesn't
`lgtm` label	Applied when ALL gates pass. Removed on new pushes. Could be used for future auto-merge.	No native "all gates pass" signal.
`/diagnose`	LLM-powered CI failure analysis grouped by blocking vs non-blocking. Fetches internal GitLab CI logs. Self-explanatory checks (lint, DCO, copyright) are never sent to the LLM.	GitHub shows raw logs. GitLab is inaccessible.
Required Reviewers	Posts which CODEOWNERS teams need to review.	Sidebar with required reviewers not visible to external contributors.
Merge Checklist	Auto-updating 3-line status (CI, reviews, mergeable).	Status spread across tabs.

`/diagnose` Output

The /diagnose command posts a structured diagnostic grouped by merge impact:

## CI Diagnostics

**1 Blocking**, 3 non-blocking · 28 passed, 1 pending · 2 approvals

### Blocking

- ✅ copyright-checks
- ✅ DCO
- ❌ dynamo-status-check
  <details><summary>Build and Test - dynamo: Infrastructure Error (92%)</summary>
  etcd image not found. Infrastructure issue -- not caused by your PR. Wait for re-run.
  </details>
- ✅ backend-status-check
- ⏳ pre-merge-status-check

### Non-Blocking

*GitLab CI runs E2E integration tests that require GPU compute
on internal infrastructure. Logs are not publicly accessible — fetched here via API.*

<details><summary>sglang-x86-2GPU: Code Error (85%)</summary>
test_sglang_deployment[disaggregated-2] returned 404. Check SGLang config.
</details>

<details><summary>vllm-arm-1GPU: Timeout (95%)</summary>
Job exceeded time limit. Infrastructure issue -- no action needed.
</details>

Key design decisions:

Blocking lists all 5 required status checks; sub-job diagnoses nest under their parent rollup check
Non-blocking shows only failures; GitLab jobs include a note that logs were fetched from internal CI
Timeouts and other CI checks that do not need further explanation (e.g., copyright-checks, dco) are never sent to LLM for classification
A reply-link is posted to the /diagnose comment so the contributor gets notified with a direct URL to the diagnostic message that will be refreshed to prevent comment bloat

Merge Checklist

## 🤖 LGTM Bot
### Merge Checklist
- [ ] **CI Checks:** 2 failing, 10 passed, 3 pending, 8 skipped
- [ ] **Review Status:** Awaiting Approval · CODEOWNERS: @ai-dynamo/Devops, @ai-dynamo/python-codeowners
- [ ] **Mergeable:** Merge Conflicts, 3 unresolved conversations

Architecture

flowchart TD
    subgraph triggers [Triggers]
        WR[workflow_run completed]
        PRR[pull_request_review]
        PRT[pull_request_target]
        IC["issue_comment /diagnose"]
    end

    subgraph workflows [Workflows]
        LB[lgtm-bot.yml]
        LD[lgtm-bot-diagnose.yml]
    end

    subgraph scripts [Scripts]
        DPY[lgtm_bot_diagnose.py]
    end

    subgraph pkg [error_classification/]
        LLC[llm_client.py]
        CFG[config.py]
    end

    subgraph external [External APIs]
        GH[GitHub API via gh CLI]
        GL[GitLab API via requests]
        NV[NVIDIA Inference API]
    end

    WR --> LB
    PRR --> LB
    PRT --> LB
    IC --> LD

    LB -->|"label + checklist + reviewers"| GH
    LD -->|run script| DPY
    DPY -->|check status| GH
    DPY -->|fetch logs| GL
    DPY --> LLC
    LLC --> CFG
    LLC -->|LLM call| NV
    DPY -->|post comment| GH

Files Changed

File	Lines	Purpose
`.github/workflows/lgtm-bot.yml`	391	Label + checklist + reviewer request
`.github/workflows/lgtm-bot-diagnose.yml`	50	`/diagnose` trigger
`.github/scripts/lgtm_bot_diagnose.py`	658	Consolidated diagnostics: GitHub + GitLab + LLM + blocking grouping
`error_classification/llm_client.py`	150	OpenAI-compatible LLM client
`error_classification/config.py`	48	Env config
`error_classification/__init__.py`	14	Package init
`error_classification/requirements.txt`	16	`requests` dependency
`.github/LGTM_BOT.md`	72	Maintainer docs
`.github/filters.yaml`	+1	CI filter coverage
`CONTRIBUTING.md`	+25	CI trigger guide + `/diagnose` docs
Total	~1,425

How to Test

Merge checklist + LGTM label (works now):
Triggers via pull_request_review and runs from the PR branch. The checklist comment is already posting on this PR.

Required Reviewers comment (works on fork PRs):
Open a PR from a fork — the bot requests CODEOWNERS reviews and posts a visibility comment.

/diagnose command (works after merge):
The /diagnose workflow uses the issue_comment trigger. GitHub only activates issue_comment-triggered workflows when the workflow file exists on the default branch (main). Since this PR introduces the workflow as a new file, the /diagnose command will activate once this PR is merged. This is a standard GitHub Actions constraint — not a bug. The example output above shows exactly what contributors will see. GitLab API integration was verified locally against real pipelines.

GitLab log fetching requires GITLAB_TOKEN from the GITLAB environment (same token used by trigger_ci.yml for mirroring). Degrades gracefully without it.

Security

pull_request_target only for label removal and reviewer requests — no PR code executed
/diagnose gated to author_association != NONE
LLM API key and GitLab token optional — degrades gracefully
Actions SHA-pinned in diagnose workflow (checkout, setup-python); lgtm-bot.yml uses github-script@v7 tag

Acknowledgments

Thanks to @natevm for PR #6258 (workflow error classification) which inspired the LLM diagnosis feature, and to @dillon-cullinan for the thorough review of #6066 that shaped this rewrite.

…nostics Add automated merge readiness tracking via the `lgtm` label and a short merge checklist comment that updates on CI completion and review changes. External contributors get a Required Reviewers comment showing which CODEOWNERS teams need to review their PR. The `/diagnose` slash command provides LLM-powered CI failure analysis using NVIDIA inference, giving root cause and fix suggestions. A reusable workflow lets existing CI pipelines auto-classify errors on failure. Architecture: 2 workflows + 1 reusable, 2 Python scripts, 1 shared LLM client package. All GitHub API calls use `gh` CLI. 1,307 total lines. Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

coderabbitai · 2026-02-15T02:45:24Z

Walkthrough

Adds LGTM Bot documentation, GitHub Actions workflows and scripts for CI diagnostics and error classification, and a new error_classification Python package with an LLM client and configuration utilities.

Changes

Cohort / File(s)	Summary
Documentation `/.github/LGTM_BOT.md`, `CONTRIBUTING.md`	New LGTM Bot documentation and CONTRIBUTING updates describing CI/merge process, LGTM behavior, diagnostics, and pre-commit guidance.
Workflows `/.github/workflows/lgtm-bot.yml`, `/.github/workflows/lgtm-bot-diagnose.yml`, `/.github/workflows/classify-workflow-errors-reusable.yml`	Added three workflows: LGTM bot for merge readiness and CODEOWNER routing, diagnose workflow triggered by comments/dispatch, and a reusable classify-workflow-errors workflow.
Automation scripts `/.github/scripts/classify_workflow_errors.py`, `/.github/scripts/lgtm_bot_diagnose.py`	New Python scripts to collect CI failures, fetch logs, run LLM-based classification/diagnosis, and post or update annotated Markdown comments on PRs.
Error classification package `error_classification/__init__.py`, `error_classification/config.py`, `error_classification/llm_client.py`, `error_classification/requirements.txt`	New package with Apache-2.0 header, Config dataclass (from_env), an LLMClient with rate limiting/retries and classify_error API, plus requirements file.
Repo filters `/.github/filters.yaml`	Added `error_classification/**` to ignore patterns.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I nibble on logs in the pale CI light,

I hop through workflows, making failures bright,
I whisper diagnoses in tidy Markdown rhyme,
Labeling PRs so they pass in good time —
Hooray for bots, and carrots for every green check! 🥕✨

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection	⚠️ Warning	❌ Merge conflicts detected (25 files): ⚔️ `.github/filters.yaml` (content) ⚔️ `CONTRIBUTING.md` (content) ⚔️ `Cargo.lock` (content) ⚔️ `Cargo.toml` (content) ⚔️ `benchmarks/router/run_engines.sh` (content) ⚔️ `components/src/dynamo/mocker/args.py` (content) ⚔️ `deploy/helm/charts/platform/components/operator/templates/prometheus.yaml` (content) ⚔️ `deploy/operator/docs/footer.md` (content) ⚔️ `deploy/operator/internal/consts/consts.go` (content) ⚔️ `deploy/operator/internal/controller/dynamocomponentdeployment_controller_test.go` (content) ⚔️ `deploy/operator/internal/dynamo/component_worker.go` (content) ⚔️ `deploy/operator/internal/dynamo/graph_test.go` (content) ⚔️ `docs/pages/components/router/router-guide.md` (content) ⚔️ `docs/pages/kubernetes/api-reference.md` (content) ⚔️ `docs/pages/kubernetes/observability/metrics.md` (content) ⚔️ `lib/kv-router/Cargo.toml` (content) ⚔️ `lib/kv-router/benches/kv_indexer_bench.rs` (content) ⚔️ `lib/kv-router/benches/radix_tree_microbench.rs` (content) ⚔️ `lib/kv-router/src/bench_utils.rs` (content) ⚔️ `lib/llm/Cargo.toml` (content) ⚔️ `lib/llm/benches/kv_router_bench.rs` (content) ⚔️ `lib/llm/src/mocker.rs` (content) ⚔️ `lib/llm/src/preprocessor.rs` (content) ⚔️ `lib/llm/src/protocols/openai/nvext.rs` (content) ⚔️ `lib/mocker/src/protocols.rs` (content) These conflicts must be resolved before merging into `main`.	Resolve conflicts locally and push changes to this branch.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title accurately summarizes the main change: introducing an LGTM Bot for external contributor merge visibility.
Description check	✅ Passed	The PR description is comprehensive and well-structured, covering problem statement, solution overview with feature table, detailed `/diagnose` output example, merge checklist format, system architecture with flowchart, files changed table, testing instructions, and security considerations.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 10

🤖 Fix all issues with AI agents

In @.github/scripts/classify_workflow_errors.py:
- Around line 118-122: The loop currently sends empty logs from get_job_log to
client.classify_error; update the loop inside the jobs iteration to skip or mark
jobs with no log data by checking the return of get_job_log (e.g., if not log:
continue or set classification to a sentinel like "no_log") before calling
client.classify_error, so only non-empty logs are passed to classify_error and
results.append receives either a real classification or a clear "no_log"
classification for that job.
- Around line 178-206: The subprocess.run calls inside the try block that post
or patch the GH comment (the branches that run when existing_id is truthy or
falsy) do not check return codes or report failures; update those calls to
either use subprocess.run(..., check=True) or capture the CompletedProcess and
inspect returncode, logging stderr/stdout on failure and raising or exiting
non‑zero so the script does not falsely report success; keep the os.unlink(tmp)
in the finally block to ensure cleanup, and add process/log context mentioning
the PR number and existing_id when logging the error.

In @.github/scripts/lgtm_bot_diagnose.py:
- Around line 77-99: get_check_status currently only requests the first page of
check-runs (per_page=100) so runs beyond page 1 are dropped; update
get_check_status to paginate through the GitHub API (e.g., loop incrementing a
page parameter and merging data.get("check_runs", []) until no more results or
headers indicate end) and aggregate passed/failed/pending/failures across all
pages; apply the same pagination fix or add a TODO for the other affected
functions mentioned (get_review_summary and post_or_update_comment) so they also
fetch all pages instead of only the first.
- Around line 180-185: The generated markdown builds a link using f['url'] which
can be empty (html_url default ""), producing a broken link; update the
failed-checks loop in the function that builds `lines` (the block iterating over
checks["failures"]) to check the URL truthiness (e.g., f.get('url') or f['url']
!= "") and append "- check-name" as plain text when the URL is empty, otherwise
append the markdown link "- [check-name](url)"; ensure you reference the same
keys f['name'] and f['url'] so the output falls back to plain text for empty
URLs.
- Around line 245-273: The subprocess.run calls that POST/PATCH GitHub comments
(the two subprocess.run invocations guarded by existing_id and the tmp variable)
silently ignore failures; change them to capture output and enforce failure
detection (e.g., use subprocess.run(..., check=True, capture_output=True) or
assign the result and check result.returncode), log or surface stderr/exception
details, and propagate the error (raise) so the caller (main) doesn't treat a
failed comment as success; ensure tmp is still unlinked in the finally block
after handling/logging the error.

In @.github/workflows/lgtm-bot.yml:
- Line 109: The self-exclusion filter currently using c.name !== 'LGTM Bot' is
incorrect and is a no-op; update the filter logic (the allChecks.filter call
that assigns ciChecks) to exclude the actual check run names produced by the
workflow (e.g., 'evaluate-lgtm' and 'request-reviews') instead of 'LGTM Bot' so
the bot does not evaluate its own runs — locate the allChecks.filter(...)
expression and replace or extend the predicate to skip c.name matching
'evaluate-lgtm' and 'request-reviews' (or use a whitelist/blacklist rule that
reliably identifies bot-generated check names).
- Around line 109-117: The current logic sets allPassed = ciChecks.every(c =>
c.conclusion === 'success'), which treats 'skipped' and 'neutral' as failures
and prevents the LGTM label; update the allPassed check (referencing ciChecks
and allPassed) to treat conclusions 'success', 'skipped', and 'neutral' as
passing (e.g., check c.conclusion is one of those values), and keep anyFailed
and anyPending semantics (anyFailed should still only consider explicit
'failure' conclusions) so conditional/matrix jobs that are skipped/neutral don't
block lgtm.
- Around line 282-303: The reopen handler currently always calls
github.rest.issues.createComment with the constructed body (using marker '<!--
lgtm-bot-reviewers -->', rows from teamSlugs and variable body), which creates
duplicate "Required Reviewers" comments; change the logic to first list existing
issue comments (github.rest.issues.listComments), search for a comment whose
body includes the marker string '<!-- lgtm-bot-reviewers -->', and if found call
github.rest.issues.updateComment with that comment's id to replace the body,
otherwise call github.rest.issues.createComment as before; keep the same marker,
body construction, and owner/repo/issue_number values (and use prNumber) so the
find-and-update pattern mirrors the checklist implementation.
- Line 298: Summary: The PR comment uses a relative link
'[CODEOWNERS](../blob/main/CODEOWNERS)' which won't resolve in PR comments;
replace it with an absolute GitHub URL. Locate the string literal '*Based on
[CODEOWNERS](../blob/main/CODEOWNERS) and changed files.*' in
.github/workflows/lgtm-bot.yml and change the relative markdown link to an
absolute one that uses the repository context (for example using
https://github.com/${{ github.repository }}/blob/main/CODEOWNERS) so the link
resolves correctly in PR comments.

In `@error_classification/requirements.txt`:
- Line 1: The requirements.txt is missing the standard SPDX copyright header;
add the same SPDX header lines used across the repo (the SPDX-FileCopyrightText
and SPDX-License-Identifier lines) to the very top of requirements.txt, above
the existing "requests>=2.28.0" entry, ensuring the header text and formatting
exactly match other files in the project.

🧹 Nitpick comments (9)

.github/scripts/lgtm_bot_diagnose.py (1)
46-60: gh_api silently swallows errors — consider logging stderr for debuggability.

The sibling gh_api in classify_workflow_errors.py (line 55) logs r.stderr[:200] on failure, but this version returns None silently. In a CI diagnostic tool, this makes it harder to troubleshoot when API calls fail.
Suggested improvement
     r = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
     if r.returncode != 0:
+        print(f"gh api {path}: {r.stderr[:200]}", file=sys.stderr)
         return None
.github/LGTM_BOT.md (1)

63-71: Add a language identifier to the fenced code block.

The linter flags this block (MD040). Since it's a plain-text diagram, use ```text.

.github/workflows/lgtm-bot-diagnose.yml (1)

24-27: contains() is a loose match — any comment containing /diagnose will trigger.

A comment like "I already ran /diagnose earlier" would re-trigger the workflow. Consider using startsWith or a stricter pattern if this becomes noisy. Low risk since the operation is read-only and idempotent.

error_classification/config.py (1)

24-48: LGTM — clean config loader with sensible defaults.

Minor nit: lines 37 and 42-46 mix os.getenv and os.environ.get, which are functionally identical but stylistically inconsistent. Picking one would be slightly cleaner.
error_classification/llm_client.py (3)
71-77: Sleeping inside the lock serializes all waiters unnecessarily.

time.sleep() inside self._lock blocks other threads from even checking the interval until the sleeping thread wakes and releases. A common pattern is to compute the needed delay inside the lock, update _last_call optimistically, then sleep outside the lock.

For a CI bot with a handful of sequential LLM calls this is unlikely to matter in practice, but it's worth noting if the client is ever used with concurrent classification.
♻️ Optional: sleep outside the lock
     def wait(self) -> None:
+        delay = 0.0
         with self._lock:
             now = time.time()
             elapsed = now - self._last_call
             if elapsed < self._min_interval:
-                time.sleep(self._min_interval - elapsed)
-            self._last_call = time.time()
+                delay = self._min_interval - elapsed
+            self._last_call = now + delay
+        if delay > 0:
+            time.sleep(delay)
111-148: Retry loop conflates retriable HTTP errors with non-retriable parse failures.

Currently, both branches (lines 119-126 for HTTP 429/5xx and lines 137-148 for exceptions) sleep and retry. If the LLM returns HTTP 200 with unparseable output, the code retries the identical request 3 times with backoff — unlikely to yield different JSON. Consider distinguishing transient errors (network/429/5xx) from deterministic parse failures to avoid wasted retries and latency.

Also, the rate limiter wait() on line 95 is called only once before the retry loop, so retries bypass rate limiting. This is fine when retries follow a long backoff, but on a fast non-429 failure caught by the except branch the next attempt fires immediately (attempt 0 → catch → sleep 4s is fine, but logically it should re-consult the limiter).

Minor: if all 3 attempts hit the continue path (lines 119-126), the exception handler is never entered yet the method still correctly returns None at line 150 — that's good.

132-133: Markdown-fence stripping can misfire on edge cases.

content.split("\n", 1)[1] raises IndexError when the content is exactly ``` with no newline (caught by the except handler, triggering a needless retry). More subtly, rsplit("```", 1) will clip valid JSON if the model output itself contains a triple-backtick string literal inside a value.

A slightly more robust approach:
♻️ Suggested improvement
-                if content.startswith("```"):
-                    content = content.split("\n", 1)[1].rsplit("```", 1)[0].strip()
+                if content.startswith("```"):
+                    lines = content.split("\n")
+                    # Drop first (```json) and last (```) fence lines
+                    if lines[-1].strip() == "```":
+                        content = "\n".join(lines[1:-1]).strip()
+                    else:
+                        content = "\n".join(lines[1:]).strip()
.github/scripts/classify_workflow_errors.py (2)

164-172: Comment listing fetches only the first page (100 comments) — duplicate comment possible on busy PRs.

gh_api makes a single request with per_page=100. If a PR has more than 100 comments, the classifier marker may not be found, resulting in a duplicate comment being created. This is a low-probability edge case for most PRs.

106-111: sys.path mutation is fragile; consider making error_classification installable.

sys.path.insert(0, ...) at line 109 works but couples the script to a specific directory layout. If the error_classification package has a requirements.txt (per the summary), adding a minimal pyproject.toml/setup.py and installing it with pip install ./error_classification in the workflow would be cleaner and would avoid import path gymnastics.

github-actions · 2026-02-15T02:45:40Z

🤖 LGTM Bot

Merge Checklist

CI Checks: 2 failing, 46 passed, 9 skipped
Review Status: Awaiting Approval
Mergeable: Blocked (Pending Review), 40 unresolved conversations, Changes Requested

Type /diagnose for AI-powered CI failure analysis.
Updates automatically on CI completion and review changes.

- Add issues: write permission to evaluate-lgtm and request-reviews jobs - Fix self-exclusion filter to use job names instead of workflow name - Accept skipped/neutral check conclusions as passing - Skip LLM classification when log is empty - Add SPDX header to requirements.txt - Log subprocess failures in comment posting - Handle empty URLs in markdown links - Find-and-update for reviewers comment to avoid duplicates on reopen - Fix CODEOWNERS link to use absolute GitHub URL Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

dagil-nvidia · 2026-02-15T02:58:37Z

/diagnose

…list format - Add error_classification/** to ignore filter in .github/filters.yaml to fix changed-files coverage check failure - Add workflow_dispatch trigger to lgtm-bot-diagnose.yml for manual testing (issue_comment only works from default branch) - Rewrite merge checklist with CI counts, CODEOWNERS review status, and unresolved conversation detection via GraphQL Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

dagil-nvidia · 2026-02-15T20:19:21Z

@coderabbitai full review

coderabbitai · 2026-02-15T20:19:31Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In @.github/scripts/classify_workflow_errors.py:
- Around line 143-145: The confidence formatting uses f"{conf:.0%}" which treats
conf as a 0–1 fraction and will produce incorrect results for integer percent
values (e.g., 85 → "8500%"); update the pct calculation around the conf variable
in classify_workflow_errors.py to normalize and safely format values: parse
string confidences to a number, if conf > 1 and conf <= 100 divide by 100, clamp
values below 0 to 0 and above 1 to 1, then format as a percent (or fall back to
the original string if parsing fails). Implement this normalization logic where
pct is computed (the conf → pct assignment) or extract into a small helper like
format_confidence(conf) and use that.
- Around line 168-176: The comment search in _post_or_update_comment currently
calls gh_api("/repos/{repo}/issues/{pr_number}/comments?per_page=100") without
pagination, so it can miss the classifier marker and create duplicates; update
this logic to paginate through all comment pages (e.g., loop pages or use the
repo API's paginate helper) when fetching comments from gh_api, scanning each
page for COMMENT_MARKER and only setting existing_id if the marker is found,
then proceed to update or create as before.

In @.github/scripts/lgtm_bot_diagnose.py:
- Around line 130-151: The run_classifier function currently only catches
subprocess.TimeoutExpired and json.JSONDecodeError, so if subprocess.run raises
an OSError (e.g., FileNotFoundError when "python3" or the script is missing) the
program will crash; update run_classifier to also catch OSError (or
FileNotFoundError) around the subprocess.run call, log the exception to stderr
similarly to the existing except block (e.g., print f"Classifier failed:
{exc}"), and return an empty list on that error so main() can continue;
reference the run_classifier function and the subprocess.run invocation when
making this change.

In @.github/workflows/lgtm-bot.yml:
- Around line 193-196: The ternary assigning prefix is redundant because both
branches produce the same string; replace the ternary expression with a simple
assignment (e.g., const prefix = ' · ';), or remove prefix entirely and append
the separator directly when updating revLine in the block that handles
requestedTeams, referencing the variables requestedTeams, approved and revLine
to locate the code.

🧹 Nitpick comments (5)

.github/LGTM_BOT.md (1)
63-71: Add a language identifier to the fenced code block.

The architecture diagram block lacks a language specifier. Use ```text to satisfy markdown linters.
Proposed fix
-```
+```text
 .github/workflows/lgtm-bot.yml            → label + checklist + CODEOWNERS
error_classification/llm_client.py (1)

63-78: Rate limiter holds the lock during sleep — acceptable here but worth a note.

time.sleep() inside the lock (line 76) blocks all other threads from even checking the limiter until the sleep completes. This is fine for this low-concurrency CI tooling context, but if _RateLimiter is ever reused in a higher-concurrency setting, consider computing the sleep duration under the lock and sleeping after releasing it.
.github/scripts/lgtm_bot_diagnose.py (1)
46-60: gh_api silently swallows errors — consider logging to stderr like the sibling script.

When returncode != 0, this function returns None with no logging. The equivalent function in classify_workflow_errors.py (line 55) logs r.stderr[:200] to stderr, which helps with debugging in CI. Consider adding similar logging here for consistency and observability.
Proposed fix
     r = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
     if r.returncode != 0:
+        print(f"gh api {path}: {r.stderr[:200]}", file=sys.stderr)
         return None
.github/workflows/lgtm-bot.yml (2)

305-330: Hardcoded CODEOWNERS mapping will silently drift from the actual CODEOWNERS file.

The team-to-path rules here are a manual mirror of the repo's CODEOWNERS. If CODEOWNERS is updated (new paths, renamed teams), this block won't reflect that until someone remembers to edit it. Consider adding an inline comment pointing maintainers to keep these in sync, or parsing the CODEOWNERS file at runtime.

70-71: Pin actions/github-script@v7 to a commit SHA for supply-chain consistency.

Both usages (lines 70 and 291) rely on a mutable version tag. Other workflows in this repository (lgtm-bot-diagnose.yml, classify-workflow-errors-reusable.yml) pin actions/checkout and actions/setup-python by commit SHA to prevent tag hijacking. Apply the same pinning practice to github-script for consistency.

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In @.github/scripts/classify_workflow_errors.py:
- Around line 47-58: The gh_api function currently calls json.loads(r.stdout)
without guarding against non-JSON output; wrap the json.loads call in a
try/except catching json.JSONDecodeError (and/or ValueError for broader
compatibility), and on decode failure print a short diagnostic (e.g., include
r.stdout[:200] or r.stderr[:200]) to sys.stderr similar to the existing error
path and return None, preserving the original "return parsed JSON or None"
contract; update the gh_api function and use the existing r variable for
diagnostics.
- Around line 259-265: The success message is printed unconditionally after
calling _post_or_update_comment in main(); change _post_or_update_comment to
either raise an exception on failure or return a boolean status, then update
main() to capture that result and only print the "Posted classification" message
when the call succeeded (use _post_or_update_comment(...) -> bool or allow
exceptions and wrap the call so you only call print(...) on success). Reference
_post_or_update_comment, _build_comment, body and pr_number when making the
conditional check or exception handling.

In @.github/scripts/lgtm_bot_diagnose.py:
- Around line 219-224: The AI analysis section can be omitted when an API key
exists and there are failed checks but the classifier produced no diagnoses;
update the logic that builds the "lines" list to handle an empty diagnoses
result by adding a fallback branch after the existing conditionals (e.g., after
the has_api_key / checks["failed"] branches) that checks if diagnoses is empty
and appends a clear message like "AI analysis attempted but returned no results
— please retry or check classifier logs." Refer to the variables names
has_api_key, checks["failed"], diagnoses and the lines list when implementing
this fallback so users always see feedback when classification was attempted.

In @.github/workflows/lgtm-bot.yml:
- Around line 346-347: The code computing isExternal uses pr.head.repo.full_name
and can throw if pr.head.repo is null; update the check to safely handle deleted
forks by guarding pr.head.repo (or using optional chaining) before accessing
full_name and treat missing head.repo as external only when appropriate (e.g.,
consider fallback to pr.head.repo?.full_name ?? '' or explicitly check
pr.head.repo !== null) so that isExternal is computed without raising a
TypeError.

🧹 Nitpick comments (2)

.github/workflows/lgtm-bot.yml (1)

305-330: Hardcoded CODEOWNERS mapping will silently drift from the actual CODEOWNERS file.

This duplicates (and can diverge from) the real CODEOWNERS file. Consider parsing CODEOWNERS at runtime or at least adding a comment referencing the source of truth so future maintainers know to update both places.
.github/scripts/lgtm_bot_diagnose.py (1)
46-60: gh_api swallows non-zero exit silently and doesn't guard against malformed JSON.

Two concerns:

Unlike the sibling gh_api in classify_workflow_errors.py, this version doesn't log r.stderr on failure, making debugging harder.

json.loads(r.stdout) on line 60 can raise json.JSONDecodeError if gh returns non-empty but malformed output, which would propagate as an unhandled exception.
Proposed fix
 def gh_api(
     path: str,
     method: str = "GET",
     fields: dict[str, str] | None = None,
 ) -> Any:
     """Call GitHub API via ``gh`` CLI."""
     cmd = ["gh", "api", path]
     if method != "GET":
         cmd.extend(["--method", method])
     for k, v in (fields or {}).items():
         cmd.extend(["-f", f"{k}={v}"])
     r = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
     if r.returncode != 0:
+        print(f"gh api {path}: {r.stderr[:200]}", file=sys.stderr)
         return None
-    return json.loads(r.stdout) if r.stdout.strip() else {}
+    try:
+        return json.loads(r.stdout) if r.stdout.strip() else {}
+    except json.JSONDecodeError:
+        print(f"gh api {path}: invalid JSON response", file=sys.stderr)
+        return None

- Add SKIP_CLASSIFICATION list to skip LLM on lint, copyright, DCO, etc. where logs already explain exactly what to fix - Fix confidence rendering: normalize 0-100 to 0-1 range - Wrap gh_api JSON parsing in try/except for malformed responses - Return success/failure from _post_or_update_comment - Add OSError to classifier subprocess except tuple - Add feedback when classifier returns empty results - Fix dead ternary in review status line - Guard against null pr.head.repo for deleted forks Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

…diagnose Consolidate classifier into diagnose script (-85 net lines): - Merge classify_workflow_errors.py into lgtm_bot_diagnose.py - Delete unused classify-workflow-errors-reusable.yml Add GitLab CI integration: - Fetch failed job details and logs via GitLab API (PRIVATE-TOKEN) - Auto-classify timeout jobs without LLM - Add contextual note explaining GitLab CI for external contributors Group failures by merge impact: - Blocking: 5 required status checks with sub-job diagnoses nested under their parent rollup check (WORKFLOW_TO_REQUIRED mapping) - Non-blocking: GitLab and other informational failures Improve output format: - Compact one-liners for passing/pending, collapsible details for failures - Reply-link posted to /diagnose comment with URL to diagnostic comment - Include commit statuses (GitLab, CodeRabbit) in merge checklist counts - Title case consistency across all bot output Total: 1,381 lines across all bot files (down from 1,467). Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

issue_comment and workflow_dispatch triggers only activate when the workflow file exists on the default branch. Since this is a new file, workflow_dispatch is dead weight until merge. Remove it to simplify the workflow and avoid misleading test instructions. Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

Signed-off-by: dagil-nvidia <dagil@nvidia.com>

dillon-cullinan

I'm very much against this code. It is attempting to put things already visible in the Github UI into a single comment.

Along with that, in its current state it has multiple problems; not considering edge cases, redundant code, and even hallucinated code.

dillon-cullinan · 2026-02-17T21:00:17Z

+# ---------------------------------------------------------------------------
+
+
+def gh_api(


This function is defined twice across two files. Why?

dillon-cullinan · 2026-02-17T21:01:22Z

+    fields: dict[str, str] | None = None,
+) -> Any:
+    """Call GitHub API via ``gh`` CLI.  Returns parsed JSON or *None*."""
+    cmd = ["gh", "api", path]


This isn't using ghcli the way I meant. We should be using subcommands like gh pr checks #### to get check statuses.

dillon-cullinan · 2026-02-17T21:02:31Z

+        return None
+
+
+def add_reaction(repo: str, comment_id: str, reaction: str) -> None:


Why is this a new function? All it does is call gh_api. Why not just use gh_api instead of defining this?

dillon-cullinan · 2026-02-17T21:03:26Z

+    own_jobs = {"evaluate-lgtm", "request-reviews", "diagnose"}
+
+    # 1. Check runs (GitHub Actions)
+    data = gh_api(f"/repos/{repo}/commits/{sha}/check-runs?per_page=100")


Why is it paginated? What if we have more than 100 jobs?

dillon-cullinan · 2026-02-17T21:08:47Z

+        )
+
+    # 2. Commit statuses (GitLab, CodeRabbit, etc.)
+    status_data = gh_api(f"/repos/{repo}/commits/{sha}/status")


What is the status API offering that check-runs is not?

dillon-cullinan · 2026-02-17T21:55:55Z

+            // Simple CODEOWNERS matching
+            const teams = new Set();
+            for (const f of files) {
+              const p = f.filename;
+              if (p.endsWith('.rs') || p.includes('Cargo.toml')) {
+                teams.add('dynamo-rust-codeowners');
+              }
+              if (p.endsWith('.py')) {
+                teams.add('python-codeowners');
+                teams.add('Devops');
+              }
+              if (p.startsWith('.github/')) teams.add('Devops');
+              if (p.startsWith('deploy/')) {
+                teams.add('dynamo-deploy-codeowners');
+              }
+              if (
+                p.startsWith('container/') ||
+                p.startsWith('examples/') ||
+                p.startsWith('recipes/')
+              ) {
+                teams.add('Devops');
+                teams.add('dynamo-rust-codeowners');
+                teams.add('python-codeowners');
+                teams.add('dynamo-deploy-codeowners');
+              }
+            }


What is this?

dillon-cullinan · 2026-02-17T21:58:51Z

+                '|------|--------|',
+                rows,
+                '',
+                '*Based on [CODEOWNERS](https://github.com/ai-dynamo/dynamo/blob/main/CODEOWNERS) and changed files.*',


This is not true. This comment being built is based off hallucinated codeowners in this workflow. See this comment: https://github.com/ai-dynamo/dynamo/pull/6313/changes#r2819268809

dillon-cullinan · 2026-02-17T22:00:00Z

+
+
+@dataclass
+class Config:


What is the point of this class? Are we publishing a library or re-using this in some meaningful way?

dillon-cullinan · 2026-02-17T22:00:17Z

+import sys
+import time
+from dataclasses import dataclass, field
+from threading import Lock


Why are we importing threading?

dillon-cullinan · 2026-02-17T22:00:52Z

+        }
+
+        for attempt in range(3):
+            try:


The sleep call should be within this for loop, not outside it.

dillon-cullinan · 2026-02-17T22:06:32Z

/diagnose

Jont828 · 2026-02-19T20:35:29Z

For anyone who doesn't have context, @dagil-nvidia, @sozercan, and I were discussing the confusion on what a PR being approved is. The GitHub UI is confusing since anyone can approve a PR, but we can merge only when it's approved by a codeowner of the certain part of the project that this PR touches. The GitHub UI also shows approval with write access, not approval from the owner of the codebase area.

Kubernetes OSS repos have a system where you can type an /approve command, and it'll add the approve label if and only if you are part of the approvers for that part of the codebase. Our suggestion is to add a label indicating if a PR is approved by a person who can merge to this area of the project and use that as the source of truth. That way, there's no confusion on if a PR being "approved" actually means it's approved.

An example of this is in a PR for the Cluster API project:

dagil-nvidia requested a review from a team February 15, 2026 02:38

dagil-nvidia requested a review from a team as a code owner February 15, 2026 02:38

pull-request-size bot added the size/XXL label Feb 15, 2026

github-actions bot added ci Issues/PRs that reference CI build/test documentation Improvements or additions to documentation actions labels Feb 15, 2026

coderabbitai bot reviewed Feb 15, 2026

View reviewed changes

copy-pr-bot bot temporarily deployed to GITLAB February 15, 2026 02:55 Inactive

Merge branch 'main' into dagil-nvidia/ci-lgtm-bot-v2

6998eb7

copy-pr-bot bot temporarily deployed to GITLAB February 15, 2026 02:57 Inactive

copy-pr-bot bot temporarily deployed to GITLAB February 15, 2026 03:03 Inactive

copy-pr-bot bot temporarily deployed to GITLAB February 15, 2026 20:11 Inactive

copy-pr-bot bot temporarily deployed to GITLAB February 15, 2026 20:12 Inactive

ai-dynamo deleted a comment from coderabbitai bot Feb 15, 2026

coderabbitai bot reviewed Feb 15, 2026

View reviewed changes

Comment thread .github/scripts/classify_workflow_errors.py Outdated

Comment thread .github/scripts/classify_workflow_errors.py Outdated

Comment thread .github/scripts/lgtm_bot_diagnose.py Outdated

Comment thread .github/workflows/lgtm-bot.yml

coderabbitai bot reviewed Feb 15, 2026

View reviewed changes

Comment thread .github/scripts/classify_workflow_errors.py Outdated

Comment thread .github/scripts/classify_workflow_errors.py Outdated

Comment thread .github/scripts/lgtm_bot_diagnose.py Outdated

Comment thread .github/workflows/lgtm-bot.yml Outdated

copy-pr-bot bot temporarily deployed to GITLAB February 15, 2026 22:18 Inactive

copy-pr-bot bot temporarily deployed to GITLAB February 15, 2026 22:19 Inactive

copy-pr-bot bot temporarily deployed to GITLAB February 16, 2026 05:41 Inactive

copy-pr-bot bot temporarily deployed to GITLAB February 16, 2026 05:42 Inactive

dagil-nvidia changed the title ~~ci: add LGTM Bot with merge readiness tracking and AI-powered CI diagnostics~~ ci: add LGTM Bot for external contributor merge visibility Feb 16, 2026

copy-pr-bot bot temporarily deployed to GITLAB February 16, 2026 06:00 Inactive

copy-pr-bot bot temporarily deployed to GITLAB February 16, 2026 06:01 Inactive

dagil-nvidia requested review from dillon-cullinan and nealvaidya February 16, 2026 06:23

dagil-nvidia self-assigned this Feb 16, 2026

dagil-nvidia requested a review from nv-nmailhot February 16, 2026 06:24

Remove tmp_pr_body.md from .gitignore

cb7af03

Signed-off-by: dagil-nvidia <dagil@nvidia.com>

copy-pr-bot bot temporarily deployed to GITLAB February 16, 2026 06:25 Inactive

copy-pr-bot bot temporarily deployed to GITLAB February 16, 2026 06:26 Inactive

dillon-cullinan requested changes Feb 17, 2026

View reviewed changes

dagil-nvidia marked this pull request as draft March 12, 2026 21:14

		# ---------------------------------------------------------------------------


		def gh_api(

		return None


		def add_reaction(repo: str, comment_id: str, reaction: str) -> None:



		@dataclass
		class Config:

Conversation

dagil-nvidia commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

/diagnose Output

Merge Checklist

Architecture

Files Changed

How to Test

Security

Acknowledgments

Uh oh!

coderabbitai bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🤖 LGTM Bot

Merge Checklist

Uh oh!

dagil-nvidia commented Feb 15, 2026

Uh oh!

dagil-nvidia commented Feb 15, 2026

Uh oh!

coderabbitai bot commented Feb 15, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dillon-cullinan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dagil-nvidia commented Feb 15, 2026 •

edited

Loading

`/diagnose` Output

coderabbitai bot commented Feb 15, 2026 •

edited

Loading

github-actions bot commented Feb 15, 2026 •

edited

Loading