Skip to content

chore: Enable use of local carbide-api instead of mock-core#197

Open
pbreton wants to merge 1 commit intomainfrom
chore/local-carbide
Open

chore: Enable use of local carbide-api instead of mock-core#197
pbreton wants to merge 1 commit intomainfrom
chore/local-carbide

Conversation

@pbreton
Copy link
Copy Markdown
Contributor

@pbreton pbreton commented Mar 5, 2026

  • Update local stack to connect to local carbide-core
  • add a test script for validation connection to carbide-api

This depends on PR#449.

Copilot AI review requested due to automatic review settings March 5, 2026 02:20
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 5, 2026

Test Results

8 700 tests  ±0   8 700 ✅ ±0   8m 9s ⏱️ - 1m 35s
  143 suites ±0       0 💤 ±0 
   14 files   ±0       0 ❌ ±0 

Results for commit be29eb2. ± Comparison against base commit 75dc45c.

♻️ This comment has been updated with latest results.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enables the local kind stack to use a locally running carbide-api (from bare-metal-manager-core) instead of the in-cluster mock-core, and adds a smoke-test script to validate the end-to-end API → core integration path.

Changes:

  • Add scripts/test-local-core.sh to exercise auth, tenant/site discovery, and resource creation through the local core path.
  • Update Makefile kind-reset to optionally skip building/deploying mock-core and patch site-agent to point at the local core with TLS and injected certs.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
scripts/test-local-core.sh New local integration smoke test script for API → core connectivity and basic workflows.
Makefile Adds LOCAL_CORE mode to connect site-agent to a locally running core and manage certs/config patches during kind-reset.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +64 to +72
local path="$1"
local out http_code
out=$(curl -s -w "\n%{http_code}" "$BASE_URL$path" \
-H "Authorization: Bearer $TOKEN" \
-H "Accept: application/json")
http_code=$(echo "$out" | tail -n1)
body=$(echo "$out" | sed '$d')
if [[ "$http_code" -lt 200 || "$http_code" -ge 300 ]]; then
die "GET $path → HTTP $http_code: $body"
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In api_get, body is assigned without being declared local, which leaks it into the global scope and makes the helper have hidden side effects. Declare local body (and include it in the local ... list) so each call is self-contained and can't accidentally overwrite a global body variable used elsewhere.

Copilot uses AI. Check for mistakes.
UPDATE_TAG=$(kubectl exec -n "$PG_NAMESPACE" "$PG_STATEFULSET" -- \
psql -U "$PG_USER" -d "$PG_DB" -c \
"UPDATE site SET status = 'Registered', updated = NOW() WHERE id = '$SITE_ID';" \
2>&1 | grep -E '^UPDATE')
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command substitution relies on grep matching ^UPDATE. With set -euo pipefail, a no-match grep (exit 1) can terminate the script immediately, so the later explicit checks/die message may never run. Make the pipeline non-fatal (e.g., append || true / disable errexit just for this capture) and then validate UPDATE_TAG/exit status explicitly.

Suggested change
2>&1 | grep -E '^UPDATE')
2>&1 | grep -E '^UPDATE' || true)

Copilot uses AI. Check for mistakes.
Comment on lines +105 to +111
# Run a SQL statement inside the in-cluster postgres pod.
db_exec() {
local sql="$1"
kubectl exec -n "$PG_NAMESPACE" "$PG_STATEFULSET" -- \
psql -U "$PG_USER" -d "$PG_DB" -tAc "$sql"
}

Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

db_exec() is defined but not used anywhere in this script, which adds noise and can mislead readers about how DB access is performed. Remove it, or refactor the later direct kubectl exec ... psql call sites to use this helper consistently.

Suggested change
# Run a SQL statement inside the in-cluster postgres pod.
db_exec() {
local sql="$1"
kubectl exec -n "$PG_NAMESPACE" "$PG_STATEFULSET" -- \
psql -U "$PG_USER" -d "$PG_DB" -tAc "$sql"
}

Copilot uses AI. Check for mistakes.
Comment thread Makefile
Comment on lines +468 to +477
@if [ "$(LOCAL_CORE)" = "true" ]; then \
echo "Creating core-grpc-client-site-agent-certs secret from $(LOCAL_CORE_CERTS_DIR)..." ; \
for f in ca.crt client.crt client.key; do \
[ -f "$(LOCAL_CORE_CERTS_DIR)/$$f" ] || { echo "ERROR: $$f not found in $(LOCAL_CORE_CERTS_DIR). Run gen-certs.sh in bare-metal-manager-core first."; exit 1; } ; \
done ; \
kubectl -n carbide-rest create secret generic core-grpc-client-site-agent-certs \
--from-file=ca.crt="$(LOCAL_CORE_CERTS_DIR)/ca.crt" \
--from-file=tls.crt="$(LOCAL_CORE_CERTS_DIR)/client.crt" \
--from-file=tls.key="$(LOCAL_CORE_CERTS_DIR)/client.key" \
--dry-run=client -o yaml | kubectl apply -f - ; \
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When LOCAL_CORE=true, this creates/updates the core-grpc-client-site-agent-certs Secret, but deploy/kustomize/base/site-agent/certificate.yaml also defines a cert-manager Certificate that continuously reconciles the same secret name. That controller will likely overwrite the manually provided local-core certs, causing flapping/connection failures. Consider deleting/omitting the Certificate resource when LOCAL_CORE=true (e.g., kubectl delete certificate core-grpc-client-site-agent-certs) or using a distinct secret name for local-core certs and patching the site-agent deployment to mount that secret instead.

Copilot uses AI. Check for mistakes.
@pbreton pbreton changed the title chore: enable to use local carbide-api instead of mock-core chore: Enable use of local carbide-api instead of mock-core Mar 6, 2026
@pbreton pbreton force-pushed the chore/local-carbide branch from c8a1846 to 240cbc1 Compare March 6, 2026 03:27
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 6, 2026

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-03-06 03:28:18 UTC | Commit: 240cbc1

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 6, 2026

🛡️ Vulnerability Scan

🚨 Found 65 vulnerability(ies)
📊 vs main: 65 (no change)

Severity Breakdown:

  • 🔴 Critical/High: 65
  • 🟡 Medium: 0
  • 🔵 Low/Info: 0

🔗 View full details in Security tab

🕐 Last updated: 2026-03-19 21:01:44 UTC | Commit: 411c2dc

Copilot AI review requested due to automatic review settings March 6, 2026 19:31
@pbreton pbreton force-pushed the chore/local-carbide branch from 240cbc1 to d12e81a Compare March 6, 2026 19:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +89 to +97
# Resource already exists; the conflict response carries the existing id
# under .data.id — reuse it so the script is idempotent across runs.
existing_id=$(echo "$body" | jq -r '.data.id // empty')
if [[ -n "$existing_id" ]]; then
info "Already exists (HTTP 409) — reusing id: $existing_id"
echo "{\"id\":\"$existing_id\"}"
return
fi
die "POST $path → HTTP 409 (conflict) but no id in response: $body"
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script claims idempotency on HTTP 409 by extracting an existing resource id from .data.id, but many API conflict responses use validation errors (e.g., { "data": { "name": "..." } }) and do not include an id. In those cases existing_id will be empty and the script exits, so re-running the smoke test fails. Consider handling 409 by performing a follow-up lookup (e.g., GET list filtered by name/site and extracting .id) rather than assuming .data.id is present.

Suggested change
# Resource already exists; the conflict response carries the existing id
# under .data.id — reuse it so the script is idempotent across runs.
existing_id=$(echo "$body" | jq -r '.data.id // empty')
if [[ -n "$existing_id" ]]; then
info "Already exists (HTTP 409) — reusing id: $existing_id"
echo "{\"id\":\"$existing_id\"}"
return
fi
die "POST $path → HTTP 409 (conflict) but no id in response: $body"
# Resource already exists; the conflict response is expected to carry
# the existing id under .data.id — reuse it so the script is idempotent
# across runs.
existing_id=$(echo "$body" | jq -r '.data.id // empty')
if [[ -n "$existing_id" ]]; then
info "Already exists (HTTP 409) — reusing id from conflict response: $existing_id"
echo "{\"id\":\"$existing_id\"}"
return
fi
# Some APIs return validation-style conflict responses without an id.
# In that case, attempt a follow-up lookup (e.g., by name) to recover
# the existing resource id and keep the script idempotent.
local name lookup_out lookup_code lookup_body lookup_id
name=$(echo "$payload" | jq -r '.name // .data.name // empty')
if [[ -n "$name" ]]; then
info "HTTP 409 without id; attempting lookup by name: \"$name\""
lookup_out=$(curl -s -w "\n%{http_code}" "$BASE_URL$path?name=$name" \
-H "Authorization: Bearer $TOKEN" \
-H "Accept: application/json")
lookup_code=$(echo "$lookup_out" | tail -n1)
lookup_body=$(echo "$lookup_out" | sed '$d')
if [[ "$lookup_code" -ge 200 && "$lookup_code" -lt 300 ]]; then
# Try a few common response shapes to find an id.
lookup_id=$(echo "$lookup_body" | jq -r '.data.id // .data[0].id // .items[0].id // .id // empty')
if [[ -n "$lookup_id" ]]; then
info "Already exists (HTTP 409) — reusing id from lookup: $lookup_id"
echo "{\"id\":\"$lookup_id\"}"
return
fi
fi
fi
die "POST $path → HTTP 409 (conflict) but no id in response or lookup: $body"

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +126
# Override defaults via environment variables:
# API_URL, KEYCLOAK_URL, ORG, SITE_NAME, PG_NAMESPACE, PG_STATEFULSET, PG_USER, PG_DB
#

set -euo pipefail

API_URL="${API_URL:-http://localhost:8388}"
KEYCLOAK_URL="${KEYCLOAK_URL:-http://localhost:8082}"
ORG="${ORG:-test-org}"
SITE_NAME="${SITE_NAME:-local-dev-site}"

# Postgres access via kubectl exec (no local psql or port-forward needed)
PG_NAMESPACE="${PG_NAMESPACE:-postgres}"
PG_STATEFULSET="${PG_STATEFULSET:-statefulset/postgres}"
PG_USER="${PG_USER:-forge}"
PG_DB="${PG_DB:-forge}"

BASE_URL="$API_URL/v2/org/$ORG/carbide"

# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------

step() { echo; echo "──── $* ────"; } >&2
ok() { echo " ✓ $*"; } >&2
info() { echo " $*"; } >&2
die() { echo; echo "ERROR: $*"; exit 1; } >&2

api_get() {
local path="$1"
local out http_code
out=$(curl -s -w "\n%{http_code}" "$BASE_URL$path" \
-H "Authorization: Bearer $TOKEN" \
-H "Accept: application/json")
http_code=$(echo "$out" | tail -n1)
body=$(echo "$out" | sed '$d')
if [[ "$http_code" -lt 200 || "$http_code" -ge 300 ]]; then
die "GET $path → HTTP $http_code: $body"
fi
echo "$body"
}

api_post() {
local path="$1"
local payload="$2"
local out http_code body existing_id
out=$(curl -s -w "\n%{http_code}" -X POST "$BASE_URL$path" \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-H "Accept: application/json" \
-d "$payload")
http_code=$(echo "$out" | tail -n1)
body=$(echo "$out" | sed '$d')
if [[ "$http_code" == "409" ]]; then
# Resource already exists; the conflict response carries the existing id
# under .data.id — reuse it so the script is idempotent across runs.
existing_id=$(echo "$body" | jq -r '.data.id // empty')
if [[ -n "$existing_id" ]]; then
info "Already exists (HTTP 409) — reusing id: $existing_id"
echo "{\"id\":\"$existing_id\"}"
return
fi
die "POST $path → HTTP 409 (conflict) but no id in response: $body"
fi
if [[ "$http_code" -lt 200 || "$http_code" -ge 300 ]]; then
die "POST $path → HTTP $http_code: $body"
fi
echo "$body"
}

# Run a SQL statement inside the in-cluster postgres pod.
db_exec() {
local sql="$1"
kubectl exec -n "$PG_NAMESPACE" "$PG_STATEFULSET" -- \
psql -U "$PG_USER" -d "$PG_DB" -tAc "$sql"
}

# ---------------------------------------------------------------------------
# 1. Authenticate
# ---------------------------------------------------------------------------

step "Authenticating via Keycloak"

TOKEN=$(curl -sf -X POST \
"$KEYCLOAK_URL/realms/carbide-dev/protocol/openid-connect/token" \
-H "Content-Type: application/x-www-form-urlencoded" \
-d "client_id=carbide-api" \
-d "client_secret=carbide-local-secret" \
-d "grant_type=password" \
-d "username=admin@example.com" \
-d "password=adminpassword" \
| jq -r '.access_token')
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header says defaults can be overridden via environment variables, but the Keycloak username/password/client_secret are hard-coded below. If different local dev credentials are needed, users have to edit the script. Consider adding env overrides (e.g., KEYCLOAK_USER, KEYCLOAK_PASSWORD, KEYCLOAK_CLIENT_SECRET) and listing them in the "Override defaults" comment block.

Copilot uses AI. Check for mistakes.
@pbreton pbreton force-pushed the chore/local-carbide branch from d12e81a to 18e8f6c Compare March 9, 2026 18:54
@thossain-nv
Copy link
Copy Markdown
Contributor

Should we update README to reflect how to use local core?

@pbreton pbreton force-pushed the chore/local-carbide branch from 18e8f6c to 6ed4b25 Compare March 17, 2026 20:56
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Add optional local Carbide Core support to Makefile and deploy flows (kustomize & helm), enabling site-agent to connect to an external/local core with cert handling and ConfigMap overrides; add an end-to-end smoke test script that authenticates via Keycloak and exercises tenant, site, IP block, allocation, and VPC API flows.

Changes

Cohort / File(s) Summary
Makefile & deploy targets
Makefile (targets: docker-build-local, kind-load, kind-redeploy, kind-reset-infra, kind-reset-kustomize, kind-reset-helm, helm-deploy-site-agent)
Add LOCAL_CORE, LOCAL_CORE_HOST, LOCAL_CORE_PORT, LOCAL_CORE_CERTS_DIR; conditionally skip building/loading/restarting carbide-rest-mock-core when LOCAL_CORE=true; update rollout/restart logic and completion logging to indicate local vs mock core.
Kustomize / Helm site-agent flow
kind-reset-kustomize, kind-reset-infra, helm-deploy-site-agent, kind-reset-helm
When LOCAL_CORE=true, create/update core-grpc-client-site-agent-certs secret from local cert files (fail on missing files), resolve LOCAL_CORE_HOST to an IPv4 address via kind control-plane, patch carbide-rest-site-agent-config with CARBIDE_ADDRESS=<resolved-ip>:<PORT>, CARBIDE_SEC_OPT=1, SKIP_GRPC_SERVER_AUTH=true, disable Helm chart certificate bootstrapping, restart the site-agent statefulset and wait for rollout.
Local core smoke test script
scripts/test-local-core.sh
Add new strict bash E2E smoke test: obtain Keycloak token, resolve tenant and site (force site to Registered via in-cluster Postgres UPDATE if needed), idempotent POST helper (treat 409 as reuse), create IP block, allocation, and VPC, and print a summary of returned IDs and VPC status.

Sequence Diagram(s)

sequenceDiagram
    participant Tester as Tester (scripts/test-local-core.sh)
    participant Keycloak as Keycloak
    participant API as Carbide API
    participant DB as PostgreSQL
    participant K8s as Kubernetes

    Tester->>Keycloak: POST /auth (credentials) -> token
    Keycloak-->>Tester: bearer token

    Tester->>API: GET /tenant/current (Authorization)
    API->>DB: SELECT tenant
    DB-->>API: tenant data
    API-->>Tester: tenant details

    Tester->>API: GET /site?name=SITE_NAME
    API->>DB: SELECT site
    DB-->>API: site id & status
    API-->>Tester: site info

    alt site.status != "Registered"
        Tester->>K8s: kubectl exec postgres -> UPDATE site SET status='Registered'
        K8s->>DB: execute SQL
        DB-->>K8s: rows affected
        Tester->>API: GET /site/:id (re-check)
        API->>DB: SELECT updated site
        DB-->>API: updated site
        API-->>Tester: confirmed Registered
    end

    Tester->>API: POST /ipblock (body)
    API->>DB: INSERT ipblock
    DB-->>API: ipblock id
    API-->>Tester: ipblock created

    Tester->>API: POST /allocation (body referencing ipblock, site, tenant)
    API->>DB: INSERT allocation
    DB-->>API: allocation id
    API-->>Tester: allocation created

    Tester->>API: POST /vpc (body)
    API->>DB: INSERT vpc
    DB-->>API: vpc id & status
    API-->>Tester: vpc created

    Tester-->>Tester: print summary of tenant/site/ipblock/allocation/vpc ids & vpc status
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped to Keycloak, fetched a token bright,
nibbled certs and patched configs late at night,
poked Postgres till the site said "Registered" clear,
spawned IPs and VPCs — the pathway's now near,
carrots for commits, a joyous debug bite! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: enabling use of local carbide-api instead of mock-core, which is the primary objective across Makefile and script modifications.
Description check ✅ Passed The description is directly related to the changeset, outlining the two main changes: updating the local stack to connect to local carbide-core and adding a test script for validation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chore/local-carbide

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
Makefile (1)

510-534: Extract the LOCAL_CORE site-agent bootstrap into one shared target.

These two recipe blocks are effectively identical, including the hard-coded carbide-rest-local-control-plane lookup. Duplicating the cert checks, host resolution, ConfigMap patch, and rollout logic makes the Helm and Kustomize flows easy to drift apart. A single phony target/define block, using $(KIND_CLUSTER_NAME) for the container name, would keep this maintainable.

Also applies to: 611-634

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 510 - 534, Extract the duplicated LOCAL_CORE bootstrap
logic into a single reusable Makefile target (or a define block) and invoke it
from both places instead of duplicating the recipe; move the cert existence
checks (checking $(LOCAL_CORE_CERTS_DIR)/ca.crt, client.crt, client.key), the
kubectl secret creation for core-grpc-client-site-agent-certs, the CORE_HOST
resolution (currently using the hard-coded container name
carbide-rest-local-control-plane) and the ConfigMap patch/rollout steps into
that shared target and parameterize the container name using
$(KIND_CLUSTER_NAME) (or accept it as a target argument), then replace the
original blocks (the one shown using LOCAL_CORE and the similar block at lines
~611-634) with calls to the new phony target so both Helm and Kustomize flows
use the exact same logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Makefile`:
- Around line 284-286: The kind-apply target fails because KUSTOMIZE_OVERLAY
points to a non-existent composite overlay (deploy/kustomize/overlays/local)
after refactor; create that composite overlay or change kind-apply to apply the
per-component overlays instead. Update the Makefile so kind-apply (and
kind-deploy) either: 1) references a new deploy/kustomize/overlays/local
kustomization that uses resources: site-manager, db, workflow, api, mock-core,
site-agent, or 2) iterates over the same per-component overlays used by
kind-reset-kustomize (site-manager, db, workflow, api, mock-core, site-agent)
and runs kubectl apply -k for each. Also ensure the mock-core inclusion is gated
by LOCAL_CORE exactly like the build/load and kind-reset-kustomize flows (use
the LOCAL_CORE conditional around including or applying mock-core) so the image
build/load and apply steps remain consistent.

In `@scripts/test-local-core.sh`:
- Around line 88-97: The current api_post() handling that treats HTTP 409 as
success makes the smoke test idempotent but allows reruns to skip exercising the
create path; fix by either (A) making resource names unique per run (e.g.,
append a timestamp or UUID when building names for the three POST calls) so POST
always attempts a real create, or (B) change api_post() so a 409 is reported as
"skipped/not-validated" instead of success: detect 409 in the function that
currently extracts existing_id and return a distinct non-success status (or
print a skip message and return a non-zero or special code) so the caller can
treat it as skipped and not declare the smoke test passed; update the logic that
currently prints "Smoke test passed" to fail the run or mark it skipped when
api_post() reports a 409. Ensure you update both the /vpc POST path and the
other POST calls referenced in the same script block (the other POST ranges
noted).
- Around line 173-180: The pipeline that assigns UPDATE_TAG is brittle under set
-euo pipefail because grep returning non-zero will abort; instead capture the
full kubectl exec output into a temporary variable (e.g. UPDATE_OUT) by running
the psql command and redirecting 2>&1, then run the grep on that captured output
(or run grep with a non-fatal fallback) to populate UPDATE_TAG, and finally use
UPDATED_ROWS extraction and existing die checks; update references to UPDATE_TAG
and UPDATED_ROWS accordingly so the script doesn't exit before your explicit
error handling.

---

Nitpick comments:
In `@Makefile`:
- Around line 510-534: Extract the duplicated LOCAL_CORE bootstrap logic into a
single reusable Makefile target (or a define block) and invoke it from both
places instead of duplicating the recipe; move the cert existence checks
(checking $(LOCAL_CORE_CERTS_DIR)/ca.crt, client.crt, client.key), the kubectl
secret creation for core-grpc-client-site-agent-certs, the CORE_HOST resolution
(currently using the hard-coded container name carbide-rest-local-control-plane)
and the ConfigMap patch/rollout steps into that shared target and parameterize
the container name using $(KIND_CLUSTER_NAME) (or accept it as a target
argument), then replace the original blocks (the one shown using LOCAL_CORE and
the similar block at lines ~611-634) with calls to the new phony target so both
Helm and Kustomize flows use the exact same logic.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7bc2f5a3-439c-4199-be1a-7091a86bdfe1

📥 Commits

Reviewing files that changed from the base of the PR and between bc56e32 and 6ed4b25.

📒 Files selected for processing (2)
  • Makefile
  • scripts/test-local-core.sh

Comment thread Makefile
Comment on lines +88 to +97
if [[ "$http_code" == "409" ]]; then
# Resource already exists; the conflict response carries the existing id
# under .data.id — reuse it so the script is idempotent across runs.
existing_id=$(echo "$body" | jq -r '.data.id // empty')
if [[ -n "$existing_id" ]]; then
info "Already exists (HTTP 409) — reusing id: $existing_id"
echo "{\"id\":\"$existing_id\"}"
return
fi
die "POST $path → HTTP 409 (conflict) but no id in response: $body"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

409 reuse makes reruns stop testing the create path this smoke test is meant to validate.

api_post() treats a conflict as success, and the three POSTs all use fixed names/prefixes. On a rerun, the /vpc call can 409, skip the actual create, and the script still prints "Smoke test passed" even though the carbide-core path was never exercised again. Please either generate per-run resource names or surface 409 as a skipped/not-validated result instead of a pass.

Also applies to: 197-251

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/test-local-core.sh` around lines 88 - 97, The current api_post()
handling that treats HTTP 409 as success makes the smoke test idempotent but
allows reruns to skip exercising the create path; fix by either (A) making
resource names unique per run (e.g., append a timestamp or UUID when building
names for the three POST calls) so POST always attempts a real create, or (B)
change api_post() so a 409 is reported as "skipped/not-validated" instead of
success: detect 409 in the function that currently extracts existing_id and
return a distinct non-success status (or print a skip message and return a
non-zero or special code) so the caller can treat it as skipped and not declare
the smoke test passed; update the logic that currently prints "Smoke test
passed" to fail the run or mark it skipped when api_post() reports a 409. Ensure
you update both the /vpc POST path and the other POST calls referenced in the
same script block (the other POST ranges noted).

Comment on lines +173 to +180
UPDATE_TAG=$(kubectl exec -n "$PG_NAMESPACE" "$PG_STATEFULSET" -- \
psql -U "$PG_USER" -d "$PG_DB" -c \
"UPDATE site SET status = 'Registered', updated = NOW() WHERE id = '$SITE_ID';" \
2>&1 | grep -E '^UPDATE')

[[ -z "$UPDATE_TAG" ]] && die "DB update produced no UPDATE tag — check site ID and postgres connectivity"
UPDATED_ROWS=$(echo "$UPDATE_TAG" | awk '{print $2}')
[[ "$UPDATED_ROWS" -eq 0 ]] && die "DB update matched 0 rows — site id '$SITE_ID' not found in table"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the script and look for the set directive and the problematic code
head -20 scripts/test-local-core.sh

Repository: NVIDIA/ncx-infra-controller-rest

Length of output: 853


🏁 Script executed:

# Also look at the specific lines mentioned in the review
sed -n '170,185p' scripts/test-local-core.sh

Repository: NVIDIA/ncx-infra-controller-rest

Length of output: 888


🏁 Script executed:

# Check if set -euo pipefail is present in the script
grep -n "set -" scripts/test-local-core.sh

Repository: NVIDIA/ncx-infra-controller-rest

Length of output: 96


🏁 Script executed:

# Test grep behavior when pattern not found
bash -c "echo 'some output' | grep -E '^UPDATE'; echo \"Exit code: \$?\""

Repository: NVIDIA/ncx-infra-controller-rest

Length of output: 88


🏁 Script executed:

# Test the full scenario with pipefail
bash -c "set -euo pipefail; UPDATE_TAG=\$(echo 'some output' | grep -E '^UPDATE'); echo \"UPDATE_TAG=\$UPDATE_TAG\""

Repository: NVIDIA/ncx-infra-controller-rest

Length of output: 58


🏁 Script executed:

# Test what happens with a failing grep in pipefail mode
bash -c "set -euo pipefail; UPDATE_TAG=\$(echo 'some output' | grep -E '^UPDATE' || echo 'not found'); echo \"UPDATE_TAG=\$UPDATE_TAG\""

Repository: NVIDIA/ncx-infra-controller-rest

Length of output: 96


The DB-update error handling is unreachable under set -euo pipefail.

The script has set -euo pipefail enabled at line 39. When grep -E '^UPDATE' finds no match, it returns exit code 1. With pipefail active, this causes the entire pipeline assignment to fail, and the script exits immediately before reaching the error check on line 178. Capture the kubectl exec output first and make the grep non-fatal before checking UPDATE_TAG.

Suggested fix
-    UPDATE_TAG=$(kubectl exec -n "$PG_NAMESPACE" "$PG_STATEFULSET" -- \
-        psql -U "$PG_USER" -d "$PG_DB" -c \
-        "UPDATE site SET status = 'Registered', updated = NOW() WHERE id = '$SITE_ID';" \
-        2>&1 | grep -E '^UPDATE')
-
-    [[ -z "$UPDATE_TAG" ]] && die "DB update produced no UPDATE tag — check site ID and postgres connectivity"
+    UPDATE_OUTPUT=$(kubectl exec -n "$PG_NAMESPACE" "$PG_STATEFULSET" -- \
+        psql -U "$PG_USER" -d "$PG_DB" -c \
+        "UPDATE site SET status = 'Registered', updated = NOW() WHERE id = '$SITE_ID';" \
+        2>&1) || die "DB update failed — check site ID and postgres connectivity"
+    UPDATE_TAG=$(printf '%s\n' "$UPDATE_OUTPUT" | grep -E '^UPDATE' || true)
+
+    [[ -z "$UPDATE_TAG" ]] && die "DB update produced no UPDATE tag — check site ID and postgres connectivity"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
UPDATE_TAG=$(kubectl exec -n "$PG_NAMESPACE" "$PG_STATEFULSET" -- \
psql -U "$PG_USER" -d "$PG_DB" -c \
"UPDATE site SET status = 'Registered', updated = NOW() WHERE id = '$SITE_ID';" \
2>&1 | grep -E '^UPDATE')
[[ -z "$UPDATE_TAG" ]] && die "DB update produced no UPDATE tag — check site ID and postgres connectivity"
UPDATED_ROWS=$(echo "$UPDATE_TAG" | awk '{print $2}')
[[ "$UPDATED_ROWS" -eq 0 ]] && die "DB update matched 0 rows — site id '$SITE_ID' not found in table"
UPDATE_OUTPUT=$(kubectl exec -n "$PG_NAMESPACE" "$PG_STATEFULSET" -- \
psql -U "$PG_USER" -d "$PG_DB" -c \
"UPDATE site SET status = 'Registered', updated = NOW() WHERE id = '$SITE_ID';" \
2>&1) || die "DB update failed — check site ID and postgres connectivity"
UPDATE_TAG=$(printf '%s\n' "$UPDATE_OUTPUT" | grep -E '^UPDATE' || true)
[[ -z "$UPDATE_TAG" ]] && die "DB update produced no UPDATE tag — check site ID and postgres connectivity"
UPDATED_ROWS=$(echo "$UPDATE_TAG" | awk '{print $2}')
[[ "$UPDATED_ROWS" -eq 0 ]] && die "DB update matched 0 rows — site id '$SITE_ID' not found in table"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/test-local-core.sh` around lines 173 - 180, The pipeline that assigns
UPDATE_TAG is brittle under set -euo pipefail because grep returning non-zero
will abort; instead capture the full kubectl exec output into a temporary
variable (e.g. UPDATE_OUT) by running the psql command and redirecting 2>&1,
then run the grep on that captured output (or run grep with a non-fatal
fallback) to populate UPDATE_TAG, and finally use UPDATED_ROWS extraction and
existing die checks; update references to UPDATE_TAG and UPDATED_ROWS
accordingly so the script doesn't exit before your explicit error handling.

@pbreton pbreton force-pushed the chore/local-carbide branch from 6ed4b25 to 411c2dc Compare March 19, 2026 21:00
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
Makefile (1)

510-534: Extract duplicated LOCAL_CORE setup logic into a reusable target.

The LOCAL_CORE secret creation, IPv4 resolution, ConfigMap patching, and restart logic is duplicated verbatim (~25 lines) between kind-reset-kustomize and helm-deploy-site-agent. This increases maintenance burden and risk of divergence.

♻️ Suggested approach: Extract to a shared target

Create a new target that both can invoke:

# Configure site-agent to connect to local carbide-api (called when LOCAL_CORE=true)
_setup-local-core-site-agent:
	`@echo` "Creating core-grpc-client-site-agent-certs secret from $(LOCAL_CORE_CERTS_DIR)..."
	`@for` f in ca.crt client.crt client.key; do \
		[ -f "$(LOCAL_CORE_CERTS_DIR)/$$f" ] || { echo "ERROR: $$f not found in $(LOCAL_CORE_CERTS_DIR). Run gen-certs.sh in bare-metal-manager-core first."; exit 1; } ; \
	done
	kubectl -n carbide-rest create secret generic core-grpc-client-site-agent-certs \
		--from-file=ca.crt="$(LOCAL_CORE_CERTS_DIR)/ca.crt" \
		--from-file=tls.crt="$(LOCAL_CORE_CERTS_DIR)/client.crt" \
		--from-file=tls.key="$(LOCAL_CORE_CERTS_DIR)/client.key" \
		--dry-run=client -o yaml | kubectl apply -f -
	`@CORE_HOST`="$(LOCAL_CORE_HOST)" ; \
	IPV4=$$(docker exec $$(docker ps -qf name=carbide-rest-local-control-plane) \
		getent ahosts "$$CORE_HOST" 2>/dev/null \
		| awk '/STREAM/{print $$1}' | grep -v ':' | head -1) ; \
	if [ -n "$$IPV4" ]; then \
		echo "Resolved $$CORE_HOST → $$IPV4 (using IPv4 to avoid Go IPv6-first dialer issue)" ; \
		CORE_HOST="$$IPV4" ; \
	fi ; \
	echo "Patching site-agent ConfigMap: CARBIDE_ADDRESS=$$CORE_HOST:$(LOCAL_CORE_PORT), CARBIDE_SEC_OPT=1 (ServerTLS)..." ; \
	kubectl -n carbide-rest patch configmap carbide-rest-site-agent-config --type merge \
		-p "{\"data\":{\"CARBIDE_ADDRESS\":\"$$CORE_HOST:$(LOCAL_CORE_PORT)\",\"CARBIDE_SEC_OPT\":\"1\"}}"
	`@echo` "Restarting site-agent to pick up new CARBIDE_ADDRESS and certs..."
	kubectl -n carbide-rest rollout restart statefulset/carbide-rest-site-agent
	kubectl -n carbide-rest rollout status statefulset/carbide-rest-site-agent --timeout=120s

Then in both targets:

	`@if` [ "$(LOCAL_CORE)" = "true" ]; then \
		$(MAKE) _setup-local-core-site-agent ; \
	fi

Also applies to: 611-634

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 510 - 534, Duplicate LOCAL_CORE setup logic (secret
creation, IPv4 resolution, ConfigMap patch, and restart) should be extracted
into a single reusable make target (e.g., _setup-local-core-site-agent) and
invoked from both kind-reset-kustomize and helm-deploy-site-agent when
LOCAL_CORE=true; create the new target containing the existing sequence (keeping
symbols like LOCAL_CORE_CERTS_DIR, LOCAL_CORE_HOST, LOCAL_CORE_PORT, and kubectl
commands intact), then replace the duplicated inline if-blocks in
kind-reset-kustomize and helm-deploy-site-agent with a conditional that calls
$(MAKE) _setup-local-core-site-agent so both targets share the same
implementation.
scripts/test-local-core.sh (1)

106-110: Remove unused db_exec helper function.

The db_exec function is defined but never called—the DB update at lines 173-176 uses kubectl exec directly with different flags (-c instead of -tAc). Either use this helper or remove it.

♻️ Option 1: Remove unused function
-# Run a SQL statement inside the in-cluster postgres pod.
-db_exec() {
-    local sql="$1"
-    kubectl exec -n "$PG_NAMESPACE" "$PG_STATEFULSET" -- \
-        psql -U "$PG_USER" -d "$PG_DB" -tAc "$sql"
-}
-
♻️ Option 2: Use the helper (with adjusted flags)
 # Run a SQL statement inside the in-cluster postgres pod.
 db_exec() {
     local sql="$1"
     kubectl exec -n "$PG_NAMESPACE" "$PG_STATEFULSET" -- \
-        psql -U "$PG_USER" -d "$PG_DB" -tAc "$sql"
+        psql -U "$PG_USER" -d "$PG_DB" -c "$sql" 2>&1
 }

Then use it at line 173:

-    UPDATE_TAG=$(kubectl exec -n "$PG_NAMESPACE" "$PG_STATEFULSET" -- \
-        psql -U "$PG_USER" -d "$PG_DB" -c \
-        "UPDATE site SET status = 'Registered', updated = NOW() WHERE id = '$SITE_ID';" \
-        2>&1 | grep -E '^UPDATE')
+    UPDATE_OUTPUT=$(db_exec "UPDATE site SET status = 'Registered', updated = NOW() WHERE id = '$SITE_ID';") \
+        || die "DB update failed — check site ID and postgres connectivity"
+    UPDATE_TAG=$(printf '%s\n' "$UPDATE_OUTPUT" | grep -E '^UPDATE' || true)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/test-local-core.sh` around lines 106 - 110, The db_exec helper
(function db_exec) is defined but unused—remove the entire db_exec function
definition to avoid dead code; alternatively, if you prefer to keep it, update
its kubectl flags to accept/include the container flag (e.g., add -c
"$PG_CONTAINER" or a parameter for container) and replace the direct kubectl
exec invocation currently used for the DB update with a call to db_exec so the
helper is actually used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@Makefile`:
- Around line 510-534: Duplicate LOCAL_CORE setup logic (secret creation, IPv4
resolution, ConfigMap patch, and restart) should be extracted into a single
reusable make target (e.g., _setup-local-core-site-agent) and invoked from both
kind-reset-kustomize and helm-deploy-site-agent when LOCAL_CORE=true; create the
new target containing the existing sequence (keeping symbols like
LOCAL_CORE_CERTS_DIR, LOCAL_CORE_HOST, LOCAL_CORE_PORT, and kubectl commands
intact), then replace the duplicated inline if-blocks in kind-reset-kustomize
and helm-deploy-site-agent with a conditional that calls $(MAKE)
_setup-local-core-site-agent so both targets share the same implementation.

In `@scripts/test-local-core.sh`:
- Around line 106-110: The db_exec helper (function db_exec) is defined but
unused—remove the entire db_exec function definition to avoid dead code;
alternatively, if you prefer to keep it, update its kubectl flags to
accept/include the container flag (e.g., add -c "$PG_CONTAINER" or a parameter
for container) and replace the direct kubectl exec invocation currently used for
the DB update with a call to db_exec so the helper is actually used.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9d5da212-9fc6-46a7-9bdc-3022e96f8784

📥 Commits

Reviewing files that changed from the base of the PR and between 6ed4b25 and 411c2dc.

📒 Files selected for processing (2)
  • Makefile
  • scripts/test-local-core.sh

@pbreton pbreton force-pushed the chore/local-carbide branch from 411c2dc to e25cd32 Compare March 25, 2026 01:15
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
Makefile (1)

316-317: ⚠️ Potential issue | 🔴 Critical

deploy/kustomize/overlays/local does not exist and must be created or the kind-apply target will fail.

The Makefile defines KUSTOMIZE_OVERLAY := deploy/kustomize/overlays/local (line 258), but this directory is missing. The repository uses per-component overlays (api, cert-manager, db, mock-core, site-agent, site-manager, workflow) instead of a single local overlay. Either create a deploy/kustomize/overlays/local/kustomization.yaml that references all components, or update the kind-apply target to deploy each overlay individually using the pattern shown in the deployment documentation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 316 - 317, The kind-apply Makefile target references
KUSTOMIZE_OVERLAY (set to deploy/kustomize/overlays/local) which doesn't exist;
either create deploy/kustomize/overlays/local/kustomization.yaml that aggregates
the per-component overlays (api, cert-manager, db, mock-core, site-agent,
site-manager, workflow) or change the kind-apply target to iterate/apply each
overlay individually (e.g., kubectl apply -k
deploy/kustomize/overlays/<component>) so the Makefile uses the existing
per-component overlays instead of the missing local overlay.
🧹 Nitpick comments (2)
Makefile (1)

611-634: Consider extracting duplicated LOCAL_CORE setup logic into a shared script.

The secret creation, IPv4 resolution, ConfigMap patching, and site-agent restart logic at lines 611-634 is nearly identical to lines 510-534. Extracting this into a shared script (e.g., scripts/setup-local-core.sh) would reduce maintenance burden and ensure consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 611 - 634, Duplicate LOCAL_CORE setup logic (secret
creation, IPv4 resolution, ConfigMap patch, and site-agent restart) should be
extracted into a reusable script; create a script (e.g.,
scripts/setup-local-core.sh) that accepts LOCAL_CORE_CERTS_DIR, LOCAL_CORE_HOST,
LOCAL_CORE_PORT and performs the checks for ca.crt/client.crt/client.key,
kubectl secret creation for core-grpc-client-site-agent-certs, the IPv4
resolution logic that sets CORE_HOST/IPV4, kubectl patch of configmap
carbide-rest-site-agent-config (CARBIDE_ADDRESS/CARBIDE_SEC_OPT), and rollout
restart of statefulset/carbide-rest-site-agent, then replace both Makefile
blocks (the sections guarded by if [ "$(LOCAL_CORE)" = "true" ]) with calls to
that script forwarding the same variables.
scripts/test-local-core.sh (1)

17-37: Consider adding README documentation for LOCAL_CORE usage.

Per the PR comment from thossain-nv, it would be helpful to document how to use LOCAL_CORE=true in the README, including:

  • Prerequisites (running carbide-api locally, generating certs)
  • Example command: make kind-reset LOCAL_CORE=true
  • How to run this smoke test: ./scripts/test-local-core.sh
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/test-local-core.sh` around lines 17 - 37, Add a short README section
documenting LOCAL_CORE usage: describe prerequisites (running carbide-api
locally, generating TLS certs), list relevant environment overrides (API_URL,
KEYCLOAK_URL, ORG, SITE_NAME, PG_NAMESPACE, PG_STATEFULSET, PG_USER, PG_DB),
show the example command to reset kind with LOCAL_CORE enabled (make kind-reset
LOCAL_CORE=true) and show how to run the integration smoke test
(./scripts/test-local-core.sh); place this under the repo README or CONTRIBUTING
and reference the scripts/test-local-core.sh script and the LOCAL_CORE env var
so users can follow the steps end-to-end.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@Makefile`:
- Around line 316-317: The kind-apply Makefile target references
KUSTOMIZE_OVERLAY (set to deploy/kustomize/overlays/local) which doesn't exist;
either create deploy/kustomize/overlays/local/kustomization.yaml that aggregates
the per-component overlays (api, cert-manager, db, mock-core, site-agent,
site-manager, workflow) or change the kind-apply target to iterate/apply each
overlay individually (e.g., kubectl apply -k
deploy/kustomize/overlays/<component>) so the Makefile uses the existing
per-component overlays instead of the missing local overlay.

---

Nitpick comments:
In `@Makefile`:
- Around line 611-634: Duplicate LOCAL_CORE setup logic (secret creation, IPv4
resolution, ConfigMap patch, and site-agent restart) should be extracted into a
reusable script; create a script (e.g., scripts/setup-local-core.sh) that
accepts LOCAL_CORE_CERTS_DIR, LOCAL_CORE_HOST, LOCAL_CORE_PORT and performs the
checks for ca.crt/client.crt/client.key, kubectl secret creation for
core-grpc-client-site-agent-certs, the IPv4 resolution logic that sets
CORE_HOST/IPV4, kubectl patch of configmap carbide-rest-site-agent-config
(CARBIDE_ADDRESS/CARBIDE_SEC_OPT), and rollout restart of
statefulset/carbide-rest-site-agent, then replace both Makefile blocks (the
sections guarded by if [ "$(LOCAL_CORE)" = "true" ]) with calls to that script
forwarding the same variables.

In `@scripts/test-local-core.sh`:
- Around line 17-37: Add a short README section documenting LOCAL_CORE usage:
describe prerequisites (running carbide-api locally, generating TLS certs), list
relevant environment overrides (API_URL, KEYCLOAK_URL, ORG, SITE_NAME,
PG_NAMESPACE, PG_STATEFULSET, PG_USER, PG_DB), show the example command to reset
kind with LOCAL_CORE enabled (make kind-reset LOCAL_CORE=true) and show how to
run the integration smoke test (./scripts/test-local-core.sh); place this under
the repo README or CONTRIBUTING and reference the scripts/test-local-core.sh
script and the LOCAL_CORE env var so users can follow the steps end-to-end.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a276b671-e0a4-4af5-bf3f-3e7150217b99

📥 Commits

Reviewing files that changed from the base of the PR and between 411c2dc and e25cd32.

📒 Files selected for processing (2)
  • Makefile
  • scripts/test-local-core.sh

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
Makefile (2)

499-503: Redundant mock-core apply.

This block duplicates the mock-core setup already performed in kind-reset-infra (lines 451-455). Since kind-reset-kustomize depends on kind-reset-infra, this apply is redundant (though harmless due to idempotency).

♻️ Suggested removal
-	`@if` [ "$(LOCAL_CORE)" != "true" ]; then \
-		echo "Setting up Carbide Mock Core..." ; \
-		kubectl apply -k deploy/kustomize/overlays/mock-core ; \
-		kubectl -n carbide-rest rollout status deployment/carbide-rest-mock-core --timeout=240s ; \
-	fi
-
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 499 - 503, Remove the redundant mock-core setup in the
conditional that checks LOCAL_CORE by deleting the kubectl apply and rollout
status commands (the "kubectl apply -k deploy/kustomize/overlays/mock-core" and
"kubectl -n carbide-rest rollout status deployment/carbide-rest-mock-core
--timeout=240s" lines) inside the `@if` [ "$(LOCAL_CORE)" != "true" ]; then ... fi
block in the Makefile; this setup is already performed by the kind-reset-infra
target, so keep the LOCAL_CORE guard but remove the duplicated apply/rollout
commands to avoid redundancy.

512-536: Extract duplicated local-core setup into a reusable target or script.

This shell block is duplicated nearly verbatim in helm-deploy-site-agent (lines 615-638). Consider extracting it into a dedicated make target (e.g., setup-local-core-certs) or a helper script to improve maintainability.

Additionally, line 523 hardcodes the control-plane node name. Use the KIND_CLUSTER_NAME variable for consistency:

♻️ Use variable for cluster name
 		CORE_HOST="$(LOCAL_CORE_HOST)" ; \
-		IPV4=$$(docker exec $$(docker ps -qf name=carbide-rest-local-control-plane) \
+		IPV4=$$(docker exec $$(docker ps -qf name=$(KIND_CLUSTER_NAME)-control-plane) \
 			getent ahosts "$$CORE_HOST" 2>/dev/null \
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 512 - 536, The duplicated LOCAL_CORE setup block
(guarded by LOCAL_CORE and using LOCAL_CORE_CERTS_DIR, LOCAL_CORE_HOST,
LOCAL_CORE_PORT) should be extracted into a reusable Makefile target or helper
script (e.g., setup-local-core-certs) and both places (current block and
helm-deploy-site-agent) should call that target to remove duplication; implement
the new target to perform the file checks, kubectl secret creation, IPv4
resolution, ConfigMap patching, and rollout restart/status, and replace the
hardcoded docker filter name "carbide-rest-local-control-plane" with a reference
using KIND_CLUSTER_NAME (e.g., "$$(docker ps -qf
name=$${KIND_CLUSTER_NAME}-control-plane)") so the cluster name variable is used
consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@Makefile`:
- Around line 499-503: Remove the redundant mock-core setup in the conditional
that checks LOCAL_CORE by deleting the kubectl apply and rollout status commands
(the "kubectl apply -k deploy/kustomize/overlays/mock-core" and "kubectl -n
carbide-rest rollout status deployment/carbide-rest-mock-core --timeout=240s"
lines) inside the `@if` [ "$(LOCAL_CORE)" != "true" ]; then ... fi block in the
Makefile; this setup is already performed by the kind-reset-infra target, so
keep the LOCAL_CORE guard but remove the duplicated apply/rollout commands to
avoid redundancy.
- Around line 512-536: The duplicated LOCAL_CORE setup block (guarded by
LOCAL_CORE and using LOCAL_CORE_CERTS_DIR, LOCAL_CORE_HOST, LOCAL_CORE_PORT)
should be extracted into a reusable Makefile target or helper script (e.g.,
setup-local-core-certs) and both places (current block and
helm-deploy-site-agent) should call that target to remove duplication; implement
the new target to perform the file checks, kubectl secret creation, IPv4
resolution, ConfigMap patching, and rollout restart/status, and replace the
hardcoded docker filter name "carbide-rest-local-control-plane" with a reference
using KIND_CLUSTER_NAME (e.g., "$$(docker ps -qf
name=$${KIND_CLUSTER_NAME}-control-plane)") so the cluster name variable is used
consistently.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e004a818-fb00-44e2-838c-a9eee45cc9b9

📥 Commits

Reviewing files that changed from the base of the PR and between e25cd32 and b8163c1.

📒 Files selected for processing (1)
  • Makefile

@pbreton pbreton force-pushed the chore/local-carbide branch from b8163c1 to baa3c8c Compare March 25, 2026 17:29
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
scripts/test-local-core.sh (2)

172-180: ⚠️ Potential issue | 🟡 Minor

Make the grep non-fatal before the explicit DB-update checks.

With set -euo pipefail on Line 39, a missing ^UPDATE match exits at the assignment on Line 173, so the die checks on Lines 178-180 never run. Capture the full kubectl exec output first, then parse it with grep ... || true.

Suggested fix
-    UPDATE_TAG=$(kubectl exec -n "$PG_NAMESPACE" "$PG_STATEFULSET" -- \
-        psql -U "$PG_USER" -d "$PG_DB" -c \
-        "UPDATE site SET status = 'Registered', updated = NOW() WHERE id = '$SITE_ID';" \
-        2>&1 | grep -E '^UPDATE')
+    UPDATE_OUTPUT=$(kubectl exec -n "$PG_NAMESPACE" "$PG_STATEFULSET" -- \
+        psql -U "$PG_USER" -d "$PG_DB" -c \
+        "UPDATE site SET status = 'Registered', updated = NOW() WHERE id = '$SITE_ID';" \
+        2>&1) || die "DB update failed — check site ID and postgres connectivity"
+    UPDATE_TAG=$(printf '%s\n' "$UPDATE_OUTPUT" | grep -E '^UPDATE' || true)
#!/bin/bash
set -euo pipefail

sed -n '39p;172,180p' scripts/test-local-core.sh
echo
bash -o errexit -o nounset -o pipefail -c \
  'UPDATE_TAG=$(printf "no update tag here\n" | grep -E "^UPDATE"); echo "unreachable"'

Expected: the second command never prints unreachable, confirming the current assignment exits before the explicit error handling.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/test-local-core.sh` around lines 172 - 180, The assignment that pipes
kubectl exec through grep is currently fatal under set -euo pipefail; instead
capture the full kubectl exec output into a variable (e.g., UPDATE_OUTPUT by
running the existing kubectl exec ... 2>&1 and storing its output), then parse
that output with grep into UPDATE_TAG using grep -E '^UPDATE' || true so the
grep failure is non-fatal; keep the existing die checks that test [[ -z
"$UPDATE_TAG" ]] and parse UPDATED_ROWS from UPDATE_TAG as before. Ensure you
reference the same kubectl exec invocation, UPDATE_TAG and UPDATED_ROWS symbols
when making the change.

88-97: ⚠️ Potential issue | 🟠 Major

Don't count HTTP 409 as a validated create-path run.

Line 88 turns conflicts into success, and the POST payloads below reuse fixed resource values. On reruns this can still end in Smoke test passed even though the create path into carbide-core was never exercised again. Either make 409 a skipped/not-validated outcome, or generate per-run resource values that cannot collide.

Also applies to: 197-206, 219-233, 246-251, 265-273

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/test-local-core.sh` around lines 88 - 97, The current HTTP 409
handling in the if block that checks http_code and sets existing_id (the branch
that calls info and echoes the reused id) treats conflicts as successful
validated creates; change this so 409 is not counted as a validated create:
either (A) mark the operation as skipped/failure by replacing the success
echo/info path with a skip log and a non-success return (or call die) so the
smoke test does not consider the create exercised, or (B) make the POST payloads
unique per run by appending a per-run random/timestamp/UUID suffix to the
resource identifiers used where payloads are generated so collisions cannot
occur; apply the same change to the other conflict-handling instances (the
similar blocks at the other ranges referenced) so 409 no longer masquerades as a
validated create.
Makefile (1)

284-286: ⚠️ Potential issue | 🟠 Major

Keep kind-deploy consistent with LOCAL_CORE=true.

These lines skip building the mock-core image, but kind-deploy still ends in kind-apply, which unconditionally applies $(KUSTOMIZE_OVERLAY) on Line 317. If that overlay still includes mock-core—or is still the missing composite overlay—make kind-deploy LOCAL_CORE=true no longer matches the reset flows and can fail or deploy stale resources.

#!/bin/bash
set -euo pipefail

sed -n '279,317p' Makefile
echo

if [ -f deploy/kustomize/overlays/local/kustomization.yaml ]; then
  echo "--- deploy/kustomize/overlays/local/kustomization.yaml ---"
  sed -n '1,200p' deploy/kustomize/overlays/local/kustomization.yaml
elif [ -f deploy/kustomize/overlays/local/kustomization.yml ]; then
  echo "--- deploy/kustomize/overlays/local/kustomization.yml ---"
  sed -n '1,200p' deploy/kustomize/overlays/local/kustomization.yml
else
  echo "deploy/kustomize/overlays/local is missing"
fi

Expected: either the overlay is missing, or its resources show whether mock-core is still applied by kind-apply.

Also applies to: 300-317

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 284 - 286, The kind-deploy flow is inconsistent when
LOCAL_CORE=true: the Docker build for carbide-rest-mock-core is skipped (see the
conditional around the docker build) but kind-apply still unconditionally
applies $(KUSTOMIZE_OVERLAY), which may include the mock-core resources; update
the Makefile so kind-deploy/kind-apply respects LOCAL_CORE by conditionally
excluding or substituting the overlay that contains mock-core (or by adding a
conditional around the apply step), e.g., gate the application of
$(KUSTOMIZE_OVERLAY) or switch to a different overlay when LOCAL_CORE=true so
that mock-core resources are not applied when the image isn't built. Ensure
changes reference the existing variables/targets: LOCAL_CORE, KUSTOMIZE_OVERLAY,
and the kind-apply / kind-deploy targets.
🧹 Nitpick comments (1)
Makefile (1)

512-536: Extract the LOCAL_CORE site-agent bootstrap into one reusable recipe.

This secret creation, host resolution, ConfigMap patching, and rollout logic is duplicated in both deployment paths. Any later fix here now has to land twice, which makes drift likely.

Also applies to: 615-638

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 512 - 536, Extract the duplicated LOCAL_CORE bootstrap
block into a single reusable Makefile recipe (e.g., a phony target like
local-core-site-agent-bootstrap) and replace both inline copies with invocations
of that recipe; move the logic that checks LOCAL_CORE, validates files in
$(LOCAL_CORE_CERTS_DIR), creates the core-grpc-client-site-agent-certs secret,
resolves CORE_HOST/IPV4, patches the carbide-rest-site-agent-config ConfigMap
and performs kubectl rollout restart/status into the new recipe, parameterize it
via the existing variables (LOCAL_CORE, LOCAL_CORE_CERTS_DIR, LOCAL_CORE_HOST,
LOCAL_CORE_PORT) and ensure the original callers (the blocks at the current
location and the duplicate at lines 615-638) simply call the new target to avoid
code drift.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@Makefile`:
- Around line 523-529: The Makefile logic that computes IPV4 is hardcoded to the
container name "carbide-rest-local-control-plane" instead of deriving it from
the existing KIND_CLUSTER_NAME, so running with make KIND_CLUSTER_NAME=...
LOCAL_CORE=true can target the wrong node; update the docker exec invocation and
any other places that reference "carbide-rest-local-control-plane" (also at the
later block around the same logic) to compute the container name from
$(KIND_CLUSTER_NAME) (e.g., use the same templated container name construction
you use elsewhere) so CORE_HOST resolution uses the correct container, and
ensure the variable CORE_HOST assignment and echo message still reflect the
substituted value.
- Around line 607-612: In the Makefile target helm-deploy-site-agent, remove the
trailing "|| true" from the helm upgrade --install carbide-rest-site-agent ...
--timeout 1m command so that template rendering, validation, and hook errors are
not suppressed; optionally consider adding --wait and increasing the --timeout
value if you want Helm to wait for resources to become ready, but do not swallow
failures by appending "|| true".

---

Duplicate comments:
In `@Makefile`:
- Around line 284-286: The kind-deploy flow is inconsistent when
LOCAL_CORE=true: the Docker build for carbide-rest-mock-core is skipped (see the
conditional around the docker build) but kind-apply still unconditionally
applies $(KUSTOMIZE_OVERLAY), which may include the mock-core resources; update
the Makefile so kind-deploy/kind-apply respects LOCAL_CORE by conditionally
excluding or substituting the overlay that contains mock-core (or by adding a
conditional around the apply step), e.g., gate the application of
$(KUSTOMIZE_OVERLAY) or switch to a different overlay when LOCAL_CORE=true so
that mock-core resources are not applied when the image isn't built. Ensure
changes reference the existing variables/targets: LOCAL_CORE, KUSTOMIZE_OVERLAY,
and the kind-apply / kind-deploy targets.

In `@scripts/test-local-core.sh`:
- Around line 172-180: The assignment that pipes kubectl exec through grep is
currently fatal under set -euo pipefail; instead capture the full kubectl exec
output into a variable (e.g., UPDATE_OUTPUT by running the existing kubectl exec
... 2>&1 and storing its output), then parse that output with grep into
UPDATE_TAG using grep -E '^UPDATE' || true so the grep failure is non-fatal;
keep the existing die checks that test [[ -z "$UPDATE_TAG" ]] and parse
UPDATED_ROWS from UPDATE_TAG as before. Ensure you reference the same kubectl
exec invocation, UPDATE_TAG and UPDATED_ROWS symbols when making the change.
- Around line 88-97: The current HTTP 409 handling in the if block that checks
http_code and sets existing_id (the branch that calls info and echoes the reused
id) treats conflicts as successful validated creates; change this so 409 is not
counted as a validated create: either (A) mark the operation as skipped/failure
by replacing the success echo/info path with a skip log and a non-success return
(or call die) so the smoke test does not consider the create exercised, or (B)
make the POST payloads unique per run by appending a per-run
random/timestamp/UUID suffix to the resource identifiers used where payloads are
generated so collisions cannot occur; apply the same change to the other
conflict-handling instances (the similar blocks at the other ranges referenced)
so 409 no longer masquerades as a validated create.

---

Nitpick comments:
In `@Makefile`:
- Around line 512-536: Extract the duplicated LOCAL_CORE bootstrap block into a
single reusable Makefile recipe (e.g., a phony target like
local-core-site-agent-bootstrap) and replace both inline copies with invocations
of that recipe; move the logic that checks LOCAL_CORE, validates files in
$(LOCAL_CORE_CERTS_DIR), creates the core-grpc-client-site-agent-certs secret,
resolves CORE_HOST/IPV4, patches the carbide-rest-site-agent-config ConfigMap
and performs kubectl rollout restart/status into the new recipe, parameterize it
via the existing variables (LOCAL_CORE, LOCAL_CORE_CERTS_DIR, LOCAL_CORE_HOST,
LOCAL_CORE_PORT) and ensure the original callers (the blocks at the current
location and the duplicate at lines 615-638) simply call the new target to avoid
code drift.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a154a0b2-eb5d-4cf5-9010-0bd9556cd8f6

📥 Commits

Reviewing files that changed from the base of the PR and between b8163c1 and baa3c8c.

📒 Files selected for processing (2)
  • Makefile
  • scripts/test-local-core.sh

Comment thread Makefile
Comment on lines +523 to +529
IPV4=$$(docker exec $$(docker ps -qf name=carbide-rest-local-control-plane) \
getent ahosts "$$CORE_HOST" 2>/dev/null \
| awk '/STREAM/{print $$1}' | grep -v ':' | head -1) ; \
if [ -n "$$IPV4" ]; then \
echo "Resolved $$CORE_HOST → $$IPV4 (using IPv4 to avoid Go IPv6-first dialer issue)" ; \
CORE_HOST="$$IPV4" ; \
fi ; \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Derive the control-plane container from KIND_CLUSTER_NAME.

Both LOCAL_CORE flows hardcode carbide-rest-local-control-plane instead of using the existing $(KIND_CLUSTER_NAME) variable. make KIND_CLUSTER_NAME=... ... LOCAL_CORE=true will resolve the wrong node and silently skip the IPv4 fallback.

Suggested fix
-		IPV4=$$(docker exec $$(docker ps -qf name=carbide-rest-local-control-plane) \
+		IPV4=$$(docker exec $$(docker ps -qf name=$(KIND_CLUSTER_NAME)-control-plane) \
 			getent ahosts "$$CORE_HOST" 2>/dev/null \
 			| awk '/STREAM/{print $$1}' | grep -v ':' | head -1) ; \

Also applies to: 626-632

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 523 - 529, The Makefile logic that computes IPV4 is
hardcoded to the container name "carbide-rest-local-control-plane" instead of
deriving it from the existing KIND_CLUSTER_NAME, so running with make
KIND_CLUSTER_NAME=... LOCAL_CORE=true can target the wrong node; update the
docker exec invocation and any other places that reference
"carbide-rest-local-control-plane" (also at the later block around the same
logic) to compute the container name from $(KIND_CLUSTER_NAME) (e.g., use the
same templated container name construction you use elsewhere) so CORE_HOST
resolution uses the correct container, and ensure the variable CORE_HOST
assignment and echo message still reflect the substituted value.

Comment thread Makefile
Comment on lines 607 to +612
helm-deploy-site-agent:
@echo "Installing site-agent chart (will CrashLoop until bootstrapped)..."
helm upgrade --install carbide-rest-site-agent $(SITE_AGENT_CHART)/ \
--namespace carbide-rest $(HELM_SET) --timeout 1m || true
--namespace carbide-rest $(HELM_SET) \
$(if $(filter true,$(LOCAL_CORE)),--set certificate.enabled=false) \
--timeout 1m || true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

For Helm CLI, does helm upgrade --installfail when a workload enters CrashLoopBackOff if--waitis not set, and does--timeoutaffect upgrade/install behavior without--waitor--atomic?

💡 Result:

No, helm upgrade --install does not fail when a workload enters CrashLoopBackOff if --wait is not set. Yes, --timeout affects upgrade/install behavior without --wait or --atomic for individual Kubernetes operations like hooks.

Citations:


🌐 Web query:

helm upgrade --install failure modes errors without --wait template rendering validation

💡 Result:

Helm upgrade --install fails without --wait primarily due to post-install hook failures, especially when hooks depend on webhooks or resources that aren't immediately ready. Template rendering and validation errors cause failures before hooks. Key failure modes: 1. Post-install hook failures: Without --wait, Helm doesn't wait for main resources (Deployments, etc.) to be ready before running post-install hooks. Hooks like Jobs or webhooks fail if dependencies (e.g., webhook services) aren't available, causing UPGRADE FAILED: failed post-install. Adding --wait makes Helm wait for resources ready, then run hooks successfully. 2. Template rendering errors: Invalid YAML, nil pointer errors (missing values), type mismatches, undefined functions. Use helm template --debug, helm lint --strict to validate before upgrade. 3. Kubernetes validation errors: Rendered manifests fail schema validation, immutable field changes, unknown fields. Validate with helm template | kubeval --strict. 4. Timeout errors without --wait: Default 5m timeout hit if resources slow to ready; command fails even if eventually succeeds. 5. Resource conflicts: Stuck failed releases from prior attempts block upgrades. Without --wait, upgrade proceeds immediately after posting resources but marks failed if hooks fail post-posting. Official docs recommend --wait for production to ensure readiness. Debug: helm upgrade --dry-run --debug; helm template --debug; check release status with helm status. Prevention: Validate charts (lint, template, kubeval), use --wait/--timeout=15m, --atomic for rollback.

Citations:


Remove || true to surface template and validation errors.

helm upgrade --install without --wait returns after submitting resources (so CrashLoop won't cause helm to fail), but it still fails on template rendering errors, Kubernetes validation failures, and post-install hook issues. The || true suppresses these legitimate errors and allows the bootstrap to continue with a broken release.

Suggested fix
 	helm upgrade --install carbide-rest-site-agent $(SITE_AGENT_CHART)/ \
 		--namespace carbide-rest $(HELM_SET) \
 		$(if $(filter true,$(LOCAL_CORE)),--set certificate.enabled=false) \
-		--timeout 1m || true
+		--timeout 1m
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 607 - 612, In the Makefile target
helm-deploy-site-agent, remove the trailing "|| true" from the helm upgrade
--install carbide-rest-site-agent ... --timeout 1m command so that template
rendering, validation, and hook errors are not suppressed; optionally consider
adding --wait and increasing the --timeout value if you want Helm to wait for
resources to become ready, but do not swallow failures by appending "|| true".

@pbreton pbreton force-pushed the chore/local-carbide branch 2 times, most recently from 7917324 to 330f61b Compare April 3, 2026 16:57
@pbreton pbreton requested a review from a team as a code owner April 3, 2026 16:57
@pbreton pbreton force-pushed the chore/local-carbide branch 2 times, most recently from ed3a541 to ba8360e Compare April 9, 2026 16:30
@pbreton pbreton force-pushed the chore/local-carbide branch 3 times, most recently from 9fe2bc0 to d75f362 Compare April 15, 2026 23:57
@pbreton pbreton force-pushed the chore/local-carbide branch 3 times, most recently from a3de0d6 to f355215 Compare April 21, 2026 22:31
…test script for validation

Signed-off-by: Patrice Breton <pbreton@nvidia.com>
@pbreton pbreton force-pushed the chore/local-carbide branch from f355215 to be29eb2 Compare April 21, 2026 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants