Skip to content

feat(observability): health/readiness endpoints, atomic metrics, config#3035

Closed
rareba wants to merge 8 commits intozeroclaw-labs:masterfrom
rareba:feature/production-observability
Closed

feat(observability): health/readiness endpoints, atomic metrics, config#3035
rareba wants to merge 8 commits intozeroclaw-labs:masterfrom
rareba:feature/production-observability

Conversation

@rareba
Copy link
Copy Markdown
Contributor

@rareba rareba commented Mar 8, 2026

Summary

  • Base branch target (master for all contributions): master
  • Problem: No structured health/readiness endpoints for orchestration platforms (Kubernetes, systemd) and no lightweight metrics collection for operational visibility.
  • Why it matters: Production deployments need standard liveness/readiness probes and basic counters for monitoring and alerting.
  • What changed:
    • Add structured /health (liveness) and /ready (readiness) gateway endpoints returning JSON with status, checks, and uptime_secs
    • Add lightweight MetricsRegistry in src/observability/metrics.rs using std::sync::atomic counters with Prometheus text format export
    • Extend ObservabilityConfig with health_enabled, metrics_enabled, metrics_prefix, health_check_interval_secs (all #[serde(default)] for backward compat)
    • Wire new endpoints into gateway router, replacing the old handle_health handler
  • What did not change (scope boundary): Existing /metrics handler unchanged. No external metrics dependencies added.

Label Snapshot (required)

  • Risk label (risk: low|medium|high): medium
  • Size label (size: XS|S|M|L|XL, auto-managed/read-only): M
  • Scope labels: gateway, observability, config
  • Module labels: gateway: health, observability: metrics
  • Contributor tier label: auto-managed
  • If any auto-label is incorrect, note requested correction: N/A

Change Metadata

  • Change type (bug|feature|refactor|docs|security|chore): feature
  • Primary scope (runtime|provider|channel|memory|security|ci|docs|multi): multi (gateway + observability)

Linked Issue

  • Closes #
  • Related # N/A
  • Depends on #
  • Supersedes #

Supersede Attribution (required when Supersedes # is used)

N/A

Validation Evidence (required)

Commands and result summary:

cargo fmt --all -- --check   # pass
cargo clippy --all-targets -- -D warnings  # pass
cargo test                   # pass
  • Evidence provided (test/log/trace/screenshot/perf): 8 unit tests for MetricsRegistry (counter increment, gauge record, prometheus export format, zero counters, unknown metric noop, uptime); 3 unit tests for HealthResponse/CheckResult serialization; 49 existing observability tests continue to pass
  • If any command is intentionally skipped, explain why: N/A

Security Impact (required)

  • New permissions/capabilities? No
  • New external network calls? No
  • Secrets/tokens handling changed? No
  • File system access scope changed? No
  • If any Yes, describe risk and mitigation: N/A

Privacy and Data Hygiene (required)

  • Data-hygiene status (pass|needs-follow-up): pass
  • Redaction/anonymization notes: Health/metrics endpoints expose operational status only; no user data.
  • Neutral wording confirmation: Confirmed — uses project-native labels throughout.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? Yes — extends [observability] config section with new fields, all with #[serde(default)]
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

i18n Follow-Through (required when docs or user-facing wording changes)

  • i18n follow-through triggered? No

Human Verification (required)

What was personally validated beyond CI:

  • Verified scenarios: 11 new unit tests for metrics registry and health endpoint serialization; 49 existing observability tests still passing
  • Edge cases checked: Zero counters, unknown metric noop, prometheus export format correctness
  • What was not verified: Live HTTP endpoint testing with curl; Kubernetes probe integration

Side Effects / Blast Radius (required)

  • Affected subsystems/workflows: Gateway router (new endpoints registered, old handle_health replaced), observability module (new metrics submodule), agent loop (metrics increment calls added), config schema
  • Potential unintended effects: Old /health handler is replaced — callers expecting the old response format will get structured JSON instead.
  • Guardrails/monitoring for early detection: Health/metrics endpoints are conditionally registered based on config; existing endpoint behavior preserved when config fields absent.

Agent Collaboration Notes (recommended)

  • Agent tools used (if any): Claude Code
  • Workflow/plan summary (if any): Implemented metrics registry first, then health/readiness endpoints, then gateway wiring and config integration.
  • Verification focus: Prometheus export format correctness and backward-compatible config.
  • Confirmation: naming + architecture boundaries followed (AGENTS.md + CONTRIBUTING.md): Yes

Rollback Plan (required)

  • Fast rollback command/path: git revert <commit-sha>
  • Feature flags or config toggles (if any): health_enabled and metrics_enabled config toggles
  • Observable failure symptoms: /health or /ready returning 500; metrics export returning empty or malformed Prometheus text.

Risks and Mitigations

  • Risk: Replacing old handle_health handler may break existing monitoring integrations expecting the old format.
    • Mitigation: New format is a superset (JSON with status field); most health checkers only look at HTTP status code.
  • Risk: Atomic counter overhead on hot paths (agent loop).
    • Mitigation: Uses std::sync::atomic with relaxed ordering — negligible overhead for counter increments.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 8, 2026

Note

.coderabbit.yaml has unrecognized properties

CodeRabbit is using all valid settings from your configuration. Unrecognized properties (listed below) have been ignored and may indicate typos or deprecated fields that can be removed.

⚠️ Parsing warnings (1)
Validation error: Unrecognized key(s) in object: 'tools', 'path_filters', 'review_instructions'
⚙️ Configuration instructions
  • Please see the configuration documentation for more information.
  • You can also validate your configuration using the online YAML validator.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
📝 Walkthrough

Walkthrough

Adds observability: new ObservabilityConfig fields and defaults, liveness/readiness HTTP handlers and conditional routes, a thread-safe MetricsRegistry with Prometheus export and global init, and small telemetry increments and test updates.

Changes

Cohort / File(s) Summary
Configuration & Defaults
src/config/schema.rs
Added public fields to ObservabilityConfig: health_enabled, metrics_enabled, metrics_prefix, health_check_interval_secs; updated Default and added private helpers for defaults.
Health Endpoints & Routing
src/gateway/health.rs, src/gateway/mod.rs
New handle_liveness and handle_readiness handlers and types; router now conditionally registers /health, /ready, and /metrics based on observability config; webhook increments request metric.
Metrics Implementation
src/observability/metrics.rs, src/observability/mod.rs
New MetricsRegistry with atomic counters/gauges, uptime, Prometheus text export, global singleton (init_global / global); exported via observability::metrics.
Startup Initialization
src/main.rs
Initializes global metrics registry at startup using config.observability.metrics_prefix.
Telemetry in Agent Loop
src/agent/loop_.rs
Added metric increments for tool calls, tool errors, and provider requests.
Tests / Minor Edits
src/observability/runtime_trace.rs
Test initializer updated to use ..ObservabilityConfig::default() (test-only change).

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client
  participant Router as Gateway Router
  participant Handler as Health Handler
  participant Health as Health Subsystem

  Client->>Router: GET /ready
  Router->>Handler: dispatch to handle_readiness
  Handler->>Health: crate::health::snapshot()
  Health-->>Handler: component statuses (BTreeMap)
  Handler->>Handler: map statuses -> checks, determine overall status
  Handler-->>Router: HTTP 200 (all ok) or 503 (unhealthy) + JSON body
  Router-->>Client: HTTP response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

size: M, risk: medium, core, gateway: core, config: core, service: core, tests

Suggested reviewers

  • theonlyhennygod
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The description provides a clear summary of changes, a test plan with specific validation steps, and risk/rollback information, though it omits most required template sections. Complete the description template by adding Label Snapshot, Change Metadata, Linked Issue, Validation Evidence, Security/Privacy Impact, Compatibility, and Human Verification sections for a comprehensive review record.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title concisely summarizes the three main changes: health/readiness endpoints, atomic metrics implementation, and configuration extensions.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
src/gateway/health.rs (1)

155-224: Please add handler-level failure-mode tests.

The current tests only assert serialization shape. They do not exercise the 200-vs-503 boundary or verify that public responses stay redacted when components fail. A small handle_readiness test with an unhealthy snapshot would lock those semantics down.

Based on learnings: "For high-risk code changes, include at least one boundary/failure-mode validation in addition to standard checks"

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/gateway/health.rs` around lines 155 - 224, Add a handler-level
failure-mode test that exercises handle_readiness: construct a
HealthResponse/CheckResult map containing an unhealthy/degraded component (e.g.,
provider status "degraded" with a message), invoke the handle_readiness handler
(or the readiness HTTP endpoint) with that snapshot, and assert the HTTP
response status is 503 (not 200) and the JSON body preserves public redaction
rules (no internal details exposed—e.g., message should be omitted or redacted
as in CheckResult serialization). Use the existing HealthResponse and
CheckResult types to build the snapshot and call the same handler function used
in production so the test verifies the 200-vs-503 boundary and public response
redaction semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/config/schema.rs`:
- Around line 1903-1905: The new metrics_prefix config field is currently
unused; either thread it into the global metrics registry creation or remove the
config key. To fix, update observability::metrics (where MetricsRegistry is
initialized with the hard-coded "zeroclaw") to accept a prefix parameter (or
overload the constructor/init function) and use that prefix when creating the
MetricsRegistry instead of the literal "zeroclaw", then pass
config.observability.metrics_prefix from the place that initializes the
registry; alternatively delete the metrics_prefix field and its default helper
in schema.rs (and any tests) if you don't want to wire it up. Ensure you update
references to metrics_prefix and the MetricsRegistry initializer (e.g., the
constructor or init function in observability/metrics.rs) so the config value is
actually applied.

In `@src/gateway/health.rs`:
- Around line 50-59: The health handler currently exposes sensitive/internal
details in CheckResult.message (e.g., the pairing block created via
checks.insert(...) and similar blocks for model and last_error); change these to
return only coarse, stable values (e.g., status = "ok"/"degraded" and message =
None or a non-sensitive string like "paired" / "not-paired" / "error") and
remove/raw-redact any raw variables (paired, configured model, last_error) from
the public payload; keep the full diagnostics in server logs or an admin-only
endpoint instead; apply the same redaction to the other CheckResult inserts
noted (the blocks around the other checks at the referenced locations).
- Around line 88-95: The code is currently synthesizing healthy dependency
entries for "provider" and "memory" (creating CheckResult with status "ok" and
message from state.model) which can mask real probe failures and overwrite
snapshot results; instead, stop inserting hardcoded "provider" and "memory"
CheckResult entries — query the real probe/health registry or probe state for
those keys and only insert their CheckResult if an actual probe result exists,
otherwise omit the entry or set it to a degraded/unknown status; also ensure you
do not blindly overwrite an existing failing "memory" entry created earlier in
the snapshot loop (check for existing checks.contains_key("memory") before
inserting or insert in the snapshot processing path that respects probe
results).

---

Nitpick comments:
In `@src/gateway/health.rs`:
- Around line 155-224: Add a handler-level failure-mode test that exercises
handle_readiness: construct a HealthResponse/CheckResult map containing an
unhealthy/degraded component (e.g., provider status "degraded" with a message),
invoke the handle_readiness handler (or the readiness HTTP endpoint) with that
snapshot, and assert the HTTP response status is 503 (not 200) and the JSON body
preserves public redaction rules (no internal details exposed—e.g., message
should be omitted or redacted as in CheckResult serialization). Use the existing
HealthResponse and CheckResult types to build the snapshot and call the same
handler function used in production so the test verifies the 200-vs-503 boundary
and public response redaction semantics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8bac5431-7ca1-49cc-83d6-524fc1e4dd41

📥 Commits

Reviewing files that changed from the base of the PR and between f7fefd4 and 3c2bab0.

📒 Files selected for processing (6)
  • src/config/schema.rs
  • src/gateway/health.rs
  • src/gateway/mod.rs
  • src/observability/metrics.rs
  • src/observability/mod.rs
  • src/observability/runtime_trace.rs

Comment thread src/config/schema.rs
Comment on lines +1903 to +1905
/// Prefix for lightweight atomic metrics (Prometheus metric names).
#[serde(default = "default_metrics_prefix")]
pub metrics_prefix: String,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

metrics_prefix is currently a no-op.

src/observability/metrics.rs:168-170 still initializes the global MetricsRegistry with a hard-coded "zeroclaw" prefix, so changing this field will never affect exported metric names. Please either thread observability.metrics_prefix into registry initialization before exposing the key, or remove the key until there is a real caller.

As per coding guidelines: "Do not add new config keys ... without a concrete accepted use case; do not introduce speculative 'future-proof' abstractions without at least one current caller."

Also applies to: 1923-1924

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/config/schema.rs` around lines 1903 - 1905, The new metrics_prefix config
field is currently unused; either thread it into the global metrics registry
creation or remove the config key. To fix, update observability::metrics (where
MetricsRegistry is initialized with the hard-coded "zeroclaw") to accept a
prefix parameter (or overload the constructor/init function) and use that prefix
when creating the MetricsRegistry instead of the literal "zeroclaw", then pass
config.observability.metrics_prefix from the place that initializes the
registry; alternatively delete the metrics_prefix field and its default helper
in schema.rs (and any tests) if you don't want to wire it up. Ensure you update
references to metrics_prefix and the MetricsRegistry initializer (e.g., the
constructor or init function in observability/metrics.rs) so the config value is
actually applied.

Comment thread src/gateway/health.rs Outdated
Comment thread src/gateway/health.rs Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/gateway/health.rs (1)

98-103: ⚠️ Potential issue | 🟠 Major

Don’t return raw dependency errors from /ready.

component.last_error is copied straight into a public readiness payload. On a public bind, that leaks backend details to unauthenticated callers. Quick rollback is to suppress message entirely here, or replace it with a coarse string and keep full diagnostics in logs/admin-only paths.

Suggested fix
         checks.insert(
             name.clone(),
             CheckResult {
                 status,
-                message: component.last_error.clone(),
+                message: component
+                    .last_error
+                    .as_ref()
+                    .map(|_| "dependency unavailable".to_string()),
             },
         );

Based on learnings: keep observability useful but non-sensitive.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/gateway/health.rs` around lines 98 - 103, The readiness payload is
leaking raw backend errors by copying component.last_error into
CheckResult.message for the /ready endpoint; instead remove or sanitize that
field: update the construction of CheckResult (where checks.insert(name.clone(),
CheckResult { status, message: component.last_error.clone(), }) is created) to
set message to None or a coarse, non-sensitive string like "unavailable" and log
component.last_error with detail to the service logger or an admin-only
diagnostics path so full error details are not exposed in public readiness
responses.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/gateway/health.rs`:
- Around line 70-79: The top-level health status currently collapses every
non-OK check to "unhealthy"; change the aggregation logic (the code that builds
the top-level overall/status after checks are inserted — e.g., where you inspect
checks created from snapshot.components and CheckResult) so it computes overall
as: "unhealthy" if any check.status == "unhealthy", else "degraded" if any
check.status == "degraded", else "ok". Keep returning HTTP 503 for degraded but
ensure the JSON top-level status string is "degraded" for empty registry /
starting-only snapshots. Update the health computation code paths that touch
snapshot.components, checks.insert, and the overall/status assignment (also the
section covering lines ~82-112) and add tests that assert empty-snapshot and
starting-only snapshots produce top-level "degraded" (with 503) rather than
"unhealthy".

---

Duplicate comments:
In `@src/gateway/health.rs`:
- Around line 98-103: The readiness payload is leaking raw backend errors by
copying component.last_error into CheckResult.message for the /ready endpoint;
instead remove or sanitize that field: update the construction of CheckResult
(where checks.insert(name.clone(), CheckResult { status, message:
component.last_error.clone(), }) is created) to set message to None or a coarse,
non-sensitive string like "unavailable" and log component.last_error with detail
to the service logger or an admin-only diagnostics path so full error details
are not exposed in public readiness responses.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: adfc5009-4444-4335-8450-70c21d77f015

📥 Commits

Reviewing files that changed from the base of the PR and between 3c2bab0 and 5da13ee.

📒 Files selected for processing (2)
  • src/gateway/health.rs
  • src/main.rs

Comment thread src/gateway/health.rs
Copy link
Copy Markdown

@rikitrader rikitrader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Health/readiness endpoints wired into gateway, atomic metrics, backward-compatible config additions. Good observability foundation.

@rareba
Copy link
Copy Markdown
Contributor Author

rareba commented Mar 9, 2026

Thanks for the review feedback. I've addressed the integration concerns:

Changes pushed:

  • Health (/health, /ready) and metrics (/metrics) routes are now conditionally registered based on config.observability.health_enabled and config.observability.metrics_enabled.
  • Added metric increment calls in actual runtime code paths:
    • requests_total in gateway webhook handler
    • tool_calls_total / tool_errors_total in tool execution path
    • provider_requests_total before provider chat calls
  • Added test verifying runtime metric increments and Prometheus export.

All 9 metrics tests pass. cargo check clean.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
src/gateway/health.rs (1)

99-104: ⚠️ Potential issue | 🟠 Major

Redact raw dependency errors from /ready.

component.last_error.clone() is returned to every caller of /ready. On a public bind, that leaks internal dependency details and upstream error text. Keep the public payload coarse and move the raw diagnostics to logs or an authenticated admin endpoint.

🔒 Suggested redaction
         checks.insert(
             name.clone(),
             CheckResult {
                 status,
-                message: component.last_error.clone(),
+                message: component
+                    .last_error
+                    .as_ref()
+                    .map(|_| "dependency unavailable".to_string()),
             },
         );

Based on learnings: keep observability useful but non-sensitive.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/gateway/health.rs` around lines 99 - 104, The /ready handler currently
exposes raw dependency errors by inserting component.last_error.clone() into
CheckResult via checks.insert(name.clone(), CheckResult { status, message:
component.last_error.clone(), }), which can leak sensitive upstream details;
change this to populate CheckResult.message with a coarse, non-sensitive string
(e.g., "unhealthy" or component.status_summary()) and log the full
component.last_error via the existing logger (or store it behind an
authenticated admin endpoint) so diagnostics remain available but are not
returned in public /ready responses.
🧹 Nitpick comments (1)
src/observability/metrics.rs (1)

68-88: Consider deferring unused gauge metrics.

The gauge metrics (request_duration_ms, provider_latency_ms, active_conversations) are registered but none of them appear to have callers in the current PR changes. Per YAGNI principle, consider adding these when they have actual callers.

That said, the overhead is negligible and having them pre-registered for future use is a reasonable trade-off for observability infrastructure.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/observability/metrics.rs` around lines 68 - 88, The gauges array in
metrics.rs (entries "request_duration_ms", "provider_latency_ms",
"active_conversations") is being registered via metrics.insert with AtomicMetric
{ value: AtomicU64::new(0), help, metric_type: MetricType::Gauge } but has no
callers in this PR; remove these unused entries (or gate their registration
behind a TODO/feature flag) so only metrics with active usage are registered, or
add a clear TODO comment referencing the gauge names to register them when code
paths actually emit values; update the code that constructs the gauges array and
the metrics.insert calls accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/gateway/health.rs`:
- Around line 20-30: The current health types only serialize status; update the
payloads to match the advertised contract by adding an uptime_secs field and
including checks on both endpoints: modify LivenessResponse and
ReadinessResponse to include pub uptime_secs: u64 (or appropriate numeric type)
and ensure LivenessResponse also includes pub checks: BTreeMap<String,
CheckResult> so both structs serialize {status, checks, uptime_secs}; update the
corresponding handler functions that return these types (the liveness and
readiness handlers) to compute and populate uptime_secs (e.g., seconds since
process start) and fill checks (use an empty map or actual CheckResult values as
appropriate) before returning the response.

In `@src/gateway/mod.rs`:
- Line 958: The metrics increment for "requests_total" is currently executed
after auth, JSON parsing, and idempotency checks so rate-limited, unauthorized,
malformed, and duplicate requests are not counted; move the call to
crate::observability::metrics::global().increment("requests_total") to the very
start of the gateway request handler (before any auth checks, JSON parsing,
idempotency checks or early returns — e.g. at the top of the handler function
that currently contains this line) so every incoming request is counted, and
add/update tests that simulate rate-limited, unauthorized, malformed, and
duplicate webhook requests to assert the metric is incremented in those failure
paths.
- Around line 700-701: The /metrics endpoint fails because BroadcastObserver's
as_any() doesn't delegate to the inner observer (so handle_metrics can't
downcast to PrometheusObserver) and requests_total is incremented via
crate::observability::metrics::global() which uses a separate registry; fix by
either delegating type-erasure through BroadcastObserver::as_any() to return the
inner observer's Any so handle_metrics can downcast to PrometheusObserver, or
(preferred) extend the Observer trait to expose the global metrics registry (or
a get_registry method) and implement it for PrometheusObserver and
BroadcastObserver so requests_total increments via metrics::global() and the
PrometheusObserver share the same registry; update handle_metrics,
BroadcastObserver::as_any(), PrometheusObserver implementation and the code that
increments requests_total to use the unified registry exposed by the observer
trait.

---

Duplicate comments:
In `@src/gateway/health.rs`:
- Around line 99-104: The /ready handler currently exposes raw dependency errors
by inserting component.last_error.clone() into CheckResult via
checks.insert(name.clone(), CheckResult { status, message:
component.last_error.clone(), }), which can leak sensitive upstream details;
change this to populate CheckResult.message with a coarse, non-sensitive string
(e.g., "unhealthy" or component.status_summary()) and log the full
component.last_error via the existing logger (or store it behind an
authenticated admin endpoint) so diagnostics remain available but are not
returned in public /ready responses.

---

Nitpick comments:
In `@src/observability/metrics.rs`:
- Around line 68-88: The gauges array in metrics.rs (entries
"request_duration_ms", "provider_latency_ms", "active_conversations") is being
registered via metrics.insert with AtomicMetric { value: AtomicU64::new(0),
help, metric_type: MetricType::Gauge } but has no callers in this PR; remove
these unused entries (or gate their registration behind a TODO/feature flag) so
only metrics with active usage are registered, or add a clear TODO comment
referencing the gauge names to register them when code paths actually emit
values; update the code that constructs the gauges array and the metrics.insert
calls accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3444ed00-858f-446e-aaa8-0e5f9f72b5ac

📥 Commits

Reviewing files that changed from the base of the PR and between 5da13ee and 6709386.

📒 Files selected for processing (4)
  • src/agent/loop_.rs
  • src/gateway/health.rs
  • src/gateway/mod.rs
  • src/observability/metrics.rs

Comment thread src/gateway/health.rs
Comment on lines +20 to +30
#[derive(Debug, Serialize)]
pub struct LivenessResponse {
pub status: &'static str,
}

/// Structured readiness response with component checks.
#[derive(Debug, Serialize)]
pub struct ReadinessResponse {
pub status: &'static str,
pub checks: BTreeMap<String, CheckResult>,
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Implement the advertised health payload shape before merge.

The new types/handler serialize /health as only {"status":"healthy"}, and /ready still has no uptime_secs. The PR contract says these endpoints return status, checks, and uptime_secs; the current tests also lock in the smaller payload, so consumers built to the documented shape will break.

Also applies to: 45-49

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/gateway/health.rs` around lines 20 - 30, The current health types only
serialize status; update the payloads to match the advertised contract by adding
an uptime_secs field and including checks on both endpoints: modify
LivenessResponse and ReadinessResponse to include pub uptime_secs: u64 (or
appropriate numeric type) and ensure LivenessResponse also includes pub checks:
BTreeMap<String, CheckResult> so both structs serialize {status, checks,
uptime_secs}; update the corresponding handler functions that return these types
(the liveness and readiness handlers) to compute and populate uptime_secs (e.g.,
seconds since process start) and fill checks (use an empty map or actual
CheckResult values as appropriate) before returning the response.

Comment thread src/gateway/mod.rs
Comment on lines +700 to +701
if config.observability.metrics_enabled {
app = app.route("/metrics", get(handle_metrics));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the symbols involved in metric writes and metric export.
rg -n -C4 'handle_metrics|BroadcastObserver|PrometheusObserver|MetricsRegistry|fn global|as_any|encode\(|increment\("requests_total"\)' src

Repository: zeroclaw-labs/zeroclaw

Length of output: 50378


🏁 Script executed:

find src -name "metrics.rs" -o -name "mod.rs" | xargs grep -l "fn global"

Repository: zeroclaw-labs/zeroclaw

Length of output: 94


🏁 Script executed:

rg -n "mod metrics|fn global" src/observability/

Repository: zeroclaw-labs/zeroclaw

Length of output: 188


🏁 Script executed:

rg -n "crate::observability::metrics::global" src/

Repository: zeroclaw-labs/zeroclaw

Length of output: 590


🏁 Script executed:

cat -n src/observability/metrics.rs | head -200

Repository: zeroclaw-labs/zeroclaw

Length of output: 7447


🏁 Script executed:

sed -n '734,745p' src/gateway/mod.rs

Repository: zeroclaw-labs/zeroclaw

Length of output: 471


🏁 Script executed:

rg -n "requests_total" src/observability/prometheus.rs

Repository: zeroclaw-labs/zeroclaw

Length of output: 377


🏁 Script executed:

rg -n "export_prometheus" src/

Repository: zeroclaw-labs/zeroclaw

Length of output: 528


Fix downcast failure and metric registry mismatch for /metrics endpoint.

The /metrics endpoint has two wiring issues:

  1. Downcast fails due to BroadcastObserver wrapper: state.observer is wrapped in BroadcastObserver (line 621), which implements as_any() to return self rather than delegating to the inner observer. When handle_metrics attempts to downcast to PrometheusObserver (line 738), the downcast fails and returns the fallback message instead of metrics.

  2. requests_total uses a separate metrics registry: Line 958 increments the metric via crate::observability::metrics::global(), which is a standalone atomic counter registry. This is distinct from the Prometheus observer's registry. PrometheusObserver does not expose requests_total at all, so even if the downcast succeeded, the metric would not appear in the scrape output.

Either delegate as_any() through the BroadcastObserver wrapper, or expose the global metrics registry from metrics::global() through the observer trait so both metrics paths write to the same backend.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/gateway/mod.rs` around lines 700 - 701, The /metrics endpoint fails
because BroadcastObserver's as_any() doesn't delegate to the inner observer (so
handle_metrics can't downcast to PrometheusObserver) and requests_total is
incremented via crate::observability::metrics::global() which uses a separate
registry; fix by either delegating type-erasure through
BroadcastObserver::as_any() to return the inner observer's Any so handle_metrics
can downcast to PrometheusObserver, or (preferred) extend the Observer trait to
expose the global metrics registry (or a get_registry method) and implement it
for PrometheusObserver and BroadcastObserver so requests_total increments via
metrics::global() and the PrometheusObserver share the same registry; update
handle_metrics, BroadcastObserver::as_any(), PrometheusObserver implementation
and the code that increments requests_total to use the unified registry exposed
by the observer trait.

Comment thread src/gateway/mod.rs
}
}

crate::observability::metrics::global().increment("requests_total");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Move requests_total before the early exits.

Line 958 only runs after auth, JSON parsing, and idempotency checks, so rate-limited, unauthorized, malformed, and duplicate webhook hits are never counted. That leaves the gateway blind to exactly the failure/abuse traffic operators usually watch.

📈 Suggested fix
 async fn handle_webhook(
     State(state): State<AppState>,
     ConnectInfo(peer_addr): ConnectInfo<SocketAddr>,
     headers: HeaderMap,
     body: Result<Json<WebhookBody>, axum::extract::rejection::JsonRejection>,
 ) -> impl IntoResponse {
+    crate::observability::metrics::global().increment("requests_total");
+
     let rate_key =
         client_key_from_request(Some(peer_addr), &headers, state.trust_forwarded_headers);
     if !state.rate_limiter.allow_webhook(&rate_key) {
         tracing::warn!("/webhook rate limit exceeded");
         let err = serde_json::json!({
@@
-    crate::observability::metrics::global().increment("requests_total");
-
     let message = &webhook_body.message;

Based on learnings: keep observability useful but non-sensitive; add/update tests or validation evidence for failure modes and boundaries.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/gateway/mod.rs` at line 958, The metrics increment for "requests_total"
is currently executed after auth, JSON parsing, and idempotency checks so
rate-limited, unauthorized, malformed, and duplicate requests are not counted;
move the call to
crate::observability::metrics::global().increment("requests_total") to the very
start of the gateway request handler (before any auth checks, JSON parsing,
idempotency checks or early returns — e.g. at the top of the handler function
that currently contains this line) so every incoming request is counted, and
add/update tests that simulate rate-limited, unauthorized, malformed, and
duplicate webhook requests to assert the metric is incremented in those failure
paths.

@rareba rareba force-pushed the feature/production-observability branch from 5527a08 to 319e659 Compare March 11, 2026 14:18
@rareba rareba force-pushed the feature/production-observability branch from 0134500 to ea41eb0 Compare March 12, 2026 08:18
rareba added 7 commits March 15, 2026 15:50
…egistry, and config

Add production observability infrastructure:

- src/gateway/health.rs: structured /health (liveness) and /ready
  (readiness) endpoints returning JSON with status, checks, uptime_secs
- src/observability/metrics.rs: lightweight MetricsRegistry using
  std::sync::atomic counters with Prometheus text export
- ObservabilityConfig: new fields health_enabled, metrics_enabled,
  metrics_prefix, health_check_interval_secs (all with serde(default)
  for backward compatibility)
- Wire /health and /ready into gateway router, replacing the old
  handle_health handler
…readiness

- Health endpoint (/health): return only {"status": "healthy"} to avoid
  exposing internal details (uptime, pairing state, component checks) to
  unauthenticated callers
- Readiness endpoint (/ready): remove hardcoded "ok" for provider and
  memory checks that synthesized fake healthy state; now relies solely on
  the health registry for real component status; returns degraded when no
  components have registered yet (service still initializing)
- Wire observability.metrics_prefix config into metrics::init_global()
  so the configured prefix is actually used for Prometheus metric names
  instead of always falling back to "zeroclaw"
…states

When components are still starting or no components have registered yet,
the readiness endpoint now returns "degraded" at the top level instead of
"unhealthy", distinguishing transient warming states from actual failures.
…c increments

Addresses reviewer feedback: metrics are now incremented from actual
runtime code paths (gateway, tools, providers).
@rareba rareba force-pushed the feature/production-observability branch from 7c7568e to 199b693 Compare March 15, 2026 14:52
…, and readiness timeout

Add new ObservabilityConfig fields:
- health_endpoint_path: custom health endpoint path (default: /healthz)
- metrics_endpoint_path: custom metrics endpoint path (default: /metrics)
- custom_labels: additional metric labels in key=value format
- readiness_check_timeout_secs: readiness probe timeout (default: 5s)

All fields use serde(default) for backward compatibility.
@rareba rareba force-pushed the feature/production-observability branch from 199b693 to c513f0c Compare March 15, 2026 14:53
@rareba rareba closed this Mar 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants