feat(observability): health/readiness endpoints, atomic metrics, config by rareba · Pull Request #3035 · zeroclaw-labs/zeroclaw

rareba · 2026-03-08T17:04:18Z

Summary

Base branch target (master for all contributions): master
Problem: No structured health/readiness endpoints for orchestration platforms (Kubernetes, systemd) and no lightweight metrics collection for operational visibility.
Why it matters: Production deployments need standard liveness/readiness probes and basic counters for monitoring and alerting.
What changed:
- Add structured /health (liveness) and /ready (readiness) gateway endpoints returning JSON with status, checks, and uptime_secs
- Add lightweight MetricsRegistry in src/observability/metrics.rs using std::sync::atomic counters with Prometheus text format export
- Extend ObservabilityConfig with health_enabled, metrics_enabled, metrics_prefix, health_check_interval_secs (all #[serde(default)] for backward compat)
- Wire new endpoints into gateway router, replacing the old handle_health handler
What did not change (scope boundary): Existing /metrics handler unchanged. No external metrics dependencies added.

Label Snapshot (required)

Risk label (risk: low|medium|high): medium
Size label (size: XS|S|M|L|XL, auto-managed/read-only): M
Scope labels: gateway, observability, config
Module labels: gateway: health, observability: metrics
Contributor tier label: auto-managed
If any auto-label is incorrect, note requested correction: N/A

Change Metadata

Change type (bug|feature|refactor|docs|security|chore): feature
Primary scope (runtime|provider|channel|memory|security|ci|docs|multi): multi (gateway + observability)

Linked Issue

Closes #
Related # N/A
Depends on #
Supersedes #

Supersede Attribution (required when `Supersedes #` is used)

N/A

Validation Evidence (required)

Commands and result summary:

cargo fmt --all -- --check   # pass
cargo clippy --all-targets -- -D warnings  # pass
cargo test                   # pass

Evidence provided (test/log/trace/screenshot/perf): 8 unit tests for MetricsRegistry (counter increment, gauge record, prometheus export format, zero counters, unknown metric noop, uptime); 3 unit tests for HealthResponse/CheckResult serialization; 49 existing observability tests continue to pass
If any command is intentionally skipped, explain why: N/A

Security Impact (required)

New permissions/capabilities? No
New external network calls? No
Secrets/tokens handling changed? No
File system access scope changed? No
If any Yes, describe risk and mitigation: N/A

Privacy and Data Hygiene (required)

Data-hygiene status (pass|needs-follow-up): pass
Redaction/anonymization notes: Health/metrics endpoints expose operational status only; no user data.
Neutral wording confirmation: Confirmed — uses project-native labels throughout.

Compatibility / Migration

Backward compatible? Yes
Config/env changes? Yes — extends [observability] config section with new fields, all with #[serde(default)]
Migration needed? No
If yes, exact upgrade steps: N/A

i18n Follow-Through (required when docs or user-facing wording changes)

i18n follow-through triggered? No

Human Verification (required)

What was personally validated beyond CI:

Verified scenarios: 11 new unit tests for metrics registry and health endpoint serialization; 49 existing observability tests still passing
Edge cases checked: Zero counters, unknown metric noop, prometheus export format correctness
What was not verified: Live HTTP endpoint testing with curl; Kubernetes probe integration

Side Effects / Blast Radius (required)

Affected subsystems/workflows: Gateway router (new endpoints registered, old handle_health replaced), observability module (new metrics submodule), agent loop (metrics increment calls added), config schema
Potential unintended effects: Old /health handler is replaced — callers expecting the old response format will get structured JSON instead.
Guardrails/monitoring for early detection: Health/metrics endpoints are conditionally registered based on config; existing endpoint behavior preserved when config fields absent.

Agent Collaboration Notes (recommended)

Agent tools used (if any): Claude Code
Workflow/plan summary (if any): Implemented metrics registry first, then health/readiness endpoints, then gateway wiring and config integration.
Verification focus: Prometheus export format correctness and backward-compatible config.
Confirmation: naming + architecture boundaries followed (AGENTS.md + CONTRIBUTING.md): Yes

Rollback Plan (required)

Fast rollback command/path: git revert <commit-sha>
Feature flags or config toggles (if any): health_enabled and metrics_enabled config toggles
Observable failure symptoms: /health or /ready returning 500; metrics export returning empty or malformed Prometheus text.

Risks and Mitigations

Risk: Replacing old handle_health handler may break existing monitoring integrations expecting the old format.
- Mitigation: New format is a superset (JSON with status field); most health checkers only look at HTTP status code.
Risk: Atomic counter overhead on hot paths (agent loop).
- Mitigation: Uses std::sync::atomic with relaxed ordering — negligible overhead for counter increments.

coderabbitai · 2026-03-08T17:04:32Z

Note

`.coderabbit.yaml` has unrecognized properties

CodeRabbit is using all valid settings from your configuration. Unrecognized properties (listed below) have been ignored and may indicate typos or deprecated fields that can be removed.

⚠️ Parsing warnings (1)

Validation error: Unrecognized key(s) in object: 'tools', 'path_filters', 'review_instructions'

⚙️ Configuration instructions

Please see the configuration documentation for more information.
You can also validate your configuration using the online YAML validator.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

📝 Walkthrough

Walkthrough

Adds observability: new ObservabilityConfig fields and defaults, liveness/readiness HTTP handlers and conditional routes, a thread-safe MetricsRegistry with Prometheus export and global init, and small telemetry increments and test updates.

Changes

Cohort / File(s)	Summary
Configuration & Defaults `src/config/schema.rs`	Added public fields to `ObservabilityConfig`: `health_enabled`, `metrics_enabled`, `metrics_prefix`, `health_check_interval_secs`; updated `Default` and added private helpers for defaults.
Health Endpoints & Routing `src/gateway/health.rs`, `src/gateway/mod.rs`	New `handle_liveness` and `handle_readiness` handlers and types; router now conditionally registers `/health`, `/ready`, and `/metrics` based on observability config; webhook increments request metric.
Metrics Implementation `src/observability/metrics.rs`, `src/observability/mod.rs`	New `MetricsRegistry` with atomic counters/gauges, uptime, Prometheus text export, global singleton (`init_global` / `global`); exported via `observability::metrics`.
Startup Initialization `src/main.rs`	Initializes global metrics registry at startup using `config.observability.metrics_prefix`.
Telemetry in Agent Loop `src/agent/loop_.rs`	Added metric increments for tool calls, tool errors, and provider requests.
Tests / Minor Edits `src/observability/runtime_trace.rs`	Test initializer updated to use `..ObservabilityConfig::default()` (test-only change).

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client
  participant Router as Gateway Router
  participant Handler as Health Handler
  participant Health as Health Subsystem

  Client->>Router: GET /ready
  Router->>Handler: dispatch to handle_readiness
  Handler->>Health: crate::health::snapshot()
  Health-->>Handler: component statuses (BTreeMap)
  Handler->>Handler: map statuses -> checks, determine overall status
  Handler-->>Router: HTTP 200 (all ok) or 503 (unhealthy) + JSON body
  Router-->>Client: HTTP response

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

feat: OpenTelemetry tracing and metrics observer #160: Modifies the same ObservabilityConfig struct (adds observability fields and defaults).
feat(agent): add result-aware loop detection for tool-call loop #2153: Touches src/agent/loop_.rs and agent loop logic; related to added telemetry in tool/provider loops.
feat(gateway): add per-endpoint rate limiting and webhook idempotency #188: Changes tool execution flow in the agent loop, which overlaps with telemetry additions.

Suggested labels

size: M, risk: medium, core, gateway: core, config: core, service: core, tests

Suggested reviewers

theonlyhennygod

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	The description provides a clear summary of changes, a test plan with specific validation steps, and risk/rollback information, though it omits most required template sections.	Complete the description template by adding Label Snapshot, Change Metadata, Linked Issue, Validation Evidence, Security/Privacy Impact, Compatibility, and Human Verification sections for a comprehensive review record.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title concisely summarizes the three main changes: health/readiness endpoints, atomic metrics implementation, and configuration extensions.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

src/gateway/health.rs (1)
155-224: Please add handler-level failure-mode tests.

The current tests only assert serialization shape. They do not exercise the 200-vs-503 boundary or verify that public responses stay redacted when components fail. A small handle_readiness test with an unhealthy snapshot would lock those semantics down.

Based on learnings: "For high-risk code changes, include at least one boundary/failure-mode validation in addition to standard checks"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/gateway/health.rs` around lines 155 - 224, Add a handler-level
failure-mode test that exercises handle_readiness: construct a
HealthResponse/CheckResult map containing an unhealthy/degraded component (e.g.,
provider status "degraded" with a message), invoke the handle_readiness handler
(or the readiness HTTP endpoint) with that snapshot, and assert the HTTP
response status is 503 (not 200) and the JSON body preserves public redaction
rules (no internal details exposed—e.g., message should be omitted or redacted
as in CheckResult serialization). Use the existing HealthResponse and
CheckResult types to build the snapshot and call the same handler function used
in production so the test verifies the 200-vs-503 boundary and public response
redaction semantics.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/config/schema.rs`:
- Around line 1903-1905: The new metrics_prefix config field is currently
unused; either thread it into the global metrics registry creation or remove the
config key. To fix, update observability::metrics (where MetricsRegistry is
initialized with the hard-coded "zeroclaw") to accept a prefix parameter (or
overload the constructor/init function) and use that prefix when creating the
MetricsRegistry instead of the literal "zeroclaw", then pass
config.observability.metrics_prefix from the place that initializes the
registry; alternatively delete the metrics_prefix field and its default helper
in schema.rs (and any tests) if you don't want to wire it up. Ensure you update
references to metrics_prefix and the MetricsRegistry initializer (e.g., the
constructor or init function in observability/metrics.rs) so the config value is
actually applied.

In `@src/gateway/health.rs`:
- Around line 50-59: The health handler currently exposes sensitive/internal
details in CheckResult.message (e.g., the pairing block created via
checks.insert(...) and similar blocks for model and last_error); change these to
return only coarse, stable values (e.g., status = "ok"/"degraded" and message =
None or a non-sensitive string like "paired" / "not-paired" / "error") and
remove/raw-redact any raw variables (paired, configured model, last_error) from
the public payload; keep the full diagnostics in server logs or an admin-only
endpoint instead; apply the same redaction to the other CheckResult inserts
noted (the blocks around the other checks at the referenced locations).
- Around line 88-95: The code is currently synthesizing healthy dependency
entries for "provider" and "memory" (creating CheckResult with status "ok" and
message from state.model) which can mask real probe failures and overwrite
snapshot results; instead, stop inserting hardcoded "provider" and "memory"
CheckResult entries — query the real probe/health registry or probe state for
those keys and only insert their CheckResult if an actual probe result exists,
otherwise omit the entry or set it to a degraded/unknown status; also ensure you
do not blindly overwrite an existing failing "memory" entry created earlier in
the snapshot loop (check for existing checks.contains_key("memory") before
inserting or insert in the snapshot processing path that respects probe
results).

---

Nitpick comments:
In `@src/gateway/health.rs`:
- Around line 155-224: Add a handler-level failure-mode test that exercises
handle_readiness: construct a HealthResponse/CheckResult map containing an
unhealthy/degraded component (e.g., provider status "degraded" with a message),
invoke the handle_readiness handler (or the readiness HTTP endpoint) with that
snapshot, and assert the HTTP response status is 503 (not 200) and the JSON body
preserves public redaction rules (no internal details exposed—e.g., message
should be omitted or redacted as in CheckResult serialization). Use the existing
HealthResponse and CheckResult types to build the snapshot and call the same
handler function used in production so the test verifies the 200-vs-503 boundary
and public response redaction semantics.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8bac5431-7ca1-49cc-83d6-524fc1e4dd41

📥 Commits

Reviewing files that changed from the base of the PR and between f7fefd4 and 3c2bab0.

📒 Files selected for processing (6)

src/config/schema.rs
src/gateway/health.rs
src/gateway/mod.rs
src/observability/metrics.rs
src/observability/mod.rs
src/observability/runtime_trace.rs

coderabbitai · 2026-03-08T17:15:55Z

+    /// Prefix for lightweight atomic metrics (Prometheus metric names).
+    #[serde(default = "default_metrics_prefix")]
+    pub metrics_prefix: String,


⚠️ Potential issue | 🟠 Major

metrics_prefix is currently a no-op.

src/observability/metrics.rs:168-170 still initializes the global MetricsRegistry with a hard-coded "zeroclaw" prefix, so changing this field will never affect exported metric names. Please either thread observability.metrics_prefix into registry initialization before exposing the key, or remove the key until there is a real caller.

As per coding guidelines: "Do not add new config keys ... without a concrete accepted use case; do not introduce speculative 'future-proof' abstractions without at least one current caller."

Also applies to: 1923-1924

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/config/schema.rs` around lines 1903 - 1905, The new metrics_prefix config field is currently unused; either thread it into the global metrics registry creation or remove the config key. To fix, update observability::metrics (where MetricsRegistry is initialized with the hard-coded "zeroclaw") to accept a prefix parameter (or overload the constructor/init function) and use that prefix when creating the MetricsRegistry instead of the literal "zeroclaw", then pass config.observability.metrics_prefix from the place that initializes the registry; alternatively delete the metrics_prefix field and its default helper in schema.rs (and any tests) if you don't want to wire it up. Ensure you update references to metrics_prefix and the MetricsRegistry initializer (e.g., the constructor or init function in observability/metrics.rs) so the config value is actually applied.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/gateway/health.rs (1)

98-103: ⚠️ Potential issue | 🟠 Major

Don’t return raw dependency errors from /ready.

component.last_error is copied straight into a public readiness payload. On a public bind, that leaks backend details to unauthenticated callers. Quick rollback is to suppress message entirely here, or replace it with a coarse string and keep full diagnostics in logs/admin-only paths.

Suggested fix

         checks.insert(
             name.clone(),
             CheckResult {
                 status,
-                message: component.last_error.clone(),
+                message: component
+                    .last_error
+                    .as_ref()
+                    .map(|_| "dependency unavailable".to_string()),
             },
         );

Based on learnings: keep observability useful but non-sensitive.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/gateway/health.rs` around lines 98 - 103, The readiness payload is
leaking raw backend errors by copying component.last_error into
CheckResult.message for the /ready endpoint; instead remove or sanitize that
field: update the construction of CheckResult (where checks.insert(name.clone(),
CheckResult { status, message: component.last_error.clone(), }) is created) to
set message to None or a coarse, non-sensitive string like "unavailable" and log
component.last_error with detail to the service logger or an admin-only
diagnostics path so full error details are not exposed in public readiness
responses.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/gateway/health.rs`:
- Around line 70-79: The top-level health status currently collapses every
non-OK check to "unhealthy"; change the aggregation logic (the code that builds
the top-level overall/status after checks are inserted — e.g., where you inspect
checks created from snapshot.components and CheckResult) so it computes overall
as: "unhealthy" if any check.status == "unhealthy", else "degraded" if any
check.status == "degraded", else "ok". Keep returning HTTP 503 for degraded but
ensure the JSON top-level status string is "degraded" for empty registry /
starting-only snapshots. Update the health computation code paths that touch
snapshot.components, checks.insert, and the overall/status assignment (also the
section covering lines ~82-112) and add tests that assert empty-snapshot and
starting-only snapshots produce top-level "degraded" (with 503) rather than
"unhealthy".

---

Duplicate comments:
In `@src/gateway/health.rs`:
- Around line 98-103: The readiness payload is leaking raw backend errors by
copying component.last_error into CheckResult.message for the /ready endpoint;
instead remove or sanitize that field: update the construction of CheckResult
(where checks.insert(name.clone(), CheckResult { status, message:
component.last_error.clone(), }) is created) to set message to None or a coarse,
non-sensitive string like "unavailable" and log component.last_error with detail
to the service logger or an admin-only diagnostics path so full error details
are not exposed in public readiness responses.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: adfc5009-4444-4335-8450-70c21d77f015

📥 Commits

Reviewing files that changed from the base of the PR and between 3c2bab0 and 5da13ee.

📒 Files selected for processing (2)

src/gateway/health.rs
src/main.rs

rikitrader

LGTM. Health/readiness endpoints wired into gateway, atomic metrics, backward-compatible config additions. Good observability foundation.

rareba · 2026-03-09T07:37:01Z

Thanks for the review feedback. I've addressed the integration concerns:

Changes pushed:

Health (/health, /ready) and metrics (/metrics) routes are now conditionally registered based on config.observability.health_enabled and config.observability.metrics_enabled.
Added metric increment calls in actual runtime code paths:
- requests_total in gateway webhook handler
- tool_calls_total / tool_errors_total in tool execution path
- provider_requests_total before provider chat calls
Added test verifying runtime metric increments and Prometheus export.

All 9 metrics tests pass. cargo check clean.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

src/gateway/health.rs (1)

99-104: ⚠️ Potential issue | 🟠 Major

Redact raw dependency errors from /ready.

component.last_error.clone() is returned to every caller of /ready. On a public bind, that leaks internal dependency details and upstream error text. Keep the public payload coarse and move the raw diagnostics to logs or an authenticated admin endpoint.

🔒 Suggested redaction

         checks.insert(
             name.clone(),
             CheckResult {
                 status,
-                message: component.last_error.clone(),
+                message: component
+                    .last_error
+                    .as_ref()
+                    .map(|_| "dependency unavailable".to_string()),
             },
         );

Based on learnings: keep observability useful but non-sensitive.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/gateway/health.rs` around lines 99 - 104, The /ready handler currently
exposes raw dependency errors by inserting component.last_error.clone() into
CheckResult via checks.insert(name.clone(), CheckResult { status, message:
component.last_error.clone(), }), which can leak sensitive upstream details;
change this to populate CheckResult.message with a coarse, non-sensitive string
(e.g., "unhealthy" or component.status_summary()) and log the full
component.last_error via the existing logger (or store it behind an
authenticated admin endpoint) so diagnostics remain available but are not
returned in public /ready responses.

🧹 Nitpick comments (1)

src/observability/metrics.rs (1)
68-88: Consider deferring unused gauge metrics.

The gauge metrics (request_duration_ms, provider_latency_ms, active_conversations) are registered but none of them appear to have callers in the current PR changes. Per YAGNI principle, consider adding these when they have actual callers.

That said, the overhead is negligible and having them pre-registered for future use is a reasonable trade-off for observability infrastructure.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/observability/metrics.rs` around lines 68 - 88, The gauges array in
metrics.rs (entries "request_duration_ms", "provider_latency_ms",
"active_conversations") is being registered via metrics.insert with AtomicMetric
{ value: AtomicU64::new(0), help, metric_type: MetricType::Gauge } but has no
callers in this PR; remove these unused entries (or gate their registration
behind a TODO/feature flag) so only metrics with active usage are registered, or
add a clear TODO comment referencing the gauge names to register them when code
paths actually emit values; update the code that constructs the gauges array and
the metrics.insert calls accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/gateway/health.rs`:
- Around line 20-30: The current health types only serialize status; update the
payloads to match the advertised contract by adding an uptime_secs field and
including checks on both endpoints: modify LivenessResponse and
ReadinessResponse to include pub uptime_secs: u64 (or appropriate numeric type)
and ensure LivenessResponse also includes pub checks: BTreeMap<String,
CheckResult> so both structs serialize {status, checks, uptime_secs}; update the
corresponding handler functions that return these types (the liveness and
readiness handlers) to compute and populate uptime_secs (e.g., seconds since
process start) and fill checks (use an empty map or actual CheckResult values as
appropriate) before returning the response.

In `@src/gateway/mod.rs`:
- Line 958: The metrics increment for "requests_total" is currently executed
after auth, JSON parsing, and idempotency checks so rate-limited, unauthorized,
malformed, and duplicate requests are not counted; move the call to
crate::observability::metrics::global().increment("requests_total") to the very
start of the gateway request handler (before any auth checks, JSON parsing,
idempotency checks or early returns — e.g. at the top of the handler function
that currently contains this line) so every incoming request is counted, and
add/update tests that simulate rate-limited, unauthorized, malformed, and
duplicate webhook requests to assert the metric is incremented in those failure
paths.
- Around line 700-701: The /metrics endpoint fails because BroadcastObserver's
as_any() doesn't delegate to the inner observer (so handle_metrics can't
downcast to PrometheusObserver) and requests_total is incremented via
crate::observability::metrics::global() which uses a separate registry; fix by
either delegating type-erasure through BroadcastObserver::as_any() to return the
inner observer's Any so handle_metrics can downcast to PrometheusObserver, or
(preferred) extend the Observer trait to expose the global metrics registry (or
a get_registry method) and implement it for PrometheusObserver and
BroadcastObserver so requests_total increments via metrics::global() and the
PrometheusObserver share the same registry; update handle_metrics,
BroadcastObserver::as_any(), PrometheusObserver implementation and the code that
increments requests_total to use the unified registry exposed by the observer
trait.

---

Duplicate comments:
In `@src/gateway/health.rs`:
- Around line 99-104: The /ready handler currently exposes raw dependency errors
by inserting component.last_error.clone() into CheckResult via
checks.insert(name.clone(), CheckResult { status, message:
component.last_error.clone(), }), which can leak sensitive upstream details;
change this to populate CheckResult.message with a coarse, non-sensitive string
(e.g., "unhealthy" or component.status_summary()) and log the full
component.last_error via the existing logger (or store it behind an
authenticated admin endpoint) so diagnostics remain available but are not
returned in public /ready responses.

---

Nitpick comments:
In `@src/observability/metrics.rs`:
- Around line 68-88: The gauges array in metrics.rs (entries
"request_duration_ms", "provider_latency_ms", "active_conversations") is being
registered via metrics.insert with AtomicMetric { value: AtomicU64::new(0),
help, metric_type: MetricType::Gauge } but has no callers in this PR; remove
these unused entries (or gate their registration behind a TODO/feature flag) so
only metrics with active usage are registered, or add a clear TODO comment
referencing the gauge names to register them when code paths actually emit
values; update the code that constructs the gauges array and the metrics.insert
calls accordingly.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3444ed00-858f-446e-aaa8-0e5f9f72b5ac

📥 Commits

Reviewing files that changed from the base of the PR and between 5da13ee and 6709386.

📒 Files selected for processing (4)

src/agent/loop_.rs
src/gateway/health.rs
src/gateway/mod.rs
src/observability/metrics.rs

coderabbitai · 2026-03-09T07:44:02Z

+#[derive(Debug, Serialize)]
+pub struct LivenessResponse {
+    pub status: &'static str,
+}
+
+/// Structured readiness response with component checks.
+#[derive(Debug, Serialize)]
+pub struct ReadinessResponse {
+    pub status: &'static str,
+    pub checks: BTreeMap<String, CheckResult>,
+}


⚠️ Potential issue | 🟠 Major

Implement the advertised health payload shape before merge.

The new types/handler serialize /health as only {"status":"healthy"}, and /ready still has no uptime_secs. The PR contract says these endpoints return status, checks, and uptime_secs; the current tests also lock in the smaller payload, so consumers built to the documented shape will break.

Also applies to: 45-49

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/gateway/health.rs` around lines 20 - 30, The current health types only serialize status; update the payloads to match the advertised contract by adding an uptime_secs field and including checks on both endpoints: modify LivenessResponse and ReadinessResponse to include pub uptime_secs: u64 (or appropriate numeric type) and ensure LivenessResponse also includes pub checks: BTreeMap<String, CheckResult> so both structs serialize {status, checks, uptime_secs}; update the corresponding handler functions that return these types (the liveness and readiness handlers) to compute and populate uptime_secs (e.g., seconds since process start) and fill checks (use an empty map or actual CheckResult values as appropriate) before returning the response.

coderabbitai · 2026-03-09T07:44:02Z

+    if config.observability.metrics_enabled {
+        app = app.route("/metrics", get(handle_metrics));


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the symbols involved in metric writes and metric export. rg -n -C4 'handle_metrics|BroadcastObserver|PrometheusObserver|MetricsRegistry|fn global|as_any|encode$|increment\("requests_total"$' src

Repository: zeroclaw-labs/zeroclaw

Length of output: 50378

🏁 Script executed:

find src -name "metrics.rs" -o -name "mod.rs" | xargs grep -l "fn global"

Repository: zeroclaw-labs/zeroclaw

Length of output: 94

🏁 Script executed:

rg -n "mod metrics|fn global" src/observability/

Repository: zeroclaw-labs/zeroclaw

Length of output: 188

🏁 Script executed:

rg -n "crate::observability::metrics::global" src/

Repository: zeroclaw-labs/zeroclaw

Length of output: 590

🏁 Script executed:

cat -n src/observability/metrics.rs | head -200

Repository: zeroclaw-labs/zeroclaw

Length of output: 7447

🏁 Script executed:

sed -n '734,745p' src/gateway/mod.rs

Repository: zeroclaw-labs/zeroclaw

Length of output: 471

🏁 Script executed:

rg -n "requests_total" src/observability/prometheus.rs

Repository: zeroclaw-labs/zeroclaw

Length of output: 377

🏁 Script executed:

rg -n "export_prometheus" src/

Repository: zeroclaw-labs/zeroclaw

Length of output: 528

Fix downcast failure and metric registry mismatch for /metrics endpoint.

The /metrics endpoint has two wiring issues:

Downcast fails due to BroadcastObserver wrapper: state.observer is wrapped in BroadcastObserver (line 621), which implements as_any() to return self rather than delegating to the inner observer. When handle_metrics attempts to downcast to PrometheusObserver (line 738), the downcast fails and returns the fallback message instead of metrics.

requests_total uses a separate metrics registry: Line 958 increments the metric via crate::observability::metrics::global(), which is a standalone atomic counter registry. This is distinct from the Prometheus observer's registry. PrometheusObserver does not expose requests_total at all, so even if the downcast succeeded, the metric would not appear in the scrape output.

Either delegate as_any() through the BroadcastObserver wrapper, or expose the global metrics registry from metrics::global() through the observer trait so both metrics paths write to the same backend.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/gateway/mod.rs` around lines 700 - 701, The /metrics endpoint fails because BroadcastObserver's as_any() doesn't delegate to the inner observer (so handle_metrics can't downcast to PrometheusObserver) and requests_total is incremented via crate::observability::metrics::global() which uses a separate registry; fix by either delegating type-erasure through BroadcastObserver::as_any() to return the inner observer's Any so handle_metrics can downcast to PrometheusObserver, or (preferred) extend the Observer trait to expose the global metrics registry (or a get_registry method) and implement it for PrometheusObserver and BroadcastObserver so requests_total increments via metrics::global() and the PrometheusObserver share the same registry; update handle_metrics, BroadcastObserver::as_any(), PrometheusObserver implementation and the code that increments requests_total to use the unified registry exposed by the observer trait.

coderabbitai · 2026-03-09T07:44:02Z

        }
    }

+    crate::observability::metrics::global().increment("requests_total");


⚠️ Potential issue | 🟠 Major

Move requests_total before the early exits.

Line 958 only runs after auth, JSON parsing, and idempotency checks, so rate-limited, unauthorized, malformed, and duplicate webhook hits are never counted. That leaves the gateway blind to exactly the failure/abuse traffic operators usually watch.

📈 Suggested fix

async fn handle_webhook( State(state): State<AppState>, ConnectInfo(peer_addr): ConnectInfo<SocketAddr>, headers: HeaderMap, body: Result<Json<WebhookBody>, axum::extract::rejection::JsonRejection>, ) -> impl IntoResponse { + crate::observability::metrics::global().increment("requests_total"); + let rate_key = client_key_from_request(Some(peer_addr), &headers, state.trust_forwarded_headers); if !state.rate_limiter.allow_webhook(&rate_key) { tracing::warn!("/webhook rate limit exceeded"); let err = serde_json::json!({ @@ - crate::observability::metrics::global().increment("requests_total"); - let message = &webhook_body.message;

Based on learnings: keep observability useful but non-sensitive; add/update tests or validation evidence for failure modes and boundaries.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/gateway/mod.rs` at line 958, The metrics increment for "requests_total" is currently executed after auth, JSON parsing, and idempotency checks so rate-limited, unauthorized, malformed, and duplicate requests are not counted; move the call to crate::observability::metrics::global().increment("requests_total") to the very start of the gateway request handler (before any auth checks, JSON parsing, idempotency checks or early returns — e.g. at the top of the handler function that currently contains this line) so every incoming request is counted, and add/update tests that simulate rate-limited, unauthorized, malformed, and duplicate webhook requests to assert the metric is incremented in those failure paths.

…egistry, and config Add production observability infrastructure: - src/gateway/health.rs: structured /health (liveness) and /ready (readiness) endpoints returning JSON with status, checks, uptime_secs - src/observability/metrics.rs: lightweight MetricsRegistry using std::sync::atomic counters with Prometheus text export - ObservabilityConfig: new fields health_enabled, metrics_enabled, metrics_prefix, health_check_interval_secs (all with serde(default) for backward compatibility) - Wire /health and /ready into gateway router, replacing the old handle_health handler

…readiness - Health endpoint (/health): return only {"status": "healthy"} to avoid exposing internal details (uptime, pairing state, component checks) to unauthenticated callers - Readiness endpoint (/ready): remove hardcoded "ok" for provider and memory checks that synthesized fake healthy state; now relies solely on the health registry for real component status; returns degraded when no components have registered yet (service still initializing) - Wire observability.metrics_prefix config into metrics::init_global() so the configured prefix is actually used for Prometheus metric names instead of always falling back to "zeroclaw"

…states When components are still starting or no components have registered yet, the readiness endpoint now returns "degraded" at the top level instead of "unhealthy", distinguishing transient warming states from actual failures.

…c increments Addresses reviewer feedback: metrics are now incremented from actual runtime code paths (gateway, tools, providers).

…, and readiness timeout Add new ObservabilityConfig fields: - health_endpoint_path: custom health endpoint path (default: /healthz) - metrics_endpoint_path: custom metrics endpoint path (default: /metrics) - custom_labels: additional metric labels in key=value format - readiness_check_timeout_secs: readiness probe timeout (default: 5s) All fields use serde(default) for backward compatibility.

rareba requested review from JordanTheJet and theonlyhennygod as code owners March 8, 2026 17:04

coderabbitai bot reviewed Mar 8, 2026

View reviewed changes

Comment thread src/gateway/health.rs

This was referenced Mar 9, 2026

🦞 OpenClaw 生态日报 2026-03-09 duanyytop/agents-radar#107

Open

🦞 Bản tin hàng ngày hệ sinh thái OpenClaw 2026-03-09 compasify/agents-radar#15

Open

🦞 OpenClaw 生态日报 2026-03-09 rollysys/agents-radar#58

Open

rikitrader approved these changes Mar 9, 2026

View reviewed changes

coderabbitai bot reviewed Mar 9, 2026

View reviewed changes

rareba force-pushed the feature/production-observability branch from 5527a08 to 319e659 Compare March 11, 2026 14:18

rareba requested a review from SimianAstronaut7 as a code owner March 11, 2026 17:50

rareba force-pushed the feature/production-observability branch from 0134500 to ea41eb0 Compare March 12, 2026 08:18

rareba added 7 commits March 15, 2026 15:50

fix(observability): gate endpoints behind config flags and wire metri…

73ee653

…c increments Addresses reviewer feedback: metrics are now incremented from actual runtime code paths (gateway, tools, providers).

style: fix cargo fmt + clippy violations from TTS merge

32769a9

ci: trigger quality gate

68a1df8

style: fix cargo fmt + clippy violations

2444717

rareba force-pushed the feature/production-observability branch from 7c7568e to 199b693 Compare March 15, 2026 14:52

rareba force-pushed the feature/production-observability branch from 199b693 to c513f0c Compare March 15, 2026 14:53

rareba closed this Mar 15, 2026

		if config.observability.metrics_enabled {
		app = app.route("/metrics", get(handle_metrics));

Conversation

rareba commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Label Snapshot (required)

Change Metadata

Linked Issue

Supersede Attribution (required when Supersedes # is used)

Validation Evidence (required)

Security Impact (required)

Privacy and Data Hygiene (required)

Compatibility / Migration

i18n Follow-Through (required when docs or user-facing wording changes)

Human Verification (required)

Side Effects / Blast Radius (required)

Agent Collaboration Notes (recommended)

Rollback Plan (required)

Risks and Mitigations

Uh oh!

coderabbitai bot commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

.coderabbit.yaml has unrecognized properties

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rikitrader left a comment

Choose a reason for hiding this comment

Uh oh!

rareba commented Mar 9, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rareba commented Mar 8, 2026 •

edited

Loading

Supersede Attribution (required when `Supersedes #` is used)

coderabbitai bot commented Mar 8, 2026 •

edited

Loading

`.coderabbit.yaml` has unrecognized properties