feat(observability): health/readiness endpoints, atomic metrics, config#3035
feat(observability): health/readiness endpoints, atomic metrics, config#3035rareba wants to merge 8 commits intozeroclaw-labs:masterfrom
Conversation
|
Note
|
| Cohort / File(s) | Summary |
|---|---|
Configuration & Defaults src/config/schema.rs |
Added public fields to ObservabilityConfig: health_enabled, metrics_enabled, metrics_prefix, health_check_interval_secs; updated Default and added private helpers for defaults. |
Health Endpoints & Routing src/gateway/health.rs, src/gateway/mod.rs |
New handle_liveness and handle_readiness handlers and types; router now conditionally registers /health, /ready, and /metrics based on observability config; webhook increments request metric. |
Metrics Implementation src/observability/metrics.rs, src/observability/mod.rs |
New MetricsRegistry with atomic counters/gauges, uptime, Prometheus text export, global singleton (init_global / global); exported via observability::metrics. |
Startup Initialization src/main.rs |
Initializes global metrics registry at startup using config.observability.metrics_prefix. |
Telemetry in Agent Loop src/agent/loop_.rs |
Added metric increments for tool calls, tool errors, and provider requests. |
Tests / Minor Edits src/observability/runtime_trace.rs |
Test initializer updated to use ..ObservabilityConfig::default() (test-only change). |
Sequence Diagram(s)
sequenceDiagram
participant Client as Client
participant Router as Gateway Router
participant Handler as Health Handler
participant Health as Health Subsystem
Client->>Router: GET /ready
Router->>Handler: dispatch to handle_readiness
Handler->>Health: crate::health::snapshot()
Health-->>Handler: component statuses (BTreeMap)
Handler->>Handler: map statuses -> checks, determine overall status
Handler-->>Router: HTTP 200 (all ok) or 503 (unhealthy) + JSON body
Router-->>Client: HTTP response
Estimated code review effort
🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
- feat: OpenTelemetry tracing and metrics observer #160: Modifies the same
ObservabilityConfigstruct (adds observability fields and defaults). - feat(agent): add result-aware loop detection for tool-call loop #2153: Touches
src/agent/loop_.rsand agent loop logic; related to added telemetry in tool/provider loops. - feat(gateway): add per-endpoint rate limiting and webhook idempotency #188: Changes tool execution flow in the agent loop, which overlaps with telemetry additions.
Suggested labels
size: M, risk: medium, core, gateway: core, config: core, service: core, tests
Suggested reviewers
- theonlyhennygod
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 inconclusive)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Description check | ❓ Inconclusive | The description provides a clear summary of changes, a test plan with specific validation steps, and risk/rollback information, though it omits most required template sections. | Complete the description template by adding Label Snapshot, Change Metadata, Linked Issue, Validation Evidence, Security/Privacy Impact, Compatibility, and Human Verification sections for a comprehensive review record. |
✅ Passed checks (2 passed)
| Check name | Status | Explanation |
|---|---|---|
| Title check | ✅ Passed | The title concisely summarizes the three main changes: health/readiness endpoints, atomic metrics implementation, and configuration extensions. |
| Docstring Coverage | ✅ Passed | Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%. |
✏️ Tip: You can configure your own custom pre-merge checks in the settings.
✨ Finishing Touches
🧪 Generate unit tests (beta)
- Create PR with unit tests
- Post copyable unit tests in a comment
Comment @coderabbitai help to get the list of available commands and usage tips.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
src/gateway/health.rs (1)
155-224: Please add handler-level failure-mode tests.The current tests only assert serialization shape. They do not exercise the 200-vs-503 boundary or verify that public responses stay redacted when components fail. A small
handle_readinesstest with an unhealthy snapshot would lock those semantics down.Based on learnings: "For high-risk code changes, include at least one boundary/failure-mode validation in addition to standard checks"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/gateway/health.rs` around lines 155 - 224, Add a handler-level failure-mode test that exercises handle_readiness: construct a HealthResponse/CheckResult map containing an unhealthy/degraded component (e.g., provider status "degraded" with a message), invoke the handle_readiness handler (or the readiness HTTP endpoint) with that snapshot, and assert the HTTP response status is 503 (not 200) and the JSON body preserves public redaction rules (no internal details exposed—e.g., message should be omitted or redacted as in CheckResult serialization). Use the existing HealthResponse and CheckResult types to build the snapshot and call the same handler function used in production so the test verifies the 200-vs-503 boundary and public response redaction semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/config/schema.rs`:
- Around line 1903-1905: The new metrics_prefix config field is currently
unused; either thread it into the global metrics registry creation or remove the
config key. To fix, update observability::metrics (where MetricsRegistry is
initialized with the hard-coded "zeroclaw") to accept a prefix parameter (or
overload the constructor/init function) and use that prefix when creating the
MetricsRegistry instead of the literal "zeroclaw", then pass
config.observability.metrics_prefix from the place that initializes the
registry; alternatively delete the metrics_prefix field and its default helper
in schema.rs (and any tests) if you don't want to wire it up. Ensure you update
references to metrics_prefix and the MetricsRegistry initializer (e.g., the
constructor or init function in observability/metrics.rs) so the config value is
actually applied.
In `@src/gateway/health.rs`:
- Around line 50-59: The health handler currently exposes sensitive/internal
details in CheckResult.message (e.g., the pairing block created via
checks.insert(...) and similar blocks for model and last_error); change these to
return only coarse, stable values (e.g., status = "ok"/"degraded" and message =
None or a non-sensitive string like "paired" / "not-paired" / "error") and
remove/raw-redact any raw variables (paired, configured model, last_error) from
the public payload; keep the full diagnostics in server logs or an admin-only
endpoint instead; apply the same redaction to the other CheckResult inserts
noted (the blocks around the other checks at the referenced locations).
- Around line 88-95: The code is currently synthesizing healthy dependency
entries for "provider" and "memory" (creating CheckResult with status "ok" and
message from state.model) which can mask real probe failures and overwrite
snapshot results; instead, stop inserting hardcoded "provider" and "memory"
CheckResult entries — query the real probe/health registry or probe state for
those keys and only insert their CheckResult if an actual probe result exists,
otherwise omit the entry or set it to a degraded/unknown status; also ensure you
do not blindly overwrite an existing failing "memory" entry created earlier in
the snapshot loop (check for existing checks.contains_key("memory") before
inserting or insert in the snapshot processing path that respects probe
results).
---
Nitpick comments:
In `@src/gateway/health.rs`:
- Around line 155-224: Add a handler-level failure-mode test that exercises
handle_readiness: construct a HealthResponse/CheckResult map containing an
unhealthy/degraded component (e.g., provider status "degraded" with a message),
invoke the handle_readiness handler (or the readiness HTTP endpoint) with that
snapshot, and assert the HTTP response status is 503 (not 200) and the JSON body
preserves public redaction rules (no internal details exposed—e.g., message
should be omitted or redacted as in CheckResult serialization). Use the existing
HealthResponse and CheckResult types to build the snapshot and call the same
handler function used in production so the test verifies the 200-vs-503 boundary
and public response redaction semantics.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8bac5431-7ca1-49cc-83d6-524fc1e4dd41
📒 Files selected for processing (6)
src/config/schema.rssrc/gateway/health.rssrc/gateway/mod.rssrc/observability/metrics.rssrc/observability/mod.rssrc/observability/runtime_trace.rs
| /// Prefix for lightweight atomic metrics (Prometheus metric names). | ||
| #[serde(default = "default_metrics_prefix")] | ||
| pub metrics_prefix: String, |
There was a problem hiding this comment.
metrics_prefix is currently a no-op.
src/observability/metrics.rs:168-170 still initializes the global MetricsRegistry with a hard-coded "zeroclaw" prefix, so changing this field will never affect exported metric names. Please either thread observability.metrics_prefix into registry initialization before exposing the key, or remove the key until there is a real caller.
As per coding guidelines: "Do not add new config keys ... without a concrete accepted use case; do not introduce speculative 'future-proof' abstractions without at least one current caller."
Also applies to: 1923-1924
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/config/schema.rs` around lines 1903 - 1905, The new metrics_prefix config
field is currently unused; either thread it into the global metrics registry
creation or remove the config key. To fix, update observability::metrics (where
MetricsRegistry is initialized with the hard-coded "zeroclaw") to accept a
prefix parameter (or overload the constructor/init function) and use that prefix
when creating the MetricsRegistry instead of the literal "zeroclaw", then pass
config.observability.metrics_prefix from the place that initializes the
registry; alternatively delete the metrics_prefix field and its default helper
in schema.rs (and any tests) if you don't want to wire it up. Ensure you update
references to metrics_prefix and the MetricsRegistry initializer (e.g., the
constructor or init function in observability/metrics.rs) so the config value is
actually applied.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/gateway/health.rs (1)
98-103:⚠️ Potential issue | 🟠 MajorDon’t return raw dependency errors from
/ready.
component.last_erroris copied straight into a public readiness payload. On a public bind, that leaks backend details to unauthenticated callers. Quick rollback is to suppressmessageentirely here, or replace it with a coarse string and keep full diagnostics in logs/admin-only paths.Suggested fix
checks.insert( name.clone(), CheckResult { status, - message: component.last_error.clone(), + message: component + .last_error + .as_ref() + .map(|_| "dependency unavailable".to_string()), }, );Based on learnings: keep observability useful but non-sensitive.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/gateway/health.rs` around lines 98 - 103, The readiness payload is leaking raw backend errors by copying component.last_error into CheckResult.message for the /ready endpoint; instead remove or sanitize that field: update the construction of CheckResult (where checks.insert(name.clone(), CheckResult { status, message: component.last_error.clone(), }) is created) to set message to None or a coarse, non-sensitive string like "unavailable" and log component.last_error with detail to the service logger or an admin-only diagnostics path so full error details are not exposed in public readiness responses.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/gateway/health.rs`:
- Around line 70-79: The top-level health status currently collapses every
non-OK check to "unhealthy"; change the aggregation logic (the code that builds
the top-level overall/status after checks are inserted — e.g., where you inspect
checks created from snapshot.components and CheckResult) so it computes overall
as: "unhealthy" if any check.status == "unhealthy", else "degraded" if any
check.status == "degraded", else "ok". Keep returning HTTP 503 for degraded but
ensure the JSON top-level status string is "degraded" for empty registry /
starting-only snapshots. Update the health computation code paths that touch
snapshot.components, checks.insert, and the overall/status assignment (also the
section covering lines ~82-112) and add tests that assert empty-snapshot and
starting-only snapshots produce top-level "degraded" (with 503) rather than
"unhealthy".
---
Duplicate comments:
In `@src/gateway/health.rs`:
- Around line 98-103: The readiness payload is leaking raw backend errors by
copying component.last_error into CheckResult.message for the /ready endpoint;
instead remove or sanitize that field: update the construction of CheckResult
(where checks.insert(name.clone(), CheckResult { status, message:
component.last_error.clone(), }) is created) to set message to None or a coarse,
non-sensitive string like "unavailable" and log component.last_error with detail
to the service logger or an admin-only diagnostics path so full error details
are not exposed in public readiness responses.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: adfc5009-4444-4335-8450-70c21d77f015
📒 Files selected for processing (2)
src/gateway/health.rssrc/main.rs
rikitrader
left a comment
There was a problem hiding this comment.
LGTM. Health/readiness endpoints wired into gateway, atomic metrics, backward-compatible config additions. Good observability foundation.
|
Thanks for the review feedback. I've addressed the integration concerns: Changes pushed:
All 9 metrics tests pass. |
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (1)
src/gateway/health.rs (1)
99-104:⚠️ Potential issue | 🟠 MajorRedact raw dependency errors from
/ready.
component.last_error.clone()is returned to every caller of/ready. On a public bind, that leaks internal dependency details and upstream error text. Keep the public payload coarse and move the raw diagnostics to logs or an authenticated admin endpoint.🔒 Suggested redaction
checks.insert( name.clone(), CheckResult { status, - message: component.last_error.clone(), + message: component + .last_error + .as_ref() + .map(|_| "dependency unavailable".to_string()), }, );Based on learnings: keep observability useful but non-sensitive.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/gateway/health.rs` around lines 99 - 104, The /ready handler currently exposes raw dependency errors by inserting component.last_error.clone() into CheckResult via checks.insert(name.clone(), CheckResult { status, message: component.last_error.clone(), }), which can leak sensitive upstream details; change this to populate CheckResult.message with a coarse, non-sensitive string (e.g., "unhealthy" or component.status_summary()) and log the full component.last_error via the existing logger (or store it behind an authenticated admin endpoint) so diagnostics remain available but are not returned in public /ready responses.
🧹 Nitpick comments (1)
src/observability/metrics.rs (1)
68-88: Consider deferring unused gauge metrics.The gauge metrics (
request_duration_ms,provider_latency_ms,active_conversations) are registered but none of them appear to have callers in the current PR changes. Per YAGNI principle, consider adding these when they have actual callers.That said, the overhead is negligible and having them pre-registered for future use is a reasonable trade-off for observability infrastructure.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/observability/metrics.rs` around lines 68 - 88, The gauges array in metrics.rs (entries "request_duration_ms", "provider_latency_ms", "active_conversations") is being registered via metrics.insert with AtomicMetric { value: AtomicU64::new(0), help, metric_type: MetricType::Gauge } but has no callers in this PR; remove these unused entries (or gate their registration behind a TODO/feature flag) so only metrics with active usage are registered, or add a clear TODO comment referencing the gauge names to register them when code paths actually emit values; update the code that constructs the gauges array and the metrics.insert calls accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/gateway/health.rs`:
- Around line 20-30: The current health types only serialize status; update the
payloads to match the advertised contract by adding an uptime_secs field and
including checks on both endpoints: modify LivenessResponse and
ReadinessResponse to include pub uptime_secs: u64 (or appropriate numeric type)
and ensure LivenessResponse also includes pub checks: BTreeMap<String,
CheckResult> so both structs serialize {status, checks, uptime_secs}; update the
corresponding handler functions that return these types (the liveness and
readiness handlers) to compute and populate uptime_secs (e.g., seconds since
process start) and fill checks (use an empty map or actual CheckResult values as
appropriate) before returning the response.
In `@src/gateway/mod.rs`:
- Line 958: The metrics increment for "requests_total" is currently executed
after auth, JSON parsing, and idempotency checks so rate-limited, unauthorized,
malformed, and duplicate requests are not counted; move the call to
crate::observability::metrics::global().increment("requests_total") to the very
start of the gateway request handler (before any auth checks, JSON parsing,
idempotency checks or early returns — e.g. at the top of the handler function
that currently contains this line) so every incoming request is counted, and
add/update tests that simulate rate-limited, unauthorized, malformed, and
duplicate webhook requests to assert the metric is incremented in those failure
paths.
- Around line 700-701: The /metrics endpoint fails because BroadcastObserver's
as_any() doesn't delegate to the inner observer (so handle_metrics can't
downcast to PrometheusObserver) and requests_total is incremented via
crate::observability::metrics::global() which uses a separate registry; fix by
either delegating type-erasure through BroadcastObserver::as_any() to return the
inner observer's Any so handle_metrics can downcast to PrometheusObserver, or
(preferred) extend the Observer trait to expose the global metrics registry (or
a get_registry method) and implement it for PrometheusObserver and
BroadcastObserver so requests_total increments via metrics::global() and the
PrometheusObserver share the same registry; update handle_metrics,
BroadcastObserver::as_any(), PrometheusObserver implementation and the code that
increments requests_total to use the unified registry exposed by the observer
trait.
---
Duplicate comments:
In `@src/gateway/health.rs`:
- Around line 99-104: The /ready handler currently exposes raw dependency errors
by inserting component.last_error.clone() into CheckResult via
checks.insert(name.clone(), CheckResult { status, message:
component.last_error.clone(), }), which can leak sensitive upstream details;
change this to populate CheckResult.message with a coarse, non-sensitive string
(e.g., "unhealthy" or component.status_summary()) and log the full
component.last_error via the existing logger (or store it behind an
authenticated admin endpoint) so diagnostics remain available but are not
returned in public /ready responses.
---
Nitpick comments:
In `@src/observability/metrics.rs`:
- Around line 68-88: The gauges array in metrics.rs (entries
"request_duration_ms", "provider_latency_ms", "active_conversations") is being
registered via metrics.insert with AtomicMetric { value: AtomicU64::new(0),
help, metric_type: MetricType::Gauge } but has no callers in this PR; remove
these unused entries (or gate their registration behind a TODO/feature flag) so
only metrics with active usage are registered, or add a clear TODO comment
referencing the gauge names to register them when code paths actually emit
values; update the code that constructs the gauges array and the metrics.insert
calls accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3444ed00-858f-446e-aaa8-0e5f9f72b5ac
📒 Files selected for processing (4)
src/agent/loop_.rssrc/gateway/health.rssrc/gateway/mod.rssrc/observability/metrics.rs
| #[derive(Debug, Serialize)] | ||
| pub struct LivenessResponse { | ||
| pub status: &'static str, | ||
| } | ||
|
|
||
| /// Structured readiness response with component checks. | ||
| #[derive(Debug, Serialize)] | ||
| pub struct ReadinessResponse { | ||
| pub status: &'static str, | ||
| pub checks: BTreeMap<String, CheckResult>, | ||
| } |
There was a problem hiding this comment.
Implement the advertised health payload shape before merge.
The new types/handler serialize /health as only {"status":"healthy"}, and /ready still has no uptime_secs. The PR contract says these endpoints return status, checks, and uptime_secs; the current tests also lock in the smaller payload, so consumers built to the documented shape will break.
Also applies to: 45-49
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/gateway/health.rs` around lines 20 - 30, The current health types only
serialize status; update the payloads to match the advertised contract by adding
an uptime_secs field and including checks on both endpoints: modify
LivenessResponse and ReadinessResponse to include pub uptime_secs: u64 (or
appropriate numeric type) and ensure LivenessResponse also includes pub checks:
BTreeMap<String, CheckResult> so both structs serialize {status, checks,
uptime_secs}; update the corresponding handler functions that return these types
(the liveness and readiness handlers) to compute and populate uptime_secs (e.g.,
seconds since process start) and fill checks (use an empty map or actual
CheckResult values as appropriate) before returning the response.
| if config.observability.metrics_enabled { | ||
| app = app.route("/metrics", get(handle_metrics)); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the symbols involved in metric writes and metric export.
rg -n -C4 'handle_metrics|BroadcastObserver|PrometheusObserver|MetricsRegistry|fn global|as_any|encode\(|increment\("requests_total"\)' srcRepository: zeroclaw-labs/zeroclaw
Length of output: 50378
🏁 Script executed:
find src -name "metrics.rs" -o -name "mod.rs" | xargs grep -l "fn global"Repository: zeroclaw-labs/zeroclaw
Length of output: 94
🏁 Script executed:
rg -n "mod metrics|fn global" src/observability/Repository: zeroclaw-labs/zeroclaw
Length of output: 188
🏁 Script executed:
rg -n "crate::observability::metrics::global" src/Repository: zeroclaw-labs/zeroclaw
Length of output: 590
🏁 Script executed:
cat -n src/observability/metrics.rs | head -200Repository: zeroclaw-labs/zeroclaw
Length of output: 7447
🏁 Script executed:
sed -n '734,745p' src/gateway/mod.rsRepository: zeroclaw-labs/zeroclaw
Length of output: 471
🏁 Script executed:
rg -n "requests_total" src/observability/prometheus.rsRepository: zeroclaw-labs/zeroclaw
Length of output: 377
🏁 Script executed:
rg -n "export_prometheus" src/Repository: zeroclaw-labs/zeroclaw
Length of output: 528
Fix downcast failure and metric registry mismatch for /metrics endpoint.
The /metrics endpoint has two wiring issues:
-
Downcast fails due to BroadcastObserver wrapper:
state.observeris wrapped inBroadcastObserver(line 621), which implementsas_any()to returnselfrather than delegating to the inner observer. Whenhandle_metricsattempts to downcast toPrometheusObserver(line 738), the downcast fails and returns the fallback message instead of metrics. -
requests_totaluses a separate metrics registry: Line 958 increments the metric viacrate::observability::metrics::global(), which is a standalone atomic counter registry. This is distinct from the Prometheus observer's registry. PrometheusObserver does not exposerequests_totalat all, so even if the downcast succeeded, the metric would not appear in the scrape output.
Either delegate as_any() through the BroadcastObserver wrapper, or expose the global metrics registry from metrics::global() through the observer trait so both metrics paths write to the same backend.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/gateway/mod.rs` around lines 700 - 701, The /metrics endpoint fails
because BroadcastObserver's as_any() doesn't delegate to the inner observer (so
handle_metrics can't downcast to PrometheusObserver) and requests_total is
incremented via crate::observability::metrics::global() which uses a separate
registry; fix by either delegating type-erasure through
BroadcastObserver::as_any() to return the inner observer's Any so handle_metrics
can downcast to PrometheusObserver, or (preferred) extend the Observer trait to
expose the global metrics registry (or a get_registry method) and implement it
for PrometheusObserver and BroadcastObserver so requests_total increments via
metrics::global() and the PrometheusObserver share the same registry; update
handle_metrics, BroadcastObserver::as_any(), PrometheusObserver implementation
and the code that increments requests_total to use the unified registry exposed
by the observer trait.
| } | ||
| } | ||
|
|
||
| crate::observability::metrics::global().increment("requests_total"); |
There was a problem hiding this comment.
Move requests_total before the early exits.
Line 958 only runs after auth, JSON parsing, and idempotency checks, so rate-limited, unauthorized, malformed, and duplicate webhook hits are never counted. That leaves the gateway blind to exactly the failure/abuse traffic operators usually watch.
📈 Suggested fix
async fn handle_webhook(
State(state): State<AppState>,
ConnectInfo(peer_addr): ConnectInfo<SocketAddr>,
headers: HeaderMap,
body: Result<Json<WebhookBody>, axum::extract::rejection::JsonRejection>,
) -> impl IntoResponse {
+ crate::observability::metrics::global().increment("requests_total");
+
let rate_key =
client_key_from_request(Some(peer_addr), &headers, state.trust_forwarded_headers);
if !state.rate_limiter.allow_webhook(&rate_key) {
tracing::warn!("/webhook rate limit exceeded");
let err = serde_json::json!({
@@
- crate::observability::metrics::global().increment("requests_total");
-
let message = &webhook_body.message;Based on learnings: keep observability useful but non-sensitive; add/update tests or validation evidence for failure modes and boundaries.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/gateway/mod.rs` at line 958, The metrics increment for "requests_total"
is currently executed after auth, JSON parsing, and idempotency checks so
rate-limited, unauthorized, malformed, and duplicate requests are not counted;
move the call to
crate::observability::metrics::global().increment("requests_total") to the very
start of the gateway request handler (before any auth checks, JSON parsing,
idempotency checks or early returns — e.g. at the top of the handler function
that currently contains this line) so every incoming request is counted, and
add/update tests that simulate rate-limited, unauthorized, malformed, and
duplicate webhook requests to assert the metric is incremented in those failure
paths.
5527a08 to
319e659
Compare
0134500 to
ea41eb0
Compare
…egistry, and config Add production observability infrastructure: - src/gateway/health.rs: structured /health (liveness) and /ready (readiness) endpoints returning JSON with status, checks, uptime_secs - src/observability/metrics.rs: lightweight MetricsRegistry using std::sync::atomic counters with Prometheus text export - ObservabilityConfig: new fields health_enabled, metrics_enabled, metrics_prefix, health_check_interval_secs (all with serde(default) for backward compatibility) - Wire /health and /ready into gateway router, replacing the old handle_health handler
…readiness
- Health endpoint (/health): return only {"status": "healthy"} to avoid
exposing internal details (uptime, pairing state, component checks) to
unauthenticated callers
- Readiness endpoint (/ready): remove hardcoded "ok" for provider and
memory checks that synthesized fake healthy state; now relies solely on
the health registry for real component status; returns degraded when no
components have registered yet (service still initializing)
- Wire observability.metrics_prefix config into metrics::init_global()
so the configured prefix is actually used for Prometheus metric names
instead of always falling back to "zeroclaw"
…states When components are still starting or no components have registered yet, the readiness endpoint now returns "degraded" at the top level instead of "unhealthy", distinguishing transient warming states from actual failures.
…c increments Addresses reviewer feedback: metrics are now incremented from actual runtime code paths (gateway, tools, providers).
7c7568e to
199b693
Compare
…, and readiness timeout Add new ObservabilityConfig fields: - health_endpoint_path: custom health endpoint path (default: /healthz) - metrics_endpoint_path: custom metrics endpoint path (default: /metrics) - custom_labels: additional metric labels in key=value format - readiness_check_timeout_secs: readiness probe timeout (default: 5s) All fields use serde(default) for backward compatibility.
199b693 to
c513f0c
Compare
Summary
masterfor all contributions):master/health(liveness) and/ready(readiness) gateway endpoints returning JSON withstatus,checks, anduptime_secsMetricsRegistryinsrc/observability/metrics.rsusingstd::sync::atomiccounters with Prometheus text format exportObservabilityConfigwithhealth_enabled,metrics_enabled,metrics_prefix,health_check_interval_secs(all#[serde(default)]for backward compat)handle_healthhandler/metricshandler unchanged. No external metrics dependencies added.Label Snapshot (required)
risk: low|medium|high): mediumsize: XS|S|M|L|XL, auto-managed/read-only): Mgateway: health,observability: metricsChange Metadata
bug|feature|refactor|docs|security|chore): featureruntime|provider|channel|memory|security|ci|docs|multi): multi (gateway + observability)Linked Issue
Supersede Attribution (required when
Supersedes #is used)N/A
Validation Evidence (required)
Commands and result summary:
Security Impact (required)
Yes, describe risk and mitigation: N/APrivacy and Data Hygiene (required)
pass|needs-follow-up): passCompatibility / Migration
[observability]config section with new fields, all with#[serde(default)]i18n Follow-Through (required when docs or user-facing wording changes)
Human Verification (required)
What was personally validated beyond CI:
Side Effects / Blast Radius (required)
/healthhandler is replaced — callers expecting the old response format will get structured JSON instead.Agent Collaboration Notes (recommended)
AGENTS.md+CONTRIBUTING.md): YesRollback Plan (required)
git revert <commit-sha>health_enabledandmetrics_enabledconfig toggles/healthor/readyreturning 500; metrics export returning empty or malformed Prometheus text.Risks and Mitigations
handle_healthhandler may break existing monitoring integrations expecting the old format.statusfield); most health checkers only look at HTTP status code.std::sync::atomicwith relaxed ordering — negligible overhead for counter increments.