Skip to content

[Serve][4/n] Add replica lifecycle metrics#59235

Merged
abrarsheikh merged 11 commits intomasterfrom
59218-abrar-replica_health
Dec 18, 2025
Merged

[Serve][4/n] Add replica lifecycle metrics#59235
abrarsheikh merged 11 commits intomasterfrom
59218-abrar-replica_health

Conversation

@abrarsheikh
Copy link
Contributor

fixes #59218

Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Dec 7, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds several new metrics for replica lifecycle events, including startup, shutdown, reconfigure, and health checks. The changes look good and the new tests are comprehensive. I have a few suggestions to improve maintainability and test robustness.

Comment on lines +1223 to +1239
def check_metrics_count():
metrics = get_metric_dictionaries(
"ray_serve_replica_initialization_latency_ms_count"
)
assert len(metrics) == 2, f"Expected 2 metrics, got {len(metrics)}"
# All metrics should have same deployment and application
for metric in metrics:
assert metric["deployment"] == "MyDeployment"
assert metric["application"] == "app"
# Each replica should have a unique replica tag
replica_ids = {metric["replica"] for metric in metrics}
assert (
len(replica_ids) == 2
), f"Expected 2 unique replica IDs, got {replica_ids}"
return True

wait_for_condition(check_metrics_count, timeout=20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test verifies that two metrics are recorded for initialization_latency, but it doesn't do the same for startup_latency, even though both should have metrics for each of the two replicas. The test for startup latency is therefore incomplete. You can modify check_metrics_count to verify both metrics.

Suggested change
def check_metrics_count():
metrics = get_metric_dictionaries(
"ray_serve_replica_initialization_latency_ms_count"
)
assert len(metrics) == 2, f"Expected 2 metrics, got {len(metrics)}"
# All metrics should have same deployment and application
for metric in metrics:
assert metric["deployment"] == "MyDeployment"
assert metric["application"] == "app"
# Each replica should have a unique replica tag
replica_ids = {metric["replica"] for metric in metrics}
assert (
len(replica_ids) == 2
), f"Expected 2 unique replica IDs, got {replica_ids}"
return True
wait_for_condition(check_metrics_count, timeout=20)
def check_metrics_count():
for metric_name in [
"ray_serve_replica_initialization_latency_ms_count",
"ray_serve_replica_startup_latency_ms_count",
]:
metrics = get_metric_dictionaries(metric_name)
assert len(metrics) == 2, f"Expected 2 metrics for {metric_name}, got {len(metrics)}"
# All metrics should have same deployment and application
for metric in metrics:
assert metric["deployment"] == "MyDeployment"
assert metric["application"] == "app"
# Each replica should have a unique replica tag
replica_ids = {metric["replica"] for metric in metrics}
assert (
len(replica_ids) == 2
), f"Expected 2 unique replica IDs for {metric_name}, got {replica_ids}"
return True
wait_for_condition(check_metrics_count, timeout=20)

@abrarsheikh abrarsheikh marked this pull request as ready for review December 16, 2025 04:35
@abrarsheikh abrarsheikh requested review from a team as code owners December 16, 2025 04:35
@ray-gardener ray-gardener bot added serve Ray Serve Related Issue observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Dec 16, 2025
Comment on lines +897 to +899
# Reset the last health check status for this check cycle.
self._last_health_check_latency_ms = None
self._last_health_check_failed = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we reset these values when we don't get the new values? that way, we will hold the actual values for a little longer, which can be used in the replica.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is intentional, update the comment

Signed-off-by: abrar <abrar@anyscale.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Health check metrics repeatedly double-counted

_last_health_check_latency_ms and _last_health_check_failed are never reset after a health-check cycle completes. check_and_update_replicas() then re-observes the same latency every control-loop tick and repeatedly increments serve_health_check_failures_total while the flag remains true, inflating health-check latency/failure metrics and making them inaccurate.

python/ray/serve/_private/deployment_state.py#L896-L940

"""
if self._health_check_ref is None:
# There is no outstanding health check.
response = ReplicaHealthCheckResponse.NONE
elif check_obj_ref_ready_nowait(self._health_check_ref):
# Object ref is ready, ray.get it to check for exceptions.
try:
ray.get(self._health_check_ref)
# Calculate health check latency.
self._last_health_check_latency_ms = (
time.time() - self._last_health_check_time
) * 1000
self._last_health_check_failed = False
# Health check succeeded without exception.
response = ReplicaHealthCheckResponse.SUCCEEDED
except RayActorError:
# Health check failed due to actor crashing.
response = ReplicaHealthCheckResponse.ACTOR_CRASHED
self._last_health_check_failed = True
except RayError as e:
# Health check failed due to application-level exception.
logger.warning(f"Health check for {self._replica_id} failed: {e}")
response = ReplicaHealthCheckResponse.APP_FAILURE
self._last_health_check_failed = True
elif time.time() - self._last_health_check_time > self.health_check_timeout_s:
# Health check hasn't returned and the timeout is up, consider it failed.
logger.warning(
"Didn't receive health check response for replica "
f"{self._replica_id} after "
f"{self.health_check_timeout_s}s, marking it unhealthy."
)
response = ReplicaHealthCheckResponse.APP_FAILURE
# Calculate latency for timeout case.
self._last_health_check_latency_ms = (
time.time() - self._last_health_check_time
) * 1000
self._last_health_check_failed = True
else:
# Health check hasn't returned and the timeout isn't up yet.
response = ReplicaHealthCheckResponse.NONE
if response is not ReplicaHealthCheckResponse.NONE:
self._health_check_ref = None
return response

python/ray/serve/_private/deployment_state.py#L3077-L3095

for replica in self._replicas.pop(
states=[ReplicaState.RUNNING, ReplicaState.PENDING_MIGRATION]
):
is_healthy = replica.check_health()
# Record health check latency and failure metrics.
metric_tags = {
"deployment": self.deployment_name,
"replica": replica.replica_id.unique_id,
"application": self.app_name,
}
if replica.last_health_check_latency_ms is not None:
self.health_check_latency_histogram.observe(
replica.last_health_check_latency_ms, tags=metric_tags
)
if replica.last_health_check_failed:
self.health_check_failures_counter.inc(tags=metric_tags)

Fix in Cursor Fix in Web


Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
@akyang-anyscale
Copy link
Contributor

akyang-anyscale commented Dec 17, 2025

i'm curious how the startup/initialization/reconfigure/shutdown metrics look like in grafana. is it just a single sample?

@abrarsheikh
Copy link
Contributor Author

i'm curious how the startup/initialization/reconfigure/shutdown metrics look like in grafana. is it just a single sample?

Havn't made a decision one way, but either

  1. Time series - which would look sparse
  2. a single guage value which shows average across all replicas for last recorded value.

wdyt?

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
@akyang-anyscale
Copy link
Contributor

i'm curious how the startup/initialization/reconfigure/shutdown metrics look like in grafana. is it just a single sample?

Havn't made a decision one way, but either

  1. Time series - which would look sparse
  2. a single guage value which shows average across all replicas for last recorded value.

wdyt?

I think it makes sense to record it as is. It might not be useful to visualize in a time series as you said because of the sparsity. option 2 makes sense, showing a single value like avg, p90, and max of this counter in the last recorded value

@akyang-anyscale
Copy link
Contributor

also if you want the PR to be linked to the GH issue, but not close it when the PR is merged, you can remove the "fixes" keyword in the PR description.

@abrarsheikh
Copy link
Contributor Author

also if you want the PR to be linked to the GH issue, but not close it when the PR is merged, you can remove the "fixes" keyword in the PR description.

Thank you :) i kept struggling to figure this out.

@abrarsheikh abrarsheikh enabled auto-merge (squash) December 18, 2025 00:52
@abrarsheikh abrarsheikh merged commit d5f5b06 into master Dec 18, 2025
7 checks passed
@abrarsheikh abrarsheikh deleted the 59218-abrar-replica_health branch December 18, 2025 01:38
zzchun pushed a commit to zzchun/ray that referenced this pull request Dec 18, 2025
fixes ray-project#59218

---------

Signed-off-by: abrar <abrar@anyscale.com>
Yicheng-Lu-llll pushed a commit to Yicheng-Lu-llll/ray that referenced this pull request Dec 22, 2025
fixes ray-project#59218

---------

Signed-off-by: abrar <abrar@anyscale.com>
peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026
fixes ray-project#59218

---------

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: peterxcli <peterxcli@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Serve] add debugging metrics to ray serve

3 participants