[Serve][4/n] Add replica lifecycle metrics by abrarsheikh · Pull Request #59235 · ray-project/ray

abrarsheikh · 2025-12-07T06:31:24Z

Signed-off-by: abrar <abrar@anyscale.com>

gemini-code-assist

Code Review

This pull request adds several new metrics for replica lifecycle events, including startup, shutdown, reconfigure, and health checks. The changes look good and the new tests are comprehensive. I have a few suggestions to improve maintainability and test robustness.

python/ray/serve/_private/deployment_state.py

gemini-code-assist · 2025-12-07T06:35:23Z

python/ray/serve/tests/test_metrics.py

+    def check_metrics_count():
+        metrics = get_metric_dictionaries(
+            "ray_serve_replica_initialization_latency_ms_count"
+        )
+        assert len(metrics) == 2, f"Expected 2 metrics, got {len(metrics)}"
+        # All metrics should have same deployment and application
+        for metric in metrics:
+            assert metric["deployment"] == "MyDeployment"
+            assert metric["application"] == "app"
+        # Each replica should have a unique replica tag
+        replica_ids = {metric["replica"] for metric in metrics}
+        assert (
+            len(replica_ids) == 2
+        ), f"Expected 2 unique replica IDs, got {replica_ids}"
+        return True
+
+    wait_for_condition(check_metrics_count, timeout=20)


This test verifies that two metrics are recorded for initialization_latency, but it doesn't do the same for startup_latency, even though both should have metrics for each of the two replicas. The test for startup latency is therefore incomplete. You can modify check_metrics_count to verify both metrics.

Suggested change

def check_metrics_count():

metrics = get_metric_dictionaries(

"ray_serve_replica_initialization_latency_ms_count"

)

assert len(metrics) == 2, f"Expected 2 metrics, got {len(metrics)}"

# All metrics should have same deployment and application

for metric in metrics:

assert metric["deployment"] == "MyDeployment"

assert metric["application"] == "app"

# Each replica should have a unique replica tag

replica_ids = {metric["replica"] for metric in metrics}

assert (

len(replica_ids) == 2

), f"Expected 2 unique replica IDs, got {replica_ids}"

return True

wait_for_condition(check_metrics_count, timeout=20)

def check_metrics_count():

for metric_name in [

"ray_serve_replica_initialization_latency_ms_count",

"ray_serve_replica_startup_latency_ms_count",

]:

metrics = get_metric_dictionaries(metric_name)

assert len(metrics) == 2, f"Expected 2 metrics for {metric_name}, got {len(metrics)}"

# All metrics should have same deployment and application

for metric in metrics:

assert metric["deployment"] == "MyDeployment"

assert metric["application"] == "app"

# Each replica should have a unique replica tag

replica_ids = {metric["replica"] for metric in metrics}

assert (

len(replica_ids) == 2

), f"Expected 2 unique replica IDs for {metric_name}, got {replica_ids}"

return True

wait_for_condition(check_metrics_count, timeout=20)

python/ray/serve/tests/test_metrics.py

…replica_health

Signed-off-by: abrar <abrar@anyscale.com>

…replica_health

python/ray/serve/_private/deployment_state.py

harshit-anyscale · 2025-12-17T05:59:20Z

python/ray/serve/_private/deployment_state.py

+        # Reset the last health check status for this check cycle.
+        self._last_health_check_latency_ms = None
+        self._last_health_check_failed = False


shouldn't we reset these values when we don't get the new values? that way, we will hold the actual values for a little longer, which can be used in the replica.

It is intentional, update the comment

Signed-off-by: abrar <abrar@anyscale.com>

python/ray/serve/tests/test_metrics.py

cursor

Bug: Health check metrics repeatedly double-counted

_last_health_check_latency_ms and _last_health_check_failed are never reset after a health-check cycle completes. check_and_update_replicas() then re-observes the same latency every control-loop tick and repeatedly increments serve_health_check_failures_total while the flag remains true, inflating health-check latency/failure metrics and making them inaccurate.

python/ray/serve/_private/deployment_state.py#L896-L940

ray/python/ray/serve/_private/deployment_state.py

Lines 896 to 940 in ecb1223

    
                   """ 
        
                   if self._health_check_ref is None: 
        
                       # There is no outstanding health check. 
        
                       response = ReplicaHealthCheckResponse.NONE 
        
                   elif check_obj_ref_ready_nowait(self._health_check_ref): 
        
                       # Object ref is ready, ray.get it to check for exceptions. 
        
                       try: 
        
                           ray.get(self._health_check_ref) 
        
                           # Calculate health check latency. 
        
                           self._last_health_check_latency_ms = ( 
        
                               time.time() - self._last_health_check_time 
        
                           ) * 1000 
        
                           self._last_health_check_failed = False 
        
                           # Health check succeeded without exception. 
        
                           response = ReplicaHealthCheckResponse.SUCCEEDED 
        
                       except RayActorError: 
        
                           # Health check failed due to actor crashing. 
        
                           response = ReplicaHealthCheckResponse.ACTOR_CRASHED 
        
                           self._last_health_check_failed = True 
        
                       except RayError as e: 
        
                           # Health check failed due to application-level exception. 
        
                           logger.warning(f"Health check for {self._replica_id} failed: {e}") 
        
                           response = ReplicaHealthCheckResponse.APP_FAILURE 
        
                           self._last_health_check_failed = True 
        
                   elif time.time() - self._last_health_check_time > self.health_check_timeout_s: 
        
                       # Health check hasn't returned and the timeout is up, consider it failed. 
        
                       logger.warning( 
        
                           "Didn't receive health check response for replica " 
        
                           f"{self._replica_id} after " 
        
                           f"{self.health_check_timeout_s}s, marking it unhealthy." 
        
                       ) 
        
                       response = ReplicaHealthCheckResponse.APP_FAILURE 
        
                       # Calculate latency for timeout case. 
        
                       self._last_health_check_latency_ms = ( 
        
                           time.time() - self._last_health_check_time 
        
                       ) * 1000 
        
                       self._last_health_check_failed = True 
        
                   else: 
        
                       # Health check hasn't returned and the timeout isn't up yet. 
        
                       response = ReplicaHealthCheckResponse.NONE 
        
                   if response is not ReplicaHealthCheckResponse.NONE: 
        
                       self._health_check_ref = None 
        
                   return response

python/ray/serve/_private/deployment_state.py#L3077-L3095

ray/python/ray/serve/_private/deployment_state.py

Lines 3077 to 3095 in ecb1223

    
           for replica in self._replicas.pop( 
        
               states=[ReplicaState.RUNNING, ReplicaState.PENDING_MIGRATION] 
        
           ): 
        
               is_healthy = replica.check_health() 
        
               # Record health check latency and failure metrics. 
        
               metric_tags = { 
        
                   "deployment": self.deployment_name, 
        
                   "replica": replica.replica_id.unique_id, 
        
                   "application": self.app_name, 
        
               } 
        
               if replica.last_health_check_latency_ms is not None: 
        
                   self.health_check_latency_histogram.observe( 
        
                       replica.last_health_check_latency_ms, tags=metric_tags 
        
                   ) 
        
               if replica.last_health_check_failed: 
        
                   self.health_check_failures_counter.inc(tags=metric_tags)

Signed-off-by: abrar <abrar@anyscale.com>

python/ray/serve/_private/deployment_state.py

python/ray/serve/tests/test_metrics.py

Signed-off-by: abrar <abrar@anyscale.com>

python/ray/serve/_private/constants.py

python/ray/serve/_private/deployment_state.py

akyang-anyscale · 2025-12-17T21:41:39Z

i'm curious how the startup/initialization/reconfigure/shutdown metrics look like in grafana. is it just a single sample?

python/ray/serve/_private/deployment_state.py

abrarsheikh · 2025-12-17T23:53:58Z

i'm curious how the startup/initialization/reconfigure/shutdown metrics look like in grafana. is it just a single sample?

Havn't made a decision one way, but either

Time series - which would look sparse
a single guage value which shows average across all replicas for last recorded value.

wdyt?

Signed-off-by: abrar <abrar@anyscale.com>

…replica_health

Signed-off-by: abrar <abrar@anyscale.com>

akyang-anyscale · 2025-12-18T00:38:34Z

i'm curious how the startup/initialization/reconfigure/shutdown metrics look like in grafana. is it just a single sample?

Havn't made a decision one way, but either

Time series - which would look sparse

a single guage value which shows average across all replicas for last recorded value.

wdyt?

I think it makes sense to record it as is. It might not be useful to visualize in a time series as you said because of the sparsity. option 2 makes sense, showing a single value like avg, p90, and max of this counter in the last recorded value

akyang-anyscale · 2025-12-18T00:44:16Z

also if you want the PR to be linked to the GH issue, but not close it when the PR is merged, you can remove the "fixes" keyword in the PR description.

abrarsheikh · 2025-12-18T00:51:47Z

also if you want the PR to be linked to the GH issue, but not close it when the PR is merged, you can remove the "fixes" keyword in the PR description.

Thank you :) i kept struggling to figure this out.

fixes ray-project#59218 --------- Signed-off-by: abrar <abrar@anyscale.com>

fixes ray-project#59218 --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

[Serve][4/n] Add replica lifecycle metrics

6abc80d

Signed-off-by: abrar <abrar@anyscale.com>

abrarsheikh added the go add ONLY when ready to merge, run all tests label Dec 7, 2025

gemini-code-assist bot reviewed Dec 7, 2025

View reviewed changes

abrarsheikh added 2 commits December 16, 2025 04:20

Merge branch 'master' of github.com:ray-project/ray into 59218-abrar-…

8548670

…replica_health

add docs

df85dc3

Signed-off-by: abrar <abrar@anyscale.com>

abrarsheikh marked this pull request as ready for review December 16, 2025 04:35

abrarsheikh requested review from a team as code owners December 16, 2025 04:35

abrarsheikh requested a review from harshit-anyscale December 16, 2025 04:51

ray-gardener bot added serve Ray Serve Related Issue observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Dec 16, 2025

Merge branch 'master' of github.com:ray-project/ray into 59218-abrar-…

28070ff

…replica_health

harshit-anyscale reviewed Dec 17, 2025

View reviewed changes

fix health check bug

ecb1223

Signed-off-by: abrar <abrar@anyscale.com>

cursor bot reviewed Dec 17, 2025

View reviewed changes

python/ray/serve/tests/test_metrics.py Show resolved Hide resolved

cursor bot reviewed Dec 17, 2025

View reviewed changes

revert reset

b3b5ff9

Signed-off-by: abrar <abrar@anyscale.com>

cursor bot reviewed Dec 17, 2025

View reviewed changes

python/ray/serve/_private/deployment_state.py Show resolved Hide resolved

python/ray/serve/tests/test_metrics.py Show resolved Hide resolved

change check to >= 1

15b55ae

Signed-off-by: abrar <abrar@anyscale.com>

akyang-anyscale reviewed Dec 17, 2025

View reviewed changes

python/ray/serve/_private/constants.py Show resolved Hide resolved

python/ray/serve/_private/deployment_state.py Show resolved Hide resolved

akyang-anyscale reviewed Dec 17, 2025

View reviewed changes

python/ray/serve/_private/deployment_state.py Outdated Show resolved Hide resolved

abrarsheikh added 4 commits December 18, 2025 00:17

set defaults

fcb8698

Signed-off-by: abrar <abrar@anyscale.com>

Merge branch 'master' of github.com:ray-project/ray into 59218-abrar-…

7afb149

…replica_health

set defaults

a0fd9cf

Signed-off-by: abrar <abrar@anyscale.com>

add note

37e1218

Signed-off-by: abrar <abrar@anyscale.com>

akyang-anyscale approved these changes Dec 18, 2025

View reviewed changes

abrarsheikh enabled auto-merge (squash) December 18, 2025 00:52

abrarsheikh merged commit d5f5b06 into master Dec 18, 2025
7 checks passed

abrarsheikh deleted the 59218-abrar-replica_health branch December 18, 2025 01:38

zzchun pushed a commit to zzchun/ray that referenced this pull request Dec 18, 2025

[Serve][4/n] Add replica lifecycle metrics (ray-project#59235)

4f22c07

fixes ray-project#59218 --------- Signed-off-by: abrar <abrar@anyscale.com>

Yicheng-Lu-llll pushed a commit to Yicheng-Lu-llll/ray that referenced this pull request Dec 22, 2025

[Serve][4/n] Add replica lifecycle metrics (ray-project#59235)

a8fc695

fixes ray-project#59218 --------- Signed-off-by: abrar <abrar@anyscale.com>

peterxcli pushed a commit to peterxcli/ray that referenced this pull request Feb 25, 2026

[Serve][4/n] Add replica lifecycle metrics (ray-project#59235)

bea4bf3

fixes ray-project#59218 --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>

	"""
	if self._health_check_ref is None:
	# There is no outstanding health check.
	response = ReplicaHealthCheckResponse.NONE
	elif check_obj_ref_ready_nowait(self._health_check_ref):
	# Object ref is ready, ray.get it to check for exceptions.
	try:
	ray.get(self._health_check_ref)
	# Calculate health check latency.
	self._last_health_check_latency_ms = (
	time.time() - self._last_health_check_time
	) * 1000
	self._last_health_check_failed = False
	# Health check succeeded without exception.
	response = ReplicaHealthCheckResponse.SUCCEEDED
	except RayActorError:
	# Health check failed due to actor crashing.
	response = ReplicaHealthCheckResponse.ACTOR_CRASHED
	self._last_health_check_failed = True
	except RayError as e:
	# Health check failed due to application-level exception.
	logger.warning(f"Health check for {self._replica_id} failed: {e}")
	response = ReplicaHealthCheckResponse.APP_FAILURE
	self._last_health_check_failed = True
	elif time.time() - self._last_health_check_time > self.health_check_timeout_s:
	# Health check hasn't returned and the timeout is up, consider it failed.
	logger.warning(
	"Didn't receive health check response for replica "
	f"{self._replica_id} after "
	f"{self.health_check_timeout_s}s, marking it unhealthy."
	)
	response = ReplicaHealthCheckResponse.APP_FAILURE
	# Calculate latency for timeout case.
	self._last_health_check_latency_ms = (
	time.time() - self._last_health_check_time
	) * 1000
	self._last_health_check_failed = True
	else:
	# Health check hasn't returned and the timeout isn't up yet.
	response = ReplicaHealthCheckResponse.NONE

	if response is not ReplicaHealthCheckResponse.NONE:
	self._health_check_ref = None

	return response


	for replica in self._replicas.pop(
	states=[ReplicaState.RUNNING, ReplicaState.PENDING_MIGRATION]
	):
	is_healthy = replica.check_health()

	# Record health check latency and failure metrics.
	metric_tags = {
	"deployment": self.deployment_name,
	"replica": replica.replica_id.unique_id,
	"application": self.app_name,
	}
	if replica.last_health_check_latency_ms is not None:
	self.health_check_latency_histogram.observe(
	replica.last_health_check_latency_ms, tags=metric_tags
	)
	if replica.last_health_check_failed:
	self.health_check_failures_counter.inc(tags=metric_tags)

Conversation

abrarsheikh commented Dec 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

harshit-anyscale Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

abrarsheikh Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Bug: Health check metrics repeatedly double-counted

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

akyang-anyscale commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

abrarsheikh commented Dec 17, 2025

Uh oh!

akyang-anyscale commented Dec 18, 2025

Uh oh!

akyang-anyscale commented Dec 18, 2025

Uh oh!

abrarsheikh commented Dec 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

akyang-anyscale commented Dec 17, 2025 •

edited

Loading