[Serve][4/n] Add replica lifecycle metrics#59235
Conversation
Signed-off-by: abrar <abrar@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request adds several new metrics for replica lifecycle events, including startup, shutdown, reconfigure, and health checks. The changes look good and the new tests are comprehensive. I have a few suggestions to improve maintainability and test robustness.
| def check_metrics_count(): | ||
| metrics = get_metric_dictionaries( | ||
| "ray_serve_replica_initialization_latency_ms_count" | ||
| ) | ||
| assert len(metrics) == 2, f"Expected 2 metrics, got {len(metrics)}" | ||
| # All metrics should have same deployment and application | ||
| for metric in metrics: | ||
| assert metric["deployment"] == "MyDeployment" | ||
| assert metric["application"] == "app" | ||
| # Each replica should have a unique replica tag | ||
| replica_ids = {metric["replica"] for metric in metrics} | ||
| assert ( | ||
| len(replica_ids) == 2 | ||
| ), f"Expected 2 unique replica IDs, got {replica_ids}" | ||
| return True | ||
|
|
||
| wait_for_condition(check_metrics_count, timeout=20) |
There was a problem hiding this comment.
This test verifies that two metrics are recorded for initialization_latency, but it doesn't do the same for startup_latency, even though both should have metrics for each of the two replicas. The test for startup latency is therefore incomplete. You can modify check_metrics_count to verify both metrics.
| def check_metrics_count(): | |
| metrics = get_metric_dictionaries( | |
| "ray_serve_replica_initialization_latency_ms_count" | |
| ) | |
| assert len(metrics) == 2, f"Expected 2 metrics, got {len(metrics)}" | |
| # All metrics should have same deployment and application | |
| for metric in metrics: | |
| assert metric["deployment"] == "MyDeployment" | |
| assert metric["application"] == "app" | |
| # Each replica should have a unique replica tag | |
| replica_ids = {metric["replica"] for metric in metrics} | |
| assert ( | |
| len(replica_ids) == 2 | |
| ), f"Expected 2 unique replica IDs, got {replica_ids}" | |
| return True | |
| wait_for_condition(check_metrics_count, timeout=20) | |
| def check_metrics_count(): | |
| for metric_name in [ | |
| "ray_serve_replica_initialization_latency_ms_count", | |
| "ray_serve_replica_startup_latency_ms_count", | |
| ]: | |
| metrics = get_metric_dictionaries(metric_name) | |
| assert len(metrics) == 2, f"Expected 2 metrics for {metric_name}, got {len(metrics)}" | |
| # All metrics should have same deployment and application | |
| for metric in metrics: | |
| assert metric["deployment"] == "MyDeployment" | |
| assert metric["application"] == "app" | |
| # Each replica should have a unique replica tag | |
| replica_ids = {metric["replica"] for metric in metrics} | |
| assert ( | |
| len(replica_ids) == 2 | |
| ), f"Expected 2 unique replica IDs for {metric_name}, got {replica_ids}" | |
| return True | |
| wait_for_condition(check_metrics_count, timeout=20) |
| # Reset the last health check status for this check cycle. | ||
| self._last_health_check_latency_ms = None | ||
| self._last_health_check_failed = False |
There was a problem hiding this comment.
shouldn't we reset these values when we don't get the new values? that way, we will hold the actual values for a little longer, which can be used in the replica.
There was a problem hiding this comment.
It is intentional, update the comment
Signed-off-by: abrar <abrar@anyscale.com>
There was a problem hiding this comment.
Bug: Health check metrics repeatedly double-counted
_last_health_check_latency_ms and _last_health_check_failed are never reset after a health-check cycle completes. check_and_update_replicas() then re-observes the same latency every control-loop tick and repeatedly increments serve_health_check_failures_total while the flag remains true, inflating health-check latency/failure metrics and making them inaccurate.
python/ray/serve/_private/deployment_state.py#L896-L940
ray/python/ray/serve/_private/deployment_state.py
Lines 896 to 940 in ecb1223
python/ray/serve/_private/deployment_state.py#L3077-L3095
ray/python/ray/serve/_private/deployment_state.py
Lines 3077 to 3095 in ecb1223
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
|
i'm curious how the startup/initialization/reconfigure/shutdown metrics look like in grafana. is it just a single sample? |
Havn't made a decision one way, but either
wdyt? |
Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: abrar <abrar@anyscale.com>
I think it makes sense to record it as is. It might not be useful to visualize in a time series as you said because of the sparsity. option 2 makes sense, showing a single value like avg, p90, and max of this counter in the last recorded value |
|
also if you want the PR to be linked to the GH issue, but not close it when the PR is merged, you can remove the "fixes" keyword in the PR description. |
Thank you :) i kept struggling to figure this out. |
fixes ray-project#59218 --------- Signed-off-by: abrar <abrar@anyscale.com>
fixes ray-project#59218 --------- Signed-off-by: abrar <abrar@anyscale.com>
fixes ray-project#59218 --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: peterxcli <peterxcli@gmail.com>
fixes #59218