Tiny add back fixes of incorrect metrics after worker removal#15624
Tiny add back fixes of incorrect metrics after worker removal#15624slin1237 merged 5 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @fzyzcjy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves an issue where metrics associated with workers were not correctly cleared or reset upon a worker's removal or unhealthiness. By introducing a dedicated cleanup mechanism, it ensures that monitoring data accurately reflects the current state of active workers, preventing the persistence of stale or incorrect metric values and improving the reliability of observability. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/tag-and-rerun-ci |
There was a problem hiding this comment.
Code Review
This pull request introduces a fix to clean up worker-specific metrics when a worker is removed from the registry. This is achieved by adding a remove_worker_metrics function in the Metrics module and calling it from the WorkerRegistry::remove method. The changes look good and address the problem described. I have two main points of feedback:
- The new test for
remove_worker_metricsis missing assertions, which makes it ineffective. I've suggested an implementation that properly verifies the metrics are reset. - The
smg_worker_healthmetric is being reset, but it's not described and doesn't seem to be set anywhere else. It should be described, and if it's meant to be used, it should be set when health status changes.
Please address these points to improve the correctness and maintainability of the code.
I am having trouble creating individual review comments. Click here to see my feedback.
sgl-model-gateway/src/observability/metrics.rs (1290-1300)
The new test test_remove_worker_metrics is missing assertions. A test without assertions doesn't verify the correctness of the code and can't prevent regressions. Please add assertions to ensure that the metrics are correctly reset after remove_worker_metrics is called. You can use a PrometheusBuilder to capture and inspect the metrics.
#[test]
fn test_remove_worker_metrics() {
// To test metrics, we need to install a recorder and capture its output.
// This may panic if tests run in parallel. Consider using `try_install` or running tests serially.
let handle = metrics_exporter_prometheus::PrometheusBuilder::new()
.install()
.unwrap();
let worker_url = "http://worker1:8000";
Metrics::set_worker_requests_active(worker_url, 10);
Metrics::set_worker_cb_state(worker_url, 1);
Metrics::set_worker_cb_consecutive_failures(worker_url, 5);
Metrics::set_worker_cb_consecutive_successes(worker_url, 3);
gauge!("smg_worker_health", "worker" => worker_url.to_string()).set(1.0);
Metrics::remove_worker_metrics(worker_url);
let output = handle.render();
assert!(output.contains("smg_worker_requests_active{worker=\"http://worker1:8000\"} 0"));
assert!(output.contains("smg_worker_cb_state{worker=\"http://worker1:8000\"} -1"));
assert!(output.contains("smg_worker_cb_consecutive_failures{worker=\"http://worker1:8000\"} 0"));
assert!(output.contains("smg_worker_cb_consecutive_successes{worker=\"http://worker1:8000\"} 0"));
assert!(output.contains("smg_worker_health{worker=\"http://worker1:8000\"} -1"));
}sgl-model-gateway/src/observability/metrics.rs (1048)
The metric smg_worker_health is being reset here, but it's not described with describe_gauge! in the init_metrics function. For consistency and discoverability, all metrics should be described. Please add a description for smg_worker_health (e.g., to indicate worker health status where 1 is healthy and 0 is unhealthy).
Additionally, I couldn't find where this metric is being set with a positive value. If this metric is intended to track worker health, it should be set when the worker's health status changes (e.g., in set_healthy or check_health_async). If it's an old metric that's no longer used, it could be removed from this cleanup logic.
|
|
||
| worker.set_healthy(false); | ||
| Metrics::remove_worker_metrics(worker.url()); | ||
|
|
There was a problem hiding this comment.
should this be in workflow instead of worker registry?
Motivation
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist