Skip to content

Tiny add back fixes of incorrect metrics after worker removal#15624

Merged
slin1237 merged 5 commits intosgl-project:mainfrom
fzyzcjy:feat/add_back_rm_worker
Dec 24, 2025
Merged

Tiny add back fixes of incorrect metrics after worker removal#15624
slin1237 merged 5 commits intosgl-project:mainfrom
fzyzcjy:feat/add_back_rm_worker

Conversation

@fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Dec 22, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @fzyzcjy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves an issue where metrics associated with workers were not correctly cleared or reset upon a worker's removal or unhealthiness. By introducing a dedicated cleanup mechanism, it ensures that monitoring data accurately reflects the current state of active workers, preventing the persistence of stale or incorrect metric values and improving the reliability of observability.

Highlights

  • Metric Cleanup Function: A new public function, remove_worker_metrics, has been introduced in the Metrics module to explicitly reset worker-specific gauge metrics when a worker is no longer active.
  • Worker Registry Integration: The WorkerRegistry now invokes the remove_worker_metrics function whenever a worker is removed or marked as unhealthy, ensuring that associated metrics are promptly cleared.
  • Temporary Metric Invalidation: For specific metrics like smg_worker_cb_state and smg_worker_health, values are temporarily set to -1.0 to indicate invalidation, addressing a limitation in the metrics-rs library where 0.0 might have a valid meaning.
  • Unit Test Coverage: A new unit test, test_remove_worker_metrics, has been added to verify the correct functionality of the metric cleanup process.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@fzyzcjy
Copy link
Collaborator Author

fzyzcjy commented Dec 22, 2025

/tag-and-rerun-ci

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix to clean up worker-specific metrics when a worker is removed from the registry. This is achieved by adding a remove_worker_metrics function in the Metrics module and calling it from the WorkerRegistry::remove method. The changes look good and address the problem described. I have two main points of feedback:

  1. The new test for remove_worker_metrics is missing assertions, which makes it ineffective. I've suggested an implementation that properly verifies the metrics are reset.
  2. The smg_worker_health metric is being reset, but it's not described and doesn't seem to be set anywhere else. It should be described, and if it's meant to be used, it should be set when health status changes.

Please address these points to improve the correctness and maintainability of the code.

I am having trouble creating individual review comments. Click here to see my feedback.

sgl-model-gateway/src/observability/metrics.rs (1290-1300)

critical

The new test test_remove_worker_metrics is missing assertions. A test without assertions doesn't verify the correctness of the code and can't prevent regressions. Please add assertions to ensure that the metrics are correctly reset after remove_worker_metrics is called. You can use a PrometheusBuilder to capture and inspect the metrics.

    #[test]
    fn test_remove_worker_metrics() {
        // To test metrics, we need to install a recorder and capture its output.
        // This may panic if tests run in parallel. Consider using `try_install` or running tests serially.
        let handle = metrics_exporter_prometheus::PrometheusBuilder::new()
            .install()
            .unwrap();

        let worker_url = "http://worker1:8000";

        Metrics::set_worker_requests_active(worker_url, 10);
        Metrics::set_worker_cb_state(worker_url, 1);
        Metrics::set_worker_cb_consecutive_failures(worker_url, 5);
        Metrics::set_worker_cb_consecutive_successes(worker_url, 3);
        gauge!("smg_worker_health", "worker" => worker_url.to_string()).set(1.0);

        Metrics::remove_worker_metrics(worker_url);

        let output = handle.render();
        assert!(output.contains("smg_worker_requests_active{worker=\"http://worker1:8000\"} 0"));
        assert!(output.contains("smg_worker_cb_state{worker=\"http://worker1:8000\"} -1"));
        assert!(output.contains("smg_worker_cb_consecutive_failures{worker=\"http://worker1:8000\"} 0"));
        assert!(output.contains("smg_worker_cb_consecutive_successes{worker=\"http://worker1:8000\"} 0"));
        assert!(output.contains("smg_worker_health{worker=\"http://worker1:8000\"} -1"));
    }

sgl-model-gateway/src/observability/metrics.rs (1048)

medium

The metric smg_worker_health is being reset here, but it's not described with describe_gauge! in the init_metrics function. For consistency and discoverability, all metrics should be described. Please add a description for smg_worker_health (e.g., to indicate worker health status where 1 is healthy and 0 is unhealthy).

Additionally, I couldn't find where this metric is being set with a positive value. If this metric is intended to track worker health, it should be set when the worker's health status changes (e.g., in set_healthy or check_health_async). If it's an old metric that's no longer used, it could be removed from this cleanup logic.


worker.set_healthy(false);
Metrics::remove_worker_metrics(worker.url());

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be in workflow instead of worker registry?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(synced offline)

@slin1237 slin1237 merged commit 4d64f15 into sgl-project:main Dec 24, 2025
70 checks passed
jiaming1130 pushed a commit to zhuyijie88/sglang that referenced this pull request Dec 25, 2025
GuoYechang pushed a commit to GuoYechang/sglang that referenced this pull request Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments