[metrics] Add more useful metrics by Baidu-AIAK · Pull Request #15809 · sgl-project/sglang

Baidu-AIAK · 2025-12-25T06:49:21Z

Motivation

These are practical monitoring metrics used in our production environment:

sglang:prefill_chunk_util
- The utilization ratio of chunked prefill, calculated as the number of tokens involved in the chunked prefill computation divided by server_args.chunked_prefill_size.
sglang:request_waiting_time
- For Centralized architecture: The time a request spends waiting in the waiting_queue.
- For Prefill instances: The time from when a request enters the bootstrap queue until the start of the forward pass.
- For Decode instances: The time from when a request enters the transferring queue until the start of the forward pass.
sglang:request_zmq_time_seconds
- The latency between the tokenizer process sending a request and the scheduler process receiving it.
sglang:request_first_token_forward_time
- For Centralized architecture: The time elapsed from entering the waiting_queue to the completion of run_batch.
- For Prefill instances: The time elapsed from entering the bootstrap queue to the completion of run_batch.
sglang:request_time_to_tokenizer_ready_seconds
- The processing time consumed by the tokenizer.
sglang:all_gather_latency_us
- Measures the time taken to aggregate local token counts from all ranks and complete a cross-DP all_gather operation before each run_batch in the Decode phase (during DP attention/MLP preparation). This indicates the overhead of Data Parallel (DP) communication and synchronization during decoding.
sglang:decode_bs_util
- Batch size utilization during the Decode phase (the proportion of active decode requests per DP worker).
sglang:dp_balance
- Data Parallel (DP) load balance metric, calculated using the Gini coefficient based on DP token counts (1.0 indicates perfect balance).
sglang:generation_time
- The total time required to generate a single token.
sglang:idle_batch_ratio
- The proportion of idle workers among all DP workers.
sglang:run_batch_time
- The execution time for a batch (duration of the forward pass).
sglang:iter_token_process_time
- The processing overhead between tokens, calculated as generation_time - run_batch_time.

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

Change-Id: I57363c3af1f9f4a7adb2367a767464172c8728af

Change-Id: I6269e86d6b36399431af3cbd8065030fc4f12723

Change-Id: Ib4e7b00cbc8c03f75ce5da324fcfc418ee3640c4

Change-Id: Id66a5e1f587faaa022586f8fb00f87b37b1c4f37

Change-Id: I5cf74b383519f05459bbc1c1bc1e70f05be70baa

Change-Id: I008c5d0c56572bfd76639b6a357fbc26671043fa

Change-Id: Iee8876df87a463bbe2055f05f745e8a8f4dbfe9e

Change-Id: I7f81c21649c1195c73a0bf8139e7808566ce65a5

Change-Id: I62589ba5b9afe2136afcb74bd5740b8e2542c827

gemini-code-assist · 2025-12-25T06:49:24Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

ishandhanani · 2025-12-31T06:01:53Z

This is super helpful. Thank you.

Do you mind sharing some before and after benchmarks with these metrics emitted?

ishandhanani · 2026-01-03T18:00:44Z

Can you rebase?

Baidu-AIAK · 2026-01-04T06:44:26Z

This is super helpful. Thank you.

Do you mind sharing some before and after benchmarks with these metrics emitted?

It needs some work. Please wait a moment

Baidu-AIAK · 2026-01-04T06:53:11Z

Can you rebase?

Conflicts were resolved.

Change-Id: I0d595dab49da9929b1a2f9eafb127df39ada3748

Change-Id: Ic62b9277008ad3062e4bff013f9cc42a011fc6f2

Change-Id: Id14a23364e83b10c185d3a4dc487b9e7d810f9e0

Change-Id: If6c7183e825e1d9444e80d3cb190c953a44ad2b2

…ken_forward_time` Change-Id: Id8c758cb4434bef9c2dd2ff101aaf43aed9e2051

ishandhanani · 2026-01-11T00:11:04Z

Sounds good. Let me know when this is re-ready to review

Cc @sufeng-buaa as well

sufeng-buaa · 2026-01-11T04:08:18Z

The metrics and collection method for request stage latency are somewhat redundant with the existing implementation in #8710. If we merge this patch, we should revert #8710, consolidate all request stage latency metrics within TimeStats and SchedulerMetricsMixin, and abandon integrating the request stage latency metric into the Tracing package in my PR #13152. What do you think? @ShangmingCai @ishandhanani @acelyc111 @Baidu-AIAK @scottjlee @zhanghaotong

sufeng-buaa · 2026-01-11T04:16:03Z

python/sglang/srt/entrypoints/http_server.py

 async def openai_v1_completions(request: CompletionRequest, raw_request: Request):
    """OpenAI-compatible text completion endpoint."""
+    # Timestamp when the HTTP request is received and handed off to the tokenizer
+    tokenizer_rev_request_time = time.time()


Somewhat redundant with #13432

Somewhat redundant with #13432

Yes, I think this variable is essentially no different from received_time, and it can also be used for embedding requests placed within the serving_base.

Somewhat redundant with #13432

Yes, I think this variable is essentially no different from received_time, and it can also be used for embedding requests placed within the serving_base.

yes, it's redundant. I will reuse received_time

merrymercy

Now we have many observability features. There seems to be some redundancy among them. All the related code is scattered around the whole codebase without documentation. The existing related code:

TimeStats (python/sglang/srt/metrics/collector.py)
per request tracing (python/sglang/srt/tracing/trace.py)
Scheduler metrics for prefill and decode batch (python/sglang/srt/managers/scheduler_metrics_mixin.py)

Questions:

Can we move all observability code into a single place?
How can we avoid ad-hoc change to insert metrics collection code all around the code base? We should have a unified interface to collect the latency breakdown of a request.

merrymercy · 2026-01-11T11:52:28Z

python/sglang/srt/disaggregation/decode.py


        while True:
-            # Receive requests
+            self.iter_start_time = time.time()


Do not delete comment

use perf counter?

merrymercy · 2026-01-11T11:54:50Z

python/sglang/srt/managers/scheduler_metrics_mixin.py

            )

+    def log_prefill_dp_balance_stats(self: Scheduler, batch: ScheduleBatch) -> None:
+        """Log DP-level load-balancing metrics for the prefill stage."""


@fzyzcjy take a review?

merrymercy · 2026-01-11T11:55:26Z

python/sglang/srt/managers/scheduler_metrics_mixin.py

+        run_batch_time = self.iter_forward_finish_time - self.iter_forward_start_time
+        iter_token_process_time = generation_time - run_batch_time
+
+        # Add metrics related to DP balancing


split this big function into multiple smaller ones

split this big function into multiple smaller ones

I already splitted log_decode_stats into sub log_decode_*

zhanghaotong · 2026-01-11T12:43:09Z

The metrics and collection method for request stage latency are somewhat redundant with the existing implementation in #8710. If we merge this patch, we should revert #8710, consolidate all request stage latency metrics within TimeStats and SchedulerMetricsMixin, and abandon integrating the request stage latency metric into the Tracing package in my PR #13152. What do you think? @ShangmingCai @ishandhanani @acelyc111 @Baidu-AIAK @scottjlee @zhanghaotong

We definitely need a unified location to place these metrics, but it's also necessary to store some metrics about this request as attributes within the tracing. What do you think?

sufeng-buaa · 2026-01-11T18:20:40Z

Questions:

Can we move all observability code into a single place?

How can we avoid ad-hoc change to insert metrics collection code all around the code base? We should have a unified interface to collect the latency breakdown of a request.

@merrymercy I'm currently trying to integrate these. I might submit a draft PR in about 2-3 weeks. Another question: are the request stage metrics and tracing data exported in real-time, or all at once when the request finishes? I suspect the latter might cause the scheduler to block due to potentially large data volumes.

sufeng-buaa · 2026-01-12T02:48:43Z

python/sglang/srt/metrics/collector.py

+            labelnames=labels.keys(),
+            multiprocess_mode="mostrecent",
+        )
+        self.run_batch_time = Gauge(


Do generation_time,run_batch_time and iter_token_process_time have any practical observational value in your experience? In event_loop_overlap, the batch in run_batch and the batch in process_batch_result are not the same, so I don't think adding this observation is necessary. If you simply want to measure the time spent in process_inputs, run_batch, and process_batch_result, it would suffice to record timestamps at the beginning and end of each function. Even then, however, I still find the value limited, since their latency is directly correlated with batch size.

@sufeng-buaa Hello. Yes, firstly, the granularity of the time-consuming data collection is indeed quite fine-grained, which has helped us pinpoint some difficult performance issues in our production environment. Secondly, in overlap mode, these values are not entirely accurate, and we haven't found a good way to accurately collect these metrics. Currently, in our actual usage scenarios, we have compared and analyzed these metrics with nsys-profile data, and they generally match the end-to-end time of the profile (under conditions where the batch size is not frequently and significantly changed). This still helps us quickly identify some performance problems and is of great reference value. For now, we can use this method to statistically measure the time spent on this part, and we welcome any suggestions for better implementation methods.

zhanghaotong · 2026-01-12T02:54:38Z

python/sglang/srt/managers/tokenizer_manager.py

+            and self.enable_metrics
+        ):
+            self.metrics_collector.observe_request_time_to_tokenizer_ready(
+                tokenizer_ready_time - tokenizer_rev_request_time


Instead of passing it through layer by layer, could we use obj.received_time?

Change-Id: Ie066fa8be7e6a6d304a9ddc8667f694600ec1639

Change-Id: Ica88de24840ecf4ef140008a5a7b1fcb9815d752

Change-Id: I80d8655d23d1ae7deb7919dd3a4e13636f595c7a

…o pr-monitor Change-Id: I68206a67b5c6e892b93cdb9a634f3744427c419a

wangchao101 added 9 commits December 24, 2025 11:58

[monitor] add more metrics

8748aa1

Change-Id: I57363c3af1f9f4a7adb2367a767464172c8728af

[bugfix] change multiprocess_mode from sum to mostrecent

0148d75

Change-Id: I6269e86d6b36399431af3cbd8065030fc4f12723

[bugfix] remove redundant metric cache_hit

b887199

Change-Id: Ib4e7b00cbc8c03f75ce5da324fcfc418ee3640c4

[bugfix] remove redundant metrics cache_all

5d1bc8b

Change-Id: Id66a5e1f587faaa022586f8fb00f87b37b1c4f37

[bugfix] remove unused metrics prefill_run_batch_time

c12a5c8

Change-Id: I5cf74b383519f05459bbc1c1bc1e70f05be70baa

sync latest code

99cb07e

Change-Id: I008c5d0c56572bfd76639b6a357fbc26671043fa

fix missing import statement

d3a3923

Change-Id: Iee8876df87a463bbe2055f05f745e8a8f4dbfe9e

fix undefined variable server_args

d7944ab

Change-Id: I7f81c21649c1195c73a0bf8139e7808566ce65a5

format code by pre-commit

7659c6e

Change-Id: I62589ba5b9afe2136afcb74bd5740b8e2542c827

Baidu-AIAK requested review from ByronHsu, CatherineSue, JustinTong0323, ShangmingCai, Ying1123, hnyls2002, ispobock, merrymercy, slin1237 and xiezhq-hermann as code owners December 25, 2025 06:49

Baidu-AIAK added 10 commits December 25, 2025 14:50

Merge branch 'main' into pr-monitor

a9ea8ce

Merge branch 'main' into pr-monitor

17924a5

Merge branch 'main' into pr-monitor

3ce41bc

Merge branch 'main' into pr-monitor

838af83

Merge branch 'main' into pr-monitor

2ca1768

Merge branch 'main' into pr-monitor

1c83aa4

Merge branch 'main' into pr-monitor

bfe286a

Merge branch 'main' into pr-monitor

2f47ef5

Merge branch 'main' into pr-monitor

e7e1da2

Merge branch 'main' into pr-monitor

375035e

ishandhanani assigned timmy-feng and ishandhanani and unassigned timmy-feng Dec 31, 2025

Merge branch 'main' into pr-monitor

cc63e11

Merge branch 'main' into pr-monitor

667f97b

Merge branch 'main' into pr-monitor

0ec9123

Baidu-AIAK and others added 6 commits January 5, 2026 14:02

Merge branch 'main' into pr-monitor

9d583e3

[Bugfix] init_metrics deps on attribute defined in init_model_worker

36da24e

Change-Id: I0d595dab49da9929b1a2f9eafb127df39ada3748

Merge branch 'pr-monitor' into pr-monitor-remote

b1754bd

Change-Id: Ic62b9277008ad3062e4bff013f9cc42a011fc6f2

[Bugfix]init_ipc_channelsdeps on attribute defined in init_metrics

b9c84c2

Change-Id: Id14a23364e83b10c185d3a4dc487b9e7d810f9e0

[bugfix] current_scheduler_metrics_enabled is a bool object

956c16f

Change-Id: If6c7183e825e1d9444e80d3cb190c953a44ad2b2

[bugfix]fix only DP0 log metricsrun_batch_time and `equest_first_to…

29e37c7

…ken_forward_time` Change-Id: Id8c758cb4434bef9c2dd2ff101aaf43aed9e2051

sufeng-buaa reviewed Jan 11, 2026

View reviewed changes

merrymercy reviewed Jan 11, 2026

View reviewed changes

sufeng-buaa reviewed Jan 12, 2026

View reviewed changes

zhanghaotong reviewed Jan 12, 2026

View reviewed changes

wangchao101 and others added 6 commits January 12, 2026 11:04

[bugfix] recovery comment and use perf_counter instead of time

898d763

Change-Id: Ie066fa8be7e6a6d304a9ddc8667f694600ec1639

Merge remote-tracking branch 'community/main' into pr-monitor

0754208

Change-Id: Ica88de24840ecf4ef140008a5a7b1fcb9815d752

Merge branch 'main' into pr-monitor

397c4f2

split huge log_decode_stats into sub log_decode_*

b510bff

Change-Id: I80d8655d23d1ae7deb7919dd3a4e13636f595c7a

Merge branch 'pr-monitor' of https://github.com/Baidu-AIAK/sglang int…

6a02ade

…o pr-monitor Change-Id: I68206a67b5c6e892b93cdb9a634f3744427c419a

Merge branch 'main' into pr-monitor

a882c92

Conversation

Baidu-AIAK commented Dec 25, 2025

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 25, 2025

Uh oh!

ishandhanani commented Dec 31, 2025

Uh oh!

ishandhanani commented Jan 3, 2026

Uh oh!

Baidu-AIAK commented Jan 4, 2026

Uh oh!

Baidu-AIAK commented Jan 4, 2026

Uh oh!

ishandhanani commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sufeng-buaa commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

merrymercy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhanghaotong commented Jan 11, 2026

Uh oh!

sufeng-buaa commented Jan 11, 2026

Uh oh!

sufeng-buaa Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Comments

ishandhanani commented Jan 11, 2026 •

edited

Loading

sufeng-buaa commented Jan 11, 2026 •

edited

Loading

sufeng-buaa Jan 12, 2026 •

edited

Loading