Skip to content

[metrics] Add more useful metrics#15809

Open
Baidu-AIAK wants to merge 35 commits intosgl-project:mainfrom
Baidu-AIAK:pr-monitor
Open

[metrics] Add more useful metrics#15809
Baidu-AIAK wants to merge 35 commits intosgl-project:mainfrom
Baidu-AIAK:pr-monitor

Conversation

@Baidu-AIAK
Copy link
Contributor

Motivation

These are practical monitoring metrics used in our production environment:

  • sglang:prefill_chunk_util

    • The utilization ratio of chunked prefill, calculated as the number of tokens involved in the chunked prefill computation divided by server_args.chunked_prefill_size.
  • sglang:request_waiting_time

    • For Centralized architecture: The time a request spends waiting in the waiting_queue.
    • For Prefill instances: The time from when a request enters the bootstrap queue until the start of the forward pass.
    • For Decode instances: The time from when a request enters the transferring queue until the start of the forward pass.
  • sglang:request_zmq_time_seconds

    • The latency between the tokenizer process sending a request and the scheduler process receiving it.
  • sglang:request_first_token_forward_time

    • For Centralized architecture: The time elapsed from entering the waiting_queue to the completion of run_batch.
    • For Prefill instances: The time elapsed from entering the bootstrap queue to the completion of run_batch.
  • sglang:request_time_to_tokenizer_ready_seconds

    • The processing time consumed by the tokenizer.
  • sglang:all_gather_latency_us

    • Measures the time taken to aggregate local token counts from all ranks and complete a cross-DP all_gather operation before each run_batch in the Decode phase (during DP attention/MLP preparation). This indicates the overhead of Data Parallel (DP) communication and synchronization during decoding.
  • sglang:decode_bs_util

    • Batch size utilization during the Decode phase (the proportion of active decode requests per DP worker).
  • sglang:dp_balance

    • Data Parallel (DP) load balance metric, calculated using the Gini coefficient based on DP token counts (1.0 indicates perfect balance).
  • sglang:generation_time

    • The total time required to generate a single token.
  • sglang:idle_batch_ratio

    • The proportion of idle workers among all DP workers.
  • sglang:run_batch_time

    • The execution time for a batch (duration of the forward pass).
  • sglang:iter_token_process_time

    • The processing overhead between tokens, calculated as generation_time - run_batch_time.

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

wangchao101 added 9 commits December 24, 2025 11:58
Change-Id: I57363c3af1f9f4a7adb2367a767464172c8728af
Change-Id: I6269e86d6b36399431af3cbd8065030fc4f12723
Change-Id: Ib4e7b00cbc8c03f75ce5da324fcfc418ee3640c4
Change-Id: Id66a5e1f587faaa022586f8fb00f87b37b1c4f37
Change-Id: I5cf74b383519f05459bbc1c1bc1e70f05be70baa
Change-Id: I008c5d0c56572bfd76639b6a357fbc26671043fa
Change-Id: Iee8876df87a463bbe2055f05f745e8a8f4dbfe9e
Change-Id: I7f81c21649c1195c73a0bf8139e7808566ce65a5
Change-Id: I62589ba5b9afe2136afcb74bd5740b8e2542c827
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ishandhanani
Copy link
Collaborator

This is super helpful. Thank you.

Do you mind sharing some before and after benchmarks with these metrics emitted?

@ishandhanani
Copy link
Collaborator

Can you rebase?

@Baidu-AIAK
Copy link
Contributor Author

This is super helpful. Thank you.

Do you mind sharing some before and after benchmarks with these metrics emitted?

It needs some work. Please wait a moment

@Baidu-AIAK
Copy link
Contributor Author

Can you rebase?

Conflicts were resolved.

Baidu-AIAK and others added 6 commits January 5, 2026 14:02
Change-Id: I0d595dab49da9929b1a2f9eafb127df39ada3748
Change-Id: Ic62b9277008ad3062e4bff013f9cc42a011fc6f2
Change-Id: Id14a23364e83b10c185d3a4dc487b9e7d810f9e0
Change-Id: If6c7183e825e1d9444e80d3cb190c953a44ad2b2
…ken_forward_time`

Change-Id: Id8c758cb4434bef9c2dd2ff101aaf43aed9e2051
@ishandhanani
Copy link
Collaborator

ishandhanani commented Jan 11, 2026

Sounds good. Let me know when this is re-ready to review

Cc @sufeng-buaa as well

@sufeng-buaa
Copy link
Contributor

sufeng-buaa commented Jan 11, 2026

The metrics and collection method for request stage latency are somewhat redundant with the existing implementation in #8710. If we merge this patch, we should revert #8710, consolidate all request stage latency metrics within TimeStats and SchedulerMetricsMixin, and abandon integrating the request stage latency metric into the Tracing package in my PR #13152. What do you think? @ShangmingCai @ishandhanani @acelyc111 @Baidu-AIAK @scottjlee @zhanghaotong

async def openai_v1_completions(request: CompletionRequest, raw_request: Request):
"""OpenAI-compatible text completion endpoint."""
# Timestamp when the HTTP request is received and handed off to the tokenizer
tokenizer_rev_request_time = time.time()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat redundant with #13432

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat redundant with #13432

Yes, I think this variable is essentially no different from received_time, and it can also be used for embedding requests placed within the serving_base.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat redundant with #13432

Yes, I think this variable is essentially no different from received_time, and it can also be used for embedding requests placed within the serving_base.

yes, it's redundant. I will reuse received_time

Copy link
Contributor

@merrymercy merrymercy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we have many observability features. There seems to be some redundancy among them. All the related code is scattered around the whole codebase without documentation. The existing related code:

  • TimeStats (python/sglang/srt/metrics/collector.py)
  • per request tracing (python/sglang/srt/tracing/trace.py)
  • Scheduler metrics for prefill and decode batch (python/sglang/srt/managers/scheduler_metrics_mixin.py)

Questions:

  • Can we move all observability code into a single place?
  • How can we avoid ad-hoc change to insert metrics collection code all around the code base? We should have a unified interface to collect the latency breakdown of a request.


while True:
# Receive requests
self.iter_start_time = time.time()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Do not delete comment
  2. use perf counter?

)

def log_prefill_dp_balance_stats(self: Scheduler, batch: ScheduleBatch) -> None:
"""Log DP-level load-balancing metrics for the prefill stage."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fzyzcjy take a review?

run_batch_time = self.iter_forward_finish_time - self.iter_forward_start_time
iter_token_process_time = generation_time - run_batch_time

# Add metrics related to DP balancing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split this big function into multiple smaller ones

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split this big function into multiple smaller ones

I already splitted log_decode_stats into sub log_decode_*

@zhanghaotong
Copy link
Contributor

The metrics and collection method for request stage latency are somewhat redundant with the existing implementation in #8710. If we merge this patch, we should revert #8710, consolidate all request stage latency metrics within TimeStats and SchedulerMetricsMixin, and abandon integrating the request stage latency metric into the Tracing package in my PR #13152. What do you think? @ShangmingCai @ishandhanani @acelyc111 @Baidu-AIAK @scottjlee @zhanghaotong

We definitely need a unified location to place these metrics, but it's also necessary to store some metrics about this request as attributes within the tracing. What do you think?

@sufeng-buaa
Copy link
Contributor

Questions:

  • Can we move all observability code into a single place?
  • How can we avoid ad-hoc change to insert metrics collection code all around the code base? We should have a unified interface to collect the latency breakdown of a request.

@merrymercy I'm currently trying to integrate these. I might submit a draft PR in about 2-3 weeks. Another question: are the request stage metrics and tracing data exported in real-time, or all at once when the request finishes? I suspect the latter might cause the scheduler to block due to potentially large data volumes.

labelnames=labels.keys(),
multiprocess_mode="mostrecent",
)
self.run_batch_time = Gauge(
Copy link
Contributor

@sufeng-buaa sufeng-buaa Jan 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do generation_time,run_batch_time and iter_token_process_time have any practical observational value in your experience? In event_loop_overlap, the batch in run_batch and the batch in process_batch_result are not the same, so I don't think adding this observation is necessary. If you simply want to measure the time spent in process_inputs, run_batch, and process_batch_result, it would suffice to record timestamps at the beginning and end of each function. Even then, however, I still find the value limited, since their latency is directly correlated with batch size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sufeng-buaa Hello. Yes, firstly, the granularity of the time-consuming data collection is indeed quite fine-grained, which has helped us pinpoint some difficult performance issues in our production environment. Secondly, in overlap mode, these values ​​are not entirely accurate, and we haven't found a good way to accurately collect these metrics. Currently, in our actual usage scenarios, we have compared and analyzed these metrics with nsys-profile data, and they generally match the end-to-end time of the profile (under conditions where the batch size is not frequently and significantly changed). This still helps us quickly identify some performance problems and is of great reference value. For now, we can use this method to statistically measure the time spent on this part, and we welcome any suggestions for better implementation methods.

and self.enable_metrics
):
self.metrics_collector.observe_request_time_to_tokenizer_ready(
tokenizer_ready_time - tokenizer_rev_request_time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing it through layer by layer, could we use obj.received_time?

wangchao101 and others added 6 commits January 12, 2026 11:04
Change-Id: Ie066fa8be7e6a6d304a9ddc8667f694600ec1639
Change-Id: Ica88de24840ecf4ef140008a5a7b1fcb9815d752
Change-Id: I80d8655d23d1ae7deb7919dd3a4e13636f595c7a
…o pr-monitor

Change-Id: I68206a67b5c6e892b93cdb9a634f3744427c419a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants

Comments