[metrics] Add more useful metrics#15809
[metrics] Add more useful metrics#15809Baidu-AIAK wants to merge 35 commits intosgl-project:mainfrom
Conversation
Change-Id: I57363c3af1f9f4a7adb2367a767464172c8728af
Change-Id: I6269e86d6b36399431af3cbd8065030fc4f12723
Change-Id: Ib4e7b00cbc8c03f75ce5da324fcfc418ee3640c4
Change-Id: Id66a5e1f587faaa022586f8fb00f87b37b1c4f37
Change-Id: I5cf74b383519f05459bbc1c1bc1e70f05be70baa
Change-Id: I008c5d0c56572bfd76639b6a357fbc26671043fa
Change-Id: Iee8876df87a463bbe2055f05f745e8a8f4dbfe9e
Change-Id: I7f81c21649c1195c73a0bf8139e7808566ce65a5
Change-Id: I62589ba5b9afe2136afcb74bd5740b8e2542c827
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
This is super helpful. Thank you. Do you mind sharing some before and after benchmarks with these metrics emitted? |
|
Can you rebase? |
It needs some work. Please wait a moment |
Conflicts were resolved. |
Change-Id: I0d595dab49da9929b1a2f9eafb127df39ada3748
Change-Id: Ic62b9277008ad3062e4bff013f9cc42a011fc6f2
Change-Id: Id14a23364e83b10c185d3a4dc487b9e7d810f9e0
Change-Id: If6c7183e825e1d9444e80d3cb190c953a44ad2b2
…ken_forward_time` Change-Id: Id8c758cb4434bef9c2dd2ff101aaf43aed9e2051
|
Sounds good. Let me know when this is re-ready to review Cc @sufeng-buaa as well |
|
The metrics and collection method for |
| async def openai_v1_completions(request: CompletionRequest, raw_request: Request): | ||
| """OpenAI-compatible text completion endpoint.""" | ||
| # Timestamp when the HTTP request is received and handed off to the tokenizer | ||
| tokenizer_rev_request_time = time.time() |
There was a problem hiding this comment.
Somewhat redundant with #13432
Yes, I think this variable is essentially no different from received_time, and it can also be used for embedding requests placed within the serving_base.
There was a problem hiding this comment.
Somewhat redundant with #13432
Yes, I think this variable is essentially no different from received_time, and it can also be used for embedding requests placed within the serving_base.
yes, it's redundant. I will reuse received_time
merrymercy
left a comment
There was a problem hiding this comment.
Now we have many observability features. There seems to be some redundancy among them. All the related code is scattered around the whole codebase without documentation. The existing related code:
- TimeStats (
python/sglang/srt/metrics/collector.py) - per request tracing (
python/sglang/srt/tracing/trace.py) - Scheduler metrics for prefill and decode batch (
python/sglang/srt/managers/scheduler_metrics_mixin.py)
Questions:
- Can we move all observability code into a single place?
- How can we avoid ad-hoc change to insert metrics collection code all around the code base? We should have a unified interface to collect the latency breakdown of a request.
|
|
||
| while True: | ||
| # Receive requests | ||
| self.iter_start_time = time.time() |
There was a problem hiding this comment.
- Do not delete comment
- use perf counter?
| ) | ||
|
|
||
| def log_prefill_dp_balance_stats(self: Scheduler, batch: ScheduleBatch) -> None: | ||
| """Log DP-level load-balancing metrics for the prefill stage.""" |
| run_batch_time = self.iter_forward_finish_time - self.iter_forward_start_time | ||
| iter_token_process_time = generation_time - run_batch_time | ||
|
|
||
| # Add metrics related to DP balancing |
There was a problem hiding this comment.
split this big function into multiple smaller ones
There was a problem hiding this comment.
split this big function into multiple smaller ones
I already splitted log_decode_stats into sub log_decode_*
We definitely need a unified location to place these metrics, but it's also necessary to store some metrics about this request as attributes within the tracing. What do you think? |
@merrymercy I'm currently trying to integrate these. I might submit a draft PR in about 2-3 weeks. Another question: are the request stage metrics and tracing data exported in real-time, or all at once when the request finishes? I suspect the latter might cause the scheduler to block due to potentially large data volumes. |
| labelnames=labels.keys(), | ||
| multiprocess_mode="mostrecent", | ||
| ) | ||
| self.run_batch_time = Gauge( |
There was a problem hiding this comment.
Do generation_time,run_batch_time and iter_token_process_time have any practical observational value in your experience? In event_loop_overlap, the batch in run_batch and the batch in process_batch_result are not the same, so I don't think adding this observation is necessary. If you simply want to measure the time spent in process_inputs, run_batch, and process_batch_result, it would suffice to record timestamps at the beginning and end of each function. Even then, however, I still find the value limited, since their latency is directly correlated with batch size.
There was a problem hiding this comment.
@sufeng-buaa Hello. Yes, firstly, the granularity of the time-consuming data collection is indeed quite fine-grained, which has helped us pinpoint some difficult performance issues in our production environment. Secondly, in overlap mode, these values are not entirely accurate, and we haven't found a good way to accurately collect these metrics. Currently, in our actual usage scenarios, we have compared and analyzed these metrics with nsys-profile data, and they generally match the end-to-end time of the profile (under conditions where the batch size is not frequently and significantly changed). This still helps us quickly identify some performance problems and is of great reference value. For now, we can use this method to statistically measure the time spent on this part, and we welcome any suggestions for better implementation methods.
| and self.enable_metrics | ||
| ): | ||
| self.metrics_collector.observe_request_time_to_tokenizer_ready( | ||
| tokenizer_ready_time - tokenizer_rev_request_time |
There was a problem hiding this comment.
Instead of passing it through layer by layer, could we use obj.received_time?
Change-Id: Ie066fa8be7e6a6d304a9ddc8667f694600ec1639
Change-Id: Ica88de24840ecf4ef140008a5a7b1fcb9815d752
Change-Id: I80d8655d23d1ae7deb7919dd3a4e13636f595c7a
…o pr-monitor Change-Id: I68206a67b5c6e892b93cdb9a634f3744427c419a
Motivation
These are practical monitoring metrics used in our production environment:
sglang:prefill_chunk_utilserver_args.chunked_prefill_size.sglang:request_waiting_timewaiting_queue.bootstrap queueuntil the start of the forward pass.transferring queueuntil the start of the forward pass.sglang:request_zmq_time_secondssglang:request_first_token_forward_timewaiting_queueto the completion ofrun_batch.bootstrap queueto the completion ofrun_batch.sglang:request_time_to_tokenizer_ready_secondssglang:all_gather_latency_usall_gatheroperation before eachrun_batchin the Decode phase (during DP attention/MLP preparation). This indicates the overhead of Data Parallel (DP) communication and synchronization during decoding.sglang:decode_bs_utilsglang:dp_balancesglang:generation_timesglang:idle_batch_ratiosglang:run_batch_timesglang:iter_token_process_timegeneration_time - run_batch_time.Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist