Skip to content

[Feature] Add Observability for KV Eviction/Reload and Chunked Prefill (new_token_ratio, eviction/load_back durations, prefill loop count) #10218

@ShawnKung

Description

@ShawnKung

Checklist

Motivation

In production, we observed intermittent throughput drops and tail-latency spikes during bursts of long-context requests. We lacked visibility into whether the regressions were compute-bound or due to KV cache movement between GPU and CPU. To address this, we added four metrics to improve diagnosis and tuning:

  • sglang:new_token_ratio (Gauge): Tracks the proportion of newly generated tokens relative to reused tokens, helping distinguish steady decoding vs. frequent context rebuilds or cache misses.
  • sglang:eviction_duration_seconds (Histogram): Time spent evicting memory from GPU to CPU, exposing memory-pressure and paging overhead.
  • sglang:load_back_duration_seconds (Histogram): Time spent loading memory from CPU back to GPU, highlighting thrash and hot-cache misses.
  • sglang:chunked_prefill_loop_count (Histogram): Number of loops in chunked prefill, indicating how fragmented or oversized prefill segments are under load.

With these, we were able to pinpoint periods where eviction/load-back dominated latency (e.g., P95 eviction ~80 ms, load-back ~120 ms during spikes) and where prefill loops frequently exceeded expected counts, signaling suboptimal chunk sizing. After tuning chunk sizes and eviction thresholds, we saw improved throughput and reduced p99 latency in our internal deployments.

Example usages:

  • Alert if eviction_duration_seconds or load_back_duration_seconds P95 > 50 ms sustained for N minutes.
  • Track new_token_ratio drops below an expected band (e.g., <0.3) during bursts, signaling excessive cache rebuilds.
  • Watch chunked_prefill_loop_count > 5 as a heuristic for suboptimal prefill chunking.
Image

Related resources

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions