Skip to content

Search Latency Tracking - Coordinator Slow Logs #9642

@dzane17

Description

@dzane17

Is your feature request related to a problem? Please describe.
As of today, we track search request latencies on a shard level via node stats. After every query/fetch phase is completed on a shard, we note down the time taken for each, keep accumulating those values and maintain an overall average value which is tracked under stats.

But we don’t have a mechanism to track search latencies around coordinator node. Coordinator node plays an important role in fanning out requests to individual shard/data-nodes, aggregating those responses and eventually sending response back to the client. We have seen multiple issues in the past where it becomes hard/impossible to reason latency related issues because of lack of insights into coordinator level stats and we ended up spending a lot of unnecessary time/bandwidth on figuring it out. Clients using search API only rely on overall took time(present as part of search response) which doesn’t offer much insights into time taken by different phases.

Parent RFC: #7334

Describe the solution you'd like
Slow logs at coordinator level: As of now, we only have the capability to enable slow logs at a shard level for desired search phase(query and fetch). See this. Setting this threshold is tricky when customer usually sees latency spikes at a request level. Plus shard level slow logs doesn't offer a holistic view. So as part of this, we will also add capabilities to capture slow logs at a request level along with different search phases from coordinator node perspective.

Additional context
Coordinator slow logs will be governed by cluster settings. We will offer for the following 3 intervals:

  1. Overall request
  2. Query phase
  3. Fetch phase
// Setting on a whole request level
cluster.search.request.slowlog.threshold.warn: 10s
cluster.search.request.slowlog.threshold.info: 5s
cluster.search.request.slowlog.threshold.debug: 2s
cluster.search.request.slowlog.threshold.trace: 500ms

// Minimum level to print
cluster.search.request.slowlog.level: "trace"

Metadata

Metadata

Assignees

Labels

SearchSearch query, autocomplete ...etcenhancementEnhancement or improvement to existing feature or requestfeatureNew feature or requestv2.12.0Issues and PRs related to version 2.12.0

Type

No type

Projects

Status

✅ Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions