Skip to content

[Concurrent Segment Search] shard_min_doc_count is not enforced at the slice level #11847

@jed326

Description

@jed326

Currently, we do not enforce shard_min_doc_count at the slice level for concurrent segment search. The purpose of this issue is as follows:

  1. Explain what shard_min_doc_count does (this is missing from https://opensearch.org/docs/latest/aggregations/bucket/terms/)
  2. Explain the impact of not enforcing shard_min_doc_count at the slice level for concurrent segment search.
  3. Provide a place to gather feedback and use cases on if this behavior needs to changed.

1. What is shard_min_doc_count?

The min_doc_count option will limit the terms returned to only those with greater than min_doc_count documents in the bucket. shard_min_doc_count is the shard level companion to this, which will limit the terms return from the shard to the coordinator node to only those with greater than shard_min_doc_count documents. The shard level option is important because each shard does not have any information about the global document count, so a higher shard_min_doc_count may eliminate terms that are globally above the frequency threshold but not locally. This can be especially problematic if a bucket ordering other than descending _count is used (including significant terms aggs), as there is an unbounded upper limit of documents that may be eliminated this way. To avoid this, the shard_size parameter can be increased to allow more candidate terms on the shards, however there is a resource (I will use "resource" as an umbrella term to refer to memory consumption, network usage, CPU runtime, etc.) and accuracy tradeoff that the user needs to make here.

This parameter (and min_doc_count) is not documented in either https://opensearch.org/docs/latest/aggregations/bucket/terms/ or https://opensearch.org/docs/latest/aggregations/bucket/significant-terms/, so will create a separate docs issue to remedy that -- opensearch-project/documentation-website#6129.

2. How does shard_min_doc_count related to concurrent segment search?

Previously, we decided to not enforce both shard_min_doc_count and shard_size at the slice level for concurrent segment search to ensure that extra buckets are not eliminated at the slice level. See:

However, after running some benchmarks we realized this was problematic. In short, we would collect up to term_cardinality buckets on each segment slice, so on high cardinality data this would lead to pretty large runtime regressions. For example, a term aggregation using concurrent search on a field with cardinality 100,000 would create 100,000 buckets per segment slice regardless of the supplied shard_size parameter. See:

To fix this, we decided to enforce a slice_size heuristic where slice_size == shard_size and also properly calculate the additional doc_count_error introduced by this. See:

So to bring it all together, whenever a small shard_size << term_cardinality is used with shard_min_doc_count a user may get some unexpected results as the top shard_size buckets from the slice level will include those that have doc_count < shard_min_doc_count, and these buckets will then later be eliminated during the shard level reduce phase.

3. Discussion

Today we do not enforce shard_min_doc_count at the slice level to avoid eliminating buckets that are shard-globally frequent but not locally frequent. For shards users are able to use a shard routing key (or even create separate indices) to control how their documents are organized across shards, however this same control is not possible for slices. As new segments are created and old segments are merged, the document layout across slices can change and the same term may not always be locally frequent.

One of the purposes of this issue is to gather discussion/feedback on if this behavior needs to change.

Related component

Search:Aggregations

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    🆕 New

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions