[Concurrent Segment Search] shard_min_doc_count is not enforced at the slice level

Currently, we do not enforce `shard_min_doc_count` at the slice level for concurrent segment search. The purpose of this issue is as follows:
1. Explain what `shard_min_doc_count` does (this is missing from https://opensearch.org/docs/latest/aggregations/bucket/terms/)
2. Explain the impact of not enforcing `shard_min_doc_count` at the slice level for concurrent segment search.
3. Provide a place to gather feedback and use cases on if this behavior needs to changed.

## 1. What is `shard_min_doc_count`?

The `min_doc_count` option will limit the terms returned to only those with greater than `min_doc_count` documents in the bucket. `shard_min_doc_count` is the shard level companion to this, which will limit the terms return from the shard to the coordinator node to only those with greater than `shard_min_doc_count` documents. The shard level option is important because each shard does not have any information about the global document count, so a higher `shard_min_doc_count` may eliminate terms that are globally above the frequency threshold but not locally. This can be especially problematic if a bucket ordering other than descending `_count` is used (including significant terms aggs), as there is an unbounded upper limit of documents that may be eliminated this way. To avoid this, the `shard_size` parameter can be increased to allow more candidate terms on the shards, however there is a resource (I will use "resource" as an umbrella term to refer to memory consumption, network usage, CPU runtime, etc.) and accuracy tradeoff that the user needs to make here.

This parameter (and `min_doc_count`) is not documented in either https://opensearch.org/docs/latest/aggregations/bucket/terms/ or https://opensearch.org/docs/latest/aggregations/bucket/significant-terms/, so will create a separate docs issue to remedy that -- https://github.com/opensearch-project/documentation-website/issues/6129.

## 2. How does `shard_min_doc_count` related to concurrent segment search?

Previously, we decided to not enforce both `shard_min_doc_count` and `shard_size` at the slice level for concurrent segment search to ensure that extra buckets are not eliminated at the slice level. See: 
- https://github.com/opensearch-project/OpenSearch/pull/9085

However, after running some benchmarks we realized this was problematic. In short, we would collect up to `term_cardinality` buckets on each segment slice, so on high cardinality data this would lead to pretty large runtime regressions. For example, a term aggregation using concurrent search on a field with cardinality 100,000 would create 100,000 buckets per segment slice regardless of the supplied `shard_size` parameter.  See: 

- https://github.com/opensearch-project/OpenSearch/issues/11584

To fix this, we decided to enforce a `slice_size` heuristic where `slice_size == shard_size` and also properly calculate the additional `doc_count_error` introduced by this. See:
- https://github.com/opensearch-project/OpenSearch/pull/11732
- https://github.com/opensearch-project/OpenSearch/issues/11680

So to bring it all together, whenever a small `shard_size << term_cardinality` is used with `shard_min_doc_count` a user may get some unexpected results as the top `shard_size` buckets from the slice level will include those that have `doc_count < shard_min_doc_count`, and these buckets will then later be eliminated during the shard level `reduce` phase.

## 3. Discussion

Today we do not enforce `shard_min_doc_count` at the slice level to avoid eliminating buckets that are shard-globally frequent but not locally frequent. For shards users are able to use a shard routing key (or even create separate indices) to control how their documents are organized across shards, however this same control is not possible for slices. As new segments are created and old segments are merged, the document layout across slices can change and the same term may not always be locally frequent.

One of the purposes of this issue is to gather discussion/feedback on if this behavior needs to change.

### Related component

Search:Aggregations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Concurrent Segment Search] shard_min_doc_count is not enforced at the slice level #11847

1. What is `shard_min_doc_count`?

2. How does `shard_min_doc_count` related to concurrent segment search?

3. Discussion

Related component

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Concurrent Segment Search] shard_min_doc_count is not enforced at the slice level #11847

Description

1. What is shard_min_doc_count?

2. How does shard_min_doc_count related to concurrent segment search?

3. Discussion

Related component

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. What is `shard_min_doc_count`?

2. How does `shard_min_doc_count` related to concurrent segment search?