Investigate the document order variance for benchmark

When indexing large datasets like big5, we use the log merge policy. This policy ensures that only adjacent segments are merged, with the merged segments being inserted at the position of the first segment involved in the merge. These two conditions should theoretically result in perfectly aligned document order both within and across segments.

In this context, document order alignment means that the docId order of documents in the indexed data should perfectly match the order in the source data. We can read DocValues on the `@timestamp` field to check the alignment. This alignment is expected under the condition of using a single bulk indexing client and refreshing the index before sending the next bulk request.

However, an issue has been discovered (see [GitHub issue #17404](https://github.com/opensearch-project/OpenSearch/issues/17404)) where forcing a merge to a single segment results in significant document order disruption. Specifically, while the source data for big5 starts with documents from 2023-01-01, the single merged segment can begin with documents from dates like 01-13 or 01-12.

We hypothesize that this problem may be due to some unique logic applied during force merges to a single segment. Also some further investigation is needed to determine if this issue persists when force merging to an arbitrary number of segments greater than one.

### Why we want to do this

Force merge to 1 segment is a common operation when we want to compare the search performance between 2 vesions of code that use different Lucene codec. 
Without force merge to 1, the segment size and number are not easily controllable and cause variance in performance results. (Previous investigation https://github.com/opensearch-project/opensearch-benchmark/issues/398) 

This issue is expected to give us deeper understanding of the behavior of log merge policy and help us isolate the performance result variance from indexing/merge or search.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate the document order variance for benchmark #17737

Why we want to do this

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate the document order variance for benchmark #17737

Description

Why we want to do this

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions