-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
When indexing large datasets like big5, we use the log merge policy. This policy ensures that only adjacent segments are merged, with the merged segments being inserted at the position of the first segment involved in the merge. These two conditions should theoretically result in perfectly aligned document order both within and across segments.
In this context, document order alignment means that the docId order of documents in the indexed data should perfectly match the order in the source data. We can read DocValues on the @timestamp field to check the alignment. This alignment is expected under the condition of using a single bulk indexing client and refreshing the index before sending the next bulk request.
However, an issue has been discovered (see GitHub issue #17404) where forcing a merge to a single segment results in significant document order disruption. Specifically, while the source data for big5 starts with documents from 2023-01-01, the single merged segment can begin with documents from dates like 01-13 or 01-12.
We hypothesize that this problem may be due to some unique logic applied during force merges to a single segment. Also some further investigation is needed to determine if this issue persists when force merging to an arbitrary number of segments greater than one.
Why we want to do this
Force merge to 1 segment is a common operation when we want to compare the search performance between 2 vesions of code that use different Lucene codec.
Without force merge to 1, the segment size and number are not easily controllable and cause variance in performance results. (Previous investigation opensearch-project/opensearch-benchmark#398)
This issue is expected to give us deeper understanding of the behavior of log merge policy and help us isolate the performance result variance from indexing/merge or search.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status