Optimize 2 keyword multi-terms aggregation#13929
Optimize 2 keyword multi-terms aggregation#13929sandeshkr419 wants to merge 1 commit intoopensearch-project:mainfrom
Conversation
|
❌ Gradle check result for bbd49c6: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
For POC, I ran the query the below query for big5 workload and saw 50% reduction in service time. Expand to see multi-term queryBenchmark with big5 workload: Total docs in index: 116000000 (11.6*10^7)
null - indicates the request timed out I tried to benchmark this against eventdata workload since the query time in above big5 workload was too high and I needed a smaller dataset to establish gains, but sadly it doesn't looks like that the change is improving the results. It may actually end up worsening the performance. Total docs in index: 20000000 (2*10^7)
|
| while (postings1.docID() != PostingsEnum.NO_MORE_DOCS && postings2.docID() != PostingsEnum.NO_MORE_DOCS) { | ||
|
|
||
| // Count of intersecting docs to get number of docs in each bucket | ||
| if (postings1.docID() == postings2.docID()) { | ||
| bucketCount++; | ||
| postings1.nextDoc(); | ||
| postings2.nextDoc(); | ||
| } else if (postings1.docID() < postings2.docID()) { | ||
| postings1.advance(postings2.docID()); | ||
| } else { | ||
| postings2.advance(postings1.docID()); | ||
| } | ||
| } |
There was a problem hiding this comment.
we need to optimize this method.
could you create a fixedbitset and use intersectionCount https://github.com/apache/lucene/blob/ebea2e1492c95b5d6b1e1032485598f901bda286/lucene/core/src/java/org/apache/lucene/util/FixedBitSet.java#L74
There was a problem hiding this comment.
Agreed. The complexity of intersection logic is highly dependent on the documents in the posting lists. With larger datasets and higher cardinality, the leapfrogging method for intersection evaluation would require more frequent iterations over these lists, which can be expensive.
|
This PR is stalled because it has been open for 30 days with no activity. |
|
Closing as not planning to continue this optimization path forward. |
Description
Optimize multi-terms aggregation for case:
The optimization changes how buckets are collected for a segment. For the above cases, it presently checks in values for required aggregation for each document, computes the composite key and then updates the bucket count. The optimization utilizes reading posting enums directly so that we are not computing composite keys for each document, and save time by creating composite keys only once and then get the intersection document count by checking intersection of each composite bucket.
Related Issues
Resolves #13120
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.