Skip to content

[Feature Request] Improve ApproximatePointRangeQuery Traversal for Skewed Datasets with DFS Strategy #18341

@prudhvigodithi

Description

@prudhvigodithi

Is your feature request related to a problem? Please describe

Background

Coming from #18313 (comment) Identified a regression with Approximation framework especially with asc_sorts when the dataset is skewed (example http_logs). The nightly benchmark dashboard also shows the regression for asc_sort_timestamp .

While working on #14406 to add approximation for other numeric types and starting with Integer, after extending approximation and testing the http_logs asc_sort_size I have noticed it introduced a regression

|                                           Heap used for points |               |           0 |     MB |
|                                    Heap used for stored fields |               |           0 |     MB |
|                                                  Segment count |               |          35 |        |
|                                                 Min Throughput | asc_sort_size |         0.5 |  ops/s |
|                                                Mean Throughput | asc_sort_size |         0.5 |  ops/s |
|                                              Median Throughput | asc_sort_size |         0.5 |  ops/s |
|                                                 Max Throughput | asc_sort_size |         0.5 |  ops/s |
|                                        50th percentile latency | asc_sort_size |     32.4552 |     ms |
|                                        90th percentile latency | asc_sort_size |     35.7239 |     ms |
|                                        99th percentile latency | asc_sort_size |       57.71 |     ms |
|                                       100th percentile latency | asc_sort_size |       58.15 |     ms |
|                                   50th percentile service time | asc_sort_size |     30.0029 |     ms |
|                                   90th percentile service time | asc_sort_size |     33.4095 |     ms |
|                                   99th percentile service time | asc_sort_size |     54.7088 |     ms |
|                                  100th percentile service time | asc_sort_size |     55.0696 |     ms |
|                                                     error rate | asc_sort_size |           0 |      % |

Seeing the nightly benchmark dashboards it clearly shows the default Lucene optimization is faster.

Overview

The current implementation of ApproximatePointRangeQuery in OpenSearch showed a good performance improvement for workloads like big5 but does not perform the same for skewed datasets like http_logs.

In fact on the skewed datasets like http_logs compared to ApproximatePointRangeQuery the Lucene's original PointRangeQuery, especially for ASC sort queries performed better because of Lucene's getInverseIntersectVisitor strategy as includes optimizations like fast path for dense matches which returns DocIdSetIterator.all(maxDoc), but with approximation this is not the case because even though we short circuit we still have to traverse through the dense nodes, rather than using Lucene return all docs and remove the non-matching ones.

For http_logs following is the split rations and notice the tree is skewed and unbalanced

(used the https://github.com/msfroh/lucene-university/pull/28/files BKDTreeInspector to inspect the tree)

Level split values (sampling):
Node 1 (Level 0): Range [897249601000 to 897854400000], Size: 36297158
  Split value: 897585682000
  Split ratios: left=0.56, right=0.44
Node 2 (Level 1): Range [897249601000 to 897585682000], Size: 19520000
  Split value: 897504395000
  Split ratios: left=0.76, right=0.24
Node 4 (Level 2): Range [897249601000 to 897504395000], Size: 11131392
  Split value: 897485609000
  Split ratios: left=0.93, right=0.07
Node 8 (Level 3): Range [897249601000 to 897485609000], Size: 6937088
  Split value: 897413980000
  Split ratios: left=0.70, right=0.30
Node 16 (Level 4): Range [897249601000 to 897413980000], Size: 4194304

But notice for big5 the dataset is not skewed

Level split values (sampling):
Node 1 (Level 0): Range [1673571600000 to 1673575198000], Size: 372372
  Split value: 1673573936000
  Split ratios: left=0.65, right=0.35
Node 2 (Level 1): Range [1673571600000 to 1673573936000], Size: 241664
  Split value: 1673572867000
  Split ratios: left=0.54, right=0.46
Node 4 (Level 2): Range [1673571600000 to 1673572867000], Size: 131072
  Split value: 1673572235000
  Split ratios: left=0.50, right=0.50
Node 8 (Level 3): Range [1673571600000 to 1673572235000], Size: 65536
  Split value: 1673571917000
  Split ratios: left=0.50, right=0.50
Node 16 (Level 4): Range [1673571600000 to 1673571917000], Size: 32768
  Split value: 1673571759000
  Split ratios: left=0.50, right=0.50
Node 32 (Level 5): Range [1673571600000 to 1673571759000], Size: 16384
  Split value: 1673571681000
  Split ratios: left=0.51, right=0.49

Describe the solution you'd like

Example tree

                  N0 (1000 docs)
                 /             \
         N1 (900)              N2 (100)
        /      \
  N3 (800)     L3 (100)
   /     \
L1(600)  L2(200)

The existing code has a custom traversal the intersectLeft method that was BFS-like in nature. This method would traverse level by level, always exploring the “left” subtree first at each branch, but effectively still scanning broad portions of the tree before diving deeper.

DFS-based strategy naturally produces results in sorted order for numeric fields. It will immediately descend to the leftmost leaf (the smallest values, for timestamp fields its oldest) and start visiting documents. If, say, the query only needs the top 100 results, the DFS can find those in the leftmost leaves before visiting higher-value regions. As soon as the collector has gathered 100 docs, it can trigger an early termination. The traversal can then be aborted without ever exploring large portions of the tree that correspond to higher timestamp values. In other words, DFS yields earlier and more efficient short-circuiting because it finds the required values first.

N0
└── N1
    └── N3
        └── L1  (collect 100 docs)

Related component

Search:Performance

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Labels

Type

No type

Projects

Status

✅ Done

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions