-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Is your feature request related to a problem? Please describe
In the current concurrent search paradigm we will create slice_count times the number of collectors compared to non-concurrent search and for these collectors collect will be called in concurrently. Afterwards, reduce is called on the search threadpool sequentially for all of these collectors.
There are 2 main scenarios where this sequential behavior can be problematic -- [1] whenever the aggregators are nested and buildAggregation needs to BFS/DFS through the collector tree and [2] when buildAggregation itself is an expensive operation for the given collector.
One such example of this is the keyword-terms-numeric-terms, which is a nested terms aggregation. Even in the non-concurrent search case nested terms aggregations will suffer from combinatorial explosion of buckets for each additional nested layer and for concurrent search this problem is essentially multiplied by slice_count as the bucket creation during buildAggregation is done sequentially. The combinatorial explosion was partially addressed by #11585 however the sequential work is still a bottleneck.
Here is some basic query profiler breakdown to further illustrate this point:
| Metric | weather-data-2016 - disabled - size=500 | (in ms) | weather-data-2016 - slice=2 - slice_size=1150 (1.5 * shard_size + 10 — size=500) | (in ms) |
|---|---|---|---|---|
| Took | 10918 | 17567 | ||
| ConstantScoreQuery — time_in_nanos | 3021298801 | 3021.2988 | 8023 | |
| MatchAllDocsQuery — time_in_nanos | 1087697018 | 1087.69702 | 2338 | |
| rewrite_time | 27406 | 10733 | ||
| collector — time_in_nanos | 4758597960 | 4758.59796 | 3919422108 | 3919.42211 |
| GlobalOrdinalsStringTermsAggregator — time_in_nanos | 7655394436 | 7655.39444 | 9238307858 | 9238.30786 |
| collect | 2670211530 | 2670.21153 | 2349533379 | 2349.53338 |
| avg_collect | 2223026375 | 2223.02638 | ||
| collect_count | 33659481 | 33659481 | ||
| build_aggregation | 4976192164 | 4976.19216 | 12500064156 | 12500.06416 |
| avg_build_aggregation | 6250026755 | 6250.02676 | ||
| build_aggregation_count | 1 | 2 | ||
| NumericTermsAggregator — time_in_nanos | 4167152226 | 4167.15223 | 5953806596 | 5953.8066 |
| collect | 281349935 | 281.34994 | 7061608537 | 7061.60854 |
| avg_collect | 271087497 | 271.0875 | ||
| collect_count | 832960 | 1329446 | ||
| build_aggregation | 3885328997 | 3885.329 | 11216763404 | 11216.7634 |
| avg_build_aggregation | 5220390427 | 5220.39043 | ||
| build_aggregation_count | 1 | 2 | ||
| MaxAggregator — time_in_nanos | 540335 | 0.54034 | 2229363 | 2.22936 |
| collect | 458491 | 0.45849 | 5622515353 | 5622.51535 |
| avg_collect | 1576283 | 1.57628 | ||
| collect_count | 8360 | 59800 | ||
| build_aggregation | 69834 | 0.06983 | 5617813957 | 5617.81396 |
| avg_build_aggregation | 251549 | 0.25155 | ||
| build_aggregation_count | 1 | 2 |
We can see that for the NumericTermsAggregator build_aggregation is taking 3-4x as long and happening sequentially as well as the combinatorial explosion in the collect_count.
Describe the solution you'd like
The additional combinatorial explosion is due to slice_size > shard_size, which is something we can revisit (ie does slice_size really need to be 1.5*shard_size + 10 or can slice_size == shard_size?). However the long pole is actually the build_aggregation as a whole taking a long time as happening sequentially. I propose that we move the buildAggregation steps to processPostCollection so that it can happen in parallel on the index_searcher thread.
I will follow-up with a PR with this change as well as some benchmarking data to further discuss.
Related component
Search:Performance
Describe alternatives you've considered
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
Type
Projects
Status