Skip to content

[Concurrent Segment Search] Perform buildAggregation in parallel #11673

@jed326

Description

@jed326

Is your feature request related to a problem? Please describe

In the current concurrent search paradigm we will create slice_count times the number of collectors compared to non-concurrent search and for these collectors collect will be called in concurrently. Afterwards, reduce is called on the search threadpool sequentially for all of these collectors.

There are 2 main scenarios where this sequential behavior can be problematic -- [1] whenever the aggregators are nested and buildAggregation needs to BFS/DFS through the collector tree and [2] when buildAggregation itself is an expensive operation for the given collector.

One such example of this is the keyword-terms-numeric-terms, which is a nested terms aggregation. Even in the non-concurrent search case nested terms aggregations will suffer from combinatorial explosion of buckets for each additional nested layer and for concurrent search this problem is essentially multiplied by slice_count as the bucket creation during buildAggregation is done sequentially. The combinatorial explosion was partially addressed by #11585 however the sequential work is still a bottleneck.

Here is some basic query profiler breakdown to further illustrate this point:

Metric weather-data-2016 - disabled - size=500 (in ms) weather-data-2016 - slice=2 - slice_size=1150 (1.5 * shard_size + 10 — size=500) (in ms)
Took 10918   17567  
ConstantScoreQuery — time_in_nanos 3021298801 3021.2988 8023  
MatchAllDocsQuery — time_in_nanos 1087697018 1087.69702 2338  
rewrite_time 27406   10733  
collector — time_in_nanos 4758597960 4758.59796 3919422108 3919.42211
GlobalOrdinalsStringTermsAggregator — time_in_nanos 7655394436 7655.39444 9238307858 9238.30786
collect 2670211530 2670.21153 2349533379 2349.53338
avg_collect     2223026375 2223.02638
collect_count 33659481   33659481  
build_aggregation 4976192164 4976.19216 12500064156 12500.06416
avg_build_aggregation     6250026755 6250.02676
build_aggregation_count 1   2  
NumericTermsAggregator — time_in_nanos 4167152226 4167.15223 5953806596 5953.8066
collect 281349935 281.34994 7061608537 7061.60854
avg_collect     271087497 271.0875
collect_count 832960   1329446  
build_aggregation 3885328997 3885.329 11216763404 11216.7634
avg_build_aggregation     5220390427 5220.39043
build_aggregation_count 1   2  
MaxAggregator — time_in_nanos 540335 0.54034 2229363 2.22936
collect 458491 0.45849 5622515353 5622.51535
avg_collect     1576283 1.57628
collect_count 8360   59800  
build_aggregation 69834 0.06983 5617813957 5617.81396
avg_build_aggregation     251549 0.25155
build_aggregation_count 1   2

We can see that for the NumericTermsAggregator build_aggregation is taking 3-4x as long and happening sequentially as well as the combinatorial explosion in the collect_count.

Describe the solution you'd like

The additional combinatorial explosion is due to slice_size > shard_size, which is something we can revisit (ie does slice_size really need to be 1.5*shard_size + 10 or can slice_size == shard_size?). However the long pole is actually the build_aggregation as a whole taking a long time as happening sequentially. I propose that we move the buildAggregation steps to processPostCollection so that it can happen in parallel on the index_searcher thread.

I will follow-up with a PR with this change as well as some benchmarking data to further discuss.

Related component

Search:Performance

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

Search:PerformanceenhancementEnhancement or improvement to existing feature or requestv2.14.0v3.0.0Issues and PRs related to version 3.0.0

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions