Skip to content

Optimizing Multi-terms over keyword fields using global ordinals #21051

@sandeshkr419

Description

@sandeshkr419

Is your feature request related to a problem? Please describe

The MultiTermsAggregator currently serializes composite bucket keys as variable-length BytesRef arrays via BytesStreamOutput, storing them in a BytesRefHash through BytesKeyedBucketOrds. Each unique composite key incurs a full byte-array copy in the hash, plus per-document serialization/deserialization overhead. For high-cardinality multi-terms aggregations, this creates significant JVM memory pressure.

This optimization replaces the serialized BytesRef composite keys with packed ordinal tuples when all fields in the aggregation support global ordinals (ValuesSource.Bytes.WithOrdinals). Instead of storing "electronics|laptop|premium" as ~30+ bytes, we store three long ordinals packed into a single long (for 2 fields) or a compact BytesRef of fixed-width ordinals (for N fields). This eliminates the variable-length byte serialization entirely and reduces per-bucket memory from O(key_length) to O(N * 8) bytes where N is the number of fields.

For mixed-type configurations (keyword + numeric fields), the optimization falls back to the existing BytesRef-based approach, ensuring full backward compatibility.

Describe the solution you'd like

The idea is to replace the serialized BytesRef composite keys with packed ordinal tuples when all fields in the aggregation support global ordinals (ValuesSource.Bytes.WithOrdinals). Instead of storing "electronics|laptop|premium" as ~30+ bytes, we store three long ordinals packed into a compact BytesRef of fixed-width ordinals (for N fields). This eliminates the variable-length byte serialization entirely and reduces per-bucket memory from O(key_length) to O(N * 8) bytes where N is the number of fields.

POC: #21032
This was the first iteration of the planned improvement which resulted in a massive gain.

Metric Operation Baseline Contender Diff Unit
Min Throughput multi_terms-keyword 1.30403 2.00055 0.69653 ops/s
Mean Throughput multi_terms-keyword 1.32316 2.00091 0.67775 ops/s
Median Throughput multi_terms-keyword 1.32546 2.00083 0.67536 ops/s
Max Throughput multi_terms-keyword 1.33388 2.00161 0.66773 ops/s
50th percentile latency multi_terms-keyword 25248.1 171.678 -25076.4 ms
90th percentile latency multi_terms-keyword 34857.1 176.62 -34680.5 ms
99th percentile latency multi_terms-keyword 37025.8 212.795 -36813 ms
100th percentile latency multi_terms-keyword 37146.3 232.514 -36913.7 ms
50th percentile service time multi_terms-keyword 734.633 170.571 -564.062 ms
90th percentile service time multi_terms-keyword 752.307 175.58 -576.727 ms
99th percentile service time multi_terms-keyword 829.96 210.92 -619.039 ms
100th percentile service time multi_terms-keyword 848.685 231.12 -617.564 ms
error rate multi_terms-keyword 0 0 0 %

Related component

Search:Aggregations

Describe alternatives you've considered

Optimization 1: 2 Keyword packed into a single long

Packing 2 keyword ordinals into a single long as 126 bits could work perfectly fine for upto a few million cardinality for 2 keywords. This avoid converting to byteref completely as the single long can function as the composite key required to maintain the buckets.

POC: #21021

Optimization 2: N Keyword packed into a single long, or 2 longs if required

Packing multiple buckets into a single long, and extend into a second long if needed. This is inspired from LongLongHash approach but designed for multiple keywords in a multi-aggregation request.

POC: #21033


Metric Baseline Generic N-long optimization 2 keyword - single long multiple fields, 1 long (2 if it overflows 1 long) Unit
Min Throughput 1.30403 2.00055 2.00212 2.00086 ops/s
Mean Throughput 1.32316 2.00091 2.00346 2.00143 ops/s
Median Throughput 1.32546 2.00083 2.00313 2.00132 ops/s
Max Throughput 1.33388 2.00161 2.00615 2.00258 ops/s
50th percentile latency 25248.1 171.678 122.332 127.833 ms
90th percentile latency 34857.1 176.62 127.911 132.191 ms
99th percentile latency 37025.8 212.795 186.781 200.869 ms
100th percentile latency 37146.3 232.514 186.826 209.26 ms
50th percentile service time 734.633 170.571 121.029 126.69 ms
90th percentile service time 752.307 175.58 126.345 131.003 ms
99th percentile service time 829.96 210.92 185.912 199.778 ms
100th percentile service time 848.685 231.12 186.051 208.258 ms
error rate 0 0 0 0 %

Additional context

While the numbers from 3 approaches is fairly similar as evident from multiple runs you can find on the PRs itself from benchmarks, I prefer the last approach as it can handle cases for very high cardinality easy in 1 (at max 2) longs.

Basically 1M cardinality field will require 20 bits to pack [2^20 ~ 1M], a single long (126 bits) packing will suffice in most observed cases.

While a single long approach for specifically 2 keywords is great and should ideally cover majority cases, the 1/2 long approach for N keywrods offer a more optimized approach for more than 2 fields as well keeping the memory imprint in a predictable manner.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    🆕 New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions