-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Optimizing Multi-terms over keyword fields using global ordinals #21051
Description
Is your feature request related to a problem? Please describe
The MultiTermsAggregator currently serializes composite bucket keys as variable-length BytesRef arrays via BytesStreamOutput, storing them in a BytesRefHash through BytesKeyedBucketOrds. Each unique composite key incurs a full byte-array copy in the hash, plus per-document serialization/deserialization overhead. For high-cardinality multi-terms aggregations, this creates significant JVM memory pressure.
This optimization replaces the serialized BytesRef composite keys with packed ordinal tuples when all fields in the aggregation support global ordinals (ValuesSource.Bytes.WithOrdinals). Instead of storing "electronics|laptop|premium" as ~30+ bytes, we store three long ordinals packed into a single long (for 2 fields) or a compact BytesRef of fixed-width ordinals (for N fields). This eliminates the variable-length byte serialization entirely and reduces per-bucket memory from O(key_length) to O(N * 8) bytes where N is the number of fields.
For mixed-type configurations (keyword + numeric fields), the optimization falls back to the existing BytesRef-based approach, ensuring full backward compatibility.
Describe the solution you'd like
The idea is to replace the serialized BytesRef composite keys with packed ordinal tuples when all fields in the aggregation support global ordinals (ValuesSource.Bytes.WithOrdinals). Instead of storing "electronics|laptop|premium" as ~30+ bytes, we store three long ordinals packed into a compact BytesRef of fixed-width ordinals (for N fields). This eliminates the variable-length byte serialization entirely and reduces per-bucket memory from O(key_length) to O(N * 8) bytes where N is the number of fields.
POC: #21032
This was the first iteration of the planned improvement which resulted in a massive gain.
| Metric | Operation | Baseline | Contender | Diff | Unit |
|---|---|---|---|---|---|
| Min Throughput | multi_terms-keyword | 1.30403 | 2.00055 | 0.69653 | ops/s |
| Mean Throughput | multi_terms-keyword | 1.32316 | 2.00091 | 0.67775 | ops/s |
| Median Throughput | multi_terms-keyword | 1.32546 | 2.00083 | 0.67536 | ops/s |
| Max Throughput | multi_terms-keyword | 1.33388 | 2.00161 | 0.66773 | ops/s |
| 50th percentile latency | multi_terms-keyword | 25248.1 | 171.678 | -25076.4 | ms |
| 90th percentile latency | multi_terms-keyword | 34857.1 | 176.62 | -34680.5 | ms |
| 99th percentile latency | multi_terms-keyword | 37025.8 | 212.795 | -36813 | ms |
| 100th percentile latency | multi_terms-keyword | 37146.3 | 232.514 | -36913.7 | ms |
| 50th percentile service time | multi_terms-keyword | 734.633 | 170.571 | -564.062 | ms |
| 90th percentile service time | multi_terms-keyword | 752.307 | 175.58 | -576.727 | ms |
| 99th percentile service time | multi_terms-keyword | 829.96 | 210.92 | -619.039 | ms |
| 100th percentile service time | multi_terms-keyword | 848.685 | 231.12 | -617.564 | ms |
| error rate | multi_terms-keyword | 0 | 0 | 0 | % |
Related component
Search:Aggregations
Describe alternatives you've considered
Optimization 1: 2 Keyword packed into a single long
Packing 2 keyword ordinals into a single long as 126 bits could work perfectly fine for upto a few million cardinality for 2 keywords. This avoid converting to byteref completely as the single long can function as the composite key required to maintain the buckets.
POC: #21021
Optimization 2: N Keyword packed into a single long, or 2 longs if required
Packing multiple buckets into a single long, and extend into a second long if needed. This is inspired from LongLongHash approach but designed for multiple keywords in a multi-aggregation request.
POC: #21033
| Metric | Baseline | Generic N-long optimization | 2 keyword - single long | multiple fields, 1 long (2 if it overflows 1 long) | Unit |
|---|---|---|---|---|---|
| Min Throughput | 1.30403 | 2.00055 | 2.00212 | 2.00086 | ops/s |
| Mean Throughput | 1.32316 | 2.00091 | 2.00346 | 2.00143 | ops/s |
| Median Throughput | 1.32546 | 2.00083 | 2.00313 | 2.00132 | ops/s |
| Max Throughput | 1.33388 | 2.00161 | 2.00615 | 2.00258 | ops/s |
| 50th percentile latency | 25248.1 | 171.678 | 122.332 | 127.833 | ms |
| 90th percentile latency | 34857.1 | 176.62 | 127.911 | 132.191 | ms |
| 99th percentile latency | 37025.8 | 212.795 | 186.781 | 200.869 | ms |
| 100th percentile latency | 37146.3 | 232.514 | 186.826 | 209.26 | ms |
| 50th percentile service time | 734.633 | 170.571 | 121.029 | 126.69 | ms |
| 90th percentile service time | 752.307 | 175.58 | 126.345 | 131.003 | ms |
| 99th percentile service time | 829.96 | 210.92 | 185.912 | 199.778 | ms |
| 100th percentile service time | 848.685 | 231.12 | 186.051 | 208.258 | ms |
| error rate | 0 | 0 | 0 | 0 | % |
Additional context
While the numbers from 3 approaches is fairly similar as evident from multiple runs you can find on the PRs itself from benchmarks, I prefer the last approach as it can handle cases for very high cardinality easy in 1 (at max 2) longs.
Basically 1M cardinality field will require 20 bits to pack [2^20 ~ 1M], a single long (126 bits) packing will suffice in most observed cases.
While a single long approach for specifically 2 keywords is great and should ideally cover majority cases, the 1/2 long approach for N keywrods offer a more optimized approach for more than 2 fields as well keeping the memory imprint in a predictable manner.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status