Optimizing Multi-terms over keyword fields using global ordinals

### Is your feature request related to a problem? Please describe


The `MultiTermsAggregator` currently serializes composite bucket keys as variable-length `BytesRef` arrays via `BytesStreamOutput`, storing them in a `BytesRefHash` through `BytesKeyedBucketOrds`. Each unique composite key incurs a full byte-array copy in the hash, plus per-document serialization/deserialization overhead. For high-cardinality multi-terms aggregations, this creates significant JVM memory pressure.


This optimization replaces the serialized `BytesRef` composite keys with packed ordinal tuples when all fields in the aggregation support global ordinals (`ValuesSource.Bytes.WithOrdinals`). Instead of storing `"electronics|laptop|premium"` as ~30+ bytes, we store three `long` ordinals packed into a single `long` (for 2 fields) or a compact `BytesRef` of fixed-width ordinals (for N fields). This eliminates the variable-length byte serialization entirely and reduces per-bucket memory from O(key_length) to O(N * 8) bytes where N is the number of fields.

For mixed-type configurations (keyword + numeric fields), the optimization falls back to the existing `BytesRef`-based approach, ensuring full backward compatibility.


### Describe the solution you'd like

The idea is to replace the serialized `BytesRef` composite keys with packed ordinal tuples when all fields in the aggregation support global ordinals (`ValuesSource.Bytes.WithOrdinals`). Instead of storing `"electronics|laptop|premium"` as ~30+ bytes, we store three `long` ordinals packed into a compact `BytesRef` of fixed-width ordinals (for N fields). This eliminates the variable-length byte serialization entirely and reduces per-bucket memory from O(key_length) to O(N * 8) bytes where N is the number of fields.

POC: https://github.com/opensearch-project/OpenSearch/pull/21032
This was the first iteration of the planned improvement which resulted in a massive gain.

| Metric | Operation | Baseline | Contender | Diff | Unit |
| :--- | :--- | :--- | :--- | :--- | :--- |
| Min Throughput | multi_terms-keyword | 1.30403 | 2.00055 | 0.69653 | ops/s |
| Mean Throughput | multi_terms-keyword | 1.32316 | 2.00091 | 0.67775 | ops/s |
| Median Throughput | multi_terms-keyword | 1.32546 | 2.00083 | 0.67536 | ops/s |
| Max Throughput | multi_terms-keyword | 1.33388 | 2.00161 | 0.66773 | ops/s |
| 50th percentile latency | multi_terms-keyword | 25248.1 | 171.678 | -25076.4 | ms |
| 90th percentile latency | multi_terms-keyword | 34857.1 | 176.62 | -34680.5 | ms |
| 99th percentile latency | multi_terms-keyword | 37025.8 | 212.795 | -36813 | ms |
| 100th percentile latency | multi_terms-keyword | 37146.3 | 232.514 | -36913.7 | ms |
| 50th percentile service time | multi_terms-keyword | 734.633 | 170.571 | -564.062 | ms |
| 90th percentile service time | multi_terms-keyword | 752.307 | 175.58 | -576.727 | ms |
| 99th percentile service time | multi_terms-keyword | 829.96 | 210.92 | -619.039 | ms |
| 100th percentile service time | multi_terms-keyword | 848.685 | 231.12 | -617.564 | ms |
| error rate | multi_terms-keyword | 0 | 0 | 0 | % |


### Related component

Search:Aggregations

### Describe alternatives you've considered

**_Optimization 1:_ 2 Keyword packed into a single long**

Packing 2 keyword ordinals into a single long as 126 bits could work perfectly fine for upto a few million cardinality for 2 keywords. This avoid converting to byteref completely as the single long can function as the composite key required to maintain the buckets.

POC: https://github.com/opensearch-project/OpenSearch/pull/21021

**_Optimization 2:_ N Keyword packed into a single long, or 2 longs if required**

Packing multiple buckets into a single long, and extend into a second long if needed. This is inspired from LongLongHash approach but designed for multiple keywords in a multi-aggregation request.

POC: https://github.com/opensearch-project/OpenSearch/pull/21033

----

| Metric | Baseline | Generic N-long optimization | 2 keyword - single long | multiple fields, 1 long (2 if it overflows 1 long) | Unit |
| :--- | :--- | :--- | :--- | :--- | :--- |
| Min Throughput | 1.30403 | 2.00055 | 2.00212 | 2.00086 | ops/s |
| Mean Throughput | 1.32316 | 2.00091 | 2.00346 | 2.00143 | ops/s |
| Median Throughput | 1.32546 | 2.00083 | 2.00313 | 2.00132 | ops/s |
| Max Throughput | 1.33388 | 2.00161 | 2.00615 | 2.00258 | ops/s |
| 50th percentile latency | 25248.1 | 171.678 | 122.332 | 127.833 | ms |
| 90th percentile latency | 34857.1 | 176.62 | 127.911 | 132.191 | ms |
| 99th percentile latency | 37025.8 | 212.795 | 186.781 | 200.869 | ms |
| 100th percentile latency | 37146.3 | 232.514 | 186.826 | 209.26 | ms |
| 50th percentile service time | 734.633 | 170.571 | 121.029 | 126.69 | ms |
| 90th percentile service time | 752.307 | 175.58 | 126.345 | 131.003 | ms |
| 99th percentile service time | 829.96 | 210.92 | 185.912 | 199.778 | ms |
| 100th percentile service time | 848.685 | 231.12 | 186.051 | 208.258 | ms |
| error rate | 0 | 0 | 0 | 0 | % |


### Additional context

While the numbers from 3 approaches is fairly similar as evident from multiple runs you can find on the PRs itself from benchmarks, I prefer the last approach as it can handle cases for very high cardinality easy in 1 (at max 2) longs.

Basically 1M cardinality field will require 20 bits to pack [2^20 ~ 1M], a single long (126 bits) packing will suffice in most observed cases.

While a single long approach for specifically 2 keywords is great and should ideally cover majority cases, the 1/2 long approach for N keywrods offer a more optimized approach for more than 2 fields as well keeping the memory imprint in a predictable manner.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing Multi-terms over keyword fields using global ordinals #21051

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Operation	Baseline	Contender	Diff	Unit
Min Throughput	multi_terms-keyword	1.30403	2.00055	0.69653	ops/s
Mean Throughput	multi_terms-keyword	1.32316	2.00091	0.67775	ops/s
Median Throughput	multi_terms-keyword	1.32546	2.00083	0.67536	ops/s
Max Throughput	multi_terms-keyword	1.33388	2.00161	0.66773	ops/s
50th percentile latency	multi_terms-keyword	25248.1	171.678	-25076.4	ms
90th percentile latency	multi_terms-keyword	34857.1	176.62	-34680.5	ms
99th percentile latency	multi_terms-keyword	37025.8	212.795	-36813	ms
100th percentile latency	multi_terms-keyword	37146.3	232.514	-36913.7	ms
50th percentile service time	multi_terms-keyword	734.633	170.571	-564.062	ms
90th percentile service time	multi_terms-keyword	752.307	175.58	-576.727	ms
99th percentile service time	multi_terms-keyword	829.96	210.92	-619.039	ms
100th percentile service time	multi_terms-keyword	848.685	231.12	-617.564	ms
error rate	multi_terms-keyword	0	0	0	%

Optimizing Multi-terms over keyword fields using global ordinals #21051

Description

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions