[RFC] Intra-Segment Search: Query and Aggregation Performance Analysis

## Description

This RFC analyzes which OpenSearch queries and aggregations benefit from Lucene's intra-segment Search feature (introduced in [Lucene PR #13542](https://github.com/apache/lucene/pull/13542)) and which ones regress. The goal is to identify optimization opportunities and document expected behavior.


## Background

Lucene's intra-segment search partitions large segments into smaller doc ID ranges that can be processed in parallel. This benefits compute-heavy operations but can hurt light or state dependent operations due to:

- Duplicated per-partition setup work (e.g., stored fields decompression).
- Coordination/merge overhead exceeding parallelization gains.
- Shared mutable state (e.g., terms agg bucket counters).
- Early termination conflicts (sorted queries).
- BKD tree traversal overhead (range queries).
- Duplicated per-segment work across partitions.
- Incompatible data structures.


## Implementation for enabling intra-segment search
- https://github.com/opensearch-project/OpenSearch/pull/19704

## Sample Shard-Level Slice to Partition Assignment

When intra-segment search is enabled, large segments are partitioned and distributed across slices. **Each slice contains partitions from different segments only.**

```mermaid
flowchart TB
 subgraph Shard["Shard"]
 subgraph Segments
 S1["Segment 1 1.2M docs"]
 S2["Segment 2 600K docs"]
 S3["Segment 3 100K docs"]
 end
 end

 subgraph Partitions
 S1 --> P1a["S1: 0-400K"]
 S1 --> P1b["S1: 400K-800K"]
 S1 --> P1c["S1: 800K-1.2M"]
 S2 --> P2a["S2: 0-300K"]
 S2 --> P2b["S2: 300K-600K"]
 S3 --> P3["S3: 0-100K (not partitioned)"]
 end

 subgraph Slices["Slices (Parallel Execution)"]
 subgraph Slice1["Slice 1 - Thread 1"]
 T1P1["S1: 0-400K"]
 T1P2["S2: 300K-600K"]
 end
 subgraph Slice2["Slice 2 - Thread 2"]
 T2P1["S1: 400K-800K"]
 T2P2["S3: 0-100K"]
 end
 subgraph Slice3["Slice 3 - Thread 3"]
 T3P1["S1: 800K-1.2M"]
 T3P2["S2: 0-300K"]
 end
 end

 P1a --> T1P1
 P2b --> T1P2
 P1b --> T2P1
 P3 --> T2P2
 P1c --> T3P1
 P2a --> T3P2
```

**Key Rule**: Same-segment partitions are always in **different slices** (e.g., S1 partitions spread across Slice 1, 2, 3).

---

## Partition Decision Logic

```mermaid
flowchart LR
 A[Segment] --> B{docs >= floor_size AND docs > targetSize}
 B -->|No| C[Keep whole]
 B -->|Yes| D["N = min(ceil(docs/targetSize), sliceCount)"]
 D --> E[Split into N partitions]
```

- `floor_size` = `search.intra_segment_search.min_segment_size` (a floor setting to avoid partitioning small segemnts)
- `targetSize` = `totalDocs / sliceCount` (calculated to decide if a segment need to be partitioned or not)
- `sliceCount` = `search.concurrent_segment_search.target_max_slice_count` (existing slice count, same code used from concurrent segment search)

---

## Configuration

| Setting | Default | Description |
|---------|---------|-------------|
| `search.intra_segment_search.enabled` | `true` | Enable intra-segment partitioning |
| `search.intra_segment_search.min_segment_size` | `500,000` | Minimum segment size to partition |


## Query Performance Analysis

The following queries were tested on the `big5` dataset, force merged to a single segment to better understand the dynamics and query performance of intra-segment concurrency. The EC2 instance used for testing is an `r5.xlarge` with 4 vCPUs.

### Queries that improve with Intra segment search

Tested with some existing queries and added several new ones to evaluate intra-segment concurrency. Changes are in [PR #725](https://github.com/opensearch-project/opensearch-benchmark-workloads/pull/725). Additionally, validated a few queries using plain curl requests and measured `took` time.

#### Span Queries (improved)

Used `big5` `span-near-query` query.

```bash

## without intra-segment
| 50th percentile service time | span-near-query | 104.127 | ms |
| 90th percentile service time | span-near-query | 104.796 | ms |

## with intra-segment
| 50th percentile service time | span-near-query | 55.2068 | ms |
| 90th percentile service time | span-near-query | 77.3854 | ms |

```

#### Intervals ordered query (improved)

LowIntervalsOrdered, MedIntervalsOrdered, HighIntervalsOrdered

```bash
## without intra-segment
| 50th percentile service time | intervals-ordered-message | 91.1661 | ms |
| 90th percentile service time | intervals-ordered-message | 91.6819 | ms |

## with intra-segment
| 50th percentile service time | intervals-ordered-message | 48.3255 | ms |
| 90th percentile service time | intervals-ordered-message | 67.1245 | ms |

```


#### Phrase Queries (improved `match_phrase`)

bool-must-double-phrase (MedSloppyPhrase)

```bash
## without intra-segment
| 50th percentile service time | bool-must-double-phrase | 127.807 | ms |
| 90th percentile service time | bool-must-double-phrase | 129.298 | ms |
## with intra-segment
| 50th percentile service time | bool-must-double-phrase | 66.7707 | ms |
| 90th percentile service time | bool-must-double-phrase | 92.5768 | ms |
```


#### Boolean AND Queries with High Cardinality (improved)



The AndHighHigh, AndHighMed queries. Intersection work across multiple high-frequency terms parallelizes well so boolean with AND with BM25 scoring improved

```bash
## without intra-segment
| 50th percentile service time | bool-must-match-message | 58.7766 | ms |
| 90th percentile service time | bool-must-match-message | 59.3151 | ms |
## with intra-segment
| 50th percentile service time | bool-must-match-message | 33.205 | ms |
| 90th percentile service time | bool-must-match-message | 51.3522 | ms |
```


```bash
## without intra-segment
| 50th percentile service time | bool-must-term-keyword | 58.9581 | ms |
| 90th percentile service time | bool-must-term-keyword | 59.2319 | ms |
## with intra-segment
| 50th percentile service time | bool-must-term-keyword | 33.1571 | ms |
| 90th percentile service time | bool-must-term-keyword | 39.0753 | ms |
```


#### Boolean OR Queries (improved)

The OrHighHigh, OrHighMed format. Union operations with scoring benefit from parallel intra segment.


```bash
## without intra-segment
| 50th percentile service time | bool-should-match-med | 66.7524 | ms |
| 90th percentile service time | bool-should-match-med | 67.5462 | ms |
## with intra-segment
| 50th percentile service time | bool-should-match-med | 37.641 | ms |
| 90th percentile service time | bool-should-match-med | 53.3731 | ms |
```

```bash
Seen improvement with curl "took" time from 67 to 37
# OrHighHigh
curl -X GET "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"track_total_hits":false,"query":{"bool":{"should":[{"match":{"message":"monkey"}},{"match":{"message":"rabbit"}}],"minimum_should_match":1}}}'

# OrHighMed
curl -X GET "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"track_total_hits":false,"query":{"bool":{"should":[{"match":{"message":"monkey"}},{"match":{"message":"spirit"}}],"minimum_should_match":1}}}'
```

#### Query string (improved)

Directly used `query-string-on-message` query form `big5` workload.

```bash
## without intra-segment
| 50th percentile service time | query-string-on-message | 146.762 | ms |
| 90th percentile service time | query-string-on-message | 148.851 | ms |
## with intra-segment
| 50th percentile service time | query-string-on-message | 79.3278 | ms |
| 90th percentile service time | query-string-on-message | 115.854 | ms |
```

### Aggregation Performance Analysis

#### Metric Aggregations

Metric aggregations benefit because DocValues scanning and computation parallelize well. Each partition can compute partial results that merge.

##### stats aggregation (improved)
The curl query took time dropped from `2859 ms` to `1494 ms`

```bash
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d' { "size": 0, "aggs": { "metrics_stats": { "stats": { "field": "metrics.size" } } } }'
```

##### sum (improved)
The curl query took time dropped from `2046 ms` to `1022 ms`
```bash
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"tsum":{"sum":{"field":"metrics.size"}}}}'
```

##### size_percentiles (improved)
The curl query took time dropped from `12843 ms` to `6515 ms`
```bash
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d' { "size": 0, "aggs": { "size_percentiles": { "percentiles": { "field": "metrics.size" } } } }'
```

##### extended_stats (improved)
The curl query took time dropped from `3718 ms` to `1858 ms`
```bash
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"size_extended":{"extended_stats":{"field":"metrics.size"}}}}'
```

##### Multiple metric aggregations (improved metrics-agg-no-range)
```
## without intra-segment
| 50th percentile service time | metrics-agg-no-range | 9904.66 | ms |
| 90th percentile service time | metrics-agg-no-range | 9978.55 | ms |
## with intra-segment
| 50th percentile service time | metrics-agg-no-range | 4971.42 | ms |
| 90th percentile service time | metrics-agg-no-range | 5088.13 | ms |
```

```bash
curl -s "localhost:9200/index/_search?pretty" -H 'Content-Type: application/json' -d'
{
 "size": 0,
 "track_total_hits": false,
 "aggs": {
 "tsum": { "sum": { "field": "metrics.size" } },
 "tmin": { "min": { "field": "metrics.tmin" } },
 "tavg": { "avg": { "field": "metrics.size" } },
 "tmax": { "max": { "field": "metrics.size" } },
 "tstats": { "stats": { "field": "metrics.size" } }
 }
}'
```

##### value_count (improved)

The curl query took time dropped from `789 ms` to `395 ms`
```bash
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"count_values":{"value_count":{"field":"metrics.size"}}}}'
```

##### percentile_ranks (improved)
The curl query took time dropped from `12747 ms` to `6432 ms`

```bash
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"size_ranks":{"percentile_ranks":{"field":"metrics.size","values":[100,500,1000]}}}}'
```

##### matrix_stats aggregation (improved)
The curl query took time dropped from `43773 ms` to `22178 ms`

```bash
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"matrix":{"matrix_stats":{"fields":["metrics.size","metrics.tmin"]}}}}'
```

##### top_docs (improved)
The curl query took time dropped from `5618 ms` to `2920 ms`

```bash
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"by_region":{"terms":{"field":"cloud.region","size":5},"aggs":{"top_docs":{"top_hits":{"size":3}}}}}}'
```

##### cardinality high aggregation (improved)

```bash
with intra-segment
| 50th percentile service time | cardinality-agg-high | 665.718 | ms |
| 90th percentile service time | cardinality-agg-high | 723.834 | ms |

without intra-segment
| 50th percentile service time | cardinality-agg-high | 1312.29 | ms |
| 90th percentile service time | cardinality-agg-high | 1323.5 | ms |
```


##### top_hits nested under terms (improved)

The curl query took time dropped from `5430 ms` to `2829 ms`

```bash
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"by_region":{"terms":{"field":"cloud.region","size":5},"aggs":{"top_docs":{"top_hits":{"size":3}}}}}}'
```

#### Bucket Aggregations


##### terms (improved)

Seen slight imporvement with curl queries.

```bash
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"aggs":{"by_region":{"terms":{"field":"cloud.region","size":10}}}}'
```

##### significant_terms (improved)

The curl query took time dropped from `5970 ms` to `3170 ms`

```bash
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"by_process":{"terms":{"field":"process.name","size":10},"aggs":{"significant_regions":{"significant_terms":{"field":"cloud.region"}}}}}}'
```

##### multi_terms (improved)

The curl query took time dropped from `50068 ms` to `25922 ms`

```bash
curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"multi_terms":{"terms":[{"field":"process.name"},{"field":"cloud.region"}],"size":10}}}}'
```

##### histogram (improved)

The curl query took time dropped from `3355 ms` to `1691 ms` 

```bash
 curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"histogram":{"field":"metrics.size","interval":100}}}}'
```

##### date_histogram and auto_date_histogram (already fast since using BKD)

Not much improvemnt as using BKD.

```bash
curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"date_histogram":{"field":"@timestamp","calendar_interval":"day"}}}}' | jq '.'

curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"auto_date_histogram":{"field":"@timestamp","buckets":10}}}}' | jq '.'
```

##### range and date_range (using BKD)

Not much improvemnt as using BKD.

```bash
curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"range":{"field":"metrics.size","ranges":[{"to":100},{"from":100,"to":500},{"from":500}]}}}}' 

curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"date_range":{"field":"@timestamp","ranges":[{"to":"2023-06-01"},{"from":"2023-06-01"}]}}}}' 
```

##### composite

Shown regression with intra-segment.

```bash
 curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"composite":{"sources":[{"region":{"terms":{"field":"cloud.region"}}},{"process":{"terms":{"field":"process.name"}}}]}}}}'

curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"aggs":{"my_buckets":{"composite":{"size":100,"sources":[{"date":{"date_histogram":{"field":"@timestamp","calendar_interval":"day"}}},{"region":{"terms":{"field":"cloud.region"}}}]}}}}'
```

##### filter and filters (improved)

The curl query took time dropped from `788 ms` to `447 ms` for `filter` and from `1251 ms` to `650 ms` for `filters`

```bash
curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"filter":{"term":{"cloud.region":"us-east-1"}}}}}' 

curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"filters":{"filters":{"east":{"term":{"cloud.region":"us-east-1"}},"west":{"term":{"cloud.region":"us-west-2"}}}}}}}'
```

##### adjacency_matrix (improved)
The curl query took time dropped from `3411 ms` to `1712 ms`.

```bash
curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"adjacency_matrix":{"filters":{"east":{"term":{"cloud.region":"us-east-1"}},"west":{"term":{"cloud.region":"us-west-2"}},"kernel":{"term":{"process.name":"kernel"}}}}}}}'
```

##### sampler and diversified_sampler

Not much improvement with intra segment

```bash
curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"sample":{"sampler":{"shard_size":1000},"aggs":{"terms":{"terms":{"field":"process.name"}}}}}}' 

curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"sample":{"diversified_sampler":{"shard_size":1000,"field":"cloud.region"},"aggs":{"terms":{"terms":{"field":"process.name"}}}}}}'
```

##### Global aggregations

The curl query took time dropped from `3444 ms` to `1770 ms`.

```bash
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d '{"query":{"match":{"message":"monkey"}},"aggs":{"filtered_avg":{"avg":{"field":"metrics.size"}},"all_docs":{"global":{},"aggs":{"total_avg":{"avg":{"field":"metrics.size"}}}}}}'
```

The curl query took time dropped from `3301 ms` to `1668 ms`.

```bash
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d '{"size":0,"aggs":{"all_docs":{"global":{},"aggs":{"total_avg":{"avg":{"field":"metrics.size"}}}}}}'
```



### Pipeline Aggregations 

Yet to experiment ad pending to run benchmarks.

### Composite Aggregation (Regresses)

**Why it cannot benefit:**
- **Global queue state**: Maintains sorted iteration across all values - can't be safely updated from multiple threads
- **Early termination**: Needs global view of all values to know when to stop
- **Index sort optimization**: Broken by partition boundaries
- **After key pagination**: Requires sequential processing to maintain cursor position

```bash
# composite aggregation
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"aggs":{"my_buckets":{"composite":{"size":100,"sources":[{"date":{"date_histogram":{"field":"@timestamp","calendar_interval":"day"}}},{"region":{"terms":{"field":"cloud.region"}}}]}}}}'
```

### BKD-Optimized Aggregations (Minimal Improvement)

These already use BKD tree optimization and are very fast - little room for improvement:
- `date_histogram` (with filter rewrite)
- `range` (with filter rewrite)
- `auto_date_histogram`

### Single Term Queries

HighTerm, MedTerm, LowTerm pattern. Postings iteration is I/O-bound and inherently sequential. The inverted index stores doc IDs in sorted order - parallelism adds overhead without benefit.

```bash
curl -X GET "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"query":{"term":{"message":"monkey"}}}'
```

### Range Queries

IntNRQ pattern. BKD tree traversal is duplicated per partition. Each partition must traverse the same tree structure.

```bash
curl -X GET "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"query":{"range":{"metrics.size":{"gte":100,"lte":1000}}}}'
```

### Wildcard/Prefix Queries

Prefix3, Wildcard pattern. Automaton construction is duplicated per partition.

```bash
curl -X GET "localhost:9200/index/_search?pretty" -H 'Content-Type: application/json' -d'
{
 "query": {
 "prefix": { "message": "mon" }
 }
}'

curl -X GET "localhost:9200/index/_search?pretty" -H 'Content-Type: application/json' -d'
{
 "query": {
 "wildcard": { "message": "mon*ey" }
 }
}'
```


## Related Issues and PR's

- Optimize PointRangeQuery for intra-segment concurrency with segment-level DocIdSet caching: https://github.com/apache/lucene/pull/15446
- Support DocIdSetBuilder with partition bounds: https://github.com/apache/lucene/pull/15383
- Basic Default Implementation for enabling intra-segment search: https://github.com/opensearch-project/OpenSearch/pull/19704
- @ajleong623 Intrasegment Concurrent Search Experimenting: https://github.com/opensearch-project/OpenSearch/pull/20068
- Handle collectors for variance in few aggregation types: https://github.com/opensearch-project/OpenSearch/issues/18854 
- Lucene intra segment search benchmarks: http://github.com/apache/lucene/pull/13542#issuecomment-2332114836
- OpenSearch intra segment search slicing mechanism: https://github.com/opensearch-project/OpenSearch/issues/18851
- Lucene default distribution with intra segment: https://github.com/opensearch-project/OpenSearch/issues/18851#issuecomment-3422052433 
- Open META issue in OpenSearch related to intra segment: https://github.com/opensearch-project/OpenSearch/issues/18852
- Draft PR in opensearch-benchmark-workloads: https://github.com/opensearch-project/opensearch-benchmark-workloads/pull/725 
- Add standalone query and aggregation operations for isolated performance benchmarking: https://github.com/opensearch-project/opensearch-benchmark-workloads/issues/726

## Test Results
- Intra segment tests with multiple search clients: https://github.com/opensearch-project/OpenSearch/issues/20202#issuecomment-3720345233
- Intra segment tests with scaling slices and index search thread pool: https://github.com/opensearch-project/OpenSearch/issues/20202#issuecomment-3720426397
- Intra segment tests with multiple segments: https://github.com/opensearch-project/OpenSearch/issues/20202#issuecomment-3730270782
- Intra segment slice skew tests: https://github.com/opensearch-project/OpenSearch/issues/20202#issuecomment-3751316273

Setting	Default	Description
`search.intra_segment_search.enabled`	`true`	Enable intra-segment partitioning
`search.intra_segment_search.min_segment_size`	`500,000`	Minimum segment size to partition

[RFC] Intra-Segment Search: Query and Aggregation Performance Analysis #20202

Description

Description

Background

Implementation for enabling intra-segment search

Sample Shard-Level Slice to Partition Assignment

Partition Decision Logic

Configuration

Query Performance Analysis

Queries that improve with Intra segment search

Span Queries (improved)

Intervals ordered query (improved)

Phrase Queries (improved match_phrase)

Boolean AND Queries with High Cardinality (improved)

Boolean OR Queries (improved)

Query string (improved)

Aggregation Performance Analysis

Metric Aggregations

stats aggregation (improved)

sum (improved)

size_percentiles (improved)

extended_stats (improved)

Multiple metric aggregations (improved metrics-agg-no-range)

value_count (improved)

percentile_ranks (improved)

matrix_stats aggregation (improved)

top_docs (improved)

cardinality high aggregation (improved)

top_hits nested under terms (improved)

Bucket Aggregations

terms (improved)

significant_terms (improved)

multi_terms (improved)

histogram (improved)

date_histogram and auto_date_histogram (already fast since using BKD)

range and date_range (using BKD)

composite

filter and filters (improved)

adjacency_matrix (improved)

sampler and diversified_sampler

Global aggregations

Pipeline Aggregations

Composite Aggregation (Regresses)

BKD-Optimized Aggregations (Minimal Improvement)

Single Term Queries

Range Queries

Wildcard/Prefix Queries

Related Issues and PR's

Test Results

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Phrase Queries (improved `match_phrase`)