Skip to content

[RFC] Intra-Segment Search: Query and Aggregation Performance Analysis #20202

@prudhvigodithi

Description

@prudhvigodithi

Description

This RFC analyzes which OpenSearch queries and aggregations benefit from Lucene's intra-segment Search feature (introduced in Lucene PR #13542) and which ones regress. The goal is to identify optimization opportunities and document expected behavior.

Background

Lucene's intra-segment search partitions large segments into smaller doc ID ranges that can be processed in parallel. This benefits compute-heavy operations but can hurt light or state dependent operations due to:

  • Duplicated per-partition setup work (e.g., stored fields decompression).
  • Coordination/merge overhead exceeding parallelization gains.
  • Shared mutable state (e.g., terms agg bucket counters).
  • Early termination conflicts (sorted queries).
  • BKD tree traversal overhead (range queries).
  • Duplicated per-segment work across partitions.
  • Incompatible data structures.

Implementation for enabling intra-segment search

Sample Shard-Level Slice to Partition Assignment

When intra-segment search is enabled, large segments are partitioned and distributed across slices. Each slice contains partitions from different segments only.

flowchart TB
    subgraph Shard["Shard"]
        subgraph Segments
            S1["Segment 1<br/>1.2M docs"]
            S2["Segment 2<br/>600K docs"]
            S3["Segment 3<br/>100K docs"]
        end
    end

    subgraph Partitions
        S1 --> P1a["S1: 0-400K"]
        S1 --> P1b["S1: 400K-800K"]
        S1 --> P1c["S1: 800K-1.2M"]
        S2 --> P2a["S2: 0-300K"]
        S2 --> P2b["S2: 300K-600K"]
        S3 --> P3["S3: 0-100K<br/>(not partitioned)"]
    end

    subgraph Slices["Slices (Parallel Execution)"]
        subgraph Slice1["Slice 1 - Thread 1"]
            T1P1["S1: 0-400K"]
            T1P2["S2: 300K-600K"]
        end
        subgraph Slice2["Slice 2 - Thread 2"]
            T2P1["S1: 400K-800K"]
            T2P2["S3: 0-100K"]
        end
        subgraph Slice3["Slice 3 - Thread 3"]
            T3P1["S1: 800K-1.2M"]
            T3P2["S2: 0-300K"]
        end
    end

    P1a --> T1P1
    P2b --> T1P2
    P1b --> T2P1
    P3 --> T2P2
    P1c --> T3P1
    P2a --> T3P2
Loading

Key Rule: Same-segment partitions are always in different slices (e.g., S1 partitions spread across Slice 1, 2, 3).


Partition Decision Logic

flowchart LR
    A[Segment] --> B{docs >= floor_size AND<br/>docs > targetSize}
    B -->|No| C[Keep whole]
    B -->|Yes| D["N = min(ceil(docs/targetSize), sliceCount)"]
    D --> E[Split into N partitions]
Loading
  • floor_size = search.intra_segment_search.min_segment_size (a floor setting to avoid partitioning small segemnts)
  • targetSize = totalDocs / sliceCount (calculated to decide if a segment need to be partitioned or not)
  • sliceCount = search.concurrent_segment_search.target_max_slice_count (existing slice count, same code used from concurrent segment search)

Configuration

Setting Default Description
search.intra_segment_search.enabled true Enable intra-segment partitioning
search.intra_segment_search.min_segment_size 500,000 Minimum segment size to partition

Query Performance Analysis

The following queries were tested on the big5 dataset, force merged to a single segment to better understand the dynamics and query performance of intra-segment concurrency. The EC2 instance used for testing is an r5.xlarge with 4 vCPUs.

Queries that improve with Intra segment search

Tested with some existing queries and added several new ones to evaluate intra-segment concurrency. Changes are in PR #725. Additionally, validated a few queries using plain curl requests and measured took time.

Span Queries (improved)

Used big5 span-near-query query.

## without intra-segment
|                                   50th percentile service time | span-near-query |     104.127 |     ms |
|                                   90th percentile service time | span-near-query |     104.796 |     ms |

## with intra-segment
|                                   50th percentile service time | span-near-query |     55.2068 |     ms |
|                                   90th percentile service time | span-near-query |     77.3854 |     ms |

Intervals ordered query (improved)

LowIntervalsOrdered, MedIntervalsOrdered, HighIntervalsOrdered

## without intra-segment
|                                   50th percentile service time | intervals-ordered-message |     91.1661 |     ms |
|                                   90th percentile service time | intervals-ordered-message |     91.6819 |     ms |

## with intra-segment
|                                   50th percentile service time | intervals-ordered-message |     48.3255 |     ms |
|                                   90th percentile service time | intervals-ordered-message |     67.1245 |     ms |

Phrase Queries (improved match_phrase)

bool-must-double-phrase (MedSloppyPhrase)

## without intra-segment
|                                   50th percentile service time | bool-must-double-phrase |     127.807 |     ms |
|                                   90th percentile service time | bool-must-double-phrase |     129.298 |     ms |
## with intra-segment
|                                   50th percentile service time | bool-must-double-phrase |     66.7707 |     ms |
|                                   90th percentile service time | bool-must-double-phrase |     92.5768 |     ms |

Boolean AND Queries with High Cardinality (improved)

The AndHighHigh, AndHighMed queries. Intersection work across multiple high-frequency terms parallelizes well so boolean with AND with BM25 scoring improved

## without intra-segment
|                                   50th percentile service time | bool-must-match-message |     58.7766 |     ms |
|                                   90th percentile service time | bool-must-match-message |     59.3151 |     ms |
## with intra-segment
|                                   50th percentile service time | bool-must-match-message |      33.205 |     ms |
|                                   90th percentile service time | bool-must-match-message |     51.3522 |     ms |
## without intra-segment
|                                   50th percentile service time | bool-must-term-keyword |     58.9581 |     ms |
|                                   90th percentile service time | bool-must-term-keyword |     59.2319 |     ms |
## with intra-segment
|                                   50th percentile service time | bool-must-term-keyword |     33.1571 |     ms |
|                                   90th percentile service time | bool-must-term-keyword |     39.0753 |     ms |

Boolean OR Queries (improved)

The OrHighHigh, OrHighMed format. Union operations with scoring benefit from parallel intra segment.

## without intra-segment
|                                   50th percentile service time | bool-should-match-med |     66.7524 |     ms |
|                                   90th percentile service time | bool-should-match-med |     67.5462 |     ms |
## with intra-segment
|                                   50th percentile service time | bool-should-match-med |      37.641 |     ms |
|                                   90th percentile service time | bool-should-match-med |     53.3731 |     ms |
Seen improvement with curl "took" time from 67 to 37
# OrHighHigh
curl -X GET "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"track_total_hits":false,"query":{"bool":{"should":[{"match":{"message":"monkey"}},{"match":{"message":"rabbit"}}],"minimum_should_match":1}}}'

# OrHighMed
curl -X GET "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"track_total_hits":false,"query":{"bool":{"should":[{"match":{"message":"monkey"}},{"match":{"message":"spirit"}}],"minimum_should_match":1}}}'

Query string (improved)

Directly used query-string-on-message query form big5 workload.

## without intra-segment
|                                   50th percentile service time | query-string-on-message |     146.762 |     ms |
|                                   90th percentile service time | query-string-on-message |     148.851 |     ms |
## with intra-segment
|                                   50th percentile service time | query-string-on-message |     79.3278 |     ms |
|                                   90th percentile service time | query-string-on-message |     115.854 |     ms |

Aggregation Performance Analysis

Metric Aggregations

Metric aggregations benefit because DocValues scanning and computation parallelize well. Each partition can compute partial results that merge.

stats aggregation (improved)

The curl query took time dropped from 2859 ms to 1494 ms

curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d' { "size": 0, "aggs": { "metrics_stats": { "stats": { "field": "metrics.size" } } } }'
sum (improved)

The curl query took time dropped from 2046 ms to 1022 ms

curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"tsum":{"sum":{"field":"metrics.size"}}}}'
size_percentiles (improved)

The curl query took time dropped from 12843 ms to 6515 ms

curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d' { "size": 0, "aggs": { "size_percentiles": { "percentiles": { "field": "metrics.size" } } } }'
extended_stats (improved)

The curl query took time dropped from 3718 ms to 1858 ms

curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"size_extended":{"extended_stats":{"field":"metrics.size"}}}}'
Multiple metric aggregations (improved metrics-agg-no-range)
## without intra-segment
|                                   50th percentile service time | metrics-agg-no-range |     9904.66 |     ms |
|                                   90th percentile service time | metrics-agg-no-range |     9978.55 |     ms |
## with intra-segment
|                                   50th percentile service time | metrics-agg-no-range |     4971.42 |     ms |
|                                   90th percentile service time | metrics-agg-no-range |     5088.13 |     ms |
curl -s "localhost:9200/index/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "track_total_hits": false,
  "aggs": {
    "tsum": { "sum": { "field": "metrics.size" } },
    "tmin": { "min": { "field": "metrics.tmin" } },
    "tavg": { "avg": { "field": "metrics.size" } },
    "tmax": { "max": { "field": "metrics.size" } },
    "tstats": { "stats": { "field": "metrics.size" } }
  }
}'
value_count (improved)

The curl query took time dropped from 789 ms to 395 ms

curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"count_values":{"value_count":{"field":"metrics.size"}}}}'
percentile_ranks (improved)

The curl query took time dropped from 12747 ms to 6432 ms

curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"size_ranks":{"percentile_ranks":{"field":"metrics.size","values":[100,500,1000]}}}}'
matrix_stats aggregation (improved)

The curl query took time dropped from 43773 ms to 22178 ms

curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"matrix":{"matrix_stats":{"fields":["metrics.size","metrics.tmin"]}}}}'
top_docs (improved)

The curl query took time dropped from 5618 ms to 2920 ms

curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"by_region":{"terms":{"field":"cloud.region","size":5},"aggs":{"top_docs":{"top_hits":{"size":3}}}}}}'
cardinality high aggregation (improved)
with intra-segment
|                                   50th percentile service time | cardinality-agg-high |     665.718 |     ms |
|                                   90th percentile service time | cardinality-agg-high |     723.834 |     ms |

without intra-segment
|                                   50th percentile service time | cardinality-agg-high |     1312.29 |     ms |
|                                   90th percentile service time | cardinality-agg-high |      1323.5 |     ms |
top_hits nested under terms (improved)

The curl query took time dropped from 5430 ms to 2829 ms

curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"by_region":{"terms":{"field":"cloud.region","size":5},"aggs":{"top_docs":{"top_hits":{"size":3}}}}}}'

Bucket Aggregations

terms (improved)

Seen slight imporvement with curl queries.

curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"aggs":{"by_region":{"terms":{"field":"cloud.region","size":10}}}}'
significant_terms (improved)

The curl query took time dropped from 5970 ms to 3170 ms

curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"by_process":{"terms":{"field":"process.name","size":10},"aggs":{"significant_regions":{"significant_terms":{"field":"cloud.region"}}}}}}'
multi_terms (improved)

The curl query took time dropped from 50068 ms to 25922 ms

curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"multi_terms":{"terms":[{"field":"process.name"},{"field":"cloud.region"}],"size":10}}}}'
histogram (improved)

The curl query took time dropped from 3355 ms to 1691 ms

 curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"histogram":{"field":"metrics.size","interval":100}}}}'
date_histogram and auto_date_histogram (already fast since using BKD)

Not much improvemnt as using BKD.

curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"date_histogram":{"field":"@timestamp","calendar_interval":"day"}}}}' | jq '.'

curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"auto_date_histogram":{"field":"@timestamp","buckets":10}}}}' | jq '.'
range and date_range (using BKD)

Not much improvemnt as using BKD.

curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"range":{"field":"metrics.size","ranges":[{"to":100},{"from":100,"to":500},{"from":500}]}}}}' 

curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"date_range":{"field":"@timestamp","ranges":[{"to":"2023-06-01"},{"from":"2023-06-01"}]}}}}' 
composite

Shown regression with intra-segment.

 curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"composite":{"sources":[{"region":{"terms":{"field":"cloud.region"}}},{"process":{"terms":{"field":"process.name"}}}]}}}}'

curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"aggs":{"my_buckets":{"composite":{"size":100,"sources":[{"date":{"date_histogram":{"field":"@timestamp","calendar_interval":"day"}}},{"region":{"terms":{"field":"cloud.region"}}}]}}}}'
filter and filters (improved)

The curl query took time dropped from 788 ms to 447 ms for filter and from 1251 ms to 650 ms for filters

curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"filter":{"term":{"cloud.region":"us-east-1"}}}}}' 

curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"filters":{"filters":{"east":{"term":{"cloud.region":"us-east-1"}},"west":{"term":{"cloud.region":"us-west-2"}}}}}}}'
adjacency_matrix (improved)

The curl query took time dropped from 3411 ms to 1712 ms.

curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"adjacency_matrix":{"filters":{"east":{"term":{"cloud.region":"us-east-1"}},"west":{"term":{"cloud.region":"us-west-2"}},"kernel":{"term":{"process.name":"kernel"}}}}}}}'
sampler and diversified_sampler

Not much improvement with intra segment

curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"sample":{"sampler":{"shard_size":1000},"aggs":{"terms":{"terms":{"field":"process.name"}}}}}}' 

curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"sample":{"diversified_sampler":{"shard_size":1000,"field":"cloud.region"},"aggs":{"terms":{"terms":{"field":"process.name"}}}}}}'
Global aggregations

The curl query took time dropped from 3444 ms to 1770 ms.

curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d '{"query":{"match":{"message":"monkey"}},"aggs":{"filtered_avg":{"avg":{"field":"metrics.size"}},"all_docs":{"global":{},"aggs":{"total_avg":{"avg":{"field":"metrics.size"}}}}}}'

The curl query took time dropped from 3301 ms to 1668 ms.

curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d '{"size":0,"aggs":{"all_docs":{"global":{},"aggs":{"total_avg":{"avg":{"field":"metrics.size"}}}}}}'

Pipeline Aggregations

Yet to experiment ad pending to run benchmarks.

Composite Aggregation (Regresses)

Why it cannot benefit:

  • Global queue state: Maintains sorted iteration across all values - can't be safely updated from multiple threads
  • Early termination: Needs global view of all values to know when to stop
  • Index sort optimization: Broken by partition boundaries
  • After key pagination: Requires sequential processing to maintain cursor position
# composite aggregation
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"aggs":{"my_buckets":{"composite":{"size":100,"sources":[{"date":{"date_histogram":{"field":"@timestamp","calendar_interval":"day"}}},{"region":{"terms":{"field":"cloud.region"}}}]}}}}'

BKD-Optimized Aggregations (Minimal Improvement)

These already use BKD tree optimization and are very fast - little room for improvement:

  • date_histogram (with filter rewrite)
  • range (with filter rewrite)
  • auto_date_histogram

Single Term Queries

HighTerm, MedTerm, LowTerm pattern. Postings iteration is I/O-bound and inherently sequential. The inverted index stores doc IDs in sorted order - parallelism adds overhead without benefit.

curl -X GET "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"query":{"term":{"message":"monkey"}}}'

Range Queries

IntNRQ pattern. BKD tree traversal is duplicated per partition. Each partition must traverse the same tree structure.

curl -X GET "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"query":{"range":{"metrics.size":{"gte":100,"lte":1000}}}}'

Wildcard/Prefix Queries

Prefix3, Wildcard pattern. Automaton construction is duplicated per partition.

curl -X GET "localhost:9200/index/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "prefix": { "message": "mon" }
  }
}'

curl -X GET "localhost:9200/index/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "wildcard": { "message": "mon*ey" }
  }
}'

Related Issues and PR's

Test Results

Metadata

Metadata

Labels

Type

No type

Projects

Status

🆕 New

Status

In Progress

Status

New

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions