-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Description
This RFC analyzes which OpenSearch queries and aggregations benefit from Lucene's intra-segment Search feature (introduced in Lucene PR #13542) and which ones regress. The goal is to identify optimization opportunities and document expected behavior.
Background
Lucene's intra-segment search partitions large segments into smaller doc ID ranges that can be processed in parallel. This benefits compute-heavy operations but can hurt light or state dependent operations due to:
- Duplicated per-partition setup work (e.g., stored fields decompression).
- Coordination/merge overhead exceeding parallelization gains.
- Shared mutable state (e.g., terms agg bucket counters).
- Early termination conflicts (sorted queries).
- BKD tree traversal overhead (range queries).
- Duplicated per-segment work across partitions.
- Incompatible data structures.
Implementation for enabling intra-segment search
Sample Shard-Level Slice to Partition Assignment
When intra-segment search is enabled, large segments are partitioned and distributed across slices. Each slice contains partitions from different segments only.
flowchart TB
subgraph Shard["Shard"]
subgraph Segments
S1["Segment 1<br/>1.2M docs"]
S2["Segment 2<br/>600K docs"]
S3["Segment 3<br/>100K docs"]
end
end
subgraph Partitions
S1 --> P1a["S1: 0-400K"]
S1 --> P1b["S1: 400K-800K"]
S1 --> P1c["S1: 800K-1.2M"]
S2 --> P2a["S2: 0-300K"]
S2 --> P2b["S2: 300K-600K"]
S3 --> P3["S3: 0-100K<br/>(not partitioned)"]
end
subgraph Slices["Slices (Parallel Execution)"]
subgraph Slice1["Slice 1 - Thread 1"]
T1P1["S1: 0-400K"]
T1P2["S2: 300K-600K"]
end
subgraph Slice2["Slice 2 - Thread 2"]
T2P1["S1: 400K-800K"]
T2P2["S3: 0-100K"]
end
subgraph Slice3["Slice 3 - Thread 3"]
T3P1["S1: 800K-1.2M"]
T3P2["S2: 0-300K"]
end
end
P1a --> T1P1
P2b --> T1P2
P1b --> T2P1
P3 --> T2P2
P1c --> T3P1
P2a --> T3P2
Key Rule: Same-segment partitions are always in different slices (e.g., S1 partitions spread across Slice 1, 2, 3).
Partition Decision Logic
flowchart LR
A[Segment] --> B{docs >= floor_size AND<br/>docs > targetSize}
B -->|No| C[Keep whole]
B -->|Yes| D["N = min(ceil(docs/targetSize), sliceCount)"]
D --> E[Split into N partitions]
floor_size=search.intra_segment_search.min_segment_size(a floor setting to avoid partitioning small segemnts)targetSize=totalDocs / sliceCount(calculated to decide if a segment need to be partitioned or not)sliceCount=search.concurrent_segment_search.target_max_slice_count(existing slice count, same code used from concurrent segment search)
Configuration
| Setting | Default | Description |
|---|---|---|
search.intra_segment_search.enabled |
true |
Enable intra-segment partitioning |
search.intra_segment_search.min_segment_size |
500,000 |
Minimum segment size to partition |
Query Performance Analysis
The following queries were tested on the big5 dataset, force merged to a single segment to better understand the dynamics and query performance of intra-segment concurrency. The EC2 instance used for testing is an r5.xlarge with 4 vCPUs.
Queries that improve with Intra segment search
Tested with some existing queries and added several new ones to evaluate intra-segment concurrency. Changes are in PR #725. Additionally, validated a few queries using plain curl requests and measured took time.
Span Queries (improved)
Used big5 span-near-query query.
## without intra-segment
| 50th percentile service time | span-near-query | 104.127 | ms |
| 90th percentile service time | span-near-query | 104.796 | ms |
## with intra-segment
| 50th percentile service time | span-near-query | 55.2068 | ms |
| 90th percentile service time | span-near-query | 77.3854 | ms |
Intervals ordered query (improved)
LowIntervalsOrdered, MedIntervalsOrdered, HighIntervalsOrdered
## without intra-segment
| 50th percentile service time | intervals-ordered-message | 91.1661 | ms |
| 90th percentile service time | intervals-ordered-message | 91.6819 | ms |
## with intra-segment
| 50th percentile service time | intervals-ordered-message | 48.3255 | ms |
| 90th percentile service time | intervals-ordered-message | 67.1245 | ms |
Phrase Queries (improved match_phrase)
bool-must-double-phrase (MedSloppyPhrase)
## without intra-segment
| 50th percentile service time | bool-must-double-phrase | 127.807 | ms |
| 90th percentile service time | bool-must-double-phrase | 129.298 | ms |
## with intra-segment
| 50th percentile service time | bool-must-double-phrase | 66.7707 | ms |
| 90th percentile service time | bool-must-double-phrase | 92.5768 | ms |Boolean AND Queries with High Cardinality (improved)
The AndHighHigh, AndHighMed queries. Intersection work across multiple high-frequency terms parallelizes well so boolean with AND with BM25 scoring improved
## without intra-segment
| 50th percentile service time | bool-must-match-message | 58.7766 | ms |
| 90th percentile service time | bool-must-match-message | 59.3151 | ms |
## with intra-segment
| 50th percentile service time | bool-must-match-message | 33.205 | ms |
| 90th percentile service time | bool-must-match-message | 51.3522 | ms |## without intra-segment
| 50th percentile service time | bool-must-term-keyword | 58.9581 | ms |
| 90th percentile service time | bool-must-term-keyword | 59.2319 | ms |
## with intra-segment
| 50th percentile service time | bool-must-term-keyword | 33.1571 | ms |
| 90th percentile service time | bool-must-term-keyword | 39.0753 | ms |Boolean OR Queries (improved)
The OrHighHigh, OrHighMed format. Union operations with scoring benefit from parallel intra segment.
## without intra-segment
| 50th percentile service time | bool-should-match-med | 66.7524 | ms |
| 90th percentile service time | bool-should-match-med | 67.5462 | ms |
## with intra-segment
| 50th percentile service time | bool-should-match-med | 37.641 | ms |
| 90th percentile service time | bool-should-match-med | 53.3731 | ms |Seen improvement with curl "took" time from 67 to 37
# OrHighHigh
curl -X GET "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"track_total_hits":false,"query":{"bool":{"should":[{"match":{"message":"monkey"}},{"match":{"message":"rabbit"}}],"minimum_should_match":1}}}'
# OrHighMed
curl -X GET "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"track_total_hits":false,"query":{"bool":{"should":[{"match":{"message":"monkey"}},{"match":{"message":"spirit"}}],"minimum_should_match":1}}}'Query string (improved)
Directly used query-string-on-message query form big5 workload.
## without intra-segment
| 50th percentile service time | query-string-on-message | 146.762 | ms |
| 90th percentile service time | query-string-on-message | 148.851 | ms |
## with intra-segment
| 50th percentile service time | query-string-on-message | 79.3278 | ms |
| 90th percentile service time | query-string-on-message | 115.854 | ms |Aggregation Performance Analysis
Metric Aggregations
Metric aggregations benefit because DocValues scanning and computation parallelize well. Each partition can compute partial results that merge.
stats aggregation (improved)
The curl query took time dropped from 2859 ms to 1494 ms
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d' { "size": 0, "aggs": { "metrics_stats": { "stats": { "field": "metrics.size" } } } }'sum (improved)
The curl query took time dropped from 2046 ms to 1022 ms
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"tsum":{"sum":{"field":"metrics.size"}}}}'size_percentiles (improved)
The curl query took time dropped from 12843 ms to 6515 ms
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d' { "size": 0, "aggs": { "size_percentiles": { "percentiles": { "field": "metrics.size" } } } }'extended_stats (improved)
The curl query took time dropped from 3718 ms to 1858 ms
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"size_extended":{"extended_stats":{"field":"metrics.size"}}}}'Multiple metric aggregations (improved metrics-agg-no-range)
## without intra-segment
| 50th percentile service time | metrics-agg-no-range | 9904.66 | ms |
| 90th percentile service time | metrics-agg-no-range | 9978.55 | ms |
## with intra-segment
| 50th percentile service time | metrics-agg-no-range | 4971.42 | ms |
| 90th percentile service time | metrics-agg-no-range | 5088.13 | ms |
curl -s "localhost:9200/index/_search?pretty" -H 'Content-Type: application/json' -d'
{
"size": 0,
"track_total_hits": false,
"aggs": {
"tsum": { "sum": { "field": "metrics.size" } },
"tmin": { "min": { "field": "metrics.tmin" } },
"tavg": { "avg": { "field": "metrics.size" } },
"tmax": { "max": { "field": "metrics.size" } },
"tstats": { "stats": { "field": "metrics.size" } }
}
}'value_count (improved)
The curl query took time dropped from 789 ms to 395 ms
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"count_values":{"value_count":{"field":"metrics.size"}}}}'percentile_ranks (improved)
The curl query took time dropped from 12747 ms to 6432 ms
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"size_ranks":{"percentile_ranks":{"field":"metrics.size","values":[100,500,1000]}}}}'matrix_stats aggregation (improved)
The curl query took time dropped from 43773 ms to 22178 ms
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"matrix":{"matrix_stats":{"fields":["metrics.size","metrics.tmin"]}}}}'top_docs (improved)
The curl query took time dropped from 5618 ms to 2920 ms
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"by_region":{"terms":{"field":"cloud.region","size":5},"aggs":{"top_docs":{"top_hits":{"size":3}}}}}}'cardinality high aggregation (improved)
with intra-segment
| 50th percentile service time | cardinality-agg-high | 665.718 | ms |
| 90th percentile service time | cardinality-agg-high | 723.834 | ms |
without intra-segment
| 50th percentile service time | cardinality-agg-high | 1312.29 | ms |
| 90th percentile service time | cardinality-agg-high | 1323.5 | ms |top_hits nested under terms (improved)
The curl query took time dropped from 5430 ms to 2829 ms
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"by_region":{"terms":{"field":"cloud.region","size":5},"aggs":{"top_docs":{"top_hits":{"size":3}}}}}}'Bucket Aggregations
terms (improved)
Seen slight imporvement with curl queries.
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"aggs":{"by_region":{"terms":{"field":"cloud.region","size":10}}}}'significant_terms (improved)
The curl query took time dropped from 5970 ms to 3170 ms
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"track_total_hits":false,"aggs":{"by_process":{"terms":{"field":"process.name","size":10},"aggs":{"significant_regions":{"significant_terms":{"field":"cloud.region"}}}}}}'multi_terms (improved)
The curl query took time dropped from 50068 ms to 25922 ms
curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"multi_terms":{"terms":[{"field":"process.name"},{"field":"cloud.region"}],"size":10}}}}'histogram (improved)
The curl query took time dropped from 3355 ms to 1691 ms
curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"histogram":{"field":"metrics.size","interval":100}}}}'date_histogram and auto_date_histogram (already fast since using BKD)
Not much improvemnt as using BKD.
curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"date_histogram":{"field":"@timestamp","calendar_interval":"day"}}}}' | jq '.'
curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"auto_date_histogram":{"field":"@timestamp","buckets":10}}}}' | jq '.'range and date_range (using BKD)
Not much improvemnt as using BKD.
curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"range":{"field":"metrics.size","ranges":[{"to":100},{"from":100,"to":500},{"from":500}]}}}}'
curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"date_range":{"field":"@timestamp","ranges":[{"to":"2023-06-01"},{"from":"2023-06-01"}]}}}}' composite
Shown regression with intra-segment.
curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"composite":{"sources":[{"region":{"terms":{"field":"cloud.region"}}},{"process":{"terms":{"field":"process.name"}}}]}}}}'
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"aggs":{"my_buckets":{"composite":{"size":100,"sources":[{"date":{"date_histogram":{"field":"@timestamp","calendar_interval":"day"}}},{"region":{"terms":{"field":"cloud.region"}}}]}}}}'filter and filters (improved)
The curl query took time dropped from 788 ms to 447 ms for filter and from 1251 ms to 650 ms for filters
curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"filter":{"term":{"cloud.region":"us-east-1"}}}}}'
curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"filters":{"filters":{"east":{"term":{"cloud.region":"us-east-1"}},"west":{"term":{"cloud.region":"us-west-2"}}}}}}}'adjacency_matrix (improved)
The curl query took time dropped from 3411 ms to 1712 ms.
curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"a":{"adjacency_matrix":{"filters":{"east":{"term":{"cloud.region":"us-east-1"}},"west":{"term":{"cloud.region":"us-west-2"}},"kernel":{"term":{"process.name":"kernel"}}}}}}}'sampler and diversified_sampler
Not much improvement with intra segment
curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"sample":{"sampler":{"shard_size":1000},"aggs":{"terms":{"terms":{"field":"process.name"}}}}}}'
curl -s "localhost:9200/big5/_search" -H 'Content-Type: application/json' -d'
{"size":0,"track_total_hits":false,"aggs":{"sample":{"diversified_sampler":{"shard_size":1000,"field":"cloud.region"},"aggs":{"terms":{"terms":{"field":"process.name"}}}}}}'Global aggregations
The curl query took time dropped from 3444 ms to 1770 ms.
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d '{"query":{"match":{"message":"monkey"}},"aggs":{"filtered_avg":{"avg":{"field":"metrics.size"}},"all_docs":{"global":{},"aggs":{"total_avg":{"avg":{"field":"metrics.size"}}}}}}'The curl query took time dropped from 3301 ms to 1668 ms.
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d '{"size":0,"aggs":{"all_docs":{"global":{},"aggs":{"total_avg":{"avg":{"field":"metrics.size"}}}}}}'Pipeline Aggregations
Yet to experiment ad pending to run benchmarks.
Composite Aggregation (Regresses)
Why it cannot benefit:
- Global queue state: Maintains sorted iteration across all values - can't be safely updated from multiple threads
- Early termination: Needs global view of all values to know when to stop
- Index sort optimization: Broken by partition boundaries
- After key pagination: Requires sequential processing to maintain cursor position
# composite aggregation
curl -s "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"size":0,"aggs":{"my_buckets":{"composite":{"size":100,"sources":[{"date":{"date_histogram":{"field":"@timestamp","calendar_interval":"day"}}},{"region":{"terms":{"field":"cloud.region"}}}]}}}}'BKD-Optimized Aggregations (Minimal Improvement)
These already use BKD tree optimization and are very fast - little room for improvement:
date_histogram(with filter rewrite)range(with filter rewrite)auto_date_histogram
Single Term Queries
HighTerm, MedTerm, LowTerm pattern. Postings iteration is I/O-bound and inherently sequential. The inverted index stores doc IDs in sorted order - parallelism adds overhead without benefit.
curl -X GET "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"query":{"term":{"message":"monkey"}}}'Range Queries
IntNRQ pattern. BKD tree traversal is duplicated per partition. Each partition must traverse the same tree structure.
curl -X GET "localhost:9200/big5/_search?pretty" -H 'Content-Type: application/json' -d'{"query":{"range":{"metrics.size":{"gte":100,"lte":1000}}}}'Wildcard/Prefix Queries
Prefix3, Wildcard pattern. Automaton construction is duplicated per partition.
curl -X GET "localhost:9200/index/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"prefix": { "message": "mon" }
}
}'
curl -X GET "localhost:9200/index/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"wildcard": { "message": "mon*ey" }
}
}'Related Issues and PR's
- Optimize PointRangeQuery for intra-segment concurrency with segment-level DocIdSet caching: Optimize
PointRangeQueryfor intra-segment concurrency with segment-levelDocIdSetcaching apache/lucene#15446 - Support DocIdSetBuilder with partition bounds: Support DocIdSetBuilder with partition bounds apache/lucene#15383
- Basic Default Implementation for enabling intra-segment search: Implementation for enabling intra-segment search #19704
- @ajleong623 Intrasegment Concurrent Search Experimenting: AJ's Intrasegment Concurrent Search Experimenting #20068
- Handle collectors for variance in few aggregation types: [Intra-SegmentConcurrentSearch] Change Collectors and CollectorManagers #18854
- Lucene intra segment search benchmarks: http://github.com/apache/lucene/pull/13542#issuecomment-2332114836
- OpenSearch intra segment search slicing mechanism: [Intra-SegmentConcurrentSearch] Slicing mechanism #18851
- Lucene default distribution with intra segment: [Intra-SegmentConcurrentSearch] Slicing mechanism #18851 (comment)
- Open META issue in OpenSearch related to intra segment: [META] Enable Intra Segment Concurrent Search #18852
- Draft PR in opensearch-benchmark-workloads: New Queries to test intra segment search opensearch-benchmark-workloads#725
- Add standalone query and aggregation operations for isolated performance benchmarking: [FEATURE] Add standalone query and aggregation operations for isolated performance benchmarking opensearch-benchmark-workloads#726
Test Results
- Intra segment tests with multiple search clients: [RFC] Intra-Segment Search: Query and Aggregation Performance Analysis #20202 (comment)
- Intra segment tests with scaling slices and index search thread pool: [RFC] Intra-Segment Search: Query and Aggregation Performance Analysis #20202 (comment)
- Intra segment tests with multiple segments: [RFC] Intra-Segment Search: Query and Aggregation Performance Analysis #20202 (comment)
- Intra segment slice skew tests: [RFC] Intra-Segment Search: Query and Aggregation Performance Analysis #20202 (comment)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status
Status