-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Is your feature request related to a problem? Please describe
Parent-child join inner_hits has a significant performance bottleneck. When inner_hits is enabled, the Fetch Phase executes N independent searches for N parent hits - one search per parent to retrieve its child documents.
Root cause: In the Query Phase, Lucene's JoinUtil.createJoinQuery uses global ordinals for child→parent matching. Child document IDs are aggregated into a LongBitSet, and the mapping of which child matched which parent is discarded. Therefore, InnerHitsPhase must re-query for each parent hit.
Current execution flow (InnerHitsPhase.hitExecute()):
for (Map.Entry<String, InnerHitSubContext> entry : innerHits.entrySet()) {
TopDocsAndMaxScore topDoc = innerHitsContext.topDocs(hit); // ← N times
fetchPhase.execute(innerHitsContext); // ← N times
}Each topDocs(hit) call:
- Constructs a new Query
- Creates a new Weight
- Scans ALL segments
- Collects results with TopDocsCollector
Cost: O(N × S) where N = parent hits, S = segments. With 100 parent hits and 10 segments, this means 1000 segment scans instead of 10.
Related issues:
- #16878 reported the same problem (3+ second FetchPhase with inner_hits). It was addressed with cache optimization in PR #16937, but that doesn't reduce the number of searches.
Describe the solution you'd like
Batch the N searches into 1 search using TermInSetQuery.
| Operation | Current | Proposed |
|---|---|---|
| Query construction | N times | 1 time |
| Weight creation | N times | 1 time |
| Segment scan | N × S times | S times |
Implementation
- Add batch API to
InnerHitSubContext:
// New method with default fallback to existing behavior
public Map<SearchHit, TopDocsAndMaxScore> topDocs(List<SearchHit> hits) throws IOException {
Map<SearchHit, TopDocsAndMaxScore> results = new HashMap<>();
for (SearchHit hit : hits) {
results.put(hit, topDocs(hit)); // fallback
}
return results;
}
public boolean supportsBatchExecution() { return false; }- Implement batch in
JoinFieldInnerHitSubContext:
@Override
public Map<SearchHit, TopDocsAndMaxScore> topDocs(List<SearchHit> hits) throws IOException {
// 1. Collect all parent IDs
List<BytesRef> parentIds = hits.stream().map(h -> new BytesRef(h.getId())).toList();
// 2. Single TermInSetQuery
Query batchQuery = new BooleanQuery.Builder()
.add(new TermInSetQuery(parentIdField, parentIds), FILTER)
.add(joinFieldQuery, FILTER)
.build();
// 3. Single scan of all segments
// 4. Group results by parent ID
return groupedResults;
}
@Override
public boolean supportsBatchExecution() { return true; }- Update
InnerHitsPhaseto use batch API when available.
Side Effects Verified
| Concern | Investigation | Impact |
|---|---|---|
| Memory | Batch holds all child docs temporarily | |
| Latency | Current impl is NOT streaming (Fetch Phase completes before response) | ✅ No change |
| Error handling | Current impl aborts on first error, no partial results | ✅ No change |
| Result order | Depends on implementation | ✅ Can maintain order |
| Scoring | inner_hits scores don't affect parent scores | ✅ No impact |
Enablement (Feature Flag)
Default: disabled (existing behavior preserved)
search.inner_hits.batch_enabled: false # cluster setting
search.inner_hits.batch_size: 1000 # max batch size, fallback to sequential if exceededPer-request override:
{
"query": {
"has_child": {
"inner_hits": { "batch": true }
}
}
}Related component
Search:Performance
Describe alternatives you've considered
Alternative 1: Retain child doc IDs in Query Phase
Store Map<ordinal, List<childDocId>> instead of LongBitSet in GlobalOrdinalsCollector.
Rejected: Requires Lucene-level changes, significantly increases Query Phase memory, needs context passing between phases.
Alternative 2: Application-side two-stage query
1st query: has_child without inner_hits → get parent IDs
2nd query: has_parent → get children
Rejected: Requires app-side join logic, no consistency guarantee between queries, difficult to integrate scoring.
Alternative 3: Internal two-stage query (has_child → has_parent)
Engine automatically executes two queries and joins results.
Rejected: Score calculation disconnect (parent scores from 1st query, child scores from 2nd query are in different contexts), IDF mismatch. For non-scoring cases, app-side msearch achieves the same result.
Alternative 4: Concurrent execution (parallelize N searches)
Run the N searches in parallel instead of sequentially.
Rejected: Doesn't reduce total CPU/IO - just parallelizes. #16878 proposed this but was addressed differently.
Additional context
Related Work
-
#15185 - Join support in OpenSearch: Large-scale RFC for cross-index JOIN. This RFC is complementary - [RFC] Join support in OpenSearch #15185 enables new use cases while this RFC optimizes existing parent-child inner_hits without API changes.
-
#16937: Cache optimization for inner_hits. Complementary to this RFC - cache optimization improves each search's efficiency, batching reduces search count.
-
#14546: Request to customize InnerHitsPhase from plugins.
Compatibility
- API: Fully backward compatible. No user-facing API changes.
- Behavior: Results identical. Same-score document order may vary (not guaranteed by spec).
- Plugins: Default fallback ensures existing
InnerHitSubContextimplementations work unchanged.
Implementation Notes
Per CONTRIBUTING.md:
- Feature flag protection (
search.inner_hits.batch_enabled: falsedefault) - Consider sandbox (
-Dsandbox.enabled=true) sinceInnerHitsPhaseis a core class @opensearch.experimentalannotation for new APIs- Benchmark plan: Compare latency/memory with varying parent hits (10-1000), segments (1-100), and child-to-parent ratios
References
Metadata
Metadata
Assignees
Labels
Type
Projects
Status