Skip to content

[Feature Request] Batched Inner Hits for Parent-Child Join #20563

@tkykenmt

Description

@tkykenmt

Is your feature request related to a problem? Please describe

Parent-child join inner_hits has a significant performance bottleneck. When inner_hits is enabled, the Fetch Phase executes N independent searches for N parent hits - one search per parent to retrieve its child documents.

Root cause: In the Query Phase, Lucene's JoinUtil.createJoinQuery uses global ordinals for child→parent matching. Child document IDs are aggregated into a LongBitSet, and the mapping of which child matched which parent is discarded. Therefore, InnerHitsPhase must re-query for each parent hit.

Current execution flow (InnerHitsPhase.hitExecute()):

for (Map.Entry<String, InnerHitSubContext> entry : innerHits.entrySet()) {
    TopDocsAndMaxScore topDoc = innerHitsContext.topDocs(hit);  // ← N times
    fetchPhase.execute(innerHitsContext);                        // ← N times
}

Each topDocs(hit) call:

  1. Constructs a new Query
  2. Creates a new Weight
  3. Scans ALL segments
  4. Collects results with TopDocsCollector

Cost: O(N × S) where N = parent hits, S = segments. With 100 parent hits and 10 segments, this means 1000 segment scans instead of 10.

Related issues:

  • #16878 reported the same problem (3+ second FetchPhase with inner_hits). It was addressed with cache optimization in PR #16937, but that doesn't reduce the number of searches.

Describe the solution you'd like

Batch the N searches into 1 search using TermInSetQuery.

Operation Current Proposed
Query construction N times 1 time
Weight creation N times 1 time
Segment scan N × S times S times

Implementation

  1. Add batch API to InnerHitSubContext:
// New method with default fallback to existing behavior
public Map<SearchHit, TopDocsAndMaxScore> topDocs(List<SearchHit> hits) throws IOException {
    Map<SearchHit, TopDocsAndMaxScore> results = new HashMap<>();
    for (SearchHit hit : hits) {
        results.put(hit, topDocs(hit));  // fallback
    }
    return results;
}

public boolean supportsBatchExecution() { return false; }
  1. Implement batch in JoinFieldInnerHitSubContext:
@Override
public Map<SearchHit, TopDocsAndMaxScore> topDocs(List<SearchHit> hits) throws IOException {
    // 1. Collect all parent IDs
    List<BytesRef> parentIds = hits.stream().map(h -> new BytesRef(h.getId())).toList();

    // 2. Single TermInSetQuery
    Query batchQuery = new BooleanQuery.Builder()
        .add(new TermInSetQuery(parentIdField, parentIds), FILTER)
        .add(joinFieldQuery, FILTER)
        .build();

    // 3. Single scan of all segments
    // 4. Group results by parent ID
    return groupedResults;
}

@Override
public boolean supportsBatchExecution() { return true; }
  1. Update InnerHitsPhase to use batch API when available.

Side Effects Verified

Concern Investigation Impact
Memory Batch holds all child docs temporarily ⚠️ O(N×C) increase - mitigate with batch size limit
Latency Current impl is NOT streaming (Fetch Phase completes before response) ✅ No change
Error handling Current impl aborts on first error, no partial results ✅ No change
Result order Depends on implementation ✅ Can maintain order
Scoring inner_hits scores don't affect parent scores ✅ No impact

Enablement (Feature Flag)

Default: disabled (existing behavior preserved)

search.inner_hits.batch_enabled: false  # cluster setting
search.inner_hits.batch_size: 1000      # max batch size, fallback to sequential if exceeded

Per-request override:

{
  "query": {
    "has_child": {
      "inner_hits": { "batch": true }
    }
  }
}

Related component

Search:Performance

Describe alternatives you've considered

Alternative 1: Retain child doc IDs in Query Phase

Store Map<ordinal, List<childDocId>> instead of LongBitSet in GlobalOrdinalsCollector.

Rejected: Requires Lucene-level changes, significantly increases Query Phase memory, needs context passing between phases.

Alternative 2: Application-side two-stage query

1st query: has_child without inner_hits → get parent IDs
2nd query: has_parent → get children

Rejected: Requires app-side join logic, no consistency guarantee between queries, difficult to integrate scoring.

Alternative 3: Internal two-stage query (has_child → has_parent)

Engine automatically executes two queries and joins results.

Rejected: Score calculation disconnect (parent scores from 1st query, child scores from 2nd query are in different contexts), IDF mismatch. For non-scoring cases, app-side msearch achieves the same result.

Alternative 4: Concurrent execution (parallelize N searches)

Run the N searches in parallel instead of sequentially.

Rejected: Doesn't reduce total CPU/IO - just parallelizes. #16878 proposed this but was addressed differently.

Additional context

Related Work

  • #15185 - Join support in OpenSearch: Large-scale RFC for cross-index JOIN. This RFC is complementary - [RFC] Join support in OpenSearch #15185 enables new use cases while this RFC optimizes existing parent-child inner_hits without API changes.

  • #16937: Cache optimization for inner_hits. Complementary to this RFC - cache optimization improves each search's efficiency, batching reduces search count.

  • #14546: Request to customize InnerHitsPhase from plugins.

Compatibility

  • API: Fully backward compatible. No user-facing API changes.
  • Behavior: Results identical. Same-score document order may vary (not guaranteed by spec).
  • Plugins: Default fallback ensures existing InnerHitSubContext implementations work unchanged.

Implementation Notes

Per CONTRIBUTING.md:

  • Feature flag protection (search.inner_hits.batch_enabled: false default)
  • Consider sandbox (-Dsandbox.enabled=true) since InnerHitsPhase is a core class
  • @opensearch.experimental annotation for new APIs
  • Benchmark plan: Compare latency/memory with varying parent hits (10-1000), segments (1-100), and child-to-parent ratios

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Search:PerformancediscussIssues intended to help drive brainstorming and decision makingenhancementEnhancement or improvement to existing feature or request

    Type

    No type

    Projects

    Status

    🆕 New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions