[Feature Request] Batched Inner Hits for Parent-Child Join

### Is your feature request related to a problem? Please describe

Parent-child join `inner_hits` has a significant performance bottleneck. When `inner_hits` is enabled, the Fetch Phase executes N independent searches for N parent hits - one search per parent to retrieve its child documents.

**Root cause**: In the Query Phase, Lucene's `JoinUtil.createJoinQuery` uses global ordinals for child→parent matching. Child document IDs are aggregated into a `LongBitSet`, and **the mapping of which child matched which parent is discarded**. Therefore, `InnerHitsPhase` must re-query for each parent hit.

**Current execution flow** (`InnerHitsPhase.hitExecute()`):
```java
for (Map.Entry<String, InnerHitSubContext> entry : innerHits.entrySet()) {
    TopDocsAndMaxScore topDoc = innerHitsContext.topDocs(hit);  // ← N times
    fetchPhase.execute(innerHitsContext);                        // ← N times
}
```

Each `topDocs(hit)` call:
1. Constructs a new Query
2. Creates a new Weight
3. Scans ALL segments
4. Collects results with TopDocsCollector

**Cost**: O(N × S) where N = parent hits, S = segments. With 100 parent hits and 10 segments, this means 1000 segment scans instead of 10.

**Related issues**:
- [#16878](https://github.com/opensearch-project/OpenSearch/issues/16878) reported the same problem (3+ second FetchPhase with inner_hits). It was addressed with cache optimization in [PR #16937](https://github.com/opensearch-project/OpenSearch/pull/16937), but that doesn't reduce the number of searches.

### Describe the solution you'd like

**Batch the N searches into 1 search** using `TermInSetQuery`.

| Operation | Current | Proposed |
|-----------|---------|----------|
| Query construction | N times | 1 time |
| Weight creation | N times | 1 time |
| Segment scan | N × S times | S times |

### Implementation

1. **Add batch API to `InnerHitSubContext`**:
```java
// New method with default fallback to existing behavior
public Map<SearchHit, TopDocsAndMaxScore> topDocs(List<SearchHit> hits) throws IOException {
    Map<SearchHit, TopDocsAndMaxScore> results = new HashMap<>();
    for (SearchHit hit : hits) {
        results.put(hit, topDocs(hit));  // fallback
    }
    return results;
}

public boolean supportsBatchExecution() { return false; }
```

2. **Implement batch in `JoinFieldInnerHitSubContext`**:
```java
@Override
public Map<SearchHit, TopDocsAndMaxScore> topDocs(List<SearchHit> hits) throws IOException {
    // 1. Collect all parent IDs
    List<BytesRef> parentIds = hits.stream().map(h -> new BytesRef(h.getId())).toList();

    // 2. Single TermInSetQuery
    Query batchQuery = new BooleanQuery.Builder()
        .add(new TermInSetQuery(parentIdField, parentIds), FILTER)
        .add(joinFieldQuery, FILTER)
        .build();

    // 3. Single scan of all segments
    // 4. Group results by parent ID
    return groupedResults;
}

@Override
public boolean supportsBatchExecution() { return true; }
```

3. **Update `InnerHitsPhase`** to use batch API when available.

### Side Effects Verified

| Concern | Investigation | Impact |
|---------|---------------|--------|
| Memory | Batch holds all child docs temporarily | ⚠️ O(N×C) increase - mitigate with batch size limit |
| Latency | Current impl is NOT streaming (Fetch Phase completes before response) | ✅ No change |
| Error handling | Current impl aborts on first error, no partial results | ✅ No change |
| Result order | Depends on implementation | ✅ Can maintain order |
| Scoring | inner_hits scores don't affect parent scores | ✅ No impact |

### Enablement (Feature Flag)

**Default: disabled** (existing behavior preserved)

```yaml
search.inner_hits.batch_enabled: false  # cluster setting
search.inner_hits.batch_size: 1000      # max batch size, fallback to sequential if exceeded
```

Per-request override:
```json
{
  "query": {
    "has_child": {
      "inner_hits": { "batch": true }
    }
  }
}
```

### Related component

Search:Performance

### Describe alternatives you've considered

### Alternative 1: Retain child doc IDs in Query Phase
Store `Map<ordinal, List<childDocId>>` instead of `LongBitSet` in `GlobalOrdinalsCollector`.

**Rejected**: Requires Lucene-level changes, significantly increases Query Phase memory, needs context passing between phases.

### Alternative 2: Application-side two-stage query
1st query: `has_child` without inner_hits → get parent IDs
2nd query: `has_parent` → get children

**Rejected**: Requires app-side join logic, no consistency guarantee between queries, difficult to integrate scoring.

### Alternative 3: Internal two-stage query (has_child → has_parent)
Engine automatically executes two queries and joins results.

**Rejected**: Score calculation disconnect (parent scores from 1st query, child scores from 2nd query are in different contexts), IDF mismatch. For non-scoring cases, app-side `msearch` achieves the same result.

### Alternative 4: Concurrent execution (parallelize N searches)
Run the N searches in parallel instead of sequentially.

**Rejected**: Doesn't reduce total CPU/IO - just parallelizes. [#16878](https://github.com/opensearch-project/OpenSearch/issues/16878) proposed this but was addressed differently.

### Additional context

### Related Work

- **[#15185](https://github.com/opensearch-project/OpenSearch/issues/15185) - Join support in OpenSearch**: Large-scale RFC for cross-index JOIN. This RFC is complementary - #15185 enables new use cases while this RFC optimizes existing parent-child inner_hits without API changes.

- **[#16937](https://github.com/opensearch-project/OpenSearch/pull/16937)**: Cache optimization for inner_hits. Complementary to this RFC - cache optimization improves each search's efficiency, batching reduces search count.

- **[#14546](https://github.com/opensearch-project/OpenSearch/issues/14546)**: Request to customize InnerHitsPhase from plugins.

### Compatibility

- **API**: Fully backward compatible. No user-facing API changes.
- **Behavior**: Results identical. Same-score document order may vary (not guaranteed by spec).
- **Plugins**: Default fallback ensures existing `InnerHitSubContext` implementations work unchanged.

### Implementation Notes

Per [CONTRIBUTING.md](https://github.com/opensearch-project/OpenSearch/blob/main/CONTRIBUTING.md#tips-for-success):
- Feature flag protection (`search.inner_hits.batch_enabled: false` default)
- Consider sandbox (`-Dsandbox.enabled=true`) since `InnerHitsPhase` is a core class
- `@opensearch.experimental` annotation for new APIs
- Benchmark plan: Compare latency/memory with varying parent hits (10-1000), segments (1-100), and child-to-parent ratios

### References

- [InnerHitsPhase.java](https://github.com/opensearch-project/OpenSearch/blob/main/server/src/main/java/org/opensearch/search/fetch/subphase/InnerHitsPhase.java)
- [ParentChildInnerHitContextBuilder.java](https://github.com/opensearch-project/OpenSearch/blob/main/modules/parent-join/src/main/java/org/opensearch/join/query/ParentChildInnerHitContextBuilder.java)
- [Lucene GlobalOrdinalsCollector.java](https://github.com/apache/lucene/blob/main/lucene/join/src/java/org/apache/lucene/search/join/GlobalOrdinalsCollector.java)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Batched Inner Hits for Parent-Child Join #20563

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Implementation

Side Effects Verified

Enablement (Feature Flag)

Related component

Describe alternatives you've considered

Alternative 1: Retain child doc IDs in Query Phase

Alternative 2: Application-side two-stage query

Alternative 3: Internal two-stage query (has_child → has_parent)

Alternative 4: Concurrent execution (parallelize N searches)

Additional context

Related Work

Compatibility

Implementation Notes

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Operation	Current	Proposed
Query construction	N times	1 time
Weight creation	N times	1 time
Segment scan	N × S times	S times

Concern	Investigation	Impact
Memory	Batch holds all child docs temporarily	⚠️ O(N×C) increase - mitigate with batch size limit
Latency	Current impl is NOT streaming (Fetch Phase completes before response)	✅ No change
Error handling	Current impl aborts on first error, no partial results	✅ No change
Result order	Depends on implementation	✅ Can maintain order
Scoring	inner_hits scores don't affect parent scores	✅ No impact

[Feature Request] Batched Inner Hits for Parent-Child Join #20563

Description

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Implementation

Side Effects Verified

Enablement (Feature Flag)

Related component

Describe alternatives you've considered

Alternative 1: Retain child doc IDs in Query Phase

Alternative 2: Application-side two-stage query

Alternative 3: Internal two-stage query (has_child → has_parent)

Alternative 4: Concurrent execution (parallelize N searches)

Additional context

Related Work

Compatibility

Implementation Notes

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions