[BUG] madvise from stored field query path could cause mmap lock contention on kernel 5.10

## Summary

Search thread pool queue and latency spikes on OpenSearch 3.1 while CPU is well below 50%. 

Root cause: 
In our **internal patched** Lucene ([modeled after apache/lucene#14512](https://github.com/apache/lucene/pull/14512)), search threads call `madvise(MADV_SEQUENTIAL)` via `SourceLookup` → `getMergeInstance()`, which on kernel 5.10 acquires `mmap_lock` in exclusive WRITE mode. This blocks all concurrent `mincore()` readers (Lucene 10's `prefetch()`) — a convoy stall where threads serialize behind a single lock, stalling search for 10-18 seconds.

PR #20827 fixes the immediate trigger for scripts query that don't use `_source`. 

This issue explains the mechanism and tracks the remaining exposure for paths that still call `getMergeInstance()` from search threads (scripts reading `_source`, fetch phase, derived field).

## How it happens

```mermaid
flowchart TD
 A["script_score scores a document ScoreScript.setDocument()"]
 B["SourceLookup.setSegmentAndDocument() called unconditionally — even when script never reads _source"]
 C["getMergeInstance() on segment transition triggers madvise(MADV_SEQUENTIAL)"]

 A --> B --> C

 C --> W["madvise: acquires mmap_lock WRITE (2-6 search threads)"]
 C -.->|"other search threads"| R["prefetch → mincore: needs mmap_lock READ (26-46 search threads)"]

 W --> L["mmap_lock"]
 R --> L

 L --> S["Convoy stall: 10-18s Writer-preference rwsem queues all readers behind writers"]
```

**The call chain:** `ScriptScoreFunction.score()` → `ScoreScript.setDocument()` → `LeafSearchLookup.setDocument()` → `SourceLookup.setSegmentAndDocument()`. On each segment transition, `SourceLookup` eagerly calls `getSequentialStoredFieldsReader()` → `StoredFieldsReader.getMergeInstance()`. This pattern originated in [ES PR #62509](https://github.com/elastic/elasticsearch/pull/62509) as a fetch-phase optimization for sequential `_source` access.

**The madvise trigger:** `getMergeInstance()` creates a `Lucene90CompressingStoredFieldsReader` with `merging=true`, whose constructor calls `fieldsStream.updateReadAdvice(ReadAdvice.SEQUENTIAL)` → `madvise(MADV_SEQUENTIAL)`. This read advice change was added to OpenSearch's Lucene fork to fix a stored fields merge regression ([modeled after apache/lucene#14512](https://github.com/apache/lucene/pull/14512), which is still open upstream and not merged into Lucene). The assumption was only merge threads call `getMergeInstance()` — but `SourceLookup` calls it from search threads since ES 7.x.

**The lock contention:** On kernel 5.10, `madvise(SEQUENTIAL)` takes `mmap_lock` in WRITE mode (`madvise_need_mmap_write()` returns 1 for the default case in `mm/madvise.c`). Linux's rwsem is writer-preferring: once a writer is waiting, all new readers must wait too. So a single `madvise` call can block dozens of search threads stuck in `mincore()`. Search threads are both the victims and the perpetrators.

**Kernel note:** Starting with kernel 6.1+ (VMA management rework), `madvise(SEQUENTIAL)` no longer requires the global `mmap_lock` WRITE. However, the `MADV_SEQUENTIAL` flag still tells the kernel to evict pages behind reads, which hurts random-access search patterns.

**Example triggering query** — uses only `doc['created']` (doc values) and `_score`, never `_source`:

```json
{
 "script_score": {
 "query": { "match": { "title": "search terms" } },
 "script": {
 "source": "Math.max(_score, 0) * (doc['created'].size() == 0 ? 1 : Math.max(params.min, ((doc['created'].value.toInstant().toEpochMilli() - params.currentDate) / params.scale) + 1))",
 "params": { "min": 0.4, "currentDate": 1771279503795, "scale": 8.6724E9 }
 }
 }
}
```

## Mitigation

PR [#20827](https://github.com/opensearch-project/OpenSearch/pull/20827) 
Validated under load that reproduces the stall — search queue spike goes away completely.

## Known limitations

The lazy init fix eliminates the trigger for the specific workload (scripts using only `doc` values). The madvise path remains reachable for:

- **Scripts that read `_source`** — will still call `getMergeInstance()` → `madvise(SEQUENTIAL)`
- **Fetch phase** (~41% of madvise observations during stalls) — also calls `SourceLookup.setSegmentAndDocument()`
- **finishMerge() never called by search threads** — `getMergeInstance()` sets `MADV_SEQUENTIAL` on the file region, but only `finishMerge()` reverts it to `MADV_RANDOM`. Search threads never call `finishMerge()`, so the SEQUENTIAL advisory persists, telling the kernel to evict pages behind reads — harmful for random-access search patterns even on newer kernels where the WRITE lock is not an issue.

In the workload that triggered this issue, the lazy init fix was sufficient because script_score queries call `setDocument()` on every scored document per segment, making them the dominant madvise source. Other paths (fetch phase) are lower frequency and did not reproduce the stall after patching.

<details>
<summary>Evidence: jstack and kernel stacks</summary>

### jstack: madvise writer (script_score → getMergeInstance)

```
"opensearch[...][search][T#13]" runnable
 at o.a.l.store.PosixNativeAccess.madvise(:141)
 at o.a.l.store.MemorySegmentIndexInput.updateReadAdvice(:370)
 at o.a.l.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.<init>(:110)
 at o.a.l.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.getMergeInstance(:716)
 at o.o.common.lucene.index.SequentialStoredFieldsLeafReader.getSequentialStoredFieldsReader(:74)
 at o.o.search.lookup.SourceLookup.setSegmentAndDocument(:143)
 at o.o.script.ScoreScript.setDocument(:162)
 at o.o.common.lucene.search.function.ScriptScoreFunction$1.score(:100)
 at ... → QueryPhase.execute
```

### jstack: mincore victim (prefetch → isLoaded0)

```
"opensearch[...][search][T#1]" runnable
 at java.nio.MappedMemoryUtils.isLoaded0(Native Method) ← mincore() syscall
 at jdk.internal.foreign.MappedMemorySegmentImpl.isLoaded(:87)
 at o.a.l.store.MemorySegmentIndexInput.prefetch(:349)
 at o.a.l.codecs.lucene101.Lucene101PostingsReader.prefetchPostings(:1394)
 at o.a.l.search.TermQuery$TermWeight$2.get(:164)
 at ... → QueryPhase.execute
```

### Kernel stacks: mmap_lock contention

```
# mincore readers blocked (up to 46 threads during stalls):
[<0>] __do_sys_mincore+0xdc/0x2f0 ← blocked at down_read(&mm->mmap_lock)
[<0>] __arm64_sys_mincore+0x20/0x60

# madvise writers blocked (2-6 threads during stalls):
[<0>] rwsem_down_write_slowpath+0x334/0x75c ← acquiring WRITE lock
[<0>] do_madvise+0xf8/0x4d4
[<0>] __arm64_sys_madvise+0x28/0x40
```

Zero `do_mmap`/`do_munmap` during stalls — madvise is the sole WRITE lock holder. TID correlation confirms 1:1 mapping between kernel `do_madvise` threads and Java search threads in `ScriptScoreFunction.score()`. madvise threads present in **100%** of stall snapshots (across 28 stall windows), absent in **100%** of non-stall snapshots.

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] madvise from stored field query path could cause mmap lock contention on kernel 5.10 #20933

Summary

How it happens

Mitigation

Known limitations

jstack: madvise writer (script_score → getMergeInstance)

jstack: mincore victim (prefetch → isLoaded0)

Kernel stacks: mmap_lock contention

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] madvise from stored field query path could cause mmap lock contention on kernel 5.10 #20933

Description

Summary

How it happens

Mitigation

Known limitations

jstack: madvise writer (script_score → getMergeInstance)

jstack: mincore victim (prefetch → isLoaded0)

Kernel stacks: mmap_lock contention

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions