Skip to content

[BUG] madvise from stored field query path could cause mmap lock contention on kernel 5.10 #20933

@bowenlan-amzn

Description

@bowenlan-amzn

Summary

Search thread pool queue and latency spikes on OpenSearch 3.1 while CPU is well below 50%.

Root cause:
In our internal patched Lucene (modeled after apache/lucene#14512), search threads call madvise(MADV_SEQUENTIAL) via SourceLookupgetMergeInstance(), which on kernel 5.10 acquires mmap_lock in exclusive WRITE mode. This blocks all concurrent mincore() readers (Lucene 10's prefetch()) — a convoy stall where threads serialize behind a single lock, stalling search for 10-18 seconds.

PR #20827 fixes the immediate trigger for scripts query that don't use _source.

This issue explains the mechanism and tracks the remaining exposure for paths that still call getMergeInstance() from search threads (scripts reading _source, fetch phase, derived field).

How it happens

flowchart TD
    A["script_score scores a document<br/>ScoreScript.setDocument()"]
    B["SourceLookup.setSegmentAndDocument()<br/>called unconditionally — even when<br/>script never reads _source"]
    C["getMergeInstance() on segment transition<br/>triggers madvise(MADV_SEQUENTIAL)"]

    A --> B --> C

    C --> W["madvise: acquires mmap_lock <b>WRITE</b><br/>(2-6 search threads)"]
    C -.->|"other search threads"| R["prefetch → mincore: needs mmap_lock <b>READ</b><br/>(26-46 search threads)"]

    W --> L["mmap_lock"]
    R --> L

    L --> S["Convoy stall: 10-18s<br/>Writer-preference rwsem queues<br/>all readers behind writers"]
Loading

The call chain: ScriptScoreFunction.score()ScoreScript.setDocument()LeafSearchLookup.setDocument()SourceLookup.setSegmentAndDocument(). On each segment transition, SourceLookup eagerly calls getSequentialStoredFieldsReader()StoredFieldsReader.getMergeInstance(). This pattern originated in ES PR #62509 as a fetch-phase optimization for sequential _source access.

The madvise trigger: getMergeInstance() creates a Lucene90CompressingStoredFieldsReader with merging=true, whose constructor calls fieldsStream.updateReadAdvice(ReadAdvice.SEQUENTIAL)madvise(MADV_SEQUENTIAL). This read advice change was added to OpenSearch's Lucene fork to fix a stored fields merge regression (modeled after apache/lucene#14512, which is still open upstream and not merged into Lucene). The assumption was only merge threads call getMergeInstance() — but SourceLookup calls it from search threads since ES 7.x.

The lock contention: On kernel 5.10, madvise(SEQUENTIAL) takes mmap_lock in WRITE mode (madvise_need_mmap_write() returns 1 for the default case in mm/madvise.c). Linux's rwsem is writer-preferring: once a writer is waiting, all new readers must wait too. So a single madvise call can block dozens of search threads stuck in mincore(). Search threads are both the victims and the perpetrators.

Kernel note: Starting with kernel 6.1+ (VMA management rework), madvise(SEQUENTIAL) no longer requires the global mmap_lock WRITE. However, the MADV_SEQUENTIAL flag still tells the kernel to evict pages behind reads, which hurts random-access search patterns.

Example triggering query — uses only doc['created'] (doc values) and _score, never _source:

{
  "script_score": {
    "query": { "match": { "title": "search terms" } },
    "script": {
      "source": "Math.max(_score, 0) * (doc['created'].size() == 0 ? 1 : Math.max(params.min, ((doc['created'].value.toInstant().toEpochMilli() - params.currentDate) / params.scale) + 1))",
      "params": { "min": 0.4, "currentDate": 1771279503795, "scale": 8.6724E9 }
    }
  }
}

Mitigation

PR #20827
Validated under load that reproduces the stall — search queue spike goes away completely.

Known limitations

The lazy init fix eliminates the trigger for the specific workload (scripts using only doc values). The madvise path remains reachable for:

  • Scripts that read _source — will still call getMergeInstance()madvise(SEQUENTIAL)
  • Fetch phase (~41% of madvise observations during stalls) — also calls SourceLookup.setSegmentAndDocument()
  • finishMerge() never called by search threadsgetMergeInstance() sets MADV_SEQUENTIAL on the file region, but only finishMerge() reverts it to MADV_RANDOM. Search threads never call finishMerge(), so the SEQUENTIAL advisory persists, telling the kernel to evict pages behind reads — harmful for random-access search patterns even on newer kernels where the WRITE lock is not an issue.

In the workload that triggered this issue, the lazy init fix was sufficient because script_score queries call setDocument() on every scored document per segment, making them the dominant madvise source. Other paths (fetch phase) are lower frequency and did not reproduce the stall after patching.

Evidence: jstack and kernel stacks

jstack: madvise writer (script_score → getMergeInstance)

"opensearch[...][search][T#13]" runnable
  at o.a.l.store.PosixNativeAccess.madvise(:141)
  at o.a.l.store.MemorySegmentIndexInput.updateReadAdvice(:370)
  at o.a.l.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.<init>(:110)
  at o.a.l.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.getMergeInstance(:716)
  at o.o.common.lucene.index.SequentialStoredFieldsLeafReader.getSequentialStoredFieldsReader(:74)
  at o.o.search.lookup.SourceLookup.setSegmentAndDocument(:143)
  at o.o.script.ScoreScript.setDocument(:162)
  at o.o.common.lucene.search.function.ScriptScoreFunction$1.score(:100)
  at ... → QueryPhase.execute

jstack: mincore victim (prefetch → isLoaded0)

"opensearch[...][search][T#1]" runnable
  at java.nio.MappedMemoryUtils.isLoaded0(Native Method)       ← mincore() syscall
  at jdk.internal.foreign.MappedMemorySegmentImpl.isLoaded(:87)
  at o.a.l.store.MemorySegmentIndexInput.prefetch(:349)
  at o.a.l.codecs.lucene101.Lucene101PostingsReader.prefetchPostings(:1394)
  at o.a.l.search.TermQuery$TermWeight$2.get(:164)
  at ... → QueryPhase.execute

Kernel stacks: mmap_lock contention

# mincore readers blocked (up to 46 threads during stalls):
[<0>] __do_sys_mincore+0xdc/0x2f0      ← blocked at down_read(&mm->mmap_lock)
[<0>] __arm64_sys_mincore+0x20/0x60

# madvise writers blocked (2-6 threads during stalls):
[<0>] rwsem_down_write_slowpath+0x334/0x75c   ← acquiring WRITE lock
[<0>] do_madvise+0xf8/0x4d4
[<0>] __arm64_sys_madvise+0x28/0x40

Zero do_mmap/do_munmap during stalls — madvise is the sole WRITE lock holder. TID correlation confirms 1:1 mapping between kernel do_madvise threads and Java search threads in ScriptScoreFunction.score(). madvise threads present in 100% of stall snapshots (across 28 stall windows), absent in 100% of non-stall snapshots.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    ✅ Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions