-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Summary
Search thread pool queue and latency spikes on OpenSearch 3.1 while CPU is well below 50%.
Root cause:
In our internal patched Lucene (modeled after apache/lucene#14512), search threads call madvise(MADV_SEQUENTIAL) via SourceLookup → getMergeInstance(), which on kernel 5.10 acquires mmap_lock in exclusive WRITE mode. This blocks all concurrent mincore() readers (Lucene 10's prefetch()) — a convoy stall where threads serialize behind a single lock, stalling search for 10-18 seconds.
PR #20827 fixes the immediate trigger for scripts query that don't use _source.
This issue explains the mechanism and tracks the remaining exposure for paths that still call getMergeInstance() from search threads (scripts reading _source, fetch phase, derived field).
How it happens
flowchart TD
A["script_score scores a document<br/>ScoreScript.setDocument()"]
B["SourceLookup.setSegmentAndDocument()<br/>called unconditionally — even when<br/>script never reads _source"]
C["getMergeInstance() on segment transition<br/>triggers madvise(MADV_SEQUENTIAL)"]
A --> B --> C
C --> W["madvise: acquires mmap_lock <b>WRITE</b><br/>(2-6 search threads)"]
C -.->|"other search threads"| R["prefetch → mincore: needs mmap_lock <b>READ</b><br/>(26-46 search threads)"]
W --> L["mmap_lock"]
R --> L
L --> S["Convoy stall: 10-18s<br/>Writer-preference rwsem queues<br/>all readers behind writers"]
The call chain: ScriptScoreFunction.score() → ScoreScript.setDocument() → LeafSearchLookup.setDocument() → SourceLookup.setSegmentAndDocument(). On each segment transition, SourceLookup eagerly calls getSequentialStoredFieldsReader() → StoredFieldsReader.getMergeInstance(). This pattern originated in ES PR #62509 as a fetch-phase optimization for sequential _source access.
The madvise trigger: getMergeInstance() creates a Lucene90CompressingStoredFieldsReader with merging=true, whose constructor calls fieldsStream.updateReadAdvice(ReadAdvice.SEQUENTIAL) → madvise(MADV_SEQUENTIAL). This read advice change was added to OpenSearch's Lucene fork to fix a stored fields merge regression (modeled after apache/lucene#14512, which is still open upstream and not merged into Lucene). The assumption was only merge threads call getMergeInstance() — but SourceLookup calls it from search threads since ES 7.x.
The lock contention: On kernel 5.10, madvise(SEQUENTIAL) takes mmap_lock in WRITE mode (madvise_need_mmap_write() returns 1 for the default case in mm/madvise.c). Linux's rwsem is writer-preferring: once a writer is waiting, all new readers must wait too. So a single madvise call can block dozens of search threads stuck in mincore(). Search threads are both the victims and the perpetrators.
Kernel note: Starting with kernel 6.1+ (VMA management rework), madvise(SEQUENTIAL) no longer requires the global mmap_lock WRITE. However, the MADV_SEQUENTIAL flag still tells the kernel to evict pages behind reads, which hurts random-access search patterns.
Example triggering query — uses only doc['created'] (doc values) and _score, never _source:
{
"script_score": {
"query": { "match": { "title": "search terms" } },
"script": {
"source": "Math.max(_score, 0) * (doc['created'].size() == 0 ? 1 : Math.max(params.min, ((doc['created'].value.toInstant().toEpochMilli() - params.currentDate) / params.scale) + 1))",
"params": { "min": 0.4, "currentDate": 1771279503795, "scale": 8.6724E9 }
}
}
}Mitigation
PR #20827
Validated under load that reproduces the stall — search queue spike goes away completely.
Known limitations
The lazy init fix eliminates the trigger for the specific workload (scripts using only doc values). The madvise path remains reachable for:
- Scripts that read
_source— will still callgetMergeInstance()→madvise(SEQUENTIAL) - Fetch phase (~41% of madvise observations during stalls) — also calls
SourceLookup.setSegmentAndDocument() - finishMerge() never called by search threads —
getMergeInstance()setsMADV_SEQUENTIALon the file region, but onlyfinishMerge()reverts it toMADV_RANDOM. Search threads never callfinishMerge(), so the SEQUENTIAL advisory persists, telling the kernel to evict pages behind reads — harmful for random-access search patterns even on newer kernels where the WRITE lock is not an issue.
In the workload that triggered this issue, the lazy init fix was sufficient because script_score queries call setDocument() on every scored document per segment, making them the dominant madvise source. Other paths (fetch phase) are lower frequency and did not reproduce the stall after patching.
Evidence: jstack and kernel stacks
jstack: madvise writer (script_score → getMergeInstance)
"opensearch[...][search][T#13]" runnable
at o.a.l.store.PosixNativeAccess.madvise(:141)
at o.a.l.store.MemorySegmentIndexInput.updateReadAdvice(:370)
at o.a.l.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.<init>(:110)
at o.a.l.codecs.lucene90.compressing.Lucene90CompressingStoredFieldsReader.getMergeInstance(:716)
at o.o.common.lucene.index.SequentialStoredFieldsLeafReader.getSequentialStoredFieldsReader(:74)
at o.o.search.lookup.SourceLookup.setSegmentAndDocument(:143)
at o.o.script.ScoreScript.setDocument(:162)
at o.o.common.lucene.search.function.ScriptScoreFunction$1.score(:100)
at ... → QueryPhase.execute
jstack: mincore victim (prefetch → isLoaded0)
"opensearch[...][search][T#1]" runnable
at java.nio.MappedMemoryUtils.isLoaded0(Native Method) ← mincore() syscall
at jdk.internal.foreign.MappedMemorySegmentImpl.isLoaded(:87)
at o.a.l.store.MemorySegmentIndexInput.prefetch(:349)
at o.a.l.codecs.lucene101.Lucene101PostingsReader.prefetchPostings(:1394)
at o.a.l.search.TermQuery$TermWeight$2.get(:164)
at ... → QueryPhase.execute
Kernel stacks: mmap_lock contention
# mincore readers blocked (up to 46 threads during stalls):
[<0>] __do_sys_mincore+0xdc/0x2f0 ← blocked at down_read(&mm->mmap_lock)
[<0>] __arm64_sys_mincore+0x20/0x60
# madvise writers blocked (2-6 threads during stalls):
[<0>] rwsem_down_write_slowpath+0x334/0x75c ← acquiring WRITE lock
[<0>] do_madvise+0xf8/0x4d4
[<0>] __arm64_sys_madvise+0x28/0x40
Zero do_mmap/do_munmap during stalls — madvise is the sole WRITE lock holder. TID correlation confirms 1:1 mapping between kernel do_madvise threads and Java search threads in ScriptScoreFunction.score(). madvise threads present in 100% of stall snapshots (across 28 stall windows), absent in 100% of non-stall snapshots.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status