Bug fix: unpin blocks pinned by clones#129
Merged
udabhas merged 1 commit intoopensearch-project:mainfrom Dec 19, 2025
Merged
Conversation
Signed-off-by: Gulshan <kumargu@amazon.com>
81403b4 to
d7344ab
Compare
Collaborator
Author
|
Indexing sees no impact (as below). but queries like multi_term are seeing higher latency, but I'm out of ideas. So we will focus on correctness over perf |
udabhas
approved these changes
Dec 19, 2025
prudhvigodithi
added a commit
to prudhvigodithi/opensearch-storage-encryption
that referenced
this pull request
Mar 17, 2026
Replace per-read pin/unpin CAS cycle with thread-local PinScope that holds one pinned block per thread. Since PinScope owns the pin, slices keep their currentBlock reference across reads — the fast path (blockOffset == currentBlockOffset) hits on every same-block read with zero overhead. JMH results (aggregationPatternReadLong, 512 reads/block, 8 threads): Baseline (PR opensearch-project#129): 19.0 ops/s (bufferpool/mmap = 4.4%) PinScope: 84.5 ops/s (bufferpool/mmap = 19.3%) Improvement: 4.5x faster Changes: - New PinScope.java: ThreadLocal single-block pin holder - CachedMemorySegmentIndexInput: slices delegate pin lifecycle to PinScope; releasePinnedBlockIfSlice is now a no-op - close(): slices clear local ref only, PinScope owns the pin - Tests: clear PinScope in setUp to prevent cross-test stale state - New aggregationPatternReadLong JMH benchmark for validation
5 tasks
prudhvigodithi
added a commit
to prudhvigodithi/opensearch-storage-encryption
that referenced
this pull request
Mar 18, 2026
Replace per-read pin/unpin CAS cycle with thread-local PinScope that holds one pinned block per thread. Since PinScope owns the pin, slices keep their currentBlock reference across reads — the fast path (blockOffset == currentBlockOffset) hits on every same-block read with zero overhead. JMH results (aggregationPatternReadLong, 512 reads/block, 8 threads): Baseline (PR opensearch-project#129): 19.0 ops/s (bufferpool/mmap = 4.4%) PinScope: 84.5 ops/s (bufferpool/mmap = 19.3%) Improvement: 4.5x faster Changes: - New PinScope.java: ThreadLocal single-block pin holder - CachedMemorySegmentIndexInput: slices delegate pin lifecycle to PinScope; releasePinnedBlockIfSlice is now a no-op - close(): slices clear local ref only, PinScope owns the pin - Tests: clear PinScope in setUp to prevent cross-test stale state - New aggregationPatternReadLong JMH benchmark for validation Signed-off-by: Prudhvi Godithi <pgodithi@amazon.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Lucene IndexInput slices are lightweight views over a shared file and are heavily used across codec paths. However, slices are not guaranteed to be closed promptly. In practice, many slices remain reachable for the lifetime of a query or reader, and in some cases are only closed when the owning IndexInput or segment reader is closed. Retaining pinned cache blocks in slice state until close() therefore makes pin lifetime proportional to slice fan-out rather than actual read activity.
This caused a real failure during testing of the https_hourly_aggr_filter workload on a t3.medium instance, where a single aggregation query created roughly 50,000 slices. Each slice pinned one cache block and retained it for the duration of the query, resulting in tens of thousands of simultaneously pinned segments. This exhausted the memory segment pool, led to repeated pin failures, and ultimately caused OOM and shard failures. The issue is fundamentally unbounded and can occur on any instance size given sufficient slice fan-out.
To fix this, we treat master IndexInputs and slices differently. Master inputs retain the existing one-block pin optimization, keeping a single block pinned across calls for fast sequential access. Slices, however, no longer retain pinned blocks in instance state. A block is pinned only for the duration of an individual read operation and is unpinned immediately once the read completes. This ensures that pin count scales with concurrent reads rather than the number of live slices, eliminating unbounded pin accumulation while preserving the performance characteristics of non-slice inputs.
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
--signoff.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.