Bug fix: unpin blocks pinned by clones by kumargu · Pull Request #129 · opensearch-project/opensearch-storage-encryption

kumargu · 2025-12-17T11:05:18Z

Description

Lucene IndexInput slices are lightweight views over a shared file and are heavily used across codec paths. However, slices are not guaranteed to be closed promptly. In practice, many slices remain reachable for the lifetime of a query or reader, and in some cases are only closed when the owning IndexInput or segment reader is closed. Retaining pinned cache blocks in slice state until close() therefore makes pin lifetime proportional to slice fan-out rather than actual read activity.

This caused a real failure during testing of the https_hourly_aggr_filter workload on a t3.medium instance, where a single aggregation query created roughly 50,000 slices. Each slice pinned one cache block and retained it for the duration of the query, resulting in tens of thousands of simultaneously pinned segments. This exhausted the memory segment pool, led to repeated pin failures, and ultimately caused OOM and shard failures. The issue is fundamentally unbounded and can occur on any instance size given sufficient slice fan-out.

To fix this, we treat master IndexInputs and slices differently. Master inputs retain the existing one-block pin optimization, keeping a single block pinned across calls for fast sequential access. Slices, however, no longer retain pinned blocks in instance state. A block is pinned only for the duration of an individual read operation and is unpinned immediately once the read completes. This ensures that pin count scales with concurrent reads rather than the number of live slices, eliminating unbounded pin accumulation while preserving the performance characteristics of non-slice inputs.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Gulshan <kumargu@amazon.com>

kumargu · 2025-12-17T18:02:29Z

Indexing sees no impact (as below). but queries like multi_term are seeing higher latency, but I'm out of ideas. So we will focus on correctness over perf

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                         Metric |         Task |      Value |   Unit |
|---------------------------------------------------------------:|-------------:|-----------:|-------:|
|                     Cumulative indexing time of primary shards |              |    87.2833 |    min |
|             Min cumulative indexing time across primary shards |              |          0 |    min |
|          Median cumulative indexing time across primary shards |              |    0.79315 |    min |
|             Max cumulative indexing time across primary shards |              |    12.9289 |    min |
|            Cumulative indexing throttle time of primary shards |              |          0 |    min |
|    Min cumulative indexing throttle time across primary shards |              |          0 |    min |
| Median cumulative indexing throttle time across primary shards |              |          0 |    min |
|    Max cumulative indexing throttle time across primary shards |              |          0 |    min |
|                        Cumulative merge time of primary shards |              |    47.3679 |    min |
|                       Cumulative merge count of primary shards |              |        219 |        |
|                Min cumulative merge time across primary shards |              |          0 |    min |
|             Median cumulative merge time across primary shards |              |   0.130533 |    min |
|                Max cumulative merge time across primary shards |              |    9.59567 |    min |
|               Cumulative merge throttle time of primary shards |              |    31.5924 |    min |
|       Min cumulative merge throttle time across primary shards |              |          0 |    min |
|    Median cumulative merge throttle time across primary shards |              |  0.0534583 |    min |
|       Max cumulative merge throttle time across primary shards |              |    6.92777 |    min |
|                      Cumulative refresh time of primary shards |              |    4.31093 |    min |
|                     Cumulative refresh count of primary shards |              |        430 |        |
|              Min cumulative refresh time across primary shards |              |          0 |    min |
|           Median cumulative refresh time across primary shards |              |  0.0332083 |    min |
|              Max cumulative refresh time across primary shards |              |   0.654267 |    min |
|                        Cumulative flush time of primary shards |              |    9.90028 |    min |
|                       Cumulative flush count of primary shards |              |        109 |        |
|                Min cumulative flush time across primary shards |              |          0 |    min |
|             Median cumulative flush time across primary shards |              | 0.00775833 |    min |
|                Max cumulative flush time across primary shards |              |    1.92945 |    min |
|                                        Total Young Gen GC time |              |     10.559 |      s |
|                                       Total Young Gen GC count |              |        769 |        |
|                                          Total Old Gen GC time |              |          0 |      s |
|                                         Total Old Gen GC count |              |          0 |        |
|                                                     Store size |              |    19.4197 |     GB |
|                                                  Translog size |              |    1.18146 |     GB |
|                                         Heap used for segments |              |          0 |     MB |
|                                       Heap used for doc values |              |          0 |     MB |
|                                            Heap used for terms |              |          0 |     MB |
|                                            Heap used for norms |              |          0 |     MB |
|                                           Heap used for points |              |          0 |     MB |
|                                    Heap used for stored fields |              |          0 |     MB |
|                                                  Segment count |              |        364 |        |
|                                                 Min Throughput | index-append |     336361 | docs/s |
|                                                Mean Throughput | index-append |     370226 | docs/s |
|                                              Median Throughput | index-append |     362883 | docs/s |
|                                                 Max Throughput | index-append |     431022 | docs/s |
|                                        50th percentile latency | index-append |    63.0692 |     ms |
|                                        90th percentile latency | index-append |    292.299 |     ms |
|                                        99th percentile latency | index-append |     599.61 |     ms |
|                                      99.9th percentile latency | index-append |    1099.86 |     ms |
|                                     99.99th percentile latency | index-append |    1241.28 |     ms |
|                                       100th percentile latency | index-append |    1363.83 |     ms |
|                                   50th percentile service time | index-append |    63.0692 |     ms |
|                                   90th percentile service time | index-append |    292.299 |     ms |
|                                   99th percentile service time | index-append |     599.61 |     ms |
|                                 99.9th percentile service time | index-append |    1099.86 |     ms |
|                                99.99th percentile service time | index-append |    1241.28 |     ms |
|                                  100th percentile service time | index-append |    1363.83 |     ms |
|                                                     error rate | index-append |          0 |      % |


---------------------------------
[INFO] SUCCESS (took 769 seconds)
---------------------------------

Replace per-read pin/unpin CAS cycle with thread-local PinScope that holds one pinned block per thread. Since PinScope owns the pin, slices keep their currentBlock reference across reads — the fast path (blockOffset == currentBlockOffset) hits on every same-block read with zero overhead. JMH results (aggregationPatternReadLong, 512 reads/block, 8 threads): Baseline (PR opensearch-project#129): 19.0 ops/s (bufferpool/mmap = 4.4%) PinScope: 84.5 ops/s (bufferpool/mmap = 19.3%) Improvement: 4.5x faster Changes: - New PinScope.java: ThreadLocal single-block pin holder - CachedMemorySegmentIndexInput: slices delegate pin lifecycle to PinScope; releasePinnedBlockIfSlice is now a no-op - close(): slices clear local ref only, PinScope owns the pin - Tests: clear PinScope in setUp to prevent cross-test stale state - New aggregationPatternReadLong JMH benchmark for validation

Replace per-read pin/unpin CAS cycle with thread-local PinScope that holds one pinned block per thread. Since PinScope owns the pin, slices keep their currentBlock reference across reads — the fast path (blockOffset == currentBlockOffset) hits on every same-block read with zero overhead. JMH results (aggregationPatternReadLong, 512 reads/block, 8 threads): Baseline (PR opensearch-project#129): 19.0 ops/s (bufferpool/mmap = 4.4%) PinScope: 84.5 ops/s (bufferpool/mmap = 19.3%) Improvement: 4.5x faster Changes: - New PinScope.java: ThreadLocal single-block pin holder - CachedMemorySegmentIndexInput: slices delegate pin lifecycle to PinScope; releasePinnedBlockIfSlice is now a no-op - close(): slices clear local ref only, PinScope owns the pin - Tests: clear PinScope in setUp to prevent cross-test stale state - New aggregationPatternReadLong JMH benchmark for validation Signed-off-by: Prudhvi Godithi <pgodithi@amazon.com>

Bug fix: unpin blocks pinned by clones

d7344ab

Signed-off-by: Gulshan <kumargu@amazon.com>

kumargu force-pushed the fix_pin_leaks branch from 81403b4 to d7344ab Compare December 17, 2025 11:21

kumargu marked this pull request as ready for review December 17, 2025 11:22

kumargu requested review from asonje, shubhmkr-amazon and udabhas as code owners December 17, 2025 11:22

udabhas approved these changes Dec 19, 2025

View reviewed changes

udabhas merged commit 46661cd into opensearch-project:main Dec 19, 2025
6 checks passed

asimmahmood1 mentioned this pull request Mar 6, 2026

[BUG] Hot Path Performance Regression #150

Open

prudhvigodithi mentioned this pull request Mar 18, 2026

Add PinScope to eliminate per-read CAS #157

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fix: unpin blocks pinned by clones#129

Bug fix: unpin blocks pinned by clones#129
udabhas merged 1 commit intoopensearch-project:mainfrom
kumargu:fix_pin_leaks

kumargu commented Dec 17, 2025

Uh oh!

kumargu commented Dec 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kumargu commented Dec 17, 2025

Description

Related Issues

Check List

Uh oh!

kumargu commented Dec 17, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants