Skip to content

Bug fix: unpin blocks pinned by clones#129

Merged
udabhas merged 1 commit intoopensearch-project:mainfrom
kumargu:fix_pin_leaks
Dec 19, 2025
Merged

Bug fix: unpin blocks pinned by clones#129
udabhas merged 1 commit intoopensearch-project:mainfrom
kumargu:fix_pin_leaks

Conversation

@kumargu
Copy link
Copy Markdown
Collaborator

@kumargu kumargu commented Dec 17, 2025

Description

Lucene IndexInput slices are lightweight views over a shared file and are heavily used across codec paths. However, slices are not guaranteed to be closed promptly. In practice, many slices remain reachable for the lifetime of a query or reader, and in some cases are only closed when the owning IndexInput or segment reader is closed. Retaining pinned cache blocks in slice state until close() therefore makes pin lifetime proportional to slice fan-out rather than actual read activity.

This caused a real failure during testing of the https_hourly_aggr_filter workload on a t3.medium instance, where a single aggregation query created roughly 50,000 slices. Each slice pinned one cache block and retained it for the duration of the query, resulting in tens of thousands of simultaneously pinned segments. This exhausted the memory segment pool, led to repeated pin failures, and ultimately caused OOM and shard failures. The issue is fundamentally unbounded and can occur on any instance size given sufficient slice fan-out.

To fix this, we treat master IndexInputs and slices differently. Master inputs retain the existing one-block pin optimization, keeping a single block pinned across calls for fast sequential access. Slices, however, no longer retain pinned blocks in instance state. A block is pinned only for the duration of an individual read operation and is unpinned immediately once the read completes. This ensures that pin count scales with concurrent reads rather than the number of live slices, eliminating unbounded pin accumulation while preserving the performance characteristics of non-slice inputs.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Gulshan <kumargu@amazon.com>
@kumargu kumargu marked this pull request as ready for review December 17, 2025 11:22
@kumargu
Copy link
Copy Markdown
Collaborator Author

kumargu commented Dec 17, 2025

Indexing sees no impact (as below). but queries like multi_term are seeing higher latency, but I'm out of ideas. So we will focus on correctness over perf

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------

|                                                         Metric |         Task |      Value |   Unit |
|---------------------------------------------------------------:|-------------:|-----------:|-------:|
|                     Cumulative indexing time of primary shards |              |    87.2833 |    min |
|             Min cumulative indexing time across primary shards |              |          0 |    min |
|          Median cumulative indexing time across primary shards |              |    0.79315 |    min |
|             Max cumulative indexing time across primary shards |              |    12.9289 |    min |
|            Cumulative indexing throttle time of primary shards |              |          0 |    min |
|    Min cumulative indexing throttle time across primary shards |              |          0 |    min |
| Median cumulative indexing throttle time across primary shards |              |          0 |    min |
|    Max cumulative indexing throttle time across primary shards |              |          0 |    min |
|                        Cumulative merge time of primary shards |              |    47.3679 |    min |
|                       Cumulative merge count of primary shards |              |        219 |        |
|                Min cumulative merge time across primary shards |              |          0 |    min |
|             Median cumulative merge time across primary shards |              |   0.130533 |    min |
|                Max cumulative merge time across primary shards |              |    9.59567 |    min |
|               Cumulative merge throttle time of primary shards |              |    31.5924 |    min |
|       Min cumulative merge throttle time across primary shards |              |          0 |    min |
|    Median cumulative merge throttle time across primary shards |              |  0.0534583 |    min |
|       Max cumulative merge throttle time across primary shards |              |    6.92777 |    min |
|                      Cumulative refresh time of primary shards |              |    4.31093 |    min |
|                     Cumulative refresh count of primary shards |              |        430 |        |
|              Min cumulative refresh time across primary shards |              |          0 |    min |
|           Median cumulative refresh time across primary shards |              |  0.0332083 |    min |
|              Max cumulative refresh time across primary shards |              |   0.654267 |    min |
|                        Cumulative flush time of primary shards |              |    9.90028 |    min |
|                       Cumulative flush count of primary shards |              |        109 |        |
|                Min cumulative flush time across primary shards |              |          0 |    min |
|             Median cumulative flush time across primary shards |              | 0.00775833 |    min |
|                Max cumulative flush time across primary shards |              |    1.92945 |    min |
|                                        Total Young Gen GC time |              |     10.559 |      s |
|                                       Total Young Gen GC count |              |        769 |        |
|                                          Total Old Gen GC time |              |          0 |      s |
|                                         Total Old Gen GC count |              |          0 |        |
|                                                     Store size |              |    19.4197 |     GB |
|                                                  Translog size |              |    1.18146 |     GB |
|                                         Heap used for segments |              |          0 |     MB |
|                                       Heap used for doc values |              |          0 |     MB |
|                                            Heap used for terms |              |          0 |     MB |
|                                            Heap used for norms |              |          0 |     MB |
|                                           Heap used for points |              |          0 |     MB |
|                                    Heap used for stored fields |              |          0 |     MB |
|                                                  Segment count |              |        364 |        |
|                                                 Min Throughput | index-append |     336361 | docs/s |
|                                                Mean Throughput | index-append |     370226 | docs/s |
|                                              Median Throughput | index-append |     362883 | docs/s |
|                                                 Max Throughput | index-append |     431022 | docs/s |
|                                        50th percentile latency | index-append |    63.0692 |     ms |
|                                        90th percentile latency | index-append |    292.299 |     ms |
|                                        99th percentile latency | index-append |     599.61 |     ms |
|                                      99.9th percentile latency | index-append |    1099.86 |     ms |
|                                     99.99th percentile latency | index-append |    1241.28 |     ms |
|                                       100th percentile latency | index-append |    1363.83 |     ms |
|                                   50th percentile service time | index-append |    63.0692 |     ms |
|                                   90th percentile service time | index-append |    292.299 |     ms |
|                                   99th percentile service time | index-append |     599.61 |     ms |
|                                 99.9th percentile service time | index-append |    1099.86 |     ms |
|                                99.99th percentile service time | index-append |    1241.28 |     ms |
|                                  100th percentile service time | index-append |    1363.83 |     ms |
|                                                     error rate | index-append |          0 |      % |


---------------------------------
[INFO] SUCCESS (took 769 seconds)
---------------------------------

@udabhas udabhas merged commit 46661cd into opensearch-project:main Dec 19, 2025
6 checks passed
prudhvigodithi added a commit to prudhvigodithi/opensearch-storage-encryption that referenced this pull request Mar 17, 2026
Replace per-read pin/unpin CAS cycle with thread-local PinScope that
holds one pinned block per thread. Since PinScope owns the pin, slices
keep their currentBlock reference across reads — the fast path
(blockOffset == currentBlockOffset) hits on every same-block read with
zero overhead.

JMH results (aggregationPatternReadLong, 512 reads/block, 8 threads):
  Baseline (PR opensearch-project#129):  19.0 ops/s  (bufferpool/mmap = 4.4%)
  PinScope:            84.5 ops/s  (bufferpool/mmap = 19.3%)
  Improvement:        4.5x faster

Changes:
- New PinScope.java: ThreadLocal single-block pin holder
- CachedMemorySegmentIndexInput: slices delegate pin lifecycle to
  PinScope; releasePinnedBlockIfSlice is now a no-op
- close(): slices clear local ref only, PinScope owns the pin
- Tests: clear PinScope in setUp to prevent cross-test stale state
- New aggregationPatternReadLong JMH benchmark for validation
prudhvigodithi added a commit to prudhvigodithi/opensearch-storage-encryption that referenced this pull request Mar 18, 2026
Replace per-read pin/unpin CAS cycle with thread-local PinScope that
holds one pinned block per thread. Since PinScope owns the pin, slices
keep their currentBlock reference across reads — the fast path
(blockOffset == currentBlockOffset) hits on every same-block read with
zero overhead.

JMH results (aggregationPatternReadLong, 512 reads/block, 8 threads):
  Baseline (PR opensearch-project#129):  19.0 ops/s  (bufferpool/mmap = 4.4%)
  PinScope:            84.5 ops/s  (bufferpool/mmap = 19.3%)
  Improvement:        4.5x faster

Changes:
- New PinScope.java: ThreadLocal single-block pin holder
- CachedMemorySegmentIndexInput: slices delegate pin lifecycle to
  PinScope; releasePinnedBlockIfSlice is now a no-op
- close(): slices clear local ref only, PinScope owns the pin
- Tests: clear PinScope in setUp to prevent cross-test stale state
- New aggregationPatternReadLong JMH benchmark for validation

Signed-off-by: Prudhvi Godithi <pgodithi@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants