Skip to content

Modify BufferPool prefetch to async load blocks#149

Open
asimmahmood1 wants to merge 26 commits intoopensearch-project:mainfrom
asimmahmood1:prefetchAsync
Open

Modify BufferPool prefetch to async load blocks#149
asimmahmood1 wants to merge 26 commits intoopensearch-project:mainfrom
asimmahmood1:prefetchAsync

Conversation

@asimmahmood1
Copy link
Copy Markdown

@asimmahmood1 asimmahmood1 commented Mar 2, 2026

Description

1. Async Prefetch Architecture

  • CaffeineBlockCache: Added prefetchExecutor field and async/sync execution split
    • loadMissingBlocks() - async when executor present, sync otherwise
    • loadMissingBlocksSync() - private method with actual logic
    • Returns long (count) instead of Map<>

2. Prefetch Cache for Deduplication

  • Global shared cache: ConcurrentHashMap<BlockCacheKey, Boolean> created in BlockCacheBuilder
  • Check-then-load pattern: prefetchCache.putIfAbsent() before checking main cache
  • Cleanup after load: Removes keys from prefetch cache after successful loadAllBlocks()
  • Passed through: BlockCacheBuilder → PoolResources → CaffeineBlockCache

3. Smart Block Loading

  • loadMissingBlocks(): Checks cache first, only loads missing blocks, combines consecutive ranges
  • loadAllBlocks(): Single I/O call without cache check (for readahead)
  • Changed from FileBlockCacheKey[]: Stores full keys instead of offsets to enable cleanup

4. Prefetch Threadpool Configuration

  • New threadpool: CRYPTO_PLUGIN_THREADPOOL_PREFETCH using FixedExecutorBuilder
  • Configurable settings:
    • node.store.crypto.prefetch.queue_size (default: threads × 1000)
    • node.store.crypto.prefetch.thread_count (default: processors × 4)
      • this accounts for approx processors1.5 search threads and processors2 index_searcher threads
  • Wired through: CryptoDirectoryPlugin.getExecutorBuilders() → CryptoDirectoryFactory.setThreadPool()

5. API Changes

  • CachedMemorySegmentIndexInput: Added prefetch(offset, length) method
  • BlockCache interface:
    • loadForPrefetch() → loadMissingBlocks() (checks cache first)
    • Added loadAllBlocks() (no cache check)
    • Return type changed from Map<> to long (count)

6. Test Updates

  • CaffeineBlockCacheTests: +3 async tests (async execution, deduplication, cleanup)

Key Architectural Improvements:

  1. Separation of concerns: Async execution moved from input layer to cache layer
  2. Efficient deduplication: Per-block HashMap check before expensive cache lookup
  3. Resource cleanup: Prefetch cache entries removed after successful load
  4. Configurable threading: Allows tuning for different workloads
  5. Smart loading: Only loads missing blocks, combines consecutive ranges

Related Issues

Resolves #119

Testing

jmh

Test setup in https://github.com/asimmahmood1/opensearch-storage-encryption/tree/jmhPrefetch

Sequential Prefetch: noKms vs jmhPrefetch (async)

  • baaseline loadForPrefetch does disk I/O for every call even if the block is already cached (loads first, then checks putIfAbsent)
  • The ~600-2700x gap is entirely from avoiding redundant disk reads
Threads Cache baaseline Prefetch async Speedup
1 1000 1.09 2,909 ~2,670x
1 10000 1.09 2,770 ~2,541x
4 1000 2.95 2,260 ~766x
4 10000 2.96 1,810 ~612x
8 1000 2.95 2,178 ~738x
8 10000 2.95 1,275 ~432x
16 1000 2.93 2,219 ~757x
16 10000 2.94 2,009 ~683x

OSB

Ran big5.

Metric Task Baseline Contender Diff Unit
Store size 4.75663 4.75663 0.00% 0 GB
Segment count 9 9 0.00% 0
Min Throughput cardinality-agg-high 2.00198 2.00281 0.04% 0.00083 ops/s
Mean Throughput cardinality-agg-high 2.00938 2.01331 0.20% 0.00393 ops/s
Median Throughput cardinality-agg-high 2.00393 2.00556 0.08% 0.00162 ops/s
Max Throughput cardinality-agg-high 2.14201 2.2065 3.01% 0.06449 ops/s
50th percentile latency cardinality-agg-high 332.176 322.432 -2.93% -9.7437 ms
90th percentile latency cardinality-agg-high 372.484 356.947 -4.17% -15.5369 ms
99th percentile latency cardinality-agg-high 428.933 429.896 0.22% 0.9636 ms
100th percentile latency cardinality-agg-high 469.811 444.854 -5.31% 🟢 -24.9569 ms
50th percentile service time cardinality-agg-high 331.222 321.437 -2.95% -9.78525 ms
90th percentile service time cardinality-agg-high 371.801 355.639 -4.35% -16.1622 ms
99th percentile service time cardinality-agg-high 428.085 428.869 0.18% 0.78466 ms
100th percentile service time cardinality-agg-high 468.645 443.522 -5.36% 🟢 -25.1223 ms
error rate cardinality-agg-high 0 0 0.00% 0 %

Full Run

Details
Metric Task Baseline Contender Diff Unit
Cumulative indexing time of primary shards 0 0 0.00% 0 min
Min cumulative indexing time across primary shard 0 0 0.00% 0 min
Median cumulative indexing time across primary shard 0 0 0.00% 0 min
Max cumulative indexing time across primary shard 0 0 0.00% 0 min
Cumulative indexing throttle time of primary shards 0 0 0.00% 0 min
Min cumulative indexing throttle time across primary shard 0 0 0.00% 0 min
Median cumulative indexing throttle time across primary shard 0 0 0.00% 0 min
Max cumulative indexing throttle time across primary shard 0 0 0.00% 0 min
Cumulative merge time of primary shards 0 0 0.00% 0 min
Cumulative merge count of primary shards 0 0 0.00% 0
Min cumulative merge time across primary shard 0 0 0.00% 0 min
Median cumulative merge time across primary shard 0 0 0.00% 0 min
Max cumulative merge time across primary shard 0 0 0.00% 0 min
Cumulative merge throttle time of primary shards 0 0 0.00% 0 min
Min cumulative merge throttle time across primary shard 0 0 0.00% 0 min
Median cumulative merge throttle time across primary shard 0 0 0.00% 0 min
Max cumulative merge throttle time across primary shard 0 0 0.00% 0 min
Cumulative refresh time of primary shards 0 0 0.00% 0 min
Cumulative refresh count of primary shards 2 2 0.00% 0
Min cumulative refresh time across primary shard 0 0 0.00% 0 min
Median cumulative refresh time across primary shard 0 0 0.00% 0 min
Max cumulative refresh time across primary shard 0 0 0.00% 0 min
Cumulative flush time of primary shards 0 0 0.00% 0 min
Cumulative flush count of primary shards 1 1 0.00% 0
Min cumulative flush time across primary shard 0 0 0.00% 0 min
Median cumulative flush time across primary shard 0 0 0.00% 0 min
Max cumulative flush time across primary shard 0 0 0.00% 0 min
Total Young Gen GC time 0.203 0.326 0.06% 0.123 s
Total Young Gen GC count 24 25 4.17% 1
Total Old Gen GC time 0 0 0.00% 0 s
Total Old Gen GC count 0 0 0.00% 0
Store size 4.75663 4.75663 0.00% 0 GB
Translog size 5.12227e-08 5.12227e-08 0.00% 0 GB
Heap used for segments 0 0 0.00% 0 MB
Heap used for doc values 0 0 0.00% 0 MB
Heap used for terms 0 0 0.00% 0 MB
Heap used for norms 0 0 0.00% 0 MB
Heap used for points 0 0 0.00% 0 MB
Heap used for stored fields 0 0 0.00% 0 MB
Segment count 9 9 0.00% 0
50th percentile service time match-all 5.33123 5.86739 +10.06% 🔴 0.53617 ms
90th percentile service time match-all 5.78563 6.45272 +11.53% 🔴 0.66709 ms
99th percentile service time match-all 6.36024 6.902 +8.52% 🔴 0.54176 ms
100th percentile service time match-all 6.89619 7.40267 +7.34% 🔴 0.50648 ms
error rate match-all 0 0 0.00% 0 %
50th percentile service time desc_sort_timestamp 7.06757 7.43808 +5.24% 🔴 0.37051 ms
90th percentile service time desc_sort_timestamp 7.47604 8.17836 +9.39% 🔴 0.70232 ms
99th percentile service time desc_sort_timestamp 8.2023 10.5148 +28.19% 🔴 2.31246 ms
100th percentile service time desc_sort_timestamp 8.41588 11.1232 +32.17% 🔴 2.70729 ms
error rate desc_sort_timestamp 0 0 0.00% 0 %
50th percentile service time asc_sort_timestamp 6.07399 6.71714 +10.59% 🔴 0.64315 ms
90th percentile service time asc_sort_timestamp 6.60525 7.02824 +6.40% 🔴 0.42299 ms
99th percentile service time asc_sort_timestamp 7.25477 8.01705 +10.51% 🔴 0.76227 ms
100th percentile service time asc_sort_timestamp 7.47516 9.18894 +22.93% 🔴 1.71378 ms
error rate asc_sort_timestamp 0 0 0.00% 0 %
50th percentile service time desc_sort_with_after_timestamp 6.7219 7.1929 +7.01% 🔴 0.47099 ms
90th percentile service time desc_sort_with_after_timestamp 7.21773 7.65726 +6.09% 🔴 0.43952 ms
99th percentile service time desc_sort_with_after_timestamp 8.62742 7.94864 -7.87% 🟢 -0.67878 ms
100th percentile service time desc_sort_with_after_timestamp 8.72136 10.1188 +16.02% 🔴 1.39744 ms
error rate desc_sort_with_after_timestamp 0 0 0.00% 0 %
50th percentile service time asc_sort_with_after_timestamp 5.66367 6.20629 +9.58% 🔴 0.54262 ms
90th percentile service time asc_sort_with_after_timestamp 6.01183 6.50703 +8.24% 🔴 0.4952 ms
99th percentile service time asc_sort_with_after_timestamp 7.2456 6.77196 -6.54% 🟢 -0.47365 ms
100th percentile service time asc_sort_with_after_timestamp 7.33387 8.25086 +12.50% 🔴 0.91699 ms
error rate asc_sort_with_after_timestamp 0 0 0.00% 0 %
50th percentile service time desc_sort_timestamp_can_match_shortcut 15.7101 16.1844 3.02% 0.47429 ms
90th percentile service time desc_sort_timestamp_can_match_shortcut 16.2406 16.6953 2.80% 0.45475 ms
99th percentile service time desc_sort_timestamp_can_match_shortcut 18.4206 20.031 +8.74% 🔴 1.61044 ms
100th percentile service time desc_sort_timestamp_can_match_shortcut 22.6156 20.7142 -8.41% 🟢 -1.9014 ms
error rate desc_sort_timestamp_can_match_shortcut 0 0 0.00% 0 %
50th percentile service time desc_sort_timestamp_no_can_match_shortcut 15.6206 15.767 0.94% 0.14637 ms
90th percentile service time desc_sort_timestamp_no_can_match_shortcut 16.2876 16.3407 0.33% 0.05304 ms
99th percentile service time desc_sort_timestamp_no_can_match_shortcut 16.791 16.8323 0.25% 0.04133 ms
100th percentile service time desc_sort_timestamp_no_can_match_shortcut 21.8805 16.8549 -22.97% 🟢 -5.02559 ms
error rate desc_sort_timestamp_no_can_match_shortcut 0 0 0.00% 0 %
50th percentile service time asc_sort_timestamp_can_match_shortcut 8.71334 8.77906 0.75% 0.06572 ms
90th percentile service time asc_sort_timestamp_can_match_shortcut 8.99686 9.16502 1.87% 0.16816 ms
99th percentile service time asc_sort_timestamp_can_match_shortcut 9.22067 9.7336 +5.56% 🔴 0.51293 ms
100th percentile service time asc_sort_timestamp_can_match_shortcut 9.49736 12.0311 +26.68% 🔴 2.53375 ms
error rate asc_sort_timestamp_can_match_shortcut 0 0 0.00% 0 %
50th percentile service time asc_sort_timestamp_no_can_match_shortcut 8.47544 8.88359 4.82% 0.40815 ms
90th percentile service time asc_sort_timestamp_no_can_match_shortcut 8.74425 9.31907 +6.57% 🔴 0.57482 ms
99th percentile service time asc_sort_timestamp_no_can_match_shortcut 9.73853 12.4136 +27.47% 🔴 2.6751 ms
100th percentile service time asc_sort_timestamp_no_can_match_shortcut 12.4721 12.6794 1.66% 0.20726 ms
error rate asc_sort_timestamp_no_can_match_shortcut 0 0 0.00% 0 %
50th percentile service time term 2.10747 2.12477 0.82% 0.01731 ms
90th percentile service time term 2.25896 2.2814 0.99% 0.02244 ms
99th percentile service time term 2.481 2.53553 2.20% 0.05453 ms
100th percentile service time term 2.48144 2.69251 +8.51% 🔴 0.21107 ms
error rate term 0 0 0.00% 0 %
50th percentile service time multi_terms-keyword 3.05296 3.51322 +15.08% 🔴 0.46026 ms
90th percentile service time multi_terms-keyword 3.22996 3.83942 +18.87% 🔴 0.60946 ms
99th percentile service time multi_terms-keyword 3.58289 4.19302 +17.03% 🔴 0.61012 ms
100th percentile service time multi_terms-keyword 3.61061 4.20848 +16.56% 🔴 0.59787 ms
error rate multi_terms-keyword 0 0 0.00% 0 %
50th percentile service time keyword-terms 13.2875 14.3673 +8.13% 🔴 1.07975 ms
90th percentile service time keyword-terms 14.0648 14.9372 +6.20% 🔴 0.87248 ms
99th percentile service time keyword-terms 14.3675 15.911 +10.74% 🔴 1.54351 ms
100th percentile service time keyword-terms 15.1458 16.1412 +6.57% 🔴 0.9954 ms
error rate keyword-terms 0 0 0.00% 0 %
50th percentile service time keyword-terms-low-cardinality 7.58433 8.28041 +9.18% 🔴 0.69607 ms
90th percentile service time keyword-terms-low-cardinality 8.45918 9.26455 +9.52% 🔴 0.80538 ms
99th percentile service time keyword-terms-low-cardinality 9.61154 9.58736 -0.25% -0.02418 ms
100th percentile service time keyword-terms-low-cardinality 10.1142 9.70683 -4.03% -0.40735 ms
error rate keyword-terms-low-cardinality 0 0 0.00% 0 %
50th percentile service time composite-terms 2.81982 3.50044 +24.14% 🔴 0.68063 ms
90th percentile service time composite-terms 3.04717 3.75163 +23.12% 🔴 0.70446 ms
99th percentile service time composite-terms 3.37939 4.01992 +18.95% 🔴 0.64053 ms
100th percentile service time composite-terms 3.48056 4.21379 +21.07% 🔴 0.73323 ms
error rate composite-terms 0 0 0.00% 0 %
50th percentile service time composite_terms-keyword 2.7505 3.56073 +29.46% 🔴 0.81023 ms
90th percentile service time composite_terms-keyword 3.01512 3.89274 +29.11% 🔴 0.87762 ms
99th percentile service time composite_terms-keyword 3.22946 4.02166 +24.53% 🔴 0.79219 ms
100th percentile service time composite_terms-keyword 3.4604 4.27569 +23.56% 🔴 0.81529 ms
error rate composite_terms-keyword 0 0 0.00% 0 %
50th percentile service time composite-date_histogram-daily 3.06497 3.72072 +21.40% 🔴 0.65575 ms
90th percentile service time composite-date_histogram-daily 3.36834 3.96079 +17.59% 🔴 0.59245 ms
99th percentile service time composite-date_histogram-daily 3.63777 4.26223 +17.17% 🔴 0.62446 ms
100th percentile service time composite-date_histogram-daily 3.66295 4.3295 +18.20% 🔴 0.66656 ms
error rate composite-date_histogram-daily 0 0 0.00% 0 %
50th percentile service time range 4.92059 4.99095 1.43% 0.07036 ms
90th percentile service time range 5.16957 5.27564 2.05% 0.10606 ms
99th percentile service time range 5.47204 6.28938 +14.94% 🔴 0.81735 ms
100th percentile service time range 6.45927 6.4094 -0.77% -0.04987 ms
error rate range 0 0 0.00% 0 %
50th percentile service time range-numeric 1.65534 1.75821 +6.21% 🔴 0.10288 ms
90th percentile service time range-numeric 1.81575 1.9125 +5.33% 🔴 0.09675 ms
99th percentile service time range-numeric 1.96901 2.07095 +5.18% 🔴 0.10194 ms
100th percentile service time range-numeric 2.12584 2.12926 0.16% 0.00342 ms
error rate range-numeric 0 0 0.00% 0 %
50th percentile service time keyword-in-range 47.7598 48.1159 0.75% 0.35606 ms
90th percentile service time keyword-in-range 48.8703 49.1258 0.52% 0.25549 ms
99th percentile service time keyword-in-range 53.6782 53.6714 -0.01% -0.00679 ms
100th percentile service time keyword-in-range 54.8347 56.0567 2.23% 1.22204 ms
error rate keyword-in-range 0 0 0.00% 0 %
50th percentile service time date_histogram_hourly_agg 3.78211 3.83351 1.36% 0.0514 ms
90th percentile service time date_histogram_hourly_agg 4.07845 4.08291 0.11% 0.00446 ms
99th percentile service time date_histogram_hourly_agg 4.26438 4.34943 1.99% 0.08505 ms
100th percentile service time date_histogram_hourly_agg 4.41574 4.5001 1.91% 0.08437 ms
error rate date_histogram_hourly_agg 0 0 0.00% 0 %
50th percentile service time date_histogram_hourly_with_filter_agg 76.7156 80.5914 +5.05% 🔴 3.8758 ms
90th percentile service time date_histogram_hourly_with_filter_agg 95.2174 94.1445 -1.13% -1.07295 ms
99th percentile service time date_histogram_hourly_with_filter_agg 103.482 99.4576 -3.89% -4.02471 ms
100th percentile service time date_histogram_hourly_with_filter_agg 111.089 105.018 -5.47% 🟢 -6.07109 ms
error rate date_histogram_hourly_with_filter_agg 0 0 0.00% 0 %
50th percentile service time date_histogram_minute_agg 19.6701 21.1711 +7.63% 🔴 1.50103 ms
90th percentile service time date_histogram_minute_agg 21.0316 22.176 +5.44% 🔴 1.14435 ms
99th percentile service time date_histogram_minute_agg 21.5778 22.7066 +5.23% 🔴 1.12883 ms
100th percentile service time date_histogram_minute_agg 23.5867 22.9667 -2.63% -0.61996 ms
error rate date_histogram_minute_agg 0 0 0.00% 0 %
50th percentile service time scroll 406.894 405.263 -0.40% -1.63079 ms
90th percentile service time scroll 412.009 414.52 0.61% 2.51122 ms
99th percentile service time scroll 467.127 448.316 -4.03% -18.8106 ms
100th percentile service time scroll 467.739 459.127 -1.84% -8.61277 ms
error rate scroll 0 0 0.00% 0 %
50th percentile service time query-string-on-message 4.54705 4.80666 +5.71% 🔴 0.25961 ms
90th percentile service time query-string-on-message 4.78296 5.07629 +6.13% 🔴 0.29333 ms
99th percentile service time query-string-on-message 5.71624 5.35042 -6.40% 🟢 -0.36582 ms
100th percentile service time query-string-on-message 6.4403 5.89987 -8.39% 🟢 -0.54043 ms
error rate query-string-on-message 0 0 0.00% 0 %
50th percentile service time query-string-on-message-filtered 2.55509 3.24915 +27.16% 🔴 0.69406 ms
90th percentile service time query-string-on-message-filtered 2.81586 3.57133 +26.83% 🔴 0.75547 ms
99th percentile service time query-string-on-message-filtered 3.00406 3.7635 +25.28% 🔴 0.75944 ms
100th percentile service time query-string-on-message-filtered 3.93152 3.81243 -3.03% -0.11908 ms
error rate query-string-on-message-filtered 0 0 0.00% 0 %
50th percentile service time query-string-on-message-filtered-sorted-num 2.57539 3.29158 +27.81% 🔴 0.71619 ms
90th percentile service time query-string-on-message-filtered-sorted-num 2.83471 3.58323 +26.41% 🔴 0.74852 ms
99th percentile service time query-string-on-message-filtered-sorted-num 3.18062 4.11627 +29.42% 🔴 0.93565 ms
100th percentile service time query-string-on-message-filtered-sorted-num 3.28402 4.43583 +35.07% 🔴 1.15181 ms
error rate query-string-on-message-filtered-sorted-num 0 0 0.00% 0 %
50th percentile service time sort_keyword_can_match_shortcut 3.49719 3.51419 0.49% 0.017 ms
90th percentile service time sort_keyword_can_match_shortcut 3.74281 3.72567 -0.46% -0.01714 ms
99th percentile service time sort_keyword_can_match_shortcut 4.42667 4.21578 -4.76% -0.21089 ms
100th percentile service time sort_keyword_can_match_shortcut 4.51188 4.32403 -4.16% -0.18786 ms
error rate sort_keyword_can_match_shortcut 0 0 0.00% 0 %
50th percentile service time sort_keyword_no_can_match_shortcut 3.46648 3.42706 -1.14% -0.03943 ms
90th percentile service time sort_keyword_no_can_match_shortcut 3.67565 3.5882 -2.38% -0.08745 ms
99th percentile service time sort_keyword_no_can_match_shortcut 4.13905 3.83441 -7.36% 🟢 -0.30465 ms
100th percentile service time sort_keyword_no_can_match_shortcut 4.16243 4.73659 +13.79% 🔴 0.57416 ms
error rate sort_keyword_no_can_match_shortcut 0 0 0.00% 0 %
50th percentile service time sort_numeric_desc 3.63547 3.71957 2.31% 0.0841 ms
90th percentile service time sort_numeric_desc 3.84452 3.93636 2.39% 0.09183 ms
99th percentile service time sort_numeric_desc 4.06919 4.79127 +17.74% 🔴 0.72208 ms
100th percentile service time sort_numeric_desc 4.73341 5.01556 +5.96% 🔴 0.28215 ms
error rate sort_numeric_desc 0 0 0.00% 0 %
50th percentile service time sort_numeric_asc 3.65323 3.62205 -0.85% -0.03118 ms
90th percentile service time sort_numeric_asc 3.79218 3.84065 1.28% 0.04847 ms
99th percentile service time sort_numeric_asc 3.92672 3.99433 1.72% 0.06761 ms
100th percentile service time sort_numeric_asc 4.96029 4.11201 -17.10% 🟢 -0.84828 ms
error rate sort_numeric_asc 0 0 0.00% 0 %
50th percentile service time sort_numeric_desc_with_match 1.52568 1.62595 +6.57% 🔴 0.10027 ms
90th percentile service time sort_numeric_desc_with_match 1.66199 1.76939 +6.46% 🔴 0.1074 ms
99th percentile service time sort_numeric_desc_with_match 1.77765 1.87154 +5.28% 🔴 0.09389 ms
100th percentile service time sort_numeric_desc_with_match 1.77796 1.87976 +5.73% 🔴 0.10181 ms
error rate sort_numeric_desc_with_match 0 0 0.00% 0 %
50th percentile service time sort_numeric_asc_with_match 1.49831 1.53507 2.45% 0.03676 ms
90th percentile service time sort_numeric_asc_with_match 1.6647 1.71718 3.15% 0.05248 ms
99th percentile service time sort_numeric_asc_with_match 1.86646 1.83928 -1.46% -0.02718 ms
100th percentile service time sort_numeric_asc_with_match 2.11189 1.87457 -11.24% 🟢 -0.23732 ms
error rate sort_numeric_asc_with_match 0 0 0.00% 0 %
50th percentile service time range_field_conjunction_big_range_big_term_query 1.43486 1.44379 0.62% 0.00893 ms
90th percentile service time range_field_conjunction_big_range_big_term_query 1.57305 1.57702 0.25% 0.00397 ms
99th percentile service time range_field_conjunction_big_range_big_term_query 1.76164 1.66972 -5.22% 🟢 -0.09192 ms
100th percentile service time range_field_conjunction_big_range_big_term_query 1.9082 1.79807 -5.77% 🟢 -0.11013 ms
error rate range_field_conjunction_big_range_big_term_query 0 0 0.00% 0 %
50th percentile service time range_field_disjunction_big_range_small_term_query 1.56878 1.58749 1.19% 0.01871 ms
90th percentile service time range_field_disjunction_big_range_small_term_query 1.68525 1.77193 +5.14% 🔴 0.08668 ms
99th percentile service time range_field_disjunction_big_range_small_term_query 1.79692 1.90525 +6.03% 🔴 0.10833 ms
100th percentile service time range_field_disjunction_big_range_small_term_query 1.823 1.92929 +5.83% 🔴 0.1063 ms
error rate range_field_disjunction_big_range_small_term_query 0 0 0.00% 0 %
50th percentile service time range_field_conjunction_small_range_small_term_query 1.51183 1.58039 4.53% 0.06856 ms
90th percentile service time range_field_conjunction_small_range_small_term_query 1.64027 1.71293 4.43% 0.07266 ms
99th percentile service time range_field_conjunction_small_range_small_term_query 1.85023 1.89982 2.68% 0.0496 ms
100th percentile service time range_field_conjunction_small_range_small_term_query 1.86626 2.01001 +7.70% 🔴 0.14374 ms
error rate range_field_conjunction_small_range_small_term_query 0 0 0.00% 0 %
50th percentile service time range_field_conjunction_small_range_big_term_query 1.41612 1.41227 -0.27% -0.00385 ms
90th percentile service time range_field_conjunction_small_range_big_term_query 1.54616 1.52956 -1.07% -0.0166 ms
99th percentile service time range_field_conjunction_small_range_big_term_query 1.67479 1.65814 -0.99% -0.01666 ms
100th percentile service time range_field_conjunction_small_range_big_term_query 1.7091 1.76788 3.44% 0.05877 ms
error rate range_field_conjunction_small_range_big_term_query 0 0 0.00% 0 %
50th percentile service time range-auto-date-histo 490.565 495.548 1.02% 4.98328 ms
90th percentile service time range-auto-date-histo 525.215 524.636 -0.11% -0.57877 ms
99th percentile service time range-auto-date-histo 621.443 615.539 -0.95% -5.90403 ms
100th percentile service time range-auto-date-histo 702.496 695.363 -1.02% -7.13283 ms
error rate range-auto-date-histo 0 0 0.00% 0 %
50th percentile service time range-with-metrics 1875.43 1884.39 0.48% 8.9616 ms
90th percentile service time range-with-metrics 2000.93 1946.21 -2.73% -54.7181 ms
99th percentile service time range-with-metrics 2186.49 2183.67 -0.13% -2.82462 ms
100th percentile service time range-with-metrics 2267.92 2287.38 0.86% 19.451 ms
error rate range-with-metrics 0 0 0.00% 0 %
50th percentile service time range-auto-date-histo-with-metrics 1809.78 1815.64 0.32% 5.86349 ms
90th percentile service time range-auto-date-histo-with-metrics 1872.58 1870.32 -0.12% -2.25941 ms
99th percentile service time range-auto-date-histo-with-metrics 1963.78 2070.85 +5.45% 🔴 107.066 ms
100th percentile service time range-auto-date-histo-with-metrics 1996.35 2279.66 +14.19% 🔴 283.303 ms
error rate range-auto-date-histo-with-metrics 0 0 0.00% 0 %
50th percentile service time range-agg-1 1.70609 1.64496 -3.58% -0.06113 ms
90th percentile service time range-agg-1 1.82613 1.83282 0.37% 0.00669 ms
99th percentile service time range-agg-1 2.02114 1.93038 -4.49% -0.09076 ms
100th percentile service time range-agg-1 2.02428 2.0062 -0.89% -0.01808 ms
error rate range-agg-1 0 0 0.00% 0 %
50th percentile service time range-agg-2 1.64055 1.7353 +5.78% 🔴 0.09475 ms
90th percentile service time range-agg-2 1.80868 1.9107 +5.64% 🔴 0.10202 ms
99th percentile service time range-agg-2 1.99279 2.08039 4.40% 0.0876 ms
100th percentile service time range-agg-2 2.06418 2.21432 +7.27% 🔴 0.15014 ms
error rate range-agg-2 0 0 0.00% 0 %
50th percentile service time cardinality-agg-low 2.24735 3.51218 +56.28% 🔴 1.26483 ms
90th percentile service time cardinality-agg-low 2.39313 3.83441 +60.23% 🔴 1.44129 ms
99th percentile service time cardinality-agg-low 2.57683 3.99596 +55.07% 🔴 1.41913 ms
100th percentile service time cardinality-agg-low 2.65259 4.23045 +59.48% 🔴 1.57785 ms
error rate cardinality-agg-low 0 0 0.00% 0 %
50th percentile service time cardinality-agg-high 380.392 388.443 2.12% 8.05097 ms
90th percentile service time cardinality-agg-high 454.406 427.199 -5.99% 🟢 -27.2071 ms
99th percentile service time cardinality-agg-high 455.958 464.742 1.93% 8.7845 ms
100th percentile service time cardinality-agg-high 456.387 546.761 +19.80% 🔴 90.3746 ms
error rate cardinality-agg-high 0 0 0.00% 0 %

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • [n/a] API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • [n/a] Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

* length param is honored
* use OS threadpool model, new fixed threadpool is called 'crypto_plugin_prefetch_threadpool'
* currently uses available processors
* TODO: will add a config to make it a factor of available process,
  likely default to 2.0
* single async call that loads all blocks, can be modified to load all
  blocks in parallel if needed
* will do more performance tests

Signed-off-by: Asim Mahmood <asim.seng@gmail.com>
@asimmahmood1
Copy link
Copy Markdown
Author

JMH - cold path

block size 4kb

write using directIO so its not mmaped 40kp

open using mmap channel io

loop:
prefetch 1-10

read: 1-10

hot

warm up

then prefetch and read

* prefetch is mostly IO work so threads will be blocked
* also prefetch will only be called in search path, which has fixed threads
* check cache first before prefetch
* the cache check may act to dedup, not sure if dedicated dedup strategy
  is needed
* will add JMH benchmarks, osb isn't showing any change

Signed-off-by: Asim Mahmood <asim.seng@gmail.com>
Signed-off-by: Asim Mahmood <asim.seng@gmail.com>
* similar to Lucene's stored field reader
* use long[32] array of startOffset, that checked first
* this array is created per file, slices share the array
* scaling threadpool doesn't have a queue, switch back to use fixed

Signed-off-by: Asim Mahmood <asim.seng@gmail.com>
final FileBlockCacheKey firstBlockKey = new FileBlockCacheKey(path, startBlockOffset);
if (blockCache.get(firstBlockKey) != null) {
return;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain why the check for first block is needed here ? If first block is present that doesn't mean others are also present in the queue right ?

Copy link
Copy Markdown
Author

@asimmahmood1 asimmahmood1 Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should explain that in comment as well: this is the simplest approach that might reload some blocks. What is the probability that subsequent blocks are missing if 1st one isn't available? I would argue that it is unlikely since prefetch is useful for sequential reads.

Alternatives are:

  1. Load after 1st missing block - should still be simpler. I'm ok to go with this alternative. Even though blockCache look up is sync, its still cheaper than IO.
  2. Check each missing block, then load them separately. In order to reduce the IO calls, it'll be better to collect continue blocks. Is it worth the effort?

Ideally I would add some metrics and test out all 3 using some benchmarks. On the hand I'm not familiar with non-search usecases like knn.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing is, as we improve our readahead algorithm. Readaheads should also be able to catchup for subsequent consecutive blocks.

* found a bug in loadForPrefetch, it doesn't check cache first

Signed-off-by: Asim Mahmood <asim.seng@gmail.com>
* before it was loading (IO) all blocks regardless of cache entry
* now it loads only misisng cache values
* contigous missing blocks are combined to a single load call
* TODO: add metrics

Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

// Use cache size to determine, but double it so we're more aggressive than read ahead
if (queueSize == -1) {
queueSize = ReadAheadSizingPolicy.calculateQueueSize(maxCacheBlocks) * 2;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefetch should be called for blocks which are deterministic to be accessed instead of being speculative like read ahead. With that in mind, should we keep the queue size same as maxCacheBlocks as that is the number of blocks we should be able to prefetch as part of one or more search requests. Whereas currently read-ahead is done in speculative manner without being IOContext aware which can lead to unnecessary cache churns. @abiesps thoughts ?


// Use cache size to determine, but double it so we're more aggressive than read ahead
if (queueSize == -1) {
queueSize = ReadAheadSizingPolicy.calculateQueueSize(maxCacheBlocks) * 2;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that in mind, should we keep the queue size same as maxCacheBlocks as that is the number of blocks we should be able to prefetch as part of one or more search requests.

maxCacheBlocks will be in 10s of thousands (remember each block is 8kb), having a queue that large is better off a UnboundedQueue. Anyways, if you think you need to prefetch all the cache blocks, your caching is really screwed up.

@kumargu
Copy link
Copy Markdown
Collaborator

kumargu commented Mar 4, 2026

Please run a OSB http_log or big_5 and post results with and without this change.

* Based on the discussion, will estimate default threadpool size to be (search+index_searcher)*4. Since this prefetch will mostly mostly be blocked on IO, and its trying to help the search path by prefetching, we want to be more aggressive.

* For queue size, for search lucene itelf only calls with block size 1 and there might be 10s of calls per query, but knn it can be much worst case e.g. 32 neighbors, there can be 1000 calls. So we'll estimate threads*1000 as default. Will tune this in future based on benchmark results.

Signed-off-by: Asim Mahmood <asim.seng@gmail.com>
* use concurrent hashmap to dedup
* map is created per file but shared across slices, this avoid shared
  map across each directory so keeps the concurrency load low, and use
simple offset (long) as key
* FastUtil would be even faster, but don't wnat to introduct a new
  dependency
* added unit tests, will had JMH to prove the improvement

Signed-off-by: Asim Mahmood <asim.seng@gmail.com>
* read ahead is only called when there is a cache miss
* while search prefetch may already be cached

Signed-off-by: Asim Mahmood <asim.seng@gmail.com>
* now the executor is passed into CaffeineBlockCache
* single map per node, instead of per file

Signed-off-by: Asim Mahmood <asim.seng@gmail.com>
Signed-off-by: Asim Mahmood <asim.seng@gmail.com>
Signed-off-by: Asim Mahmood <asim.seng@gmail.com>
Comment on lines +186 to +188
if (prefetchCache != null) {
prefetchCache.keySet().removeIf(key -> key instanceof FileBlockCacheKey fk && fk.filePath().equals(normalized));
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what case a key getting invalidated will be in prefetchCache ? If it is in prefetchCache that means it was not found in block cache and the download for the key is in-progress

* @param maxBlocks the maximum number of blocks to cache (currently unused but kept for API compatibility)
*/
public CaffeineBlockCache(Cache<BlockCacheKey, BlockCacheValue<T>> cache, BlockLoader<V> blockLoader, long maxBlocks) {
this(cache, blockLoader, maxBlocks, null, null);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we want to support null cache and executor inside the block cache ? This will also help to keep the code simple in load methods by avoiding all the null checks

@asimmahmood1
Copy link
Copy Markdown
Author

JMH Test 1

https://github.com/asimmahmood1/opensearch-storage-encryption/blob/1d1f3cbf9fe54b3a4c3dc6560704ec9f7c0e04b2/src/jmh/java/org/opensearch/index/store/benchmark/PrefetchBufferpoolVsMMapBenchmark.java

File: 100MB encrypted file (12,800 blocks × 8KB)

Cache layers:

  • Large enough to hold entire file
  • L1: BlockSlotTinyCache — 32-slot direct-mapped, per-IndexInput clone
  • L2: CaffeineBlockCache — 15,000 blocks (~120MB), shared across all threads
  • Pool: 256MB MemorySegmentPool

Threads:

  • 1, 4 reader threads
  • 32 prefetch worker threads (fixed threadpool)

Read pattern:

  • Sequential: Each threads starts at offset 0
  • Per invocation: seek(offset) → readLong() × 1024 (one full block)
  • Advance offset by one block
  • If prefetchEnabled: prefetch block at offset + BLOCK_SIZE (1 block ahead)
  • On wrap (offset > fileLength): reset to 0, increment pass counter

Other changes:

  • encryption - already disabled by commenting it out
  • no unpin per read block

JMH config: 1 warmup + 1 measurement iteration, 10s each, 1 fork, throughput mode

Threads (cacheWarm) (mode) (prefetchEnabled) Mode Cnt ** Score Error ** Units
1 TRUE bufferpool TRUE thrpt 103.563 ops/ms
1 TRUE bufferpool FALSE thrpt 152.53 ops/ms
1 FALSE bufferpool TRUE thrpt 2.207 ops/ms
1 FALSE bufferpool FALSE thrpt 1.346 ops/ms
4 TRUE bufferpool TRUE thrpt 315.867 ops/ms
4 TRUE bufferpool FALSE thrpt 622.721 ops/ms
4 FALSE bufferpool TRUE thrpt 6.928 ops/ms
4 FALSE bufferpool FALSE thrpt 5.387 ops/ms
1 FALSE mmap TRUE thrpt 283.351 ops/ms
1 FALSE mmap FALSE thrpt 186.264 ops/ms
4 FALSE mmap TRUE thrpt 1126.549 ops/ms
4 FALSE mmap FALSE thrpt 1132.028 ops/ms

Summary

Summary

  • Prefetch delivers real value on cold cache: +64% (1T), +29% (4T)
  • Prefetch is a significant tax on warm cache: -32% (1T), -49% (4T)
  • Prefetch hurts thread scaling from 4x → 3x due to executor contention
  • The net benefit depends on your workload's cache hit ratio — if it's above ~70-80%, prefetch is likely a net negative

Next Steps / issues

  1. We should disable writing to the cache during writes as well.
  2. With Sequential reads kernel read aheads will also trigger.
  3. We should remove read aheads from both Bufferpool and mmap. From mmap it can be done by passing File access hint as Random.
  4. why is dedup not 0 for block_ahead=1 [Turns out it was disjoin offset, all were starting at 0]
  5. Use cache.getOrLoad() to dedup IO - no contingious IO - although we're trying to reduce prefetch IO by loading conitgious call, caffine cache cannot dedup
  6. remove CPU burn
  7. prefetch block count in a loop - 1..16, doesnt need to be contigrious, call in loop
  8. prefetch block count - single call vs loop over prefetch
    1. then loop to read some X # of bytes
  9. cold cache - invalidate after each block

Detailed stats:

Details

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, mode = bufferpool, prefetchEnabled = true)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=88
[STATS] CaffineCache[size =12800,hits=2268662, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] Prefetch[calls=1134287, requested=1134287, loaded=0, deduped=0, cacheHit=1134287, hitRatio=100.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12800, free=0, unallocated=19968, utilization=39.1%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=1134375, misses=0, total=1134375, l1Rate=0.00%, l2Rate=100.00%]
113.433 ops/ms
Iteration 1:
[STATS] passes=81
[STATS] CaffineCache[size =12800,hits=2071199, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] Prefetch[calls=1035559, requested=1035559, loaded=0, deduped=0, cacheHit=1035559, hitRatio=100.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12800, free=0, unallocated=19968, utilization=39.1%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=1035640, misses=0, total=1035640, l1Rate=0.00%, l2Rate=100.00%]
103.563 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_1Threads":
103.563 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, mode = bufferpool, prefetchEnabled = false)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=117
[STATS] CaffineCache[size =12800,hits=1510218, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12800, free=0, unallocated=19968, utilization=39.1%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=1510218, misses=0, total=1510218, l1Rate=0.00%, l2Rate=100.00%]
151.015 ops/ms
Iteration 1:
[STATS] passes=120
[STATS] CaffineCache[size =12800,hits=1525315, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12800, free=0, unallocated=19968, utilization=39.1%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=1525315, misses=0, total=1525315, l1Rate=0.00%, l2Rate=100.00%]
152.530 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_1Threads":
152.530 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, mode = bufferpool, prefetchEnabled = true)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=2
[STATS] CaffineCache[size =8185,hits=40266, misses=26113, hitRate=60.66%, loads=12706, evictions=0, avgLoadTime=0.75ms]
[STATS] Prefetch[calls=33783, requested=33783, loaded=513, deduped=0, cacheHit=19192, hitRatio=56.81%, loadRatio=1.52%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12802, free=4617, unallocated=19966, utilization=25.0%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=20977, misses=12808, total=33785, l1Rate=0.00%, l2Rate=62.09%]
3.378 ops/ms
Iteration 1:
[STATS] passes=2
STATS] CaffineCache[size =5005,hits=31906, misses=37006, hitRate=46.30%, loads=10782, evictions=0, avgLoadTime=0.91ms]
[STATS] Prefetch[calls=22067, requested=22067, loaded=11639, deduped=0, cacheHit=20619, hitRatio=93.44%, loadRatio=52.74%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12802, free=7796, unallocated=19966, utilization=15.3%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=11285, misses=10784, total=22069, l1Rate=0.00%, l2Rate=51.14%]
2.207 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_1Threads":
2.207 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, mode = bufferpool, prefetchEnabled = false)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=2
[STATS] CaffineCache[size =7778,hits=20571, misses=25696, hitRate=44.46%, loads=12807, evictions=0, avgLoadTime=0.75ms]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12801, free=5023, unallocated=19967, utilization=23.7%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=20489, misses=12889, total=33378, l1Rate=0.00%, l2Rate=61.38%]
3.338 ops/ms
Iteration 1:
[STATS] passes=1
[STATS] CaffineCache[size =8436,hits=0, misses=26916, hitRate=0.00%, loads=13458, evictions=0, avgLoadTime=0.73ms]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12801, free=4365, unallocated=19967, utilization=25.7%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=0, misses=13458, total=13458, l1Rate=0.00%, l2Rate=0.00%]
1.346 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_1Threads":
1.346 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, mode = bufferpool, prefetchEnabled = true)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=235
[STATS] CaffineCache[size =12800,hits=6060495, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] Prefetch[calls=3042069, requested=3042069, loaded=0, deduped=361, cacheHit=3041708, hitRatio=99.99%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12800, free=0, unallocated=19968, utilization=39.1%, allocation=39.1%]
[STATS] TinyCache[l1Hits=23517, l2Hits=3018787, misses=0, total=3042304, l1Rate=0.77%, l2Rate=99.23%]
304.202 ops/ms
Iteration 1:
[STATS] passes=247
[STATS] CaffineCache[size =12800,hits=6302699, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] Prefetch[calls=3158513, requested=3158513, loaded=0, deduped=19, cacheHit=3158494, hitRatio=100.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12800, free=0, unallocated=19968, utilization=39.1%, allocation=39.1%]
[STATS] TinyCache[l1Hits=14555, l2Hits=3144205, misses=0, total=3158760, l1Rate=0.46%, l2Rate=99.54%]
315.867 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
315.867 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, mode = bufferpool, prefetchEnabled = false)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=455
[STATS] CaffineCache[size =12800,hits=4060033, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12800, free=0, unallocated=19968, utilization=39.1%, allocation=39.1%]
[STATS] TinyCache[l1Hits=1795374, l2Hits=4060033, misses=0, total=5855407, l1Rate=30.66%, l2Rate=69.34%]
585.505 ops/ms
Iteration 1:
[STATS] passes=485
[STATS] CaffineCache[size =12800,hits=4065823, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12800, free=0, unallocated=19968, utilization=39.1%, allocation=39.1%]
[STATS] TinyCache[l1Hits=2161468, l2Hits=4065823, misses=0, total=6227291, l1Rate=34.71%, l2Rate=65.29%]
622.721 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
622.721 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, mode = bufferpool, prefetchEnabled = true)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=8
[STATS] CaffineCache[size =5835,hits=141813, misses=80085, hitRate=63.91%, loads=12440, evictions=0, avgLoadTime=0.78ms]
[STATS] Prefetch[calls=125723, requested=125723, loaded=6615, deduped=55527, cacheHit=51146, hitRatio=40.68%, loadRatio=5.26%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12801, free=6966, unallocated=19967, utilization=17.8%, allocation=39.1%]
[STATS] TinyCache[l1Hits=22624, l2Hits=54512, misses=48595, total=125731, l1Rate=17.99%, l2Rate=43.36%]
12.572 ops/ms
Iteration 1:
[STATS] passes=4
[STATS] CaffineCache[size =10357,hits=42862, misses=80643, hitRate=34.70%, loads=12663, evictions=0, avgLoadTime=0.77ms]
[STATS] Prefetch[calls=69288, requested=69288, loaded=4659, deduped=51965, cacheHit=1, hitRatio=0.00%, loadRatio=6.72%, inflight=1]
[STATS] PoolStats[max=32768, allocated=12801, free=2443, unallocated=19967, utilization=31.6%, allocation=39.1%]
[STATS] TinyCache[l1Hits=13768, l2Hits=4866, misses=50658, total=69292, l1Rate=19.87%, l2Rate=7.02%]
6.928 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
6.928 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, mode = bufferpool, prefetchEnabled = false)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=8
[STATS] CaffineCache[size =88,hits=66382, misses=64377, hitRate=50.77%, loads=12876, evictions=0, avgLoadTime=0.74ms]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12800, free=12712, unallocated=19968, utilization=0.3%, allocation=39.1%]
[STATS] TinyCache[l1Hits=23496, l2Hits=27757, misses=51501, total=102754, l1Rate=22.87%, l2Rate=27.01%]
10.275 ops/ms
Iteration 1:
[STATS] passes=4
[STATS] CaffineCache[size =757,hits=40405, misses=67339, hitRate=37.50%, loads=13468, evictions=0, avgLoadTime=0.73ms]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12800, free=12043, unallocated=19968, utilization=2.3%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=2, misses=53871, total=53873, l1Rate=0.00%, l2Rate=0.00%]
5.387 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
5.387 ops/ms
Benchmark (cacheWarm) (mode) (prefetchEnabled) Mode Cnt Score Error Units
PrefetchBufferpoolVsMMapBenchmark.read_1Threads true bufferpool true thrpt 103.563 ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_1Threads true bufferpool false thrpt 152.530 ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_1Threads false bufferpool true thrpt 2.207 ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_1Threads false bufferpool false thrpt 1.346 ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads true bufferpool true thrpt 315.867 ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads true bufferpool false thrpt 622.721 ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads false bufferpool true thrpt 6.928 ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads false bufferpool false thrpt 5.387 ops/ms

PrefetchBufferpoolVsMMapBenchmark.read_1Threads false mmap true thrpt 283.351 ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_1Threads false mmap false thrpt 186.264 ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads false mmap true thrpt 1126.549 ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads false mmap false thrpt 1132.028 ops/ms

Benchmark result is saved to /workplace/asimmahm/opensearch-storage-encryption/build/jmh-results/jmh_20260310_232238.json

@asimmahmood1
Copy link
Copy Markdown
Author

asimmahmood1 commented Mar 12, 2026

JMH Test #2

https://github.com/asimmahmood1/opensearch-storage-encryption/blob/46b93109c1e5ec8f0dc51d3d265f4120d51a6f26/src/jmh/java/org/opensearch/index/store/benchmark/PrefetchBufferpoolVsMMapBenchmark.java

Same setup as #1, except:

  1. Prefetch 16 blocks in a loop, 16 blocks away
  2. disjoint offset per thread, now prefetch dedup is 0

Results:

  1. prefetch is still slow. profiler shows good amount of time is spent in executor.execute()
  2. Now i'm wonder if whole dedup logic is worth it
  3. Also found that FileBlockCacheKey.init() is taking time and memory, because it calls path.getAbsolutePath().normalize(). This should be done once before opening file.
PrefetchBufferpoolVsMMapBenchmark.read_4Threads         true  bufferpool               true  thrpt         72.109          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads         true  bufferpool              false  thrpt        401.508          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads         true        mmap               true  thrpt       5177.424          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads         true        mmap              false  thrpt       7714.644          ops/ms

@asimmahmood1
Copy link
Copy Markdown
Author

JMH Test #3

  • compare OS executor vs plain jdk
    • OS executor comes with cost of ThreadContext create and destroy, which may explain why read ahead also use jdk executor
  • new mode inilne_check which checks cache before calling prefetch, so avoid any dedup logic within async thread
    • this is simpler logic
Benchmark                                              (cacheWarm)  (executorType)      (mode)  (prefetchMode)   Mode  Cnt    Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool           async  thrpt        74.814          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool    inline_check  thrpt       195.235          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool             off  thrpt       156.403          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true             jdk  bufferpool           async  thrpt        58.838          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true             jdk  bufferpool    inline_check  thrpt       163.626          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true             jdk  bufferpool             off  thrpt       520.743          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool           async  thrpt       0.170          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool    inline_check  thrpt       0.171          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool             off  thrpt       0.192          ops/ms
Details

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = bufferpool, prefetchMode = async)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=57606
[STATS] prefetch[calls=11521520, totalMs=33041.35, avgUs=2.87]
[STATS] Prefetch[calls=11521520, requested=11521520, loaded=618, deduped=0, cacheHit=11520720, hitRatio=99.99%, loadRatio=0.01%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=23042058, misses=1165, hitRate=99.99%, loads=182, evictions=0, avgLoadTime=1.84ms]
[STATS] PoolStats[max=32768, allocated=802, free=2, unallocated=31966, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=11521337, misses=183, total=11521520, l1Rate=0.00%, l2Rate=100.00%]
75.615 ops/ms
Iteration 1:
[STATS] passes=57550
[STATS] prefetch[calls=11509984, totalMs=33597.56, avgUs=2.92]
[STATS] Prefetch[calls=11509984, requested=11509984, loaded=0, deduped=0, cacheHit=11509984, hitRatio=100.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=23019968, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=802, free=2, unallocated=31966, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=11509984, misses=0, total=11509984, l1Rate=0.00%, l2Rate=100.00%]
74.814 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
74.814 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = bufferpool, prefetchMode = inline_check)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=142703
[STATS] prefetch[calls=28540968, totalMs=4344.38, avgUs=0.15]
[STATS] Prefetch[calls=800, requested=800, loaded=623, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=77.88%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=57080959, misses=1958, hitRate=100.00%, loads=177, evictions=0, avgLoadTime=1.93ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=28540787, misses=181, total=28540968, l1Rate=0.00%, l2Rate=100.00%]
186.187 ops/ms
Iteration 1:
[STATS] passes=150181
[STATS] prefetch[calls=30036200, totalMs=3170.59, avgUs=0.11]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=60072400, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=30036200, misses=0, total=30036200, l1Rate=0.00%, l2Rate=100.00%]
195.235 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
195.235 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = bufferpool, prefetchMode = off)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=115417
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=23082920, misses=1600, hitRate=99.99%, loads=800, evictions=0, avgLoadTime=1.16ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=23082920, misses=800, total=23083720, l1Rate=0.00%, l2Rate=100.00%]
151.410 ops/ms
Iteration 1:
[STATS] passes=120310
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=24062128, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=24062128, misses=0, total=24062128, l1Rate=0.00%, l2Rate=100.00%]
156.403 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
156.403 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = jdk, mode = bufferpool, prefetchMode = async)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=45118
[STATS] prefetch[calls=9024096, totalMs=34671.42, avgUs=3.84]
[STATS] Prefetch[calls=9024096, requested=9024096, loaded=638, deduped=0, cacheHit=7879790, hitRatio=87.32%, loadRatio=0.01%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=16904328, misses=1130, hitRate=99.99%, loads=162, evictions=0, avgLoadTime=3.09ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=9023926, misses=170, total=9024096, l1Rate=0.00%, l2Rate=100.00%]
58.653 ops/ms
Iteration 1:
[STATS] passes=45261
[STATS] prefetch[calls=9052344, totalMs=35857.89, avgUs=3.96]
[STATS] Prefetch[calls=9052344, requested=9052344, loaded=0, deduped=0, cacheHit=8203285, hitRatio=90.62%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=17256560, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=9052344, misses=0, total=9052344, l1Rate=0.00%, l2Rate=100.00%]
58.838 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
58.838 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = jdk, mode = bufferpool, prefetchMode = inline_check)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=123391
[STATS] prefetch[calls=24678552, totalMs=3219.24, avgUs=0.13]
[STATS] Prefetch[calls=800, requested=800, loaded=667, deduped=0, cacheHit=3, hitRatio=0.38%, loadRatio=83.38%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=49356174, misses=1870, hitRate=100.00%, loads=133, evictions=0, avgLoadTime=2.27ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=24678412, misses=140, total=24678552, l1Rate=0.00%, l2Rate=100.00%]
160.402 ops/ms
Iteration 1:
[STATS] passes=125866
[STATS] prefetch[calls=25173312, totalMs=2473.89, avgUs=0.10]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=50346624, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=25173312, misses=0, total=25173312, l1Rate=0.00%, l2Rate=100.00%]
163.626 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
163.626 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = jdk, mode = bufferpool, prefetchMode = off)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=366843
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=73368104, misses=1600, hitRate=100.00%, loads=800, evictions=0, avgLoadTime=1.49ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=73368104, misses=800, total=73368904, l1Rate=0.00%, l2Rate=100.00%]
476.871 ops/ms
Iteration 1:
[STATS] passes=400571
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=80114408, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=80114408, misses=0, total=80114408, l1Rate=0.00%, l2Rate=100.00%]
520.743 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
520.743 ops/ms

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=137
[STATS] prefetch[calls=27752, totalMs=425.84, avgUs=15.34]
[STATS] Prefetch[calls=27752, requested=27752, loaded=23780, deduped=0, cacheHit=13, hitRatio=0.05%, loadRatio=85.69%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=23793, misses=35706, hitRate=39.99%, loads=3972, evictions=0, avgLoadTime=9.90ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=23757, misses=3995, total=27752, l1Rate=0.00%, l2Rate=85.60%]
0.182 ops/ms
Iteration 1:
[STATS] passes=131
[STATS] prefetch[calls=26328, totalMs=33.40, avgUs=1.27]
[STATS] Prefetch[calls=26328, requested=26328, loaded=22578, deduped=0, cacheHit=10, hitRatio=0.04%, loadRatio=85.76%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=22588, misses=33823, hitRate=40.04%, loads=3750, evictions=0, avgLoadTime=10.66ms]
[STATS] PoolStats[max=32768, allocated=802, free=2, unallocated=31966, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=22573, misses=3755, total=26328, l1Rate=0.00%, l2Rate=85.74%]
0.170 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
0.170 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, executorType = opensearch, mode = bufferpool, prefetchMode = inline_check)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=133
[STATS] prefetch[calls=26920, totalMs=699.95, avgUs=26.00]
[STATS] Prefetch[calls=26920, requested=26920, loaded=23096, deduped=0, cacheHit=6, hitRatio=0.02%, loadRatio=85.79%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=23102, misses=61496, hitRate=27.31%, loads=3824, evictions=0, avgLoadTime=10.24ms]
[STATS] PoolStats[max=32768, allocated=803, free=3, unallocated=31965, utilization=2.4%, allocation=2.5%]
[STATS] TinyCache[l1Hits=0, l2Hits=23082, misses=3838, total=26920, l1Rate=0.00%, l2Rate=85.74%]
0.177 ops/ms
Iteration 1:
[STATS] passes=132
[STATS] prefetch[calls=26336, totalMs=36.50, avgUs=1.39]
[STATS] Prefetch[calls=26336, requested=26336, loaded=22558, deduped=0, cacheHit=11, hitRatio=0.04%, loadRatio=85.65%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=22569, misses=60220, hitRate=27.26%, loads=3778, evictions=0, avgLoadTime=10.57ms]
[STATS] PoolStats[max=32768, allocated=803, free=3, unallocated=31965, utilization=2.4%, allocation=2.5%]
[STATS] TinyCache[l1Hits=0, l2Hits=22555, misses=3781, total=26336, l1Rate=0.00%, l2Rate=85.64%]
0.171 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
0.171 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, executorType = opensearch, mode = bufferpool, prefetchMode = off)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=160
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=0, misses=64448, hitRate=0.00%, loads=32224, evictions=0, avgLoadTime=1.23ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=0, misses=32224, total=32224, l1Rate=0.00%, l2Rate=0.00%]
0.211 ops/ms
Iteration 1:
[STATS] passes=148
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=0, misses=59168, hitRate=0.00%, loads=29584, evictions=0, avgLoadTime=1.35ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=0, misses=29584, total=29584, l1Rate=0.00%, l2Rate=0.00%]
0.192 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
0.192 ops/ms

@asimmahmood1
Copy link
Copy Markdown
Author

JMH TEST 3

https://github.com/asimmahmood1/opensearch-storage-encryption/blob/dbd365fed384d1c745b07fde445eed13425de1c1/src/jmh/java/org/opensearch/index/store/benchmark/PrefetchBufferpoolVsMMapBenchmark.java

PrefetchBufferpoolVsMMapBenchmark.read_4Threads-Throughput-cacheWarm-true-executorType-opensearch-mode-bufferpool-prefetchMode-async:
flame-cpu-forward.html

PrefetchBufferpoolVsMMapBenchmark.read_4Threads-Throughput-cacheWarm-true-executorType-opensearch-mode-bufferpool-prefetchMode-inline_check
flame-cpu-forward.html

PrefetchBufferpoolVsMMapBenchmark.read_4Threads-Throughput-cacheWarm-true-executorType-opensearch-mode-bufferpool-prefetchMode-off
flame-cpu-forward.html

@asimmahmood1
Copy link
Copy Markdown
Author

asimmahmood1 commented Mar 12, 2026

My interpretation so far is that, since there is 1:1 ratio of prefetch and reads within the block, the async hand off is not cheap in this benchmark. There are over 2MM calls to executor vs the actual work the prefetch threads need to do.

In real world, will there be these many prefetch calls vs the read calls? If there will be many more reads than prefetch, then this cost should be low.

Other option still to just do cache check in search path, and load what's missing async.

JMH Test 4 - prefetch 16 blocks, read 16 times each block

Benchmark                                              (cacheWarm)  (executorType)      (mode)  (prefetchMode)   Mode  Cnt     Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool           async  thrpt         73.535          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool    inline_check  thrpt        374.221          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool     inline_load  thrpt        370.601          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool             off  thrpt        385.770          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap           async  thrpt       2587.772          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap    inline_check  thrpt       1950.268          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap     inline_load  thrpt       2082.947          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap             off  thrpt       7937.569          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool           async  thrpt          0.170          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool    inline_check  thrpt          0.171          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool     inline_load  thrpt          0.194          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool             off  thrpt          0.193          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap           async  thrpt         14.234          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap    inline_check  thrpt         11.220          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap     inline_load  thrpt          0.195          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap             off  thrpt         14.153          ops/ms
Details

dev-dsk-asimmahm-2c-a6d21262 % ./jmh_compact.sh

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = bufferpool, prefetchMode = async)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=56073
[STATS] prefetch[calls=11215016, totalMs=32598.23, avgUs=2.91]
[STATS] Prefetch[calls=11215016, requested=11215016, loaded=630, deduped=0, cacheHit=11214216, hitRatio=99.99%, loadRatio=0.01%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=22429062, misses=1146, hitRate=99.99%, loads=170, evictions=0, avgLoadTime=1.72ms]
[STATS] PoolStats[max=32768, allocated=802, free=2, unallocated=31966, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=11214840, misses=176, total=11215016, l1Rate=0.00%, l2Rate=100.00%]
73.561 ops/ms
Iteration 1:
[STATS] passes=56567
[STATS] prefetch[calls=11313224, totalMs=33420.45, avgUs=2.95]
[STATS] Prefetch[calls=11313224, requested=11313224, loaded=0, deduped=0, cacheHit=11313224, hitRatio=100.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=22626448, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=802, free=2, unallocated=31966, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=11313224, misses=0, total=11313224, l1Rate=0.00%, l2Rate=100.00%]
73.535 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
73.535 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = bufferpool, prefetchMode = inline_check)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=278737
[STATS] prefetch[calls=55747928, totalMs=5611.72, avgUs=0.10]
[STATS] Prefetch[calls=800, requested=800, loaded=650, deduped=0, cacheHit=8, hitRatio=1.00%, loadRatio=81.25%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=111494914, misses=1897, hitRate=100.00%, loads=150, evictions=0, avgLoadTime=1.51ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=55747773, misses=155, total=55747928, l1Rate=0.00%, l2Rate=100.00%]
362.341 ops/ms
Iteration 1:
[STATS] passes=287863
[STATS] prefetch[calls=57572552, totalMs=4888.69, avgUs=0.08]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=115145104, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=57572552, misses=0, total=57572552, l1Rate=0.00%, l2Rate=100.00%]
374.221 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
374.221 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = bufferpool, prefetchMode = inline_load)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=274780
[STATS] prefetch[calls=54956288, totalMs=6666.70, avgUs=0.12]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=109911776, misses=800, hitRate=100.00%, loads=800, evictions=0, avgLoadTime=1.05ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=54956288, misses=0, total=54956288, l1Rate=0.00%, l2Rate=100.00%]
357.194 ops/ms
Iteration 1:
[STATS] passes=285078
[STATS] prefetch[calls=57015792, totalMs=5354.89, avgUs=0.09]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=114031584, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=57015792, misses=0, total=57015792, l1Rate=0.00%, l2Rate=100.00%]
370.601 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
370.601 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = bufferpool, prefetchMode = off)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=291510
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=58301552, misses=1600, hitRate=100.00%, loads=800, evictions=0, avgLoadTime=1.06ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=58301552, misses=800, total=58302352, l1Rate=0.00%, l2Rate=100.00%]
378.942 ops/ms
Iteration 1:
[STATS] passes=296749
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=59349864, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=59349864, misses=0, total=59349864, l1Rate=0.00%, l2Rate=100.00%]
385.770 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
385.770 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = mmap, prefetchMode = async)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=1977521
[STATS] prefetch[calls=395504552, totalMs=10437.35, avgUs=0.03]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
2570.600 ops/ms
Iteration 1:
[STATS] passes=1990620
[STATS] prefetch[calls=398124096, totalMs=10490.30, avgUs=0.03]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
2587.772 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
2587.772 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = mmap, prefetchMode = inline_check)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=1477068
[STATS] prefetch[calls=295413920, totalMs=14638.67, avgUs=0.05]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=295413920, hitRate=0.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
1920.412 ops/ms
Iteration 1:
[STATS] passes=1500223
[STATS] prefetch[calls=300044792, totalMs=14772.10, avgUs=0.05]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=300044792, hitRate=0.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
1950.268 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
1950.268 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = mmap, prefetchMode = inline_load)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=1483658
[STATS] prefetch[calls=296731888, totalMs=20420.98, avgUs=0.07]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=296731088, misses=800, hitRate=100.00%, loads=800, evictions=0, avgLoadTime=1.10ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
1928.651 ops/ms
Iteration 1:
[STATS] passes=1602286
[STATS] prefetch[calls=320457440, totalMs=19915.07, avgUs=0.06]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=320457440, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
2082.947 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
2082.947 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = mmap, prefetchMode = off)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=6117138
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
7952.378 ops/ms
Iteration 1:
[STATS] passes=6105911
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
7937.569 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
7937.569 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, executorType = opensearch, mode = bufferpool, prefetchMode = async)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=138
[STATS] prefetch[calls=27856, totalMs=389.94, avgUs=14.00]
[STATS] Prefetch[calls=27856, requested=27856, loaded=23911, deduped=0, cacheHit=6, hitRatio=0.02%, loadRatio=85.84%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=23917, misses=35753, hitRate=40.08%, loads=3945, evictions=0, avgLoadTime=10.01ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=23898, misses=3958, total=27856, l1Rate=0.00%, l2Rate=85.79%]
0.181 ops/ms
Iteration 1:
[STATS] passes=131
[STATS] prefetch[calls=26328, totalMs=31.21, avgUs=1.19]
[STATS] Prefetch[calls=26328, requested=26328, loaded=22556, deduped=0, cacheHit=6, hitRatio=0.02%, loadRatio=85.67%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=22562, misses=33868, hitRate=39.98%, loads=3772, evictions=0, avgLoadTime=10.60ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=22554, misses=3774, total=26328, l1Rate=0.00%, l2Rate=85.67%]
0.170 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
0.170 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, executorType = opensearch, mode = bufferpool, prefetchMode = inline_check)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=136
[STATS] prefetch[calls=27456, totalMs=440.29, avgUs=16.04]
[STATS] Prefetch[calls=27456, requested=27456, loaded=23614, deduped=0, cacheHit=1, hitRatio=0.00%, loadRatio=86.01%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=23615, misses=62616, hitRate=27.39%, loads=3842, evictions=0, avgLoadTime=10.27ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=23593, misses=3863, total=27456, l1Rate=0.00%, l2Rate=85.93%]
0.178 ops/ms
Iteration 1:
[STATS] passes=132
[STATS] prefetch[calls=26352, totalMs=29.49, avgUs=1.12]
[STATS] Prefetch[calls=26352, requested=26352, loaded=22654, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=85.97%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=22654, misses=60103, hitRate=27.37%, loads=3698, evictions=0, avgLoadTime=10.81ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=22651, misses=3701, total=26352, l1Rate=0.00%, l2Rate=85.96%]
0.171 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
0.171 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, executorType = opensearch, mode = bufferpool, prefetchMode = inline_load)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=158
[STATS] prefetch[calls=31952, totalMs=39987.42, avgUs=1251.48]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=31952, misses=31952, hitRate=50.00%, loads=31952, evictions=0, avgLoadTime=1.24ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=31952, misses=0, total=31952, l1Rate=0.00%, l2Rate=100.00%]
0.207 ops/ms
Iteration 1:
[STATS] passes=149
[STATS] prefetch[calls=29928, totalMs=40028.61, avgUs=1337.50]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=29928, misses=29928, hitRate=50.00%, loads=29928, evictions=0, avgLoadTime=1.33ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=29928, misses=0, total=29928, l1Rate=0.00%, l2Rate=100.00%]
0.194 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
0.194 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, executorType = opensearch, mode = bufferpool, prefetchMode = off)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=160
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=0, misses=64960, hitRate=0.00%, loads=32480, evictions=0, avgLoadTime=1.22ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=0, misses=32480, total=32480, l1Rate=0.00%, l2Rate=0.00%]
0.211 ops/ms
Iteration 1:
[STATS] passes=149
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=0, misses=59632, hitRate=0.00%, loads=29816, evictions=0, avgLoadTime=1.34ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=0, misses=29816, total=29816, l1Rate=0.00%, l2Rate=0.00%]
0.193 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
0.193 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, executorType = opensearch, mode = mmap, prefetchMode = async)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=10254
[STATS] prefetch[calls=2050800, totalMs=63.36, avgUs=0.03]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
13.324 ops/ms
Iteration 1:
[STATS] passes=10954
[STATS] prefetch[calls=2190800, totalMs=60.85, avgUs=0.03]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
14.234 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
14.234 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, executorType = opensearch, mode = mmap, prefetchMode = inline_check)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=9420
[STATS] prefetch[calls=1884096, totalMs=209.39, avgUs=0.11]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=1884096, hitRate=0.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
12.239 ops/ms
Iteration 1:
[STATS] passes=8637
[STATS] prefetch[calls=1727304, totalMs=101.16, avgUs=0.06]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=1727304, hitRate=0.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
11.220 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
11.220 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, executorType = opensearch, mode = mmap, prefetchMode = inline_load)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=159
[STATS] prefetch[calls=32136, totalMs=38801.14, avgUs=1207.40]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=0, misses=32136, hitRate=0.00%, loads=32136, evictions=0, avgLoadTime=1.20ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
0.208 ops/ms
Iteration 1:
[STATS] passes=150
[STATS] prefetch[calls=30080, totalMs=39533.60, avgUs=1314.28]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=0, misses=30080, hitRate=0.00%, loads=30080, evictions=0, avgLoadTime=1.31ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
0.195 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
0.195 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, executorType = opensearch, mode = mmap, prefetchMode = off)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=11123
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
14.455 ops/ms
Iteration 1:
[STATS] passes=10894
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
14.153 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
14.153 ops/ms
Benchmark (cacheWarm) (executorType) (mode) (pre

@asimmahmood1
Copy link
Copy Markdown
Author

JMH Test 5 - 16 prefetch blocks, then full read

Benchmark                                              (cacheWarm)  (executorType)      (mode)  (prefetchMode)   Mode  Cnt    Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool           async  thrpt        31.124          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool    inline_check  thrpt        38.980          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool     inline_load  thrpt        44.019          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool             off  thrpt        44.259          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap           async  thrpt        74.412          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap    inline_check  thrpt        66.912          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap     inline_load  thrpt       418.439          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap             off  thrpt        92.577          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool           async  thrpt         0.171          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool    inline_check  thrpt         0.169          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool     inline_load  thrpt         0.194          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool             off  thrpt         0.195          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap           async  thrpt        12.632          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap    inline_check  thrpt        11.826          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap     inline_load  thrpt         0.194          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap             off  thrpt        10.679          ops/ms

@asimmahmood1
Copy link
Copy Markdown
Author

asimmahmood1 commented Mar 13, 2026

JMH TEST 5

  • 100mb file, block size 8k
  • prefetch 16 blocks sequentially
  • read those 16 blocks completely, sequentially
  • prefetch to read ratio is 1:1024
  • for mmap, lucene's back off is disabled (via reflection)

PrefetchMode:

  • baseline - current code that calls IO on every prefetch
  • async candidate - this PR
  • none - no prefetch, that's the cost of async if everything is in the cache and there is nothing

cacheWarm:

  • true - entire 100mb is filled in cache
  • false - cache key is invalidated before prefetch, or madvice don't need is called for mmap before prefetch block

Result:

  • async is much faster than baseline (28 vs 2), bu tnot

https://github.com/asimmahmood1/opensearch-storage-encryption/blob/351330ef2f87b2d29dd2d1f7b90b1fe33a66f5cf/src/jmh/java/org/opensearch/index/store/benchmark/PrefetchBufferpoolVsMMapBenchmark.java

Benchmark                                              (cacheWarm)  (executorType)      (mode)  (prefetchMode)   Mode  Cnt   Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool           async  thrpt       28.741          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool        baseline  thrpt        2.006          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool             off  thrpt       45.294          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap           async  thrpt       74.431          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap             off  thrpt       67.105          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool           async  thrpt        0.171          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool        baseline  thrpt        0.205          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool             off  thrpt        0.185          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap           async  thrpt        2.743          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap             off  thrpt        0.942          ops/ms

@asimmahmood1
Copy link
Copy Markdown
Author

asimmahmood1 commented Mar 23, 2026

@sohami

Cost of prefetch async call when cache is warm (score 28.741)
flame-cpu-forward.html

Compared to stubbed/noop prefetch call (score 45.294)
flame-cpu-forward.html

The LinkedTransferQueue is not highly performant, so might be worth trying higher throughput queues. But before I do that, will focus on trying to reduce the IO calls for search path.

@abiesps recently added new L1 cache (array based), so will check that 1st in read path.
If miss, then asych call to prefetch to l2.getOrLoad(). But this is a follow up.

Screenshot 2026-03-23 at 4 19 51 PM

https://github.com/asimmahmood1/opensearch-storage-encryption/blob/351330ef2f87b2d29dd2d1f7b90b1fe33a66f5cf/src/jmh/java/org/opensearch/index/store/benchmark/PrefetchBufferpoolVsMMapBenchmark.java

@asimmahmood1
Copy link
Copy Markdown
Author

asimmahmood1 commented Mar 25, 2026

JMH Test 6 - Added getOrLoad mode

  • warm cache, it has similar or worse, since all the extra time is spent in LinkedTransferQueue
  • cold cache shows improvement
Benchmark                                              (cacheWarm)  (executorType)      (mode)   (prefetchMode)   Mode  Cnt    Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool            async  thrpt        31.038          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool              off  thrpt        42.343          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool  async_getOrLoad  thrpt        28.809          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap            async  thrpt       521.461          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap              off  thrpt       663.659          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool            async  thrpt         0.171          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool              off  thrpt         0.195          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool  async_getOrLoad  thrpt         0.192          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap            async  thrpt         1.646          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap              off  thrpt         0.758          ops/ms

asimmahmood1@6339ead#diff-b91741cefaa9515609d6aa33e0e3d9b7d37c4f64200d4b98ca6d0e1833cf23fd

Details
# Parameters: (cacheWarm = true, executorType = opensearch, mode = bufferpool, prefetchMode = async)                                                                                                                                                                                                                                                                                                                                                                                        03:28 [307/1900]

Iteration   1:
[STATS] passes=23875
[STATS] prefetch[calls=4775176, totalMs=10981.35, avgUs=2.30]
[STATS] Prefetch[calls=4775176, requested=4775176, loaded=0, deduped=0, cacheHit=4775176, hitRatio=100.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=9550352, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=12800, free=12000, unallocated=19968, utilization=2.4%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=4775176, misses=0, total=4775176, l1Rate=0.00%, l2Rate=100.00%]
31.038 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
  31.038 ops/ms

# Parameters: (cacheWarm = true, executorType = opensearch, mode = bufferpool, prefetchMode = async_getOrLoad)
# Fork: 1 of 1


Iteration   1:
[STATS] passes=22162
[STATS] prefetch[calls=4432352, totalMs=12403.92, avgUs=2.80]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=8864704, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=12800, free=12000, unallocated=19968, utilization=2.4%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=4432352, misses=0, total=4432352, l1Rate=0.00%, l2Rate=100.00%]
28.809 ops/ms



# Parameters: (cacheWarm = false, executorType = opensearch, mode = mmap, prefetchMode = async)
# Fork: 1 of 1

Iteration   1:
[STATS] passes=1266
[STATS] prefetch[calls=253280, totalMs=9.48, avgUs=0.04]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =12800,hits=0, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=12800, free=0, unallocated=19968, utilization=39.1%, allocation=39.1%]
1.646 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
  1.646 ops/ms


# Parameters: (cacheWarm = false, executorType = opensearch, mode = bufferpool, prefetchMode = async_getOrLoad)

Iteration   1:
[STATS] passes=150
[STATS] prefetch[calls=29600, totalMs=11.09, avgUs=0.37]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=29600, misses=51321, hitRate=36.58%, loads=29600, evictions=0, avgLoadTime=11.04ms]
[STATS] PoolStats[max=32768, allocated=12800, free=12000, unallocated=19968, utilization=2.4%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=7879, misses=21721, total=29600, l1Rate=0.00%, l2Rate=26.62%]
0.192 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
  0.192 ops/ms


</details>

@asimmahmood1
Copy link
Copy Markdown
Author

asimmahmood1 commented Mar 25, 2026

Hotpath Summary

  • without prefetch at all, bufferpool is 42 vs mmap 663: most of the time spent in readLong which is not jit compiled
  • @prudhvigodithi has some jit friendly changes, will test them out, that way we can make sure bufferpool prefetch is as close to mmap as possible
  • plus time spent in TinyBlockCache
Screenshot 2026-03-25 at 10 07 06 AM
  • mmap:
Screenshot 2026-03-25 at 10 12 24 AM

@asimmahmood1
Copy link
Copy Markdown
Author

JMH Test 7 - Compare no prefetch, with tiny block cache vs radix

Benchmark                                              (cacheWarm)  (executorType)  (l1CacheType)      (mode)  (prefetchMode)   Mode  Cnt   Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch      tinyCache  bufferpool             off  thrpt       41.393          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch          radix  bufferpool             off  thrpt       47.207          ops/ms

Its better but still far away from 500 score for mmap.

3bdd4b6

@asimmahmood1
Copy link
Copy Markdown
Author

asimmahmood1 commented Mar 25, 2026

JMH Test 8 - use Unsafe instead of Segment

  • goes from 47 to 81, still far away from 500 mmap
Benchmark                                              (cacheWarm)  (executorType)  (l1CacheType)      (mode)  (prefetchMode)   Mode  Cnt   Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch      tinyCache  bufferpool             off  thrpt    2  67.378          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch          radix  bufferpool             off  thrpt    2  81.860          ops/ms

So CachedMemorySegmentIndexInput.readLong is 4345, so CachedMemorySegmentIndexInput.getCacheBlockWithOffset 2803 , which is 68%.

Of that, I separated the currentBlock miss into seperate method: CachedMemorySegmentIndexInput.getCacheBlockWithOffsetSlow, which is only 165 samples (so only 5% of getCacheBlockWithOffset). So limiting factor is not l1 or l2 look up, even just evaluating currentBlock itself.

flame-cpu-forward.html

Screenshot 2026-03-25 at 4 35 27 PM

@asimmahmood1
Copy link
Copy Markdown
Author

asimmahmood1 commented Mar 27, 2026

JMH Test 9 - simulate slow IO, reduce readLong

  • setup: 4 (from 16) block prefetch in loop, 16 blocks apart, 4 (from 8) readLong
  • cache warm - async is 151 vs 154, no IO calls made since all l2 hit
  • cache cold - without IO delay no prefetch is faster (0.751 vs 0.740), with IO delay is faster (0.746 vs 0.740), will try with higher IO delay
Benchmark                                              (awaitPrefetch)  (cacheWarm)  (executorType)  (l1CacheType)      (mode)  (prefetchMode)  (simulatedIoLatencyUs)   Mode  Cnt    Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false         true      opensearch      tinyCache  bufferpool           async                       0  thrpt       111.680          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false         true      opensearch      tinyCache  bufferpool           async                     500  thrpt       110.929          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false         true      opensearch      tinyCache  bufferpool             off                       0  thrpt       154.941          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false         true      opensearch      tinyCache  bufferpool             off                     500  thrpt       156.734          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false        false      opensearch      tinyCache  bufferpool           async                       0  thrpt         0.740          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false        false      opensearch      tinyCache  bufferpool           async                     500  thrpt         0.746          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false        false      opensearch      tinyCache  bufferpool             off                       0  thrpt         0.751          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false        false      opensearch      tinyCache  bufferpool             off                     500  thrpt         0.751          ops/ms

@asimmahmood1
Copy link
Copy Markdown
Author

asimmahmood1 commented Mar 27, 2026

JMH Test 10 - slower IO comparison

  • Prefetch helps with slower IO when cache is cold
  • For warm cache will test without any read compared to mmap
Benchmark                                              (awaitPrefetch)  (cacheWarm)  (executorType)  (l1CacheType)      (mode)  (prefetchMode)  (simulatedIoLatencyUs)   Mode  Cnt    Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false         true      opensearch      tinyCache  bufferpool           async                       0  thrpt       108.694          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false         true      opensearch      tinyCache  bufferpool             off                       0  thrpt       159.598          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false        false      opensearch      tinyCache  bufferpool           async                       0  thrpt         0.745          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false        false      opensearch      tinyCache  bufferpool           async                    1000  thrpt         0.746          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false        false      opensearch      tinyCache  bufferpool           async                    2000  thrpt         0.735          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false        false      opensearch      tinyCache  bufferpool             off                       0  thrpt         0.752          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false        false      opensearch      tinyCache  bufferpool             off                    1000  thrpt         0.548          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false        false      opensearch      tinyCache  bufferpool             off                    2000  thrpt         0.353          ops/ms

@asimmahmood1
Copy link
Copy Markdown
Author

asimmahmood1 commented Mar 27, 2026

JMH 11 - Hot path cost of async

  • for mmap async, the madvice backup off is disabled via mmapPrefetchField.setInt(ts.threadInput, 0);
Benchmark                                              (cacheWarm)  (l1CacheType)      (mode)  (prefetchMode)  (skipRead)   Mode  Cnt       Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      tinyCache  bufferpool           async        true  thrpt          301.786          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      tinyCache  bufferpool             off        true  thrpt       342376.546          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      tinyCache        mmap           async        true  thrpt         1022.119          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      tinyCache        mmap             off        true  thrpt       332041.566          ops/ms

asimmahmood1@7c7a093

@abiesps
Copy link
Copy Markdown
Contributor

abiesps commented Mar 27, 2026

I am trying to see how can we make the hot path of prefetch faster than mmap.
Can we try a benchmark which
a) Pre-warms L1 and l2 cache.
b) L1 cache is RadixBlockTable.
c) With bufferpool first check for prefetch api is plain read on L1 (without tryPin). If L1 has data then ignore prefetch. d) Another one on RadixBlockTable with tryPin (if we use memory pooling)
d) With mmap its the existing madvise and MemorySegment.isLoaded call without file level backoff.

@asimmahmood1
Copy link
Copy Markdown
Author

asimmahmood1 commented Mar 29, 2026

JMH 12 - Hotpath: forkjoin is 3x faster than opensearch threadpool

  • tried both arrayblockingqueue and forkjoin, forkjin is 3x faster even without L1 check

    ForkJoinPool is the clear winner:

    • forkjoin ~2500-3000 ops/ms vs opensearch ~320-435 ops/ms — 7-8x faster submission throughput
    • Even beats mmap's prefetch path (1003 ops/ms) by 3x

    L1 check helps with radix + opensearch:

    • opensearch/radix/l1_then_prefetch = 435 vs opensearch/radix/async = 349 — 25% improvement
    • The L1 radix contains() check is cheap enough to save the executor submission overhead
    • With forkjoin the difference disappears because submission is already so cheap

    Radix > tinyCache with opensearch executor:

    • opensearch/radix/async = 349 vs opensearch/tinyCache/async = 316 — radix is ~10% faster
    • The radix contains() is two plain array loads vs tinyCache's stamp acquire-load + hash compare

    Summary for production recommendation:

    Change Impact
    Switch prefetch executor to ForkJoinPool 7-8x faster prefetch submission
    Add L1 check before prefetch (radix) 25% faster with current executor, free insurance with ForkJoinPool
    Radix L1 cache over tinyCache 10% faster overall
    The biggest win by far is the executor change. The l1_then_prefetch check is a nice optimization on top — it avoids unnecessary work even when submission is cheap, and becomes more valuable under real IO load where
    each avoided submission saves actual EFS round-trips.
Benchmark                                              (cacheWarm)  (executorType)  (l1CacheType)      (mode)    (prefetchMode)  (skipRead)   Mode  Cnt     Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch      tinyCache  bufferpool             async        true  thrpt        316.538          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch      tinyCache  bufferpool  l1_then_prefetch        true  thrpt        320.108          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch      tinyCache        mmap             async        true  thrpt       1003.410          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch          radix  bufferpool             async        true  thrpt        349.937          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch          radix  bufferpool  l1_then_prefetch        true  thrpt        435.632          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch          radix        mmap             async        true  thrpt        976.015          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true        forkjoin      tinyCache  bufferpool             async        true  thrpt       2489.217          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true        forkjoin      tinyCache  bufferpool  l1_then_prefetch        true  thrpt       2982.549          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true        forkjoin          radix  bufferpool             async        true  thrpt       3000.462          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true        forkjoin          radix  bufferpool  l1_then_prefetch        true  thrpt       2705.352          ops/ms

https://github.com/asimmahmood1/opensearch-storage-encryption/blob/2b237e44bde629e67669f6a1ba0ffb36b0dcac9f/src/jmh/java/org/opensearch/index/store/benchmark/PrefetchBufferpoolVsMMapBenchmark.java

…d style to dedup load with search

* to avoid dup with search request, use getOrLoad
* we loose the IO collapsing option, but that will be redone at lower IO
  layer
* ForkJoin shows 10x improvement compared to LinkedTransferQueue used by
 fixed opensearch threadpool
* 3x improvement compared to mmap prefetch, with isLoaded and without
  backoff
* added more metrics

Signed-off-by: Asim Mahmood <asim.seng@gmail.com>
Signed-off-by: Asim Mahmood <asim.seng@gmail.com>
final long endFileOffset = absoluteBaseOffset + offset + length;
final long endBlockOffset = (endFileOffset + CACHE_BLOCK_MASK) & ~CACHE_BLOCK_MASK;
final long blockCount = (endBlockOffset - startBlockOffset) >>> CACHE_BLOCK_SIZE_POWER;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to add L1 cache short-circuit here? I was thinking to add it now even with BlockSlotTinyCache so that once we replace tiny cache with Radix table we dont miss it.

@abiesps
Copy link
Copy Markdown
Contributor

abiesps commented Mar 30, 2026

I am also wondering why is ForkJoinPool so much better than opensearch threadpool ? Do we know the reason behind it ?

@abiesps
Copy link
Copy Markdown
Contributor

abiesps commented Mar 30, 2026

On the cold path of prefetch,

How much is the delay, is there a way that i can check how does reading first byte latency varies with mmap and bufferpool as we increase this delay - starting from zero ?

Ratio of reading first byte latency and actual IO latency with delay could be a good metric and benchmark for measuring cold path prefetch performance. I want to be sure that we don't have "any application level bottlenecks that are adding on to this delay unnecessarily". We could also run this benchmark on a host with 'slower' file system to see how does this 'delay' varies with IO latencies from file system

@asimmahmood1
Copy link
Copy Markdown
Author

asimmahmood1 commented Mar 30, 2026

I am also wondering why is ForkJoinPool so much better than opensearch threadpool ? Do we know the reason behind it ?

Main reason is fork join doesn't have a shared queue, each work has its own queue. On submission it randomly distributes to a worker, if worker runs on of tasks, then it steels from others. The downside is fork join doesn't have strict FIFO. Also, queues are unbounded, so I added the inflightCount check before submitting that in case of slow IO we start dropping prefetch requests. The inflighMap.size() isn't a constant time check, so use a dedicated atomic int to count.

If we do want more FIFO, there are other queue options that provide higher throughput (e.g. netty's jcp).

Screenshot 2026-03-30 at 2 52 38 PM Screenshot 2026-03-30 at 2 59 08 PM

@asimmahmood1
Copy link
Copy Markdown
Author

JMH Cold

Final comparison (both prewarmed, truly cold):

Component Bufferpool Mmap
Prefetch call overhead ~40µs ~38µs (madvise)
IO time (inside loader) ~818µs N/A (kernel)
ReadByte (post-prefetch) ~115µs ~475µs
Total (JMH score) ~900µs ~620µs

Key findings:

  1. Prefetch call overhead is nearly identical: ~40µs for both. Bufferpool's executor dispatch is not a bottleneck.
  2. Mmap readByte is ~475µs — this is the page fault cost. madvise(WILLNEED) queues kernel readahead but it hasn't completed by the time readByte is called ~38µs later. So the read blocks on the page
    fault, waiting for disk IO.
  3. Bufferpool readByte is ~115µs — 4x faster than mmap because by the time we call readByte, the async prefetch has already completed (we polled until the block appeared in cache). The read is just an L1
    cache lookup + off-heap memory access.
  4. Total gap is ~280µs (900 vs 620). Bufferpool's total is higher because it waits for the full async IO to complete before the benchmark returns, while mmap's total = madvise(38µs) + readByte(475µs)
    where the readByte absorbs the IO wait via page fault.
  5. The real apples-to-apples comparison is prefetch+readByte: bufferpool = 900µs (prefetch completes, then 115µs read), mmap = 38 + 475 = ~513µs. The ~400µs gap is FileChannel.open(DIRECT) per block —
    the single biggest optimization opportunity.
    Bottom line: No application-level bottleneck in the prefetch path. The 40µs executor overhead is negligible. The gap vs mmap is almost entirely the cost of opening a DirectIO file descriptor per load.

Setup

Benchmark Setup (per trial):

  1. Creates a 1MB test file via BufferPoolDirectory.createOutput()
  2. Opens both directories:
    - Bufferpool: BufferPoolDirectory with ForkJoinPool(4), CaffeineBlockCache(1000 blocks), MemorySegmentPool(32MB), QueuingWorker, TimestampingBlockLoader wrapping CryptoDirectIOBlockLoader
    - Mmap: MMapDirectory (Lucene standard), resolves consecutivePrefetchHitCount field via reflection
  3. Prewarms 20 iterations to JIT-compile hot paths and start ForkJoinPool threads
  4. Drops page cache / clears block cache after prewarm

Per invocation (cold setup):

  • Bufferpool: blockCache.invalidate(key) — removes the target block from Caffeine cache
  • Mmap: madvise(MADV_DONTNEED) on mmap'd segments + posix_fadvise(FADV_DONTNEED) on fd + reset Lucene's backoff counter to 0

Call Path

Benchmark thread                          ForkJoinPool worker thread
  ─────────────────                         ──────────────────────────
  t0 = nanoTime()
  │
  sharedInput.prefetch(offset, BLOCK_SIZE)
  │
  └─ CachedMemorySegmentIndexInput.prefetch()
     └─ blockCache.loadMissingBlocks(path, offset, 1)
        ├─ prefetchTracker.recordPrefetchCall(1)
        └─ prefetchTracker.execute(runnable)          ──→  ForkJoinPool picks up task
           └─ executor.execute(runnable)                   │
              [returns immediately, ~40µs from t0]         │
                                                           loadMissingBlocksSync()
                                                           │
                                                           ├─ prefetchTracker.putIfAbsent(key)  [dedup check]
                                                           │
                                                           └─ caffeineCache.get(key, loader)
                                                              │
                                                              └─ TimestampingBlockLoader.load()
                                                                 ├─ preSyscallNanos = nanoTime()  ← [~40µs after t0]
                                                                 │
                                                                 └─ CryptoDirectIOBlockLoader.load()
                                                                    ├─ FileChannel.open(path, DIRECT)
                                                                    ├─ directIOReadAligned()
                                                                    │  └─ channel.read(buffer, offset)  ← actual pread syscall
                                                                    ├─ segmentPool.tryAcquire()
                                                                    ├─ MemorySegment.copy(read → pooled)
                                                                    └─ return RefCountedMemorySegment[]
                                                                 │
                                                                 postSyscallNanos = nanoTime()  ← [~860µs after t0]
                                                                 │
                                                           block inserted into Caffeine cache
  │
  │ [polling loop]
  while (blockCache.get(key) == null)
      Thread.onSpinWait();
  │
  tDone = nanoTime()                                   ← [~870µs after t0]
  │
  sharedInput.seek(offset)
  tRead0 = nanoTime()
  sharedInput.readByte()
  │
  └─ CachedMemorySegmentIndexInput.readByte()
     └─ getCacheBlockWithOffset()
        └─ l1Cache.acquireRefCountedValue()  ← L1 cache hit (block just loaded)
           └─ segment.get(LAYOUT_BYTE, offset)  ← off-heap memory read
  │
  readByteNs = nanoTime() - tRead0            ← [~115µs]

Mmap call path:

  Benchmark thread
  ─────────────────
  t0 = nanoTime()
  │
  sharedInput.prefetch(offset, BLOCK_SIZE)
  │
  └─ MemorySegmentIndexInput.prefetch()
     ├─ consecutivePrefetchHitCount++ (reset to 0 by setupInvocation)
     ├─ segment.isLoaded()                    ← mincore syscall (returns false, pages evicted)
     │  └─ consecutivePrefetchHitCount = 0    [cache miss detected]
     └─ nativeAccess.madviseWillNeed(segment) ← madvise(MADV_WILLNEED) syscall
        [kernel queues readahead IO, returns immediately]
  │
  tDone = nanoTime()                          ← [~38µs after t0]
  │
  sharedInput.seek(offset)
  tRead0 = nanoTime()
  sharedInput.readByte()
  │
  └─ MemorySegmentIndexInput.readByte()
     └─ curSegment.get(LAYOUT_BYTE, curPosition)
        └─ [PAGE FAULT]                       ← kernel readahead hasn't finished yet
           └─ kernel blocks thread, waits for disk IO to complete
           └─ page loaded into physical memory
           └─ returns byte value
  │
  readByteNs = nanoTime() - tRead0            ← [~475µs, dominated by page fault IO wait]

Benchmark                                             (mode)  (simulatedIoDelayUs)  Mode  Cnt  Score   Error  Units
ColdPrefetchLatencyBenchmark.coldPrefetchLatency  bufferpool                     0    ss    3  0.901 ± 0.766  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency  bufferpool                   250    ss    3  1.273 ± 0.893  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency  bufferpool                   500    ss    3  1.646 ± 2.282  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency  bufferpool                  1000    ss    3  2.021 ± 0.876  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency  bufferpool                  2000    ss    3  3.031 ± 0.946  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency  bufferpool                  4000    ss    3  5.068 ± 1.503  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency  bufferpool                  8000    ss    3  9.137 ± 0.747  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency        mmap                     0    ss    3  0.621 ± 0.656  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency        mmap                   250    ss    3  0.619 ± 0.402  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency        mmap                   500    ss    3  0.601 ± 0.242  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency        mmap                  1000    ss    3  0.644 ± 0.111  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency        mmap                  2000    ss    3  0.639 ± 0.525  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency        mmap                  4000    ss    3  0.614 ± 0.279  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency        mmap                  8000    ss    3  0.650 ± 0.344  ms/op

https://github.com/asimmahmood1/opensearch-storage-encryption/blob/3d013c1e6f40e2f430b42d12d02f64f7d21793af/src/jmh/java/org/opensearch/index/store/benchmark/ColdPrefetchLatencyBenchmark.java

* added L1BlockCache interface with contains method
* L1 Radix table can swapped in later
* contains is cheap array look up, no pinning

Signed-off-by: Asim Mahmood <asim.seng@gmail.com>
Signed-off-by: Asim Mahmood <asim.seng@gmail.com>
Signed-off-by: Asim Mahmood <asim.seng@gmail.com>
@abiesps
Copy link
Copy Markdown
Contributor

abiesps commented Mar 31, 2026

So time to submit IO is more or less same with bufferpool and mmap ~40 micro seconds. Issues are in the 'IO path' which should be covered in cold path optimizations for bufferpool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

[FEATURE] CachedMemorySegment should implement prefetch api of IndexInput

4 participants