Modify BufferPool prefetch to async load blocks by asimmahmood1 · Pull Request #149 · opensearch-project/opensearch-storage-encryption

asimmahmood1 · 2026-03-02T19:48:55Z

Description

1. Async Prefetch Architecture

CaffeineBlockCache: Added prefetchExecutor field and async/sync execution split
- loadMissingBlocks() - async when executor present, sync otherwise
- loadMissingBlocksSync() - private method with actual logic
- Returns long (count) instead of Map<>

2. Prefetch Cache for Deduplication

Global shared cache: ConcurrentHashMap<BlockCacheKey, Boolean> created in BlockCacheBuilder
Check-then-load pattern: prefetchCache.putIfAbsent() before checking main cache
Cleanup after load: Removes keys from prefetch cache after successful loadAllBlocks()
Passed through: BlockCacheBuilder → PoolResources → CaffeineBlockCache

3. Smart Block Loading

loadMissingBlocks(): Checks cache first, only loads missing blocks, combines consecutive ranges
loadAllBlocks(): Single I/O call without cache check (for readahead)
Changed from FileBlockCacheKey[]: Stores full keys instead of offsets to enable cleanup

4. Prefetch Threadpool Configuration

New threadpool: CRYPTO_PLUGIN_THREADPOOL_PREFETCH using FixedExecutorBuilder
Configurable settings:
- node.store.crypto.prefetch.queue_size (default: threads × 1000)
- node.store.crypto.prefetch.thread_count (default: processors × 4)
  - this accounts for approx processors1.5 search threads and processors2 index_searcher threads
Wired through: CryptoDirectoryPlugin.getExecutorBuilders() → CryptoDirectoryFactory.setThreadPool()

5. API Changes

CachedMemorySegmentIndexInput: Added prefetch(offset, length) method
BlockCache interface:
- loadForPrefetch() → loadMissingBlocks() (checks cache first)
- Added loadAllBlocks() (no cache check)
- Return type changed from Map<> to long (count)

6. Test Updates

CaffeineBlockCacheTests: +3 async tests (async execution, deduplication, cleanup)

Key Architectural Improvements:

Separation of concerns: Async execution moved from input layer to cache layer
Efficient deduplication: Per-block HashMap check before expensive cache lookup
Resource cleanup: Prefetch cache entries removed after successful load
Configurable threading: Allows tuning for different workloads
Smart loading: Only loads missing blocks, combines consecutive ranges

Related Issues

Resolves #119

Testing

jmh

Test setup in https://github.com/asimmahmood1/opensearch-storage-encryption/tree/jmhPrefetch

Sequential Prefetch: noKms vs jmhPrefetch (async)

baaseline loadForPrefetch does disk I/O for every call even if the block is already cached (loads first, then checks putIfAbsent)
The ~600-2700x gap is entirely from avoiding redundant disk reads

Threads	Cache	baaseline	Prefetch async	Speedup
1	1000	1.09	2,909	~2,670x
1	10000	1.09	2,770	~2,541x
4	1000	2.95	2,260	~766x
4	10000	2.96	1,810	~612x
8	1000	2.95	2,178	~738x
8	10000	2.95	1,275	~432x
16	1000	2.93	2,219	~757x
16	10000	2.94	2,009	~683x

OSB

Ran big5.

	Metric	Task	Baseline	Contender	Diff	Unit
Store size		4.75663	4.75663	0.00%	0	GB
Segment count		9	9	0.00%	0
Min Throughput	cardinality-agg-high	2.00198	2.00281	0.04%	0.00083	ops/s
Mean Throughput	cardinality-agg-high	2.00938	2.01331	0.20%	0.00393	ops/s
Median Throughput	cardinality-agg-high	2.00393	2.00556	0.08%	0.00162	ops/s
Max Throughput	cardinality-agg-high	2.14201	2.2065	3.01%	0.06449	ops/s
50th percentile latency	cardinality-agg-high	332.176	322.432	-2.93%	-9.7437	ms
90th percentile latency	cardinality-agg-high	372.484	356.947	-4.17%	-15.5369	ms
99th percentile latency	cardinality-agg-high	428.933	429.896	0.22%	0.9636	ms
100th percentile latency	cardinality-agg-high	469.811	444.854	-5.31% 🟢	-24.9569	ms
50th percentile service time	cardinality-agg-high	331.222	321.437	-2.95%	-9.78525	ms
90th percentile service time	cardinality-agg-high	371.801	355.639	-4.35%	-16.1622	ms
99th percentile service time	cardinality-agg-high	428.085	428.869	0.18%	0.78466	ms
100th percentile service time	cardinality-agg-high	468.645	443.522	-5.36% 🟢	-25.1223	ms
error rate	cardinality-agg-high	0	0	0.00%	0	%

Full Run

Details

	Metric	Task	Baseline	Contender	Diff	Unit
Cumulative indexing time of primary shards		0	0	0.00%	0	min
Min cumulative indexing time across primary shard		0	0	0.00%	0	min
Median cumulative indexing time across primary shard		0	0	0.00%	0	min
Max cumulative indexing time across primary shard		0	0	0.00%	0	min
Cumulative indexing throttle time of primary shards		0	0	0.00%	0	min
Min cumulative indexing throttle time across primary shard		0	0	0.00%	0	min
Median cumulative indexing throttle time across primary shard		0	0	0.00%	0	min
Max cumulative indexing throttle time across primary shard		0	0	0.00%	0	min
Cumulative merge time of primary shards		0	0	0.00%	0	min
Cumulative merge count of primary shards		0	0	0.00%	0
Min cumulative merge time across primary shard		0	0	0.00%	0	min
Median cumulative merge time across primary shard		0	0	0.00%	0	min
Max cumulative merge time across primary shard		0	0	0.00%	0	min
Cumulative merge throttle time of primary shards		0	0	0.00%	0	min
Min cumulative merge throttle time across primary shard		0	0	0.00%	0	min
Median cumulative merge throttle time across primary shard		0	0	0.00%	0	min
Max cumulative merge throttle time across primary shard		0	0	0.00%	0	min
Cumulative refresh time of primary shards		0	0	0.00%	0	min
Cumulative refresh count of primary shards		2	2	0.00%	0
Min cumulative refresh time across primary shard		0	0	0.00%	0	min
Median cumulative refresh time across primary shard		0	0	0.00%	0	min
Max cumulative refresh time across primary shard		0	0	0.00%	0	min
Cumulative flush time of primary shards		0	0	0.00%	0	min
Cumulative flush count of primary shards		1	1	0.00%	0
Min cumulative flush time across primary shard		0	0	0.00%	0	min
Median cumulative flush time across primary shard		0	0	0.00%	0	min
Max cumulative flush time across primary shard		0	0	0.00%	0	min
Total Young Gen GC time		0.203	0.326	0.06%	0.123	s
Total Young Gen GC count		24	25	4.17%	1
Total Old Gen GC time		0	0	0.00%	0	s
Total Old Gen GC count		0	0	0.00%	0
Store size		4.75663	4.75663	0.00%	0	GB
Translog size		5.12227e-08	5.12227e-08	0.00%	0	GB
Heap used for segments		0	0	0.00%	0	MB
Heap used for doc values		0	0	0.00%	0	MB
Heap used for terms		0	0	0.00%	0	MB
Heap used for norms		0	0	0.00%	0	MB
Heap used for points		0	0	0.00%	0	MB
Heap used for stored fields		0	0	0.00%	0	MB
Segment count		9	9	0.00%	0
50th percentile service time	match-all	5.33123	5.86739	+10.06% 🔴	0.53617	ms
90th percentile service time	match-all	5.78563	6.45272	+11.53% 🔴	0.66709	ms
99th percentile service time	match-all	6.36024	6.902	+8.52% 🔴	0.54176	ms
100th percentile service time	match-all	6.89619	7.40267	+7.34% 🔴	0.50648	ms
error rate	match-all	0	0	0.00%	0	%
50th percentile service time	desc_sort_timestamp	7.06757	7.43808	+5.24% 🔴	0.37051	ms
90th percentile service time	desc_sort_timestamp	7.47604	8.17836	+9.39% 🔴	0.70232	ms
99th percentile service time	desc_sort_timestamp	8.2023	10.5148	+28.19% 🔴	2.31246	ms
100th percentile service time	desc_sort_timestamp	8.41588	11.1232	+32.17% 🔴	2.70729	ms
error rate	desc_sort_timestamp	0	0	0.00%	0	%
50th percentile service time	asc_sort_timestamp	6.07399	6.71714	+10.59% 🔴	0.64315	ms
90th percentile service time	asc_sort_timestamp	6.60525	7.02824	+6.40% 🔴	0.42299	ms
99th percentile service time	asc_sort_timestamp	7.25477	8.01705	+10.51% 🔴	0.76227	ms
100th percentile service time	asc_sort_timestamp	7.47516	9.18894	+22.93% 🔴	1.71378	ms
error rate	asc_sort_timestamp	0	0	0.00%	0	%
50th percentile service time	desc_sort_with_after_timestamp	6.7219	7.1929	+7.01% 🔴	0.47099	ms
90th percentile service time	desc_sort_with_after_timestamp	7.21773	7.65726	+6.09% 🔴	0.43952	ms
99th percentile service time	desc_sort_with_after_timestamp	8.62742	7.94864	-7.87% 🟢	-0.67878	ms
100th percentile service time	desc_sort_with_after_timestamp	8.72136	10.1188	+16.02% 🔴	1.39744	ms
error rate	desc_sort_with_after_timestamp	0	0	0.00%	0	%
50th percentile service time	asc_sort_with_after_timestamp	5.66367	6.20629	+9.58% 🔴	0.54262	ms
90th percentile service time	asc_sort_with_after_timestamp	6.01183	6.50703	+8.24% 🔴	0.4952	ms
99th percentile service time	asc_sort_with_after_timestamp	7.2456	6.77196	-6.54% 🟢	-0.47365	ms
100th percentile service time	asc_sort_with_after_timestamp	7.33387	8.25086	+12.50% 🔴	0.91699	ms
error rate	asc_sort_with_after_timestamp	0	0	0.00%	0	%
50th percentile service time	desc_sort_timestamp_can_match_shortcut	15.7101	16.1844	3.02%	0.47429	ms
90th percentile service time	desc_sort_timestamp_can_match_shortcut	16.2406	16.6953	2.80%	0.45475	ms
99th percentile service time	desc_sort_timestamp_can_match_shortcut	18.4206	20.031	+8.74% 🔴	1.61044	ms
100th percentile service time	desc_sort_timestamp_can_match_shortcut	22.6156	20.7142	-8.41% 🟢	-1.9014	ms
error rate	desc_sort_timestamp_can_match_shortcut	0	0	0.00%	0	%
50th percentile service time	desc_sort_timestamp_no_can_match_shortcut	15.6206	15.767	0.94%	0.14637	ms
90th percentile service time	desc_sort_timestamp_no_can_match_shortcut	16.2876	16.3407	0.33%	0.05304	ms
99th percentile service time	desc_sort_timestamp_no_can_match_shortcut	16.791	16.8323	0.25%	0.04133	ms
100th percentile service time	desc_sort_timestamp_no_can_match_shortcut	21.8805	16.8549	-22.97% 🟢	-5.02559	ms
error rate	desc_sort_timestamp_no_can_match_shortcut	0	0	0.00%	0	%
50th percentile service time	asc_sort_timestamp_can_match_shortcut	8.71334	8.77906	0.75%	0.06572	ms
90th percentile service time	asc_sort_timestamp_can_match_shortcut	8.99686	9.16502	1.87%	0.16816	ms
99th percentile service time	asc_sort_timestamp_can_match_shortcut	9.22067	9.7336	+5.56% 🔴	0.51293	ms
100th percentile service time	asc_sort_timestamp_can_match_shortcut	9.49736	12.0311	+26.68% 🔴	2.53375	ms
error rate	asc_sort_timestamp_can_match_shortcut	0	0	0.00%	0	%
50th percentile service time	asc_sort_timestamp_no_can_match_shortcut	8.47544	8.88359	4.82%	0.40815	ms
90th percentile service time	asc_sort_timestamp_no_can_match_shortcut	8.74425	9.31907	+6.57% 🔴	0.57482	ms
99th percentile service time	asc_sort_timestamp_no_can_match_shortcut	9.73853	12.4136	+27.47% 🔴	2.6751	ms
100th percentile service time	asc_sort_timestamp_no_can_match_shortcut	12.4721	12.6794	1.66%	0.20726	ms
error rate	asc_sort_timestamp_no_can_match_shortcut	0	0	0.00%	0	%
50th percentile service time	term	2.10747	2.12477	0.82%	0.01731	ms
90th percentile service time	term	2.25896	2.2814	0.99%	0.02244	ms
99th percentile service time	term	2.481	2.53553	2.20%	0.05453	ms
100th percentile service time	term	2.48144	2.69251	+8.51% 🔴	0.21107	ms
error rate	term	0	0	0.00%	0	%
50th percentile service time	multi_terms-keyword	3.05296	3.51322	+15.08% 🔴	0.46026	ms
90th percentile service time	multi_terms-keyword	3.22996	3.83942	+18.87% 🔴	0.60946	ms
99th percentile service time	multi_terms-keyword	3.58289	4.19302	+17.03% 🔴	0.61012	ms
100th percentile service time	multi_terms-keyword	3.61061	4.20848	+16.56% 🔴	0.59787	ms
error rate	multi_terms-keyword	0	0	0.00%	0	%
50th percentile service time	keyword-terms	13.2875	14.3673	+8.13% 🔴	1.07975	ms
90th percentile service time	keyword-terms	14.0648	14.9372	+6.20% 🔴	0.87248	ms
99th percentile service time	keyword-terms	14.3675	15.911	+10.74% 🔴	1.54351	ms
100th percentile service time	keyword-terms	15.1458	16.1412	+6.57% 🔴	0.9954	ms
error rate	keyword-terms	0	0	0.00%	0	%
50th percentile service time	keyword-terms-low-cardinality	7.58433	8.28041	+9.18% 🔴	0.69607	ms
90th percentile service time	keyword-terms-low-cardinality	8.45918	9.26455	+9.52% 🔴	0.80538	ms
99th percentile service time	keyword-terms-low-cardinality	9.61154	9.58736	-0.25%	-0.02418	ms
100th percentile service time	keyword-terms-low-cardinality	10.1142	9.70683	-4.03%	-0.40735	ms
error rate	keyword-terms-low-cardinality	0	0	0.00%	0	%
50th percentile service time	composite-terms	2.81982	3.50044	+24.14% 🔴	0.68063	ms
90th percentile service time	composite-terms	3.04717	3.75163	+23.12% 🔴	0.70446	ms
99th percentile service time	composite-terms	3.37939	4.01992	+18.95% 🔴	0.64053	ms
100th percentile service time	composite-terms	3.48056	4.21379	+21.07% 🔴	0.73323	ms
error rate	composite-terms	0	0	0.00%	0	%
50th percentile service time	composite_terms-keyword	2.7505	3.56073	+29.46% 🔴	0.81023	ms
90th percentile service time	composite_terms-keyword	3.01512	3.89274	+29.11% 🔴	0.87762	ms
99th percentile service time	composite_terms-keyword	3.22946	4.02166	+24.53% 🔴	0.79219	ms
100th percentile service time	composite_terms-keyword	3.4604	4.27569	+23.56% 🔴	0.81529	ms
error rate	composite_terms-keyword	0	0	0.00%	0	%
50th percentile service time	composite-date_histogram-daily	3.06497	3.72072	+21.40% 🔴	0.65575	ms
90th percentile service time	composite-date_histogram-daily	3.36834	3.96079	+17.59% 🔴	0.59245	ms
99th percentile service time	composite-date_histogram-daily	3.63777	4.26223	+17.17% 🔴	0.62446	ms
100th percentile service time	composite-date_histogram-daily	3.66295	4.3295	+18.20% 🔴	0.66656	ms
error rate	composite-date_histogram-daily	0	0	0.00%	0	%
50th percentile service time	range	4.92059	4.99095	1.43%	0.07036	ms
90th percentile service time	range	5.16957	5.27564	2.05%	0.10606	ms
99th percentile service time	range	5.47204	6.28938	+14.94% 🔴	0.81735	ms
100th percentile service time	range	6.45927	6.4094	-0.77%	-0.04987	ms
error rate	range	0	0	0.00%	0	%
50th percentile service time	range-numeric	1.65534	1.75821	+6.21% 🔴	0.10288	ms
90th percentile service time	range-numeric	1.81575	1.9125	+5.33% 🔴	0.09675	ms
99th percentile service time	range-numeric	1.96901	2.07095	+5.18% 🔴	0.10194	ms
100th percentile service time	range-numeric	2.12584	2.12926	0.16%	0.00342	ms
error rate	range-numeric	0	0	0.00%	0	%
50th percentile service time	keyword-in-range	47.7598	48.1159	0.75%	0.35606	ms
90th percentile service time	keyword-in-range	48.8703	49.1258	0.52%	0.25549	ms
99th percentile service time	keyword-in-range	53.6782	53.6714	-0.01%	-0.00679	ms
100th percentile service time	keyword-in-range	54.8347	56.0567	2.23%	1.22204	ms
error rate	keyword-in-range	0	0	0.00%	0	%
50th percentile service time	date_histogram_hourly_agg	3.78211	3.83351	1.36%	0.0514	ms
90th percentile service time	date_histogram_hourly_agg	4.07845	4.08291	0.11%	0.00446	ms
99th percentile service time	date_histogram_hourly_agg	4.26438	4.34943	1.99%	0.08505	ms
100th percentile service time	date_histogram_hourly_agg	4.41574	4.5001	1.91%	0.08437	ms
error rate	date_histogram_hourly_agg	0	0	0.00%	0	%
50th percentile service time	date_histogram_hourly_with_filter_agg	76.7156	80.5914	+5.05% 🔴	3.8758	ms
90th percentile service time	date_histogram_hourly_with_filter_agg	95.2174	94.1445	-1.13%	-1.07295	ms
99th percentile service time	date_histogram_hourly_with_filter_agg	103.482	99.4576	-3.89%	-4.02471	ms
100th percentile service time	date_histogram_hourly_with_filter_agg	111.089	105.018	-5.47% 🟢	-6.07109	ms
error rate	date_histogram_hourly_with_filter_agg	0	0	0.00%	0	%
50th percentile service time	date_histogram_minute_agg	19.6701	21.1711	+7.63% 🔴	1.50103	ms
90th percentile service time	date_histogram_minute_agg	21.0316	22.176	+5.44% 🔴	1.14435	ms
99th percentile service time	date_histogram_minute_agg	21.5778	22.7066	+5.23% 🔴	1.12883	ms
100th percentile service time	date_histogram_minute_agg	23.5867	22.9667	-2.63%	-0.61996	ms
error rate	date_histogram_minute_agg	0	0	0.00%	0	%
50th percentile service time	scroll	406.894	405.263	-0.40%	-1.63079	ms
90th percentile service time	scroll	412.009	414.52	0.61%	2.51122	ms
99th percentile service time	scroll	467.127	448.316	-4.03%	-18.8106	ms
100th percentile service time	scroll	467.739	459.127	-1.84%	-8.61277	ms
error rate	scroll	0	0	0.00%	0	%
50th percentile service time	query-string-on-message	4.54705	4.80666	+5.71% 🔴	0.25961	ms
90th percentile service time	query-string-on-message	4.78296	5.07629	+6.13% 🔴	0.29333	ms
99th percentile service time	query-string-on-message	5.71624	5.35042	-6.40% 🟢	-0.36582	ms
100th percentile service time	query-string-on-message	6.4403	5.89987	-8.39% 🟢	-0.54043	ms
error rate	query-string-on-message	0	0	0.00%	0	%
50th percentile service time	query-string-on-message-filtered	2.55509	3.24915	+27.16% 🔴	0.69406	ms
90th percentile service time	query-string-on-message-filtered	2.81586	3.57133	+26.83% 🔴	0.75547	ms
99th percentile service time	query-string-on-message-filtered	3.00406	3.7635	+25.28% 🔴	0.75944	ms
100th percentile service time	query-string-on-message-filtered	3.93152	3.81243	-3.03%	-0.11908	ms
error rate	query-string-on-message-filtered	0	0	0.00%	0	%
50th percentile service time	query-string-on-message-filtered-sorted-num	2.57539	3.29158	+27.81% 🔴	0.71619	ms
90th percentile service time	query-string-on-message-filtered-sorted-num	2.83471	3.58323	+26.41% 🔴	0.74852	ms
99th percentile service time	query-string-on-message-filtered-sorted-num	3.18062	4.11627	+29.42% 🔴	0.93565	ms
100th percentile service time	query-string-on-message-filtered-sorted-num	3.28402	4.43583	+35.07% 🔴	1.15181	ms
error rate	query-string-on-message-filtered-sorted-num	0	0	0.00%	0	%
50th percentile service time	sort_keyword_can_match_shortcut	3.49719	3.51419	0.49%	0.017	ms
90th percentile service time	sort_keyword_can_match_shortcut	3.74281	3.72567	-0.46%	-0.01714	ms
99th percentile service time	sort_keyword_can_match_shortcut	4.42667	4.21578	-4.76%	-0.21089	ms
100th percentile service time	sort_keyword_can_match_shortcut	4.51188	4.32403	-4.16%	-0.18786	ms
error rate	sort_keyword_can_match_shortcut	0	0	0.00%	0	%
50th percentile service time	sort_keyword_no_can_match_shortcut	3.46648	3.42706	-1.14%	-0.03943	ms
90th percentile service time	sort_keyword_no_can_match_shortcut	3.67565	3.5882	-2.38%	-0.08745	ms
99th percentile service time	sort_keyword_no_can_match_shortcut	4.13905	3.83441	-7.36% 🟢	-0.30465	ms
100th percentile service time	sort_keyword_no_can_match_shortcut	4.16243	4.73659	+13.79% 🔴	0.57416	ms
error rate	sort_keyword_no_can_match_shortcut	0	0	0.00%	0	%
50th percentile service time	sort_numeric_desc	3.63547	3.71957	2.31%	0.0841	ms
90th percentile service time	sort_numeric_desc	3.84452	3.93636	2.39%	0.09183	ms
99th percentile service time	sort_numeric_desc	4.06919	4.79127	+17.74% 🔴	0.72208	ms
100th percentile service time	sort_numeric_desc	4.73341	5.01556	+5.96% 🔴	0.28215	ms
error rate	sort_numeric_desc	0	0	0.00%	0	%
50th percentile service time	sort_numeric_asc	3.65323	3.62205	-0.85%	-0.03118	ms
90th percentile service time	sort_numeric_asc	3.79218	3.84065	1.28%	0.04847	ms
99th percentile service time	sort_numeric_asc	3.92672	3.99433	1.72%	0.06761	ms
100th percentile service time	sort_numeric_asc	4.96029	4.11201	-17.10% 🟢	-0.84828	ms
error rate	sort_numeric_asc	0	0	0.00%	0	%
50th percentile service time	sort_numeric_desc_with_match	1.52568	1.62595	+6.57% 🔴	0.10027	ms
90th percentile service time	sort_numeric_desc_with_match	1.66199	1.76939	+6.46% 🔴	0.1074	ms
99th percentile service time	sort_numeric_desc_with_match	1.77765	1.87154	+5.28% 🔴	0.09389	ms
100th percentile service time	sort_numeric_desc_with_match	1.77796	1.87976	+5.73% 🔴	0.10181	ms
error rate	sort_numeric_desc_with_match	0	0	0.00%	0	%
50th percentile service time	sort_numeric_asc_with_match	1.49831	1.53507	2.45%	0.03676	ms
90th percentile service time	sort_numeric_asc_with_match	1.6647	1.71718	3.15%	0.05248	ms
99th percentile service time	sort_numeric_asc_with_match	1.86646	1.83928	-1.46%	-0.02718	ms
100th percentile service time	sort_numeric_asc_with_match	2.11189	1.87457	-11.24% 🟢	-0.23732	ms
error rate	sort_numeric_asc_with_match	0	0	0.00%	0	%
50th percentile service time	range_field_conjunction_big_range_big_term_query	1.43486	1.44379	0.62%	0.00893	ms
90th percentile service time	range_field_conjunction_big_range_big_term_query	1.57305	1.57702	0.25%	0.00397	ms
99th percentile service time	range_field_conjunction_big_range_big_term_query	1.76164	1.66972	-5.22% 🟢	-0.09192	ms
100th percentile service time	range_field_conjunction_big_range_big_term_query	1.9082	1.79807	-5.77% 🟢	-0.11013	ms
error rate	range_field_conjunction_big_range_big_term_query	0	0	0.00%	0	%
50th percentile service time	range_field_disjunction_big_range_small_term_query	1.56878	1.58749	1.19%	0.01871	ms
90th percentile service time	range_field_disjunction_big_range_small_term_query	1.68525	1.77193	+5.14% 🔴	0.08668	ms
99th percentile service time	range_field_disjunction_big_range_small_term_query	1.79692	1.90525	+6.03% 🔴	0.10833	ms
100th percentile service time	range_field_disjunction_big_range_small_term_query	1.823	1.92929	+5.83% 🔴	0.1063	ms
error rate	range_field_disjunction_big_range_small_term_query	0	0	0.00%	0	%
50th percentile service time	range_field_conjunction_small_range_small_term_query	1.51183	1.58039	4.53%	0.06856	ms
90th percentile service time	range_field_conjunction_small_range_small_term_query	1.64027	1.71293	4.43%	0.07266	ms
99th percentile service time	range_field_conjunction_small_range_small_term_query	1.85023	1.89982	2.68%	0.0496	ms
100th percentile service time	range_field_conjunction_small_range_small_term_query	1.86626	2.01001	+7.70% 🔴	0.14374	ms
error rate	range_field_conjunction_small_range_small_term_query	0	0	0.00%	0	%
50th percentile service time	range_field_conjunction_small_range_big_term_query	1.41612	1.41227	-0.27%	-0.00385	ms
90th percentile service time	range_field_conjunction_small_range_big_term_query	1.54616	1.52956	-1.07%	-0.0166	ms
99th percentile service time	range_field_conjunction_small_range_big_term_query	1.67479	1.65814	-0.99%	-0.01666	ms
100th percentile service time	range_field_conjunction_small_range_big_term_query	1.7091	1.76788	3.44%	0.05877	ms
error rate	range_field_conjunction_small_range_big_term_query	0	0	0.00%	0	%
50th percentile service time	range-auto-date-histo	490.565	495.548	1.02%	4.98328	ms
90th percentile service time	range-auto-date-histo	525.215	524.636	-0.11%	-0.57877	ms
99th percentile service time	range-auto-date-histo	621.443	615.539	-0.95%	-5.90403	ms
100th percentile service time	range-auto-date-histo	702.496	695.363	-1.02%	-7.13283	ms
error rate	range-auto-date-histo	0	0	0.00%	0	%
50th percentile service time	range-with-metrics	1875.43	1884.39	0.48%	8.9616	ms
90th percentile service time	range-with-metrics	2000.93	1946.21	-2.73%	-54.7181	ms
99th percentile service time	range-with-metrics	2186.49	2183.67	-0.13%	-2.82462	ms
100th percentile service time	range-with-metrics	2267.92	2287.38	0.86%	19.451	ms
error rate	range-with-metrics	0	0	0.00%	0	%
50th percentile service time	range-auto-date-histo-with-metrics	1809.78	1815.64	0.32%	5.86349	ms
90th percentile service time	range-auto-date-histo-with-metrics	1872.58	1870.32	-0.12%	-2.25941	ms
99th percentile service time	range-auto-date-histo-with-metrics	1963.78	2070.85	+5.45% 🔴	107.066	ms
100th percentile service time	range-auto-date-histo-with-metrics	1996.35	2279.66	+14.19% 🔴	283.303	ms
error rate	range-auto-date-histo-with-metrics	0	0	0.00%	0	%
50th percentile service time	range-agg-1	1.70609	1.64496	-3.58%	-0.06113	ms
90th percentile service time	range-agg-1	1.82613	1.83282	0.37%	0.00669	ms
99th percentile service time	range-agg-1	2.02114	1.93038	-4.49%	-0.09076	ms
100th percentile service time	range-agg-1	2.02428	2.0062	-0.89%	-0.01808	ms
error rate	range-agg-1	0	0	0.00%	0	%
50th percentile service time	range-agg-2	1.64055	1.7353	+5.78% 🔴	0.09475	ms
90th percentile service time	range-agg-2	1.80868	1.9107	+5.64% 🔴	0.10202	ms
99th percentile service time	range-agg-2	1.99279	2.08039	4.40%	0.0876	ms
100th percentile service time	range-agg-2	2.06418	2.21432	+7.27% 🔴	0.15014	ms
error rate	range-agg-2	0	0	0.00%	0	%
50th percentile service time	cardinality-agg-low	2.24735	3.51218	+56.28% 🔴	1.26483	ms
90th percentile service time	cardinality-agg-low	2.39313	3.83441	+60.23% 🔴	1.44129	ms
99th percentile service time	cardinality-agg-low	2.57683	3.99596	+55.07% 🔴	1.41913	ms
100th percentile service time	cardinality-agg-low	2.65259	4.23045	+59.48% 🔴	1.57785	ms
error rate	cardinality-agg-low	0	0	0.00%	0	%
50th percentile service time	cardinality-agg-high	380.392	388.443	2.12%	8.05097	ms
90th percentile service time	cardinality-agg-high	454.406	427.199	-5.99% 🟢	-27.2071	ms
99th percentile service time	cardinality-agg-high	455.958	464.742	1.93%	8.7845	ms
100th percentile service time	cardinality-agg-high	456.387	546.761	+19.80% 🔴	90.3746	ms
error rate	cardinality-agg-high	0	0	0.00%	0	%

Check List

New functionality includes testing.
New functionality has been documented.
[n/a] API changes companion pull request created.
Commits are signed per the DCO using --signoff.
[n/a] Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

* length param is honored * use OS threadpool model, new fixed threadpool is called 'crypto_plugin_prefetch_threadpool' * currently uses available processors * TODO: will add a config to make it a factor of available process, likely default to 2.0 * single async call that loads all blocks, can be modified to load all blocks in parallel if needed * will do more performance tests Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

asimmahmood1 · 2026-03-02T20:28:04Z

JMH - cold path

block size 4kb

write using directIO so its not mmaped 40kp

open using mmap channel io

loop:
prefetch 1-10

read: 1-10

hot

warm up

then prefetch and read

* prefetch is mostly IO work so threads will be blocked * also prefetch will only be called in search path, which has fixed threads * check cache first before prefetch * the cache check may act to dedup, not sure if dedicated dedup strategy is needed * will add JMH benchmarks, osb isn't showing any change Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

* similar to Lucene's stored field reader * use long[32] array of startOffset, that checked first * this array is created per file, slices share the array * scaling threadpool doesn't have a queue, switch back to use fixed Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

src/main/java/org/opensearch/index/store/bufferpoolfs/CachedMemorySegmentIndexInput.java

src/main/java/org/opensearch/index/store/CryptoDirectoryPlugin.java

src/main/java/org/opensearch/index/store/bufferpoolfs/CachedMemorySegmentIndexInput.java

sohami · 2026-03-03T08:24:26Z

src/main/java/org/opensearch/index/store/bufferpoolfs/CachedMemorySegmentIndexInput.java

+            final FileBlockCacheKey firstBlockKey = new FileBlockCacheKey(path, startBlockOffset);
+            if (blockCache.get(firstBlockKey) != null) {
+                return;
+            }


Can you please explain why the check for first block is needed here ? If first block is present that doesn't mean others are also present in the queue right ?

I should explain that in comment as well: this is the simplest approach that might reload some blocks. What is the probability that subsequent blocks are missing if 1st one isn't available? I would argue that it is unlikely since prefetch is useful for sequential reads.

Alternatives are:

Load after 1st missing block - should still be simpler. I'm ok to go with this alternative. Even though blockCache look up is sync, its still cheaper than IO.

Check each missing block, then load them separately. In order to reduce the IO calls, it'll be better to collect continue blocks. Is it worth the effort?

Ideally I would add some metrics and test out all 3 using some benchmarks. On the hand I'm not familiar with non-search usecases like knn.

Another thing is, as we improve our readahead algorithm. Readaheads should also be able to catchup for subsequent consecutive blocks.

src/main/java/org/opensearch/index/store/CryptoDirectoryPlugin.java

src/main/java/org/opensearch/index/store/bufferpoolfs/CachedMemorySegmentIndexInput.java

* found a bug in loadForPrefetch, it doesn't check cache first Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

* before it was loading (IO) all blocks regardless of cache entry * now it loads only misisng cache values * contigous missing blocks are combined to a single load call * TODO: add metrics Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

src/main/java/org/opensearch/index/store/CryptoDirectoryPlugin.java

sohami · 2026-03-04T06:33:26Z

src/main/java/org/opensearch/index/store/CryptoDirectoryPlugin.java

+
+            // Use cache size to determine, but double it so we're more aggressive than read ahead
+            if (queueSize == -1) {
+                queueSize = ReadAheadSizingPolicy.calculateQueueSize(maxCacheBlocks) * 2;


prefetch should be called for blocks which are deterministic to be accessed instead of being speculative like read ahead. With that in mind, should we keep the queue size same as maxCacheBlocks as that is the number of blocks we should be able to prefetch as part of one or more search requests. Whereas currently read-ahead is done in speculative manner without being IOContext aware which can lead to unnecessary cache churns. @abiesps thoughts ?

src/main/java/org/opensearch/index/store/bufferpoolfs/CachedMemorySegmentIndexInput.java

src/main/java/org/opensearch/index/store/block_cache/CaffeineBlockCache.java

src/main/java/org/opensearch/index/store/CryptoDirectoryPlugin.java

kumargu · 2026-03-04T17:56:59Z

src/main/java/org/opensearch/index/store/CryptoDirectoryPlugin.java

+
+            // Use cache size to determine, but double it so we're more aggressive than read ahead
+            if (queueSize == -1) {
+                queueSize = ReadAheadSizingPolicy.calculateQueueSize(maxCacheBlocks) * 2;


With that in mind, should we keep the queue size same as maxCacheBlocks as that is the number of blocks we should be able to prefetch as part of one or more search requests.

maxCacheBlocks will be in 10s of thousands (remember each block is 8kb), having a queue that large is better off a UnboundedQueue. Anyways, if you think you need to prefetch all the cache blocks, your caching is really screwed up.

kumargu · 2026-03-04T18:00:02Z

Please run a OSB http_log or big_5 and post results with and without this change.

* Based on the discussion, will estimate default threadpool size to be (search+index_searcher)*4. Since this prefetch will mostly mostly be blocked on IO, and its trying to help the search path by prefetching, we want to be more aggressive. * For queue size, for search lucene itelf only calls with block size 1 and there might be 10s of calls per query, but knn it can be much worst case e.g. 32 neighbors, there can be 1000 calls. So we'll estimate threads*1000 as default. Will tune this in future based on benchmark results. Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

* use concurrent hashmap to dedup * map is created per file but shared across slices, this avoid shared map across each directory so keeps the concurrency load low, and use simple offset (long) as key * FastUtil would be even faster, but don't wnat to introduct a new dependency * added unit tests, will had JMH to prove the improvement Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

* read ahead is only called when there is a cache miss * while search prefetch may already be cached Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

* now the executor is passed into CaffeineBlockCache * single map per node, instead of per file Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

src/main/java/org/opensearch/index/store/block_cache/CaffeineBlockCache.java

sohami · 2026-03-06T04:29:01Z

src/main/java/org/opensearch/index/store/block_cache/CaffeineBlockCache.java

+        if (prefetchCache != null) {
+            prefetchCache.keySet().removeIf(key -> key instanceof FileBlockCacheKey fk && fk.filePath().equals(normalized));
+        }


In what case a key getting invalidated will be in prefetchCache ? If it is in prefetchCache that means it was not found in block cache and the download for the key is in-progress

sohami · 2026-03-06T04:29:53Z

src/main/java/org/opensearch/index/store/block_cache/CaffeineBlockCache.java

     * @param maxBlocks the maximum number of blocks to cache (currently unused but kept for API compatibility)
     */
    public CaffeineBlockCache(Cache<BlockCacheKey, BlockCacheValue<T>> cache, BlockLoader<V> blockLoader, long maxBlocks) {
+        this(cache, blockLoader, maxBlocks, null, null);


why we want to support null cache and executor inside the block cache ? This will also help to keep the code simple in load methods by avoiding all the null checks

asimmahmood1 · 2026-03-12T04:45:16Z

JMH Test 1

https://github.com/asimmahmood1/opensearch-storage-encryption/blob/1d1f3cbf9fe54b3a4c3dc6560704ec9f7c0e04b2/src/jmh/java/org/opensearch/index/store/benchmark/PrefetchBufferpoolVsMMapBenchmark.java

File: 100MB encrypted file (12,800 blocks × 8KB)

Cache layers:

Large enough to hold entire file
L1: BlockSlotTinyCache — 32-slot direct-mapped, per-IndexInput clone
L2: CaffeineBlockCache — 15,000 blocks (~120MB), shared across all threads
Pool: 256MB MemorySegmentPool

Threads:

1, 4 reader threads
32 prefetch worker threads (fixed threadpool)

Read pattern:

Sequential: Each threads starts at offset 0
Per invocation: seek(offset) → readLong() × 1024 (one full block)
Advance offset by one block
If prefetchEnabled: prefetch block at offset + BLOCK_SIZE (1 block ahead)
On wrap (offset > fileLength): reset to 0, increment pass counter

Other changes:

encryption - already disabled by commenting it out
no unpin per read block

JMH config: 1 warmup + 1 measurement iteration, 10s each, 1 fork, throughput mode

Threads	(cacheWarm)	(mode)	(prefetchEnabled)	Mode	Cnt Score**	Error Units**
1	TRUE	bufferpool	TRUE	thrpt	103.563	ops/ms
1	TRUE	bufferpool	FALSE	thrpt	152.53	ops/ms
1	FALSE	bufferpool	TRUE	thrpt	2.207	ops/ms
1	FALSE	bufferpool	FALSE	thrpt	1.346	ops/ms
4	TRUE	bufferpool	TRUE	thrpt	315.867	ops/ms
4	TRUE	bufferpool	FALSE	thrpt	622.721	ops/ms
4	FALSE	bufferpool	TRUE	thrpt	6.928	ops/ms
4	FALSE	bufferpool	FALSE	thrpt	5.387	ops/ms
1	FALSE	mmap	TRUE	thrpt	283.351	ops/ms
1	FALSE	mmap	FALSE	thrpt	186.264	ops/ms
4	FALSE	mmap	TRUE	thrpt	1126.549	ops/ms
4	FALSE	mmap	FALSE	thrpt	1132.028	ops/ms

Summary

Prefetch delivers real value on cold cache: +64% (1T), +29% (4T)
Prefetch is a significant tax on warm cache: -32% (1T), -49% (4T)
Prefetch hurts thread scaling from 4x → 3x due to executor contention
The net benefit depends on your workload's cache hit ratio — if it's above ~70-80%, prefetch is likely a net negative

Next Steps / issues

We should disable writing to the cache during writes as well.
With Sequential reads kernel read aheads will also trigger.
We should remove read aheads from both Bufferpool and mmap. From mmap it can be done by passing File access hint as Random.
why is dedup not 0 for block_ahead=1 [Turns out it was disjoin offset, all were starting at 0]
Use cache.getOrLoad() to dedup IO - no contingious IO - although we're trying to reduce prefetch IO by loading conitgious call, caffine cache cannot dedup
remove CPU burn
prefetch block count in a loop - 1..16, doesnt need to be contigrious, call in loop
prefetch block count - single call vs loop over prefetch
1. then loop to read some X # of bytes
cold cache - invalidate after each block

Detailed stats:

Details

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, mode = bufferpool, prefetchEnabled = true)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=88
[STATS] CaffineCache[size =12800,hits=2268662, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] Prefetch[calls=1134287, requested=1134287, loaded=0, deduped=0, cacheHit=1134287, hitRatio=100.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12800, free=0, unallocated=19968, utilization=39.1%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=1134375, misses=0, total=1134375, l1Rate=0.00%, l2Rate=100.00%]
113.433 ops/ms
Iteration 1:
[STATS] passes=81
[STATS] CaffineCache[size =12800,hits=2071199, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] Prefetch[calls=1035559, requested=1035559, loaded=0, deduped=0, cacheHit=1035559, hitRatio=100.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12800, free=0, unallocated=19968, utilization=39.1%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=1035640, misses=0, total=1035640, l1Rate=0.00%, l2Rate=100.00%]
103.563 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_1Threads":
103.563 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, mode = bufferpool, prefetchEnabled = false)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=117
[STATS] CaffineCache[size =12800,hits=1510218, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12800, free=0, unallocated=19968, utilization=39.1%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=1510218, misses=0, total=1510218, l1Rate=0.00%, l2Rate=100.00%]
151.015 ops/ms
Iteration 1:
[STATS] passes=120
[STATS] CaffineCache[size =12800,hits=1525315, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12800, free=0, unallocated=19968, utilization=39.1%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=1525315, misses=0, total=1525315, l1Rate=0.00%, l2Rate=100.00%]
152.530 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_1Threads":
152.530 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, mode = bufferpool, prefetchEnabled = true)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=2
[STATS] CaffineCache[size =8185,hits=40266, misses=26113, hitRate=60.66%, loads=12706, evictions=0, avgLoadTime=0.75ms]
[STATS] Prefetch[calls=33783, requested=33783, loaded=513, deduped=0, cacheHit=19192, hitRatio=56.81%, loadRatio=1.52%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12802, free=4617, unallocated=19966, utilization=25.0%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=20977, misses=12808, total=33785, l1Rate=0.00%, l2Rate=62.09%]
3.378 ops/ms
Iteration 1:
[STATS] passes=2
STATS] CaffineCache[size =5005,hits=31906, misses=37006, hitRate=46.30%, loads=10782, evictions=0, avgLoadTime=0.91ms]
[STATS] Prefetch[calls=22067, requested=22067, loaded=11639, deduped=0, cacheHit=20619, hitRatio=93.44%, loadRatio=52.74%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12802, free=7796, unallocated=19966, utilization=15.3%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=11285, misses=10784, total=22069, l1Rate=0.00%, l2Rate=51.14%]
2.207 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_1Threads":
2.207 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, mode = bufferpool, prefetchEnabled = false)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=2
[STATS] CaffineCache[size =7778,hits=20571, misses=25696, hitRate=44.46%, loads=12807, evictions=0, avgLoadTime=0.75ms]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12801, free=5023, unallocated=19967, utilization=23.7%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=20489, misses=12889, total=33378, l1Rate=0.00%, l2Rate=61.38%]
3.338 ops/ms
Iteration 1:
[STATS] passes=1
[STATS] CaffineCache[size =8436,hits=0, misses=26916, hitRate=0.00%, loads=13458, evictions=0, avgLoadTime=0.73ms]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12801, free=4365, unallocated=19967, utilization=25.7%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=0, misses=13458, total=13458, l1Rate=0.00%, l2Rate=0.00%]
1.346 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_1Threads":
1.346 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, mode = bufferpool, prefetchEnabled = true)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=235
[STATS] CaffineCache[size =12800,hits=6060495, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] Prefetch[calls=3042069, requested=3042069, loaded=0, deduped=361, cacheHit=3041708, hitRatio=99.99%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12800, free=0, unallocated=19968, utilization=39.1%, allocation=39.1%]
[STATS] TinyCache[l1Hits=23517, l2Hits=3018787, misses=0, total=3042304, l1Rate=0.77%, l2Rate=99.23%]
304.202 ops/ms
Iteration 1:
[STATS] passes=247
[STATS] CaffineCache[size =12800,hits=6302699, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] Prefetch[calls=3158513, requested=3158513, loaded=0, deduped=19, cacheHit=3158494, hitRatio=100.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12800, free=0, unallocated=19968, utilization=39.1%, allocation=39.1%]
[STATS] TinyCache[l1Hits=14555, l2Hits=3144205, misses=0, total=3158760, l1Rate=0.46%, l2Rate=99.54%]
315.867 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
315.867 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, mode = bufferpool, prefetchEnabled = false)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=455
[STATS] CaffineCache[size =12800,hits=4060033, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12800, free=0, unallocated=19968, utilization=39.1%, allocation=39.1%]
[STATS] TinyCache[l1Hits=1795374, l2Hits=4060033, misses=0, total=5855407, l1Rate=30.66%, l2Rate=69.34%]
585.505 ops/ms
Iteration 1:
[STATS] passes=485
[STATS] CaffineCache[size =12800,hits=4065823, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12800, free=0, unallocated=19968, utilization=39.1%, allocation=39.1%]
[STATS] TinyCache[l1Hits=2161468, l2Hits=4065823, misses=0, total=6227291, l1Rate=34.71%, l2Rate=65.29%]
622.721 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
622.721 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, mode = bufferpool, prefetchEnabled = true)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=8
[STATS] CaffineCache[size =5835,hits=141813, misses=80085, hitRate=63.91%, loads=12440, evictions=0, avgLoadTime=0.78ms]
[STATS] Prefetch[calls=125723, requested=125723, loaded=6615, deduped=55527, cacheHit=51146, hitRatio=40.68%, loadRatio=5.26%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12801, free=6966, unallocated=19967, utilization=17.8%, allocation=39.1%]
[STATS] TinyCache[l1Hits=22624, l2Hits=54512, misses=48595, total=125731, l1Rate=17.99%, l2Rate=43.36%]
12.572 ops/ms
Iteration 1:
[STATS] passes=4
[STATS] CaffineCache[size =10357,hits=42862, misses=80643, hitRate=34.70%, loads=12663, evictions=0, avgLoadTime=0.77ms]
[STATS] Prefetch[calls=69288, requested=69288, loaded=4659, deduped=51965, cacheHit=1, hitRatio=0.00%, loadRatio=6.72%, inflight=1]
[STATS] PoolStats[max=32768, allocated=12801, free=2443, unallocated=19967, utilization=31.6%, allocation=39.1%]
[STATS] TinyCache[l1Hits=13768, l2Hits=4866, misses=50658, total=69292, l1Rate=19.87%, l2Rate=7.02%]
6.928 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
6.928 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, mode = bufferpool, prefetchEnabled = false)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=8
[STATS] CaffineCache[size =88,hits=66382, misses=64377, hitRate=50.77%, loads=12876, evictions=0, avgLoadTime=0.74ms]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12800, free=12712, unallocated=19968, utilization=0.3%, allocation=39.1%]
[STATS] TinyCache[l1Hits=23496, l2Hits=27757, misses=51501, total=102754, l1Rate=22.87%, l2Rate=27.01%]
10.275 ops/ms
Iteration 1:
[STATS] passes=4
[STATS] CaffineCache[size =757,hits=40405, misses=67339, hitRate=37.50%, loads=13468, evictions=0, avgLoadTime=0.73ms]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0]
[STATS] PoolStats[max=32768, allocated=12800, free=12043, unallocated=19968, utilization=2.3%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=2, misses=53871, total=53873, l1Rate=0.00%, l2Rate=0.00%]
5.387 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
5.387 ops/ms
Benchmark (cacheWarm) (mode) (prefetchEnabled) Mode Cnt Score Error Units
PrefetchBufferpoolVsMMapBenchmark.read_1Threads true bufferpool true thrpt 103.563 ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_1Threads true bufferpool false thrpt 152.530 ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_1Threads false bufferpool true thrpt 2.207 ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_1Threads false bufferpool false thrpt 1.346 ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads true bufferpool true thrpt 315.867 ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads true bufferpool false thrpt 622.721 ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads false bufferpool true thrpt 6.928 ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads false bufferpool false thrpt 5.387 ops/ms

PrefetchBufferpoolVsMMapBenchmark.read_1Threads false mmap true thrpt 283.351 ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_1Threads false mmap false thrpt 186.264 ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads false mmap true thrpt 1126.549 ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads false mmap false thrpt 1132.028 ops/ms

Benchmark result is saved to /workplace/asimmahm/opensearch-storage-encryption/build/jmh-results/jmh_20260310_232238.json

asimmahmood1 · 2026-03-12T16:31:19Z

JMH Test #2

https://github.com/asimmahmood1/opensearch-storage-encryption/blob/46b93109c1e5ec8f0dc51d3d265f4120d51a6f26/src/jmh/java/org/opensearch/index/store/benchmark/PrefetchBufferpoolVsMMapBenchmark.java

Same setup as #1, except:

Prefetch 16 blocks in a loop, 16 blocks away
disjoint offset per thread, now prefetch dedup is 0

Results:

prefetch is still slow. profiler shows good amount of time is spent in executor.execute()
Now i'm wonder if whole dedup logic is worth it
Also found that FileBlockCacheKey.init() is taking time and memory, because it calls path.getAbsolutePath().normalize(). This should be done once before opening file.

PrefetchBufferpoolVsMMapBenchmark.read_4Threads         true  bufferpool               true  thrpt         72.109          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads         true  bufferpool              false  thrpt        401.508          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads         true        mmap               true  thrpt       5177.424          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads         true        mmap              false  thrpt       7714.644          ops/ms

asimmahmood1 · 2026-03-12T16:58:42Z

JMH Test #3

compare OS executor vs plain jdk
- OS executor comes with cost of ThreadContext create and destroy, which may explain why read ahead also use jdk executor
new mode inilne_check which checks cache before calling prefetch, so avoid any dedup logic within async thread
- this is simpler logic

Benchmark                                              (cacheWarm)  (executorType)      (mode)  (prefetchMode)   Mode  Cnt    Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool           async  thrpt        74.814          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool    inline_check  thrpt       195.235          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool             off  thrpt       156.403          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true             jdk  bufferpool           async  thrpt        58.838          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true             jdk  bufferpool    inline_check  thrpt       163.626          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true             jdk  bufferpool             off  thrpt       520.743          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool           async  thrpt       0.170          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool    inline_check  thrpt       0.171          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool             off  thrpt       0.192          ops/ms

Details

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = bufferpool, prefetchMode = async)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=57606
[STATS] prefetch[calls=11521520, totalMs=33041.35, avgUs=2.87]
[STATS] Prefetch[calls=11521520, requested=11521520, loaded=618, deduped=0, cacheHit=11520720, hitRatio=99.99%, loadRatio=0.01%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=23042058, misses=1165, hitRate=99.99%, loads=182, evictions=0, avgLoadTime=1.84ms]
[STATS] PoolStats[max=32768, allocated=802, free=2, unallocated=31966, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=11521337, misses=183, total=11521520, l1Rate=0.00%, l2Rate=100.00%]
75.615 ops/ms
Iteration 1:
[STATS] passes=57550
[STATS] prefetch[calls=11509984, totalMs=33597.56, avgUs=2.92]
[STATS] Prefetch[calls=11509984, requested=11509984, loaded=0, deduped=0, cacheHit=11509984, hitRatio=100.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=23019968, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=802, free=2, unallocated=31966, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=11509984, misses=0, total=11509984, l1Rate=0.00%, l2Rate=100.00%]
74.814 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
74.814 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = bufferpool, prefetchMode = inline_check)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=142703
[STATS] prefetch[calls=28540968, totalMs=4344.38, avgUs=0.15]
[STATS] Prefetch[calls=800, requested=800, loaded=623, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=77.88%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=57080959, misses=1958, hitRate=100.00%, loads=177, evictions=0, avgLoadTime=1.93ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=28540787, misses=181, total=28540968, l1Rate=0.00%, l2Rate=100.00%]
186.187 ops/ms
Iteration 1:
[STATS] passes=150181
[STATS] prefetch[calls=30036200, totalMs=3170.59, avgUs=0.11]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=60072400, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=30036200, misses=0, total=30036200, l1Rate=0.00%, l2Rate=100.00%]
195.235 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
195.235 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = bufferpool, prefetchMode = off)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=115417
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=23082920, misses=1600, hitRate=99.99%, loads=800, evictions=0, avgLoadTime=1.16ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=23082920, misses=800, total=23083720, l1Rate=0.00%, l2Rate=100.00%]
151.410 ops/ms
Iteration 1:
[STATS] passes=120310
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=24062128, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=24062128, misses=0, total=24062128, l1Rate=0.00%, l2Rate=100.00%]
156.403 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
156.403 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = jdk, mode = bufferpool, prefetchMode = async)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=45118
[STATS] prefetch[calls=9024096, totalMs=34671.42, avgUs=3.84]
[STATS] Prefetch[calls=9024096, requested=9024096, loaded=638, deduped=0, cacheHit=7879790, hitRatio=87.32%, loadRatio=0.01%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=16904328, misses=1130, hitRate=99.99%, loads=162, evictions=0, avgLoadTime=3.09ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=9023926, misses=170, total=9024096, l1Rate=0.00%, l2Rate=100.00%]
58.653 ops/ms
Iteration 1:
[STATS] passes=45261
[STATS] prefetch[calls=9052344, totalMs=35857.89, avgUs=3.96]
[STATS] Prefetch[calls=9052344, requested=9052344, loaded=0, deduped=0, cacheHit=8203285, hitRatio=90.62%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=17256560, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=9052344, misses=0, total=9052344, l1Rate=0.00%, l2Rate=100.00%]
58.838 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
58.838 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = jdk, mode = bufferpool, prefetchMode = inline_check)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=123391
[STATS] prefetch[calls=24678552, totalMs=3219.24, avgUs=0.13]
[STATS] Prefetch[calls=800, requested=800, loaded=667, deduped=0, cacheHit=3, hitRatio=0.38%, loadRatio=83.38%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=49356174, misses=1870, hitRate=100.00%, loads=133, evictions=0, avgLoadTime=2.27ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=24678412, misses=140, total=24678552, l1Rate=0.00%, l2Rate=100.00%]
160.402 ops/ms
Iteration 1:
[STATS] passes=125866
[STATS] prefetch[calls=25173312, totalMs=2473.89, avgUs=0.10]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=50346624, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=25173312, misses=0, total=25173312, l1Rate=0.00%, l2Rate=100.00%]
163.626 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
163.626 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = jdk, mode = bufferpool, prefetchMode = off)

Fork: 1 of 1

Warmup Iteration 1:

[STATS] passes=366843
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=73368104, misses=1600, hitRate=100.00%, loads=800, evictions=0, avgLoadTime=1.49ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=73368104, misses=800, total=73368904, l1Rate=0.00%, l2Rate=100.00%]
476.871 ops/ms
Iteration 1:
[STATS] passes=400571
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=80114408, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=80114408, misses=0, total=80114408, l1Rate=0.00%, l2Rate=100.00%]
520.743 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
520.743 ops/ms

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=137
[STATS] prefetch[calls=27752, totalMs=425.84, avgUs=15.34]
[STATS] Prefetch[calls=27752, requested=27752, loaded=23780, deduped=0, cacheHit=13, hitRatio=0.05%, loadRatio=85.69%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=23793, misses=35706, hitRate=39.99%, loads=3972, evictions=0, avgLoadTime=9.90ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=23757, misses=3995, total=27752, l1Rate=0.00%, l2Rate=85.60%]
0.182 ops/ms
Iteration 1:
[STATS] passes=131
[STATS] prefetch[calls=26328, totalMs=33.40, avgUs=1.27]
[STATS] Prefetch[calls=26328, requested=26328, loaded=22578, deduped=0, cacheHit=10, hitRatio=0.04%, loadRatio=85.76%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=22588, misses=33823, hitRate=40.04%, loads=3750, evictions=0, avgLoadTime=10.66ms]
[STATS] PoolStats[max=32768, allocated=802, free=2, unallocated=31966, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=22573, misses=3755, total=26328, l1Rate=0.00%, l2Rate=85.74%]
0.170 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
0.170 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, executorType = opensearch, mode = bufferpool, prefetchMode = inline_check)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=133
[STATS] prefetch[calls=26920, totalMs=699.95, avgUs=26.00]
[STATS] Prefetch[calls=26920, requested=26920, loaded=23096, deduped=0, cacheHit=6, hitRatio=0.02%, loadRatio=85.79%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=23102, misses=61496, hitRate=27.31%, loads=3824, evictions=0, avgLoadTime=10.24ms]
[STATS] PoolStats[max=32768, allocated=803, free=3, unallocated=31965, utilization=2.4%, allocation=2.5%]
[STATS] TinyCache[l1Hits=0, l2Hits=23082, misses=3838, total=26920, l1Rate=0.00%, l2Rate=85.74%]
0.177 ops/ms
Iteration 1:
[STATS] passes=132
[STATS] prefetch[calls=26336, totalMs=36.50, avgUs=1.39]
[STATS] Prefetch[calls=26336, requested=26336, loaded=22558, deduped=0, cacheHit=11, hitRatio=0.04%, loadRatio=85.65%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=22569, misses=60220, hitRate=27.26%, loads=3778, evictions=0, avgLoadTime=10.57ms]
[STATS] PoolStats[max=32768, allocated=803, free=3, unallocated=31965, utilization=2.4%, allocation=2.5%]
[STATS] TinyCache[l1Hits=0, l2Hits=22555, misses=3781, total=26336, l1Rate=0.00%, l2Rate=85.64%]
0.171 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
0.171 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, executorType = opensearch, mode = bufferpool, prefetchMode = off)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=160
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=0, misses=64448, hitRate=0.00%, loads=32224, evictions=0, avgLoadTime=1.23ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=0, misses=32224, total=32224, l1Rate=0.00%, l2Rate=0.00%]
0.211 ops/ms
Iteration 1:
[STATS] passes=148
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=0, misses=59168, hitRate=0.00%, loads=29584, evictions=0, avgLoadTime=1.35ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=0, misses=29584, total=29584, l1Rate=0.00%, l2Rate=0.00%]
0.192 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
0.192 ops/ms

asimmahmood1 · 2026-03-12T17:19:23Z

JMH TEST 3

https://github.com/asimmahmood1/opensearch-storage-encryption/blob/dbd365fed384d1c745b07fde445eed13425de1c1/src/jmh/java/org/opensearch/index/store/benchmark/PrefetchBufferpoolVsMMapBenchmark.java

PrefetchBufferpoolVsMMapBenchmark.read_4Threads-Throughput-cacheWarm-true-executorType-opensearch-mode-bufferpool-prefetchMode-async:
flame-cpu-forward.html

PrefetchBufferpoolVsMMapBenchmark.read_4Threads-Throughput-cacheWarm-true-executorType-opensearch-mode-bufferpool-prefetchMode-inline_check
flame-cpu-forward.html

PrefetchBufferpoolVsMMapBenchmark.read_4Threads-Throughput-cacheWarm-true-executorType-opensearch-mode-bufferpool-prefetchMode-off
flame-cpu-forward.html

asimmahmood1 · 2026-03-12T18:31:41Z

My interpretation so far is that, since there is 1:1 ratio of prefetch and reads within the block, the async hand off is not cheap in this benchmark. There are over 2MM calls to executor vs the actual work the prefetch threads need to do.

In real world, will there be these many prefetch calls vs the read calls? If there will be many more reads than prefetch, then this cost should be low.

Other option still to just do cache check in search path, and load what's missing async.

JMH Test 4 - prefetch 16 blocks, read 16 times each block

Benchmark                                              (cacheWarm)  (executorType)      (mode)  (prefetchMode)   Mode  Cnt     Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool           async  thrpt         73.535          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool    inline_check  thrpt        374.221          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool     inline_load  thrpt        370.601          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool             off  thrpt        385.770          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap           async  thrpt       2587.772          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap    inline_check  thrpt       1950.268          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap     inline_load  thrpt       2082.947          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap             off  thrpt       7937.569          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool           async  thrpt          0.170          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool    inline_check  thrpt          0.171          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool     inline_load  thrpt          0.194          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool             off  thrpt          0.193          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap           async  thrpt         14.234          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap    inline_check  thrpt         11.220          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap     inline_load  thrpt          0.195          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap             off  thrpt         14.153          ops/ms

Details

dev-dsk-asimmahm-2c-a6d21262 % ./jmh_compact.sh

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = bufferpool, prefetchMode = async)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=56073
[STATS] prefetch[calls=11215016, totalMs=32598.23, avgUs=2.91]
[STATS] Prefetch[calls=11215016, requested=11215016, loaded=630, deduped=0, cacheHit=11214216, hitRatio=99.99%, loadRatio=0.01%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=22429062, misses=1146, hitRate=99.99%, loads=170, evictions=0, avgLoadTime=1.72ms]
[STATS] PoolStats[max=32768, allocated=802, free=2, unallocated=31966, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=11214840, misses=176, total=11215016, l1Rate=0.00%, l2Rate=100.00%]
73.561 ops/ms
Iteration 1:
[STATS] passes=56567
[STATS] prefetch[calls=11313224, totalMs=33420.45, avgUs=2.95]
[STATS] Prefetch[calls=11313224, requested=11313224, loaded=0, deduped=0, cacheHit=11313224, hitRatio=100.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=22626448, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=802, free=2, unallocated=31966, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=11313224, misses=0, total=11313224, l1Rate=0.00%, l2Rate=100.00%]
73.535 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
73.535 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = bufferpool, prefetchMode = inline_check)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=278737
[STATS] prefetch[calls=55747928, totalMs=5611.72, avgUs=0.10]
[STATS] Prefetch[calls=800, requested=800, loaded=650, deduped=0, cacheHit=8, hitRatio=1.00%, loadRatio=81.25%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=111494914, misses=1897, hitRate=100.00%, loads=150, evictions=0, avgLoadTime=1.51ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=55747773, misses=155, total=55747928, l1Rate=0.00%, l2Rate=100.00%]
362.341 ops/ms
Iteration 1:
[STATS] passes=287863
[STATS] prefetch[calls=57572552, totalMs=4888.69, avgUs=0.08]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=115145104, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=57572552, misses=0, total=57572552, l1Rate=0.00%, l2Rate=100.00%]
374.221 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
374.221 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = bufferpool, prefetchMode = inline_load)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=274780
[STATS] prefetch[calls=54956288, totalMs=6666.70, avgUs=0.12]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=109911776, misses=800, hitRate=100.00%, loads=800, evictions=0, avgLoadTime=1.05ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=54956288, misses=0, total=54956288, l1Rate=0.00%, l2Rate=100.00%]
357.194 ops/ms
Iteration 1:
[STATS] passes=285078
[STATS] prefetch[calls=57015792, totalMs=5354.89, avgUs=0.09]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=114031584, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=57015792, misses=0, total=57015792, l1Rate=0.00%, l2Rate=100.00%]
370.601 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
370.601 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = bufferpool, prefetchMode = off)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=291510
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=58301552, misses=1600, hitRate=100.00%, loads=800, evictions=0, avgLoadTime=1.06ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=58301552, misses=800, total=58302352, l1Rate=0.00%, l2Rate=100.00%]
378.942 ops/ms
Iteration 1:
[STATS] passes=296749
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=59349864, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=59349864, misses=0, total=59349864, l1Rate=0.00%, l2Rate=100.00%]
385.770 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
385.770 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = mmap, prefetchMode = async)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=1977521
[STATS] prefetch[calls=395504552, totalMs=10437.35, avgUs=0.03]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
2570.600 ops/ms
Iteration 1:
[STATS] passes=1990620
[STATS] prefetch[calls=398124096, totalMs=10490.30, avgUs=0.03]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
2587.772 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
2587.772 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = mmap, prefetchMode = inline_check)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=1477068
[STATS] prefetch[calls=295413920, totalMs=14638.67, avgUs=0.05]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=295413920, hitRate=0.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
1920.412 ops/ms
Iteration 1:
[STATS] passes=1500223
[STATS] prefetch[calls=300044792, totalMs=14772.10, avgUs=0.05]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=300044792, hitRate=0.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
1950.268 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
1950.268 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = mmap, prefetchMode = inline_load)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=1483658
[STATS] prefetch[calls=296731888, totalMs=20420.98, avgUs=0.07]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=296731088, misses=800, hitRate=100.00%, loads=800, evictions=0, avgLoadTime=1.10ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
1928.651 ops/ms
Iteration 1:
[STATS] passes=1602286
[STATS] prefetch[calls=320457440, totalMs=19915.07, avgUs=0.06]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=320457440, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
2082.947 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
2082.947 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, executorType = opensearch, mode = mmap, prefetchMode = off)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=6117138
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
7952.378 ops/ms
Iteration 1:
[STATS] passes=6105911
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
7937.569 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
7937.569 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, executorType = opensearch, mode = bufferpool, prefetchMode = async)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=138
[STATS] prefetch[calls=27856, totalMs=389.94, avgUs=14.00]
[STATS] Prefetch[calls=27856, requested=27856, loaded=23911, deduped=0, cacheHit=6, hitRatio=0.02%, loadRatio=85.84%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=23917, misses=35753, hitRate=40.08%, loads=3945, evictions=0, avgLoadTime=10.01ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=23898, misses=3958, total=27856, l1Rate=0.00%, l2Rate=85.79%]
0.181 ops/ms
Iteration 1:
[STATS] passes=131
[STATS] prefetch[calls=26328, totalMs=31.21, avgUs=1.19]
[STATS] Prefetch[calls=26328, requested=26328, loaded=22556, deduped=0, cacheHit=6, hitRatio=0.02%, loadRatio=85.67%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=22562, misses=33868, hitRate=39.98%, loads=3772, evictions=0, avgLoadTime=10.60ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=22554, misses=3774, total=26328, l1Rate=0.00%, l2Rate=85.67%]
0.170 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
0.170 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, executorType = opensearch, mode = bufferpool, prefetchMode = inline_check)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=136
[STATS] prefetch[calls=27456, totalMs=440.29, avgUs=16.04]
[STATS] Prefetch[calls=27456, requested=27456, loaded=23614, deduped=0, cacheHit=1, hitRatio=0.00%, loadRatio=86.01%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=23615, misses=62616, hitRate=27.39%, loads=3842, evictions=0, avgLoadTime=10.27ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=23593, misses=3863, total=27456, l1Rate=0.00%, l2Rate=85.93%]
0.178 ops/ms
Iteration 1:
[STATS] passes=132
[STATS] prefetch[calls=26352, totalMs=29.49, avgUs=1.12]
[STATS] Prefetch[calls=26352, requested=26352, loaded=22654, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=85.97%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=22654, misses=60103, hitRate=27.37%, loads=3698, evictions=0, avgLoadTime=10.81ms]
[STATS] PoolStats[max=32768, allocated=801, free=1, unallocated=31967, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=22651, misses=3701, total=26352, l1Rate=0.00%, l2Rate=85.96%]
0.171 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
0.171 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, executorType = opensearch, mode = bufferpool, prefetchMode = inline_load)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=158
[STATS] prefetch[calls=31952, totalMs=39987.42, avgUs=1251.48]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=31952, misses=31952, hitRate=50.00%, loads=31952, evictions=0, avgLoadTime=1.24ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=31952, misses=0, total=31952, l1Rate=0.00%, l2Rate=100.00%]
0.207 ops/ms
Iteration 1:
[STATS] passes=149
[STATS] prefetch[calls=29928, totalMs=40028.61, avgUs=1337.50]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=29928, misses=29928, hitRate=50.00%, loads=29928, evictions=0, avgLoadTime=1.33ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=29928, misses=0, total=29928, l1Rate=0.00%, l2Rate=100.00%]
0.194 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
0.194 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, executorType = opensearch, mode = bufferpool, prefetchMode = off)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=160
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=0, misses=64960, hitRate=0.00%, loads=32480, evictions=0, avgLoadTime=1.22ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=0, misses=32480, total=32480, l1Rate=0.00%, l2Rate=0.00%]
0.211 ops/ms
Iteration 1:
[STATS] passes=149
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=0, misses=59632, hitRate=0.00%, loads=29816, evictions=0, avgLoadTime=1.34ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
[STATS] TinyCache[l1Hits=0, l2Hits=0, misses=29816, total=29816, l1Rate=0.00%, l2Rate=0.00%]
0.193 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
0.193 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, executorType = opensearch, mode = mmap, prefetchMode = async)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=10254
[STATS] prefetch[calls=2050800, totalMs=63.36, avgUs=0.03]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
13.324 ops/ms
Iteration 1:
[STATS] passes=10954
[STATS] prefetch[calls=2190800, totalMs=60.85, avgUs=0.03]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
14.234 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
14.234 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, executorType = opensearch, mode = mmap, prefetchMode = inline_check)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=9420
[STATS] prefetch[calls=1884096, totalMs=209.39, avgUs=0.11]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=1884096, hitRate=0.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
12.239 ops/ms
Iteration 1:
[STATS] passes=8637
[STATS] prefetch[calls=1727304, totalMs=101.16, avgUs=0.06]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=1727304, hitRate=0.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
11.220 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
11.220 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, executorType = opensearch, mode = mmap, prefetchMode = inline_load)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=159
[STATS] prefetch[calls=32136, totalMs=38801.14, avgUs=1207.40]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=0, misses=32136, hitRate=0.00%, loads=32136, evictions=0, avgLoadTime=1.20ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
0.208 ops/ms
Iteration 1:
[STATS] passes=150
[STATS] prefetch[calls=30080, totalMs=39533.60, avgUs=1314.28]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=0, misses=30080, hitRate=0.00%, loads=30080, evictions=0, avgLoadTime=1.31ms]
[STATS] PoolStats[max=32768, allocated=800, free=0, unallocated=31968, utilization=2.4%, allocation=2.4%]
0.195 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
0.195 ops/ms

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, executorType = opensearch, mode = mmap, prefetchMode = off)

Fork: 1 of 1

Warmup Iteration 1: WARNING: Runtime environment or build system does not support multi-release JARs. This will impact location-based features.

[STATS] passes=11123
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
14.455 ops/ms
Iteration 1:
[STATS] passes=10894
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =0,hits=0, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=0, free=0, unallocated=32768, utilization=0.0%, allocation=0.0%]
14.153 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
14.153 ops/ms
Benchmark (cacheWarm) (executorType) (mode) (pre

asimmahmood1 · 2026-03-12T21:01:43Z

JMH Test 5 - 16 prefetch blocks, then full read

Benchmark                                              (cacheWarm)  (executorType)      (mode)  (prefetchMode)   Mode  Cnt    Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool           async  thrpt        31.124          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool    inline_check  thrpt        38.980          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool     inline_load  thrpt        44.019          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool             off  thrpt        44.259          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap           async  thrpt        74.412          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap    inline_check  thrpt        66.912          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap     inline_load  thrpt       418.439          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap             off  thrpt        92.577          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool           async  thrpt         0.171          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool    inline_check  thrpt         0.169          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool     inline_load  thrpt         0.194          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool             off  thrpt         0.195          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap           async  thrpt        12.632          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap    inline_check  thrpt        11.826          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap     inline_load  thrpt         0.194          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap             off  thrpt        10.679          ops/ms

asimmahmood1 · 2026-03-13T14:52:24Z

JMH TEST 5

100mb file, block size 8k
prefetch 16 blocks sequentially
read those 16 blocks completely, sequentially
prefetch to read ratio is 1:1024
for mmap, lucene's back off is disabled (via reflection)

PrefetchMode:

baseline - current code that calls IO on every prefetch
async candidate - this PR
none - no prefetch, that's the cost of async if everything is in the cache and there is nothing

cacheWarm:

true - entire 100mb is filled in cache
false - cache key is invalidated before prefetch, or madvice don't need is called for mmap before prefetch block

Result:

async is much faster than baseline (28 vs 2), bu tnot

https://github.com/asimmahmood1/opensearch-storage-encryption/blob/351330ef2f87b2d29dd2d1f7b90b1fe33a66f5cf/src/jmh/java/org/opensearch/index/store/benchmark/PrefetchBufferpoolVsMMapBenchmark.java

Benchmark                                              (cacheWarm)  (executorType)      (mode)  (prefetchMode)   Mode  Cnt   Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool           async  thrpt       28.741          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool        baseline  thrpt        2.006          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool             off  thrpt       45.294          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap           async  thrpt       74.431          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap             off  thrpt       67.105          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool           async  thrpt        0.171          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool        baseline  thrpt        0.205          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool             off  thrpt        0.185          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap           async  thrpt        2.743          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap             off  thrpt        0.942          ops/ms

asimmahmood1 · 2026-03-23T23:17:41Z

@sohami

Cost of prefetch async call when cache is warm (score 28.741)
flame-cpu-forward.html

Compared to stubbed/noop prefetch call (score 45.294)
flame-cpu-forward.html

The LinkedTransferQueue is not highly performant, so might be worth trying higher throughput queues. But before I do that, will focus on trying to reduce the IO calls for search path.

@abiesps recently added new L1 cache (array based), so will check that 1st in read path.
If miss, then asych call to prefetch to l2.getOrLoad(). But this is a follow up.

https://github.com/asimmahmood1/opensearch-storage-encryption/blob/351330ef2f87b2d29dd2d1f7b90b1fe33a66f5cf/src/jmh/java/org/opensearch/index/store/benchmark/PrefetchBufferpoolVsMMapBenchmark.java

asimmahmood1 · 2026-03-25T16:00:29Z

JMH Test 6 - Added getOrLoad mode

warm cache, it has similar or worse, since all the extra time is spent in LinkedTransferQueue
cold cache shows improvement

Benchmark                                              (cacheWarm)  (executorType)      (mode)   (prefetchMode)   Mode  Cnt    Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool            async  thrpt        31.038          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool              off  thrpt        42.343          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch  bufferpool  async_getOrLoad  thrpt        28.809          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap            async  thrpt       521.461          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch        mmap              off  thrpt       663.659          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool            async  thrpt         0.171          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool              off  thrpt         0.195          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch  bufferpool  async_getOrLoad  thrpt         0.192          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap            async  thrpt         1.646          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads              false      opensearch        mmap              off  thrpt         0.758          ops/ms

warm cache async_getOrLoad:
flame-cpu-forward.html

asimmahmood1@6339ead#diff-b91741cefaa9515609d6aa33e0e3d9b7d37c4f64200d4b98ca6d0e1833cf23fd

Details

# Parameters: (cacheWarm = true, executorType = opensearch, mode = bufferpool, prefetchMode = async)                                                                                                                                                                                                                                                                                                                                                                                        03:28 [307/1900]

Iteration   1:
[STATS] passes=23875
[STATS] prefetch[calls=4775176, totalMs=10981.35, avgUs=2.30]
[STATS] Prefetch[calls=4775176, requested=4775176, loaded=0, deduped=0, cacheHit=4775176, hitRatio=100.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=9550352, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=12800, free=12000, unallocated=19968, utilization=2.4%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=4775176, misses=0, total=4775176, l1Rate=0.00%, l2Rate=100.00%]
31.038 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
  31.038 ops/ms

# Parameters: (cacheWarm = true, executorType = opensearch, mode = bufferpool, prefetchMode = async_getOrLoad)
# Fork: 1 of 1


Iteration   1:
[STATS] passes=22162
[STATS] prefetch[calls=4432352, totalMs=12403.92, avgUs=2.80]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=8864704, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=12800, free=12000, unallocated=19968, utilization=2.4%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=4432352, misses=0, total=4432352, l1Rate=0.00%, l2Rate=100.00%]
28.809 ops/ms



# Parameters: (cacheWarm = false, executorType = opensearch, mode = mmap, prefetchMode = async)
# Fork: 1 of 1

Iteration   1:
[STATS] passes=1266
[STATS] prefetch[calls=253280, totalMs=9.48, avgUs=0.04]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =12800,hits=0, misses=0, hitRate=100.00%, loads=0, evictions=0, avgLoadTime=0.00ms]
[STATS] PoolStats[max=32768, allocated=12800, free=0, unallocated=19968, utilization=39.1%, allocation=39.1%]
1.646 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
  1.646 ops/ms


# Parameters: (cacheWarm = false, executorType = opensearch, mode = bufferpool, prefetchMode = async_getOrLoad)

Iteration   1:
[STATS] passes=150
[STATS] prefetch[calls=29600, totalMs=11.09, avgUs=0.37]
[STATS] Prefetch[calls=0, requested=0, loaded=0, deduped=0, cacheHit=0, hitRatio=0.00%, loadRatio=0.00%, inflight=0, rejections=0]
[STATS] CaffineCache[size =800,hits=29600, misses=51321, hitRate=36.58%, loads=29600, evictions=0, avgLoadTime=11.04ms]
[STATS] PoolStats[max=32768, allocated=12800, free=12000, unallocated=19968, utilization=2.4%, allocation=39.1%]
[STATS] TinyCache[l1Hits=0, l2Hits=7879, misses=21721, total=29600, l1Rate=0.00%, l2Rate=26.62%]
0.192 ops/ms
Result "org.opensearch.index.store.benchmark.PrefetchBufferpoolVsMMapBenchmark.read_4Threads":
  0.192 ops/ms


</details>

asimmahmood1 · 2026-03-25T17:11:57Z

Hotpath Summary

without prefetch at all, bufferpool is 42 vs mmap 663: most of the time spent in readLong which is not jit compiled
@prudhvigodithi has some jit friendly changes, will test them out, that way we can make sure bufferpool prefetch is as close to mmap as possible
plus time spent in TinyBlockCache

mmap:

asimmahmood1 · 2026-03-25T17:58:14Z

JMH Test 7 - Compare no prefetch, with tiny block cache vs radix

Benchmark                                              (cacheWarm)  (executorType)  (l1CacheType)      (mode)  (prefetchMode)   Mode  Cnt   Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch      tinyCache  bufferpool             off  thrpt       41.393          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch          radix  bufferpool             off  thrpt       47.207          ops/ms

Its better but still far away from 500 score for mmap.

3bdd4b6

asimmahmood1 · 2026-03-25T19:04:59Z

JMH Test 8 - use Unsafe instead of Segment

goes from 47 to 81, still far away from 500 mmap

Benchmark                                              (cacheWarm)  (executorType)  (l1CacheType)      (mode)  (prefetchMode)   Mode  Cnt   Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch      tinyCache  bufferpool             off  thrpt    2  67.378          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch          radix  bufferpool             off  thrpt    2  81.860          ops/ms

So CachedMemorySegmentIndexInput.readLong is 4345, so CachedMemorySegmentIndexInput.getCacheBlockWithOffset 2803 , which is 68%.

Of that, I separated the currentBlock miss into seperate method: CachedMemorySegmentIndexInput.getCacheBlockWithOffsetSlow, which is only 165 samples (so only 5% of getCacheBlockWithOffset). So limiting factor is not l1 or l2 look up, even just evaluating currentBlock itself.

flame-cpu-forward.html

asimmahmood1 · 2026-03-27T17:28:25Z

JMH Test 9 - simulate slow IO, reduce readLong

setup: 4 (from 16) block prefetch in loop, 16 blocks apart, 4 (from 8) readLong
cache warm - async is 151 vs 154, no IO calls made since all l2 hit
cache cold - without IO delay no prefetch is faster (0.751 vs 0.740), with IO delay is faster (0.746 vs 0.740), will try with higher IO delay

Benchmark                                              (awaitPrefetch)  (cacheWarm)  (executorType)  (l1CacheType)      (mode)  (prefetchMode)  (simulatedIoLatencyUs)   Mode  Cnt    Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false         true      opensearch      tinyCache  bufferpool           async                       0  thrpt       111.680          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false         true      opensearch      tinyCache  bufferpool           async                     500  thrpt       110.929          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false         true      opensearch      tinyCache  bufferpool             off                       0  thrpt       154.941          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false         true      opensearch      tinyCache  bufferpool             off                     500  thrpt       156.734          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false        false      opensearch      tinyCache  bufferpool           async                       0  thrpt         0.740          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false        false      opensearch      tinyCache  bufferpool           async                     500  thrpt         0.746          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false        false      opensearch      tinyCache  bufferpool             off                       0  thrpt         0.751          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false        false      opensearch      tinyCache  bufferpool             off                     500  thrpt         0.751          ops/ms

asimmahmood1 · 2026-03-27T17:53:42Z

JMH Test 10 - slower IO comparison

Prefetch helps with slower IO when cache is cold
For warm cache will test without any read compared to mmap

Benchmark                                              (awaitPrefetch)  (cacheWarm)  (executorType)  (l1CacheType)      (mode)  (prefetchMode)  (simulatedIoLatencyUs)   Mode  Cnt    Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false         true      opensearch      tinyCache  bufferpool           async                       0  thrpt       108.694          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false         true      opensearch      tinyCache  bufferpool             off                       0  thrpt       159.598          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false        false      opensearch      tinyCache  bufferpool           async                       0  thrpt         0.745          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false        false      opensearch      tinyCache  bufferpool           async                    1000  thrpt         0.746          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false        false      opensearch      tinyCache  bufferpool           async                    2000  thrpt         0.735          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false        false      opensearch      tinyCache  bufferpool             off                       0  thrpt         0.752          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false        false      opensearch      tinyCache  bufferpool             off                    1000  thrpt         0.548          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads                  false        false      opensearch      tinyCache  bufferpool             off                    2000  thrpt         0.353          ops/ms

asimmahmood1 · 2026-03-27T18:40:06Z

JMH 11 - Hot path cost of async

for mmap async, the madvice backup off is disabled via mmapPrefetchField.setInt(ts.threadInput, 0);

Benchmark                                              (cacheWarm)  (l1CacheType)      (mode)  (prefetchMode)  (skipRead)   Mode  Cnt       Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      tinyCache  bufferpool           async        true  thrpt          301.786          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      tinyCache  bufferpool             off        true  thrpt       342376.546          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      tinyCache        mmap           async        true  thrpt         1022.119          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      tinyCache        mmap             off        true  thrpt       332041.566          ops/ms

asimmahmood1@7c7a093

abiesps · 2026-03-27T20:54:35Z

I am trying to see how can we make the hot path of prefetch faster than mmap.
Can we try a benchmark which
a) Pre-warms L1 and l2 cache.
b) L1 cache is RadixBlockTable.
c) With bufferpool first check for prefetch api is plain read on L1 (without tryPin). If L1 has data then ignore prefetch. d) Another one on RadixBlockTable with tryPin (if we use memory pooling)
d) With mmap its the existing madvise and MemorySegment.isLoaded call without file level backoff.

asimmahmood1 · 2026-03-29T20:51:09Z

JMH 12 - Hotpath: forkjoin is 3x faster than opensearch threadpool

tried both arrayblockingqueue and forkjoin, forkjin is 3x faster even without L1 check

ForkJoinPool is the clear winner:

forkjoin ~2500-3000 ops/ms vs opensearch ~320-435 ops/ms — 7-8x faster submission throughput
Even beats mmap's prefetch path (1003 ops/ms) by 3x

L1 check helps with radix + opensearch:

opensearch/radix/l1_then_prefetch = 435 vs opensearch/radix/async = 349 — 25% improvement
The L1 radix contains() check is cheap enough to save the executor submission overhead
With forkjoin the difference disappears because submission is already so cheap

Radix > tinyCache with opensearch executor:

opensearch/radix/async = 349 vs opensearch/tinyCache/async = 316 — radix is ~10% faster
The radix contains() is two plain array loads vs tinyCache's stamp acquire-load + hash compare

Summary for production recommendation:

Change	Impact
Switch prefetch executor to `ForkJoinPool`	7-8x faster prefetch submission
Add L1 check before prefetch (radix)	25% faster with current executor, free insurance with ForkJoinPool
Radix L1 cache over tinyCache	10% faster overall
The biggest win by far is the executor change. The l1_then_prefetch check is a nice optimization on top — it avoids unnecessary work even when submission is cheap, and becomes more valuable under real IO load where
each avoided submission saves actual EFS round-trips.

Benchmark                                              (cacheWarm)  (executorType)  (l1CacheType)      (mode)    (prefetchMode)  (skipRead)   Mode  Cnt     Score   Error   Units
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch      tinyCache  bufferpool             async        true  thrpt        316.538          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch      tinyCache  bufferpool  l1_then_prefetch        true  thrpt        320.108          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch      tinyCache        mmap             async        true  thrpt       1003.410          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch          radix  bufferpool             async        true  thrpt        349.937          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch          radix  bufferpool  l1_then_prefetch        true  thrpt        435.632          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true      opensearch          radix        mmap             async        true  thrpt        976.015          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true        forkjoin      tinyCache  bufferpool             async        true  thrpt       2489.217          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true        forkjoin      tinyCache  bufferpool  l1_then_prefetch        true  thrpt       2982.549          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true        forkjoin          radix  bufferpool             async        true  thrpt       3000.462          ops/ms
PrefetchBufferpoolVsMMapBenchmark.read_4Threads               true        forkjoin          radix  bufferpool  l1_then_prefetch        true  thrpt       2705.352          ops/ms

https://github.com/asimmahmood1/opensearch-storage-encryption/blob/2b237e44bde629e67669f6a1ba0ffb36b0dcac9f/src/jmh/java/org/opensearch/index/store/benchmark/PrefetchBufferpoolVsMMapBenchmark.java

…d style to dedup load with search * to avoid dup with search request, use getOrLoad * we loose the IO collapsing option, but that will be redone at lower IO layer * ForkJoin shows 10x improvement compared to LinkedTransferQueue used by fixed opensearch threadpool * 3x improvement compared to mmap prefetch, with isLoaded and without backoff * added more metrics Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

abiesps · 2026-03-30T20:19:18Z

src/main/java/org/opensearch/index/store/bufferpoolfs/CachedMemorySegmentIndexInput.java

+        final long endFileOffset = absoluteBaseOffset + offset + length;
+        final long endBlockOffset = (endFileOffset + CACHE_BLOCK_MASK) & ~CACHE_BLOCK_MASK;
+        final long blockCount = (endBlockOffset - startBlockOffset) >>> CACHE_BLOCK_SIZE_POWER;
+


Are we planning to add L1 cache short-circuit here? I was thinking to add it now even with BlockSlotTinyCache so that once we replace tiny cache with Radix table we dont miss it.

abiesps · 2026-03-30T20:31:40Z

I am also wondering why is ForkJoinPool so much better than opensearch threadpool ? Do we know the reason behind it ?

abiesps · 2026-03-30T21:30:26Z

On the cold path of prefetch,

How much is the delay, is there a way that i can check how does reading first byte latency varies with mmap and bufferpool as we increase this delay - starting from zero ?

Ratio of reading first byte latency and actual IO latency with delay could be a good metric and benchmark for measuring cold path prefetch performance. I want to be sure that we don't have "any application level bottlenecks that are adding on to this delay unnecessarily". We could also run this benchmark on a host with 'slower' file system to see how does this 'delay' varies with IO latencies from file system

asimmahmood1 · 2026-03-30T22:02:53Z

I am also wondering why is ForkJoinPool so much better than opensearch threadpool ? Do we know the reason behind it ?

Main reason is fork join doesn't have a shared queue, each work has its own queue. On submission it randomly distributes to a worker, if worker runs on of tasks, then it steels from others. The downside is fork join doesn't have strict FIFO. Also, queues are unbounded, so I added the inflightCount check before submitting that in case of slow IO we start dropping prefetch requests. The inflighMap.size() isn't a constant time check, so use a dedicated atomic int to count.

If we do want more FIFO, there are other queue options that provide higher throughput (e.g. netty's jcp).

Fork join:
flame-cpu-forward.html

LinkedTransferQueue (OpenSearch FixedThreadPool):
flame-cpu-forward.html

asimmahmood1 · 2026-03-31T06:58:55Z

JMH Cold

Final comparison (both prewarmed, truly cold):

Component	Bufferpool	Mmap
Prefetch call overhead	~40µs	~38µs (madvise)
IO time (inside loader)	~818µs	N/A (kernel)
ReadByte (post-prefetch)	~115µs	~475µs
Total (JMH score)	~900µs	~620µs

Key findings:

Prefetch call overhead is nearly identical: ~40µs for both. Bufferpool's executor dispatch is not a bottleneck.
Mmap readByte is ~475µs — this is the page fault cost. madvise(WILLNEED) queues kernel readahead but it hasn't completed by the time readByte is called ~38µs later. So the read blocks on the page
fault, waiting for disk IO.
Bufferpool readByte is ~115µs — 4x faster than mmap because by the time we call readByte, the async prefetch has already completed (we polled until the block appeared in cache). The read is just an L1
cache lookup + off-heap memory access.
Total gap is ~280µs (900 vs 620). Bufferpool's total is higher because it waits for the full async IO to complete before the benchmark returns, while mmap's total = madvise(38µs) + readByte(475µs)
where the readByte absorbs the IO wait via page fault.
The real apples-to-apples comparison is prefetch+readByte: bufferpool = 900µs (prefetch completes, then 115µs read), mmap = 38 + 475 = ~513µs. The ~400µs gap is FileChannel.open(DIRECT) per block —
the single biggest optimization opportunity.
Bottom line: No application-level bottleneck in the prefetch path. The 40µs executor overhead is negligible. The gap vs mmap is almost entirely the cost of opening a DirectIO file descriptor per load.

Setup

Benchmark Setup (per trial):

Creates a 1MB test file via BufferPoolDirectory.createOutput()
Opens both directories:
- Bufferpool: BufferPoolDirectory with ForkJoinPool(4), CaffeineBlockCache(1000 blocks), MemorySegmentPool(32MB), QueuingWorker, TimestampingBlockLoader wrapping CryptoDirectIOBlockLoader
- Mmap: MMapDirectory (Lucene standard), resolves consecutivePrefetchHitCount field via reflection
Prewarms 20 iterations to JIT-compile hot paths and start ForkJoinPool threads
Drops page cache / clears block cache after prewarm

Per invocation (cold setup):

Bufferpool: blockCache.invalidate(key) — removes the target block from Caffeine cache
Mmap: madvise(MADV_DONTNEED) on mmap'd segments + posix_fadvise(FADV_DONTNEED) on fd + reset Lucene's backoff counter to 0

Call Path

Benchmark thread                          ForkJoinPool worker thread
  ─────────────────                         ──────────────────────────
  t0 = nanoTime()
  │
  sharedInput.prefetch(offset, BLOCK_SIZE)
  │
  └─ CachedMemorySegmentIndexInput.prefetch()
     └─ blockCache.loadMissingBlocks(path, offset, 1)
        ├─ prefetchTracker.recordPrefetchCall(1)
        └─ prefetchTracker.execute(runnable)          ──→  ForkJoinPool picks up task
           └─ executor.execute(runnable)                   │
              [returns immediately, ~40µs from t0]         │
                                                           loadMissingBlocksSync()
                                                           │
                                                           ├─ prefetchTracker.putIfAbsent(key)  [dedup check]
                                                           │
                                                           └─ caffeineCache.get(key, loader)
                                                              │
                                                              └─ TimestampingBlockLoader.load()
                                                                 ├─ preSyscallNanos = nanoTime()  ← [~40µs after t0]
                                                                 │
                                                                 └─ CryptoDirectIOBlockLoader.load()
                                                                    ├─ FileChannel.open(path, DIRECT)
                                                                    ├─ directIOReadAligned()
                                                                    │  └─ channel.read(buffer, offset)  ← actual pread syscall
                                                                    ├─ segmentPool.tryAcquire()
                                                                    ├─ MemorySegment.copy(read → pooled)
                                                                    └─ return RefCountedMemorySegment[]
                                                                 │
                                                                 postSyscallNanos = nanoTime()  ← [~860µs after t0]
                                                                 │
                                                           block inserted into Caffeine cache
  │
  │ [polling loop]
  while (blockCache.get(key) == null)
      Thread.onSpinWait();
  │
  tDone = nanoTime()                                   ← [~870µs after t0]
  │
  sharedInput.seek(offset)
  tRead0 = nanoTime()
  sharedInput.readByte()
  │
  └─ CachedMemorySegmentIndexInput.readByte()
     └─ getCacheBlockWithOffset()
        └─ l1Cache.acquireRefCountedValue()  ← L1 cache hit (block just loaded)
           └─ segment.get(LAYOUT_BYTE, offset)  ← off-heap memory read
  │
  readByteNs = nanoTime() - tRead0            ← [~115µs]

Mmap call path:

  Benchmark thread
  ─────────────────
  t0 = nanoTime()
  │
  sharedInput.prefetch(offset, BLOCK_SIZE)
  │
  └─ MemorySegmentIndexInput.prefetch()
     ├─ consecutivePrefetchHitCount++ (reset to 0 by setupInvocation)
     ├─ segment.isLoaded()                    ← mincore syscall (returns false, pages evicted)
     │  └─ consecutivePrefetchHitCount = 0    [cache miss detected]
     └─ nativeAccess.madviseWillNeed(segment) ← madvise(MADV_WILLNEED) syscall
        [kernel queues readahead IO, returns immediately]
  │
  tDone = nanoTime()                          ← [~38µs after t0]
  │
  sharedInput.seek(offset)
  tRead0 = nanoTime()
  sharedInput.readByte()
  │
  └─ MemorySegmentIndexInput.readByte()
     └─ curSegment.get(LAYOUT_BYTE, curPosition)
        └─ [PAGE FAULT]                       ← kernel readahead hasn't finished yet
           └─ kernel blocks thread, waits for disk IO to complete
           └─ page loaded into physical memory
           └─ returns byte value
  │
  readByteNs = nanoTime() - tRead0            ← [~475µs, dominated by page fault IO wait]

Benchmark                                             (mode)  (simulatedIoDelayUs)  Mode  Cnt  Score   Error  Units
ColdPrefetchLatencyBenchmark.coldPrefetchLatency  bufferpool                     0    ss    3  0.901 ± 0.766  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency  bufferpool                   250    ss    3  1.273 ± 0.893  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency  bufferpool                   500    ss    3  1.646 ± 2.282  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency  bufferpool                  1000    ss    3  2.021 ± 0.876  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency  bufferpool                  2000    ss    3  3.031 ± 0.946  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency  bufferpool                  4000    ss    3  5.068 ± 1.503  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency  bufferpool                  8000    ss    3  9.137 ± 0.747  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency        mmap                     0    ss    3  0.621 ± 0.656  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency        mmap                   250    ss    3  0.619 ± 0.402  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency        mmap                   500    ss    3  0.601 ± 0.242  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency        mmap                  1000    ss    3  0.644 ± 0.111  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency        mmap                  2000    ss    3  0.639 ± 0.525  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency        mmap                  4000    ss    3  0.614 ± 0.279  ms/op
ColdPrefetchLatencyBenchmark.coldPrefetchLatency        mmap                  8000    ss    3  0.650 ± 0.344  ms/op

https://github.com/asimmahmood1/opensearch-storage-encryption/blob/3d013c1e6f40e2f430b42d12d02f64f7d21793af/src/jmh/java/org/opensearch/index/store/benchmark/ColdPrefetchLatencyBenchmark.java

* added L1BlockCache interface with contains method * L1 Radix table can swapped in later * contains is cheap array look up, no pinning Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

abiesps · 2026-03-31T22:33:00Z

So time to submit IO is more or less same with bufferpool and mmap ~40 micro seconds. Issues are in the 'IO path' which should be covered in cold path optimizations for bufferpool.

asimmahmood1 requested review from RajatGupta02, kumargu, shubhmkr-amazon and udabhas as code owners March 2, 2026 19:48

asimmahmood1 added 3 commits March 2, 2026 17:15

Remove extra tests

9ffb74b

Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

abiesps reviewed Mar 3, 2026

View reviewed changes

src/main/java/org/opensearch/index/store/bufferpoolfs/CachedMemorySegmentIndexInput.java Outdated Show resolved Hide resolved

abiesps reviewed Mar 3, 2026

View reviewed changes

src/main/java/org/opensearch/index/store/bufferpoolfs/CachedMemorySegmentIndexInput.java Outdated Show resolved Hide resolved

abiesps reviewed Mar 3, 2026

View reviewed changes

src/main/java/org/opensearch/index/store/CryptoDirectoryPlugin.java Outdated Show resolved Hide resolved

sohami reviewed Mar 3, 2026

View reviewed changes

asimmahmood1 added 2 commits March 3, 2026 16:12

Configurable prefetch thread and queue size; handle queue rejection

d05b5f6

* found a bug in loadForPrefetch, it doesn't check cache first Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

sohami reviewed Mar 4, 2026

View reviewed changes

kumargu reviewed Mar 4, 2026

View reviewed changes

src/main/java/org/opensearch/index/store/block_cache/CaffeineBlockCache.java Show resolved Hide resolved

kumargu reviewed Mar 4, 2026

View reviewed changes

src/main/java/org/opensearch/index/store/CryptoDirectoryPlugin.java Show resolved Hide resolved

kumargu reviewed Mar 4, 2026

View reviewed changes

asimmahmood1 added 3 commits March 5, 2026 09:42

Split prefetch logic for search vs readahead

94085b1

* read ahead is only called when there is a cache miss * while search prefetch may already be cached Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

asimmahmood1 self-assigned this Mar 5, 2026

asimmahmood1 added this to Performance Roadmap Mar 5, 2026

github-project-automation bot moved this to Todo in Performance Roadmap Mar 5, 2026

asimmahmood1 added 3 commits March 5, 2026 12:34

Do per block level dedup using prefetchCache

b66baf4

* now the executor is passed into CaffeineBlockCache * single map per node, instead of per file Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

spotless

9967280

Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

edge case

8e01b0b

Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

sohami reviewed Mar 6, 2026

View reviewed changes

asimmahmood1 mentioned this pull request Mar 25, 2026

[BUG] Hot Path Performance Regression #150

Open

asimmahmood1 added 3 commits March 29, 2026 21:20

Merge remote-tracking branch 'upstream/main' into prefetchAsync

5a70737

Fix build, spotless

e5478df

Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

abiesps reviewed Mar 30, 2026

View reviewed changes

asimmahmood1 added 3 commits March 31, 2026 17:52

Added L1 check in prefetch when block size is 1

2f1a99a

* added L1BlockCache interface with contains method * L1 Radix table can swapped in later * contains is cheap array look up, no pinning Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

spotless; fix jmh compile

2bce2ef

Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

Fix integ test by shutting prefetch executor

cfe1072

Signed-off-by: Asim Mahmood <asim.seng@gmail.com>

Conversation

asimmahmood1 commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

1. Async Prefetch Architecture

2. Prefetch Cache for Deduplication

3. Smart Block Loading

4. Prefetch Threadpool Configuration

5. API Changes

6. Test Updates

Key Architectural Improvements:

Related Issues

Testing

jmh

Sequential Prefetch: noKms vs jmhPrefetch (async)

OSB

Full Run

Check List

Uh oh!

asimmahmood1 commented Mar 2, 2026

JMH - cold path

hot

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sohami Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

asimmahmood1 Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abiesps Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sohami Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kumargu Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

kumargu commented Mar 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sohami Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

sohami Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

asimmahmood1 commented Mar 12, 2026

JMH Test 1

Summary

Next Steps / issues

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, mode = bufferpool, prefetchEnabled = true)

Fork: 1 of 1

Warmup Iteration 1:

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = true, mode = bufferpool, prefetchEnabled = false)

Fork: 1 of 1

Warmup Iteration 1:

Warmup: 1 iterations, 10 s each

Benchmark mode: Throughput, ops/time

Parameters: (cacheWarm = false, mode = bufferpool, prefetchEnabled = true)

Fork: 1 of 1

asimmahmood1 commented Mar 2, 2026 •

edited

Loading

asimmahmood1 Mar 3, 2026 •

edited

Loading

asimmahmood1 commented Mar 12, 2026 •

edited

Loading