Skip to content

Cache Tags objects to eliminate per-call allocations on metric hot paths#74

Open
sakrah wants to merge 5 commits intoopensearch-project:mainfrom
sakrah:sakrah/tags-caching
Open

Cache Tags objects to eliminate per-call allocations on metric hot paths#74
sakrah wants to merge 5 commits intoopensearch-project:mainfrom
sakrah:sakrah/tags-caching

Conversation

@sakrah
Copy link
Copy Markdown
Contributor

@sakrah sakrah commented Feb 23, 2026

Description

Profiling under production-like bulk indexing load revealed that a bulk of CPU was spent in the metrics hot path, primarily due to repeated Tags.create().addTag(...) allocations on every request. This PR caches Tags instances so that repeated metric calls with the same tag values reuse the same object, achieving zero-allocation metric recording on cache hits.

Changes

  • SourceBuilderVisitor: Replace inline Tags.create().addTag("mode", ...) with static final Tags fields for the two pushdown mode values
  • PipelineStageExecutor: Cache Tags per (stageName, stageType, location) in a static ConcurrentHashMap so pipeline stage metrics reuse the same instance across executions
  • RestPromQLAction / RestM3QLAction: Cache Tags by request tag map content (HashMap.equals/hashCode) in a static ConcurrentHashMap, avoiding Tags.create() on every request for repeated tag combinations
  • RestPromQLActionTests: Add test for getOrCreateTags caching behavior

Performance Impact

Flame graph analysis (30K-sample profiles under bulk indexing):

Baseline With Caching
Hot-path metric CPU 12.48% 1.85%
Total metric CPU 15.82% 5.19%

85% reduction in hot-path metric CPU overhead.

Test Plan

  • Existing unit tests pass
  • Added testGetOrCreateTagsCachesAndReturns for RestPromQLAction tag caching
  • Verified via flame graph profiling under production-like load

Issues Resolved

List any issues this PR will resolve, e.g. Closes [...].

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@codecov
Copy link
Copy Markdown

codecov bot commented Feb 23, 2026

Codecov Report

❌ Patch coverage is 95.65217% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 89.01%. Comparing base (14685f3) to head (59c2ff8).
⚠️ Report is 21 commits behind head on main.

Files with missing lines Patch % Lines
...g/opensearch/tsdb/query/rest/RestPromQLAction.java 83.33% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main      #74      +/-   ##
============================================
+ Coverage     88.98%   89.01%   +0.03%     
- Complexity     4574     4581       +7     
============================================
  Files           302      302              
  Lines         13903    13912       +9     
  Branches       2093     2093              
============================================
+ Hits          12371    12384      +13     
+ Misses          928      924       -4     
  Partials        604      604              
Flag Coverage Δ
unittests 89.01% <95.65%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...nsearch/tsdb/lang/m3/dsl/SourceBuilderVisitor.java 93.15% <100.00%> (+0.03%) ⬆️
...org/opensearch/tsdb/query/rest/RestM3QLAction.java 92.69% <100.00%> (+0.08%) ⬆️
...search/tsdb/query/stage/PipelineStageExecutor.java 100.00% <100.00%> (ø)
...g/opensearch/tsdb/query/rest/RestPromQLAction.java 20.12% <83.33%> (+2.91%) ⬆️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sakrah sakrah force-pushed the sakrah/tags-caching branch 2 times, most recently from 440f25f to 9959b2a Compare February 23, 2026 18:17
@sakrah sakrah changed the title Sakrah/tags caching Cache Tags objects to eliminate per-call allocations on metric hot paths Feb 23, 2026
akrahdan and others added 3 commits February 23, 2026 10:27
SourceBuilderVisitor: replace Tags.create().addTag() calls with static
final Tags fields for the two pushdown mode values.

PipelineStageExecutor: cache Tags per (stageName, stageType, location)
in a static ConcurrentHashMap so pipeline stage metrics reuse the same
Tags instance across executions.

RestPromQLAction / RestM3QLAction: cache Tags by request tag map content
(HashMap.equals/hashCode) in a static ConcurrentHashMap, avoiding
Tags.create() on every request for repeated tag combinations.

These changes ensure the same Tags instance flows through to
M3MetricsRegistry's tag cache, which in turn feeds the same Map object
to M3MetricsCollector's identity-based metric cache — achieving
zero-allocation metric recording on cache hits.

Signed-off-by: sakrah <sakrah@uber.com>
Signed-off-by: Sam Akrah <sakrah@uber.com>
Signed-off-by: Sam Akrah <sakrah@uber.com>
Signed-off-by: Sam Akrah <sakrah@uber.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: Sam Akrah <sakrah@uber.com>
@sakrah sakrah force-pushed the sakrah/tags-caching branch from 9959b2a to 5d1ed0a Compare February 23, 2026 18:27
}

static Tags getOrCreateTags(Map<String, String> tags) {
return requestTagsCache.computeIfAbsent(tags, m -> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mutable HashMap stored as ConcurrentHashMap key. If the map is ever mutated after this call, the cached entry becomes unreachable (hashCode drift). Consider Map.copyOf(tags) as the key.

Same applies to RestM3QLAction.incrementMetrics.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tags as a local variable in prepareRequest will return after prepareRequest returns. There is no reference to it anywhere else. And your recommended fix — Map.copyOf(tags) — will create a new immutable map on every single call, including cache hits. That's an allocation on the hot path, which is exactly what this PR is trying to eliminate.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can wrap the tags in Collections.unmodifiableMap, which should be cheap, it's just an interface wrapper that doesn't allow .put etc calls. it doesn't protect against the data changing from underneath it if someone holds a reference to the underlying map, but like you said, in this case, it's local variable, so it should be fine.
and if for whatever reason someone tries to get the key from the map and mutate it later, it won't allow the mutation.

but same comment as the other PR, think this will slowly leak memory over time.

since this pattern is recurring in so many places, we should consider building the cache utility directly into the TSDBMetrics class so it's managed in a central place. not necessarily to automatically cache everything, but to initalize the cache and provide methods to easily access it and manage the life cycle etc.

private void incrementMetrics(Map<String, String> tags) {
Tags finalTags = Tags.create();
tags.forEach(finalTags::addTag);
Tags finalTags = requestTagsCache.computeIfAbsent(tags, m -> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplicates the caching pattern from RestPromQLAction.getOrCreateTags. Consider extracting a shared utility (e.g. on TSDBMetrics) so the logic lives in one place.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both classes have their own cache on purpose — PromQL and M3QL are separate actions with separate tag sets. Merging them into one shared cache mixes unrelated data. And if we just share the conversion logic but keep separate caches, we're extracting a 3-line lambda. Not worth it for two call sites.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I didn't mean on every call. Map.copyOf only needs to be inside the computeIfAbsent lambda (the cache-miss path). But I guess it's fine

tags.put("query_type", "instant");
tags.put("status", "200");

Tags result1 = RestPromQLAction.getOrCreateTags(tags);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This passes the same HashMap instance twice. Production creates a new HashMap per request. Use two separate maps with equal content to actually exercise the content-based cache lookup:

Map<String, String> tags2 = new HashMap<>(tags);
Tags result2 = RestPromQLAction.getOrCreateTags(tags2);
assertSame(result1, result2);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix, thanks

…nces

Signed-off-by: Sam Akrah <sakrah@uber.com>
Comment on lines +111 to +118
* freshly-created {@link Tags} object. The downstream M3 bridge caches
* the {@link Tags}&nbsp;&rarr;&nbsp;{@code Map<String,String>} conversion
* and the resolved Tally metric handle using <b>object identity</b>
* (OpenSearch's {@link Tags} does not override {@code equals}/{@code hashCode}).
* Passing the same instance on every call keeps the hot path allocation-free;
* a new instance with identical content will still work correctly but will
* bypass both caches, re-allocating a {@code HashMap} and re-resolving the
* Tally scope on every call.</p>
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

downstream M3 bridge

this implementation is telemetry implementation agnostic, so this should be reworded and all "m3" and "tally" references should be removed. the optimizations should be applicable to all implementations

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed here. Maybe we don't even need this, Npw that we have equal/Hashcode added to Tags class.

Signed-off-by: Sam Akrah <sakrah@uber.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants