Cache Tags objects to eliminate per-call allocations on metric hot paths by sakrah · Pull Request #74 · opensearch-project/time-series-db

sakrah · 2026-02-23T00:13:12Z

Description

Profiling under production-like bulk indexing load revealed that a bulk of CPU was spent in the metrics hot path, primarily due to repeated Tags.create().addTag(...) allocations on every request. This PR caches Tags instances so that repeated metric calls with the same tag values reuse the same object, achieving zero-allocation metric recording on cache hits.

Changes

SourceBuilderVisitor: Replace inline Tags.create().addTag("mode", ...) with static final Tags fields for the two pushdown mode values
PipelineStageExecutor: Cache Tags per (stageName, stageType, location) in a static ConcurrentHashMap so pipeline stage metrics reuse the same instance across executions
RestPromQLAction / RestM3QLAction: Cache Tags by request tag map content (HashMap.equals/hashCode) in a static ConcurrentHashMap, avoiding Tags.create() on every request for repeated tag combinations
RestPromQLActionTests: Add test for getOrCreateTags caching behavior

Performance Impact

Flame graph analysis (30K-sample profiles under bulk indexing):

	Baseline	With Caching
Hot-path metric CPU	12.48%	1.85%
Total metric CPU	15.82%	5.19%

85% reduction in hot-path metric CPU overhead.

Test Plan

Existing unit tests pass
Added testGetOrCreateTagsCachesAndReturns for RestPromQLAction tag caching
Verified via flame graph profiling under production-like load

Issues Resolved

List any issues this PR will resolve, e.g. Closes [...].

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

codecov · 2026-02-23T00:24:00Z

Codecov Report

❌ Patch coverage is 95.65217% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 89.01%. Comparing base (14685f3) to head (59c2ff8).
⚠️ Report is 21 commits behind head on main.

Files with missing lines	Patch %	Lines
...g/opensearch/tsdb/query/rest/RestPromQLAction.java	83.33%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main      #74      +/-   ##
============================================
+ Coverage     88.98%   89.01%   +0.03%     
- Complexity     4574     4581       +7     
============================================
  Files           302      302              
  Lines         13903    13912       +9     
  Branches       2093     2093              
============================================
+ Hits          12371    12384      +13     
+ Misses          928      924       -4     
  Partials        604      604

Flag	Coverage Δ
unittests	`89.01% <95.65%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...nsearch/tsdb/lang/m3/dsl/SourceBuilderVisitor.java	`93.15% <100.00%> (+0.03%)`	⬆️
...org/opensearch/tsdb/query/rest/RestM3QLAction.java	`92.69% <100.00%> (+0.08%)`	⬆️
...search/tsdb/query/stage/PipelineStageExecutor.java	`100.00% <100.00%> (ø)`
...g/opensearch/tsdb/query/rest/RestPromQLAction.java	`20.12% <83.33%> (+2.91%)`	⬆️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

SourceBuilderVisitor: replace Tags.create().addTag() calls with static final Tags fields for the two pushdown mode values. PipelineStageExecutor: cache Tags per (stageName, stageType, location) in a static ConcurrentHashMap so pipeline stage metrics reuse the same Tags instance across executions. RestPromQLAction / RestM3QLAction: cache Tags by request tag map content (HashMap.equals/hashCode) in a static ConcurrentHashMap, avoiding Tags.create() on every request for repeated tag combinations. These changes ensure the same Tags instance flows through to M3MetricsRegistry's tag cache, which in turn feeds the same Map object to M3MetricsCollector's identity-based metric cache — achieving zero-allocation metric recording on cache hits. Signed-off-by: sakrah <sakrah@uber.com> Signed-off-by: Sam Akrah <sakrah@uber.com>

Signed-off-by: Sam Akrah <sakrah@uber.com>

Signed-off-by: Sam Akrah <sakrah@uber.com> Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Sam Akrah <sakrah@uber.com>

venketep-prasad · 2026-02-23T22:05:33Z

src/main/java/org/opensearch/tsdb/query/rest/RestPromQLAction.java

    }

+    static Tags getOrCreateTags(Map<String, String> tags) {
+        return requestTagsCache.computeIfAbsent(tags, m -> {


Mutable HashMap stored as ConcurrentHashMap key. If the map is ever mutated after this call, the cached entry becomes unreachable (hashCode drift). Consider Map.copyOf(tags) as the key.

Same applies to RestM3QLAction.incrementMetrics.

tags as a local variable in prepareRequest will return after prepareRequest returns. There is no reference to it anywhere else. And your recommended fix — Map.copyOf(tags) — will create a new immutable map on every single call, including cache hits. That's an allocation on the hot path, which is exactly what this PR is trying to eliminate.

you can wrap the tags in Collections.unmodifiableMap, which should be cheap, it's just an interface wrapper that doesn't allow .put etc calls. it doesn't protect against the data changing from underneath it if someone holds a reference to the underlying map, but like you said, in this case, it's local variable, so it should be fine.
and if for whatever reason someone tries to get the key from the map and mutate it later, it won't allow the mutation.

but same comment as the other PR, think this will slowly leak memory over time.

since this pattern is recurring in so many places, we should consider building the cache utility directly into the TSDBMetrics class so it's managed in a central place. not necessarily to automatically cache everything, but to initalize the cache and provide methods to easily access it and manage the life cycle etc.

venketep-prasad · 2026-02-23T22:05:33Z

src/main/java/org/opensearch/tsdb/query/rest/RestM3QLAction.java

    private void incrementMetrics(Map<String, String> tags) {
-        Tags finalTags = Tags.create();
-        tags.forEach(finalTags::addTag);
+        Tags finalTags = requestTagsCache.computeIfAbsent(tags, m -> {


This duplicates the caching pattern from RestPromQLAction.getOrCreateTags. Consider extracting a shared utility (e.g. on TSDBMetrics) so the logic lives in one place.

Both classes have their own cache on purpose — PromQL and M3QL are separate actions with separate tag sets. Merging them into one shared cache mixes unrelated data. And if we just share the conversion logic but keep separate caches, we're extracting a 3-line lambda. Not worth it for two call sites.

No. I didn't mean on every call. Map.copyOf only needs to be inside the computeIfAbsent lambda (the cache-miss path). But I guess it's fine

venketep-prasad · 2026-02-23T22:05:33Z

src/test/java/org/opensearch/tsdb/query/rest/RestPromQLActionTests.java

+        tags.put("query_type", "instant");
+        tags.put("status", "200");
+
+        Tags result1 = RestPromQLAction.getOrCreateTags(tags);


This passes the same HashMap instance twice. Production creates a new HashMap per request. Use two separate maps with equal content to actually exercise the content-based cache lookup:

Map<String, String> tags2 = new HashMap<>(tags); Tags result2 = RestPromQLAction.getOrCreateTags(tags2); assertSame(result1, result2);

Will fix, thanks

…nces Signed-off-by: Sam Akrah <sakrah@uber.com>

philiplhchan · 2026-02-24T01:44:06Z

src/main/java/org/opensearch/tsdb/metrics/TSDBMetrics.java

+     * freshly-created {@link Tags} object. The downstream M3 bridge caches
+     * the {@link Tags}&nbsp;&rarr;&nbsp;{@code Map<String,String>} conversion
+     * and the resolved Tally metric handle using <b>object identity</b>
+     * (OpenSearch's {@link Tags} does not override {@code equals}/{@code hashCode}).
+     * Passing the same instance on every call keeps the hot path allocation-free;
+     * a new instance with identical content will still work correctly but will
+     * bypass both caches, re-allocating a {@code HashMap} and re-resolving the
+     * Tally scope on every call.</p>


downstream M3 bridge

this implementation is telemetry implementation agnostic, so this should be reworded and all "m3" and "tally" references should be removed. the optimizations should be applicable to all implementations

agreed here. Maybe we don't even need this, Npw that we have equal/Hashcode added to Tags class.

Signed-off-by: Sam Akrah <sakrah@uber.com>

sakrah requested review from itschrispeck, philiplhchan and yupeng9 as code owners February 23, 2026 00:13

sakrah force-pushed the sakrah/tags-caching branch 2 times, most recently from 440f25f to 9959b2a Compare February 23, 2026 18:17

sakrah changed the title ~~Sakrah/tags caching~~ Cache Tags objects to eliminate per-call allocations on metric hot paths Feb 23, 2026

akrahdan and others added 3 commits February 23, 2026 10:27

Added contact message

7d2d5c5

Signed-off-by: Sam Akrah <sakrah@uber.com>

Add test coverage for RestPromQLAction Tags caching

5d1ed0a

Signed-off-by: Sam Akrah <sakrah@uber.com> Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Sam Akrah <sakrah@uber.com>

sakrah force-pushed the sakrah/tags-caching branch from 9959b2a to 5d1ed0a Compare February 23, 2026 18:27

venketep-prasad reviewed Feb 23, 2026

View reviewed changes

Fix test to verify content-based cache lookup with separate map insta…

ccb44cd

…nces Signed-off-by: Sam Akrah <sakrah@uber.com>

venketep-prasad approved these changes Feb 23, 2026

View reviewed changes

philiplhchan reviewed Feb 24, 2026

View reviewed changes

Remove identity-reuse Javadoc now that Tags has equals/hashCode

59c2ff8

Signed-off-by: Sam Akrah <sakrah@uber.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache Tags objects to eliminate per-call allocations on metric hot paths#74

Cache Tags objects to eliminate per-call allocations on metric hot paths#74
sakrah wants to merge 5 commits intoopensearch-project:mainfrom
sakrah:sakrah/tags-caching

sakrah commented Feb 23, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 23, 2026 •

edited

Loading

Uh oh!

venketep-prasad Feb 23, 2026

Uh oh!

sakrah Feb 23, 2026

Uh oh!

philiplhchan Feb 24, 2026

Uh oh!

venketep-prasad Feb 23, 2026

Uh oh!

sakrah Feb 23, 2026

Uh oh!

venketep-prasad Feb 23, 2026

Uh oh!

venketep-prasad Feb 23, 2026

Uh oh!

sakrah Feb 23, 2026

Uh oh!

philiplhchan Feb 24, 2026

Uh oh!

sakrah Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

sakrah commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Performance Impact

Test Plan

Issues Resolved

Uh oh!

codecov bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sakrah commented Feb 23, 2026 •

edited

Loading

codecov bot commented Feb 23, 2026 •

edited

Loading