Add node-level JVM and CPU runtime metrics by sakrah · Pull Request #20844 · opensearch-project/OpenSearch

sakrah · 2026-03-11T19:28:27Z

Description

Motivation

OpenSearch's telemetry framework (MetricsRegistry) currently exposes no JVM runtime metrics. Operators monitoring production clusters rely on external mechanisms -- the _nodes/stats REST API (polling-based, adds HTTP overhead), standalone JMX exporters (separate process, separate config), or the OTel Java agent (bytecode instrumentation, heavyweight dependency). None of these integrate with the telemetry backend that OpenSearch is already configured to use.

This means if a cluster has the OTel plugin (or any future telemetry plugin) enabled, JVM metrics like heap pressure, GC pause time, and thread exhaustion still require a separate collection pipeline. There's no reason for that gap -- the data is already available via JvmService.stats(), which caches JvmStats snapshots with a 1-second TTL.

What this PR does

Adds NodeRuntimeMetrics, a Closeable component that registers ~30 pull-based gauges through MetricsRegistry, covering JVM memory, GC, buffer pools, threads, class loading, and CPU. Since it uses the framework-agnostic MetricsRegistry API, these metrics are automatically available through whatever telemetry backend the cluster is configured with -- no additional infrastructure required.

Metric names follow OpenTelemetry JVM semantic conventions (jvm.memory.used, jvm.gc.duration, jvm.thread.count, etc.) so they align with the broader ecosystem and are immediately recognizable to operators already using OTel-based tooling.

Metrics registered

Category	Metrics	Tags
Memory	`jvm.memory.used`, `jvm.memory.committed`, `jvm.memory.limit`	`type` (heap/non_heap), `pool` (per memory pool)
GC	`jvm.gc.duration`, `jvm.gc.count`	`gc` (per collector)
Buffer pools	`jvm.buffer.memory.used`, `jvm.buffer.memory.limit`, `jvm.buffer.count`	`pool` (direct/mapped)
Threads	`jvm.thread.count` (total + per-state breakdown)	`state` (runnable, waiting, etc.)
Classes	`jvm.class.count`, `jvm.class.loaded`	—
CPU	`jvm.cpu.time`, `jvm.cpu.recent_utilization`, `process.uptime`	—

Design

All gauge suppliers read through JvmService.stats(), which caches JvmStats with a 1-second TTL -- a single collection sweep reuses one snapshot across all gauges with no redundant MXBean calls.
Memory pools, GC collectors, and buffer pools are discovered dynamically from the initial JvmStats snapshot and tagged by name, so gauges work across G1, Parallel, CMS, ZGC, and other collectors without configuration.
Per-state thread counts use a separate synchronized cache (1s TTL) to avoid redundant getThreadInfo() calls across the 6 per-state gauges.
NodeRuntimeMetrics implements Closeable with an idempotent AtomicBoolean guard and is added to Node's resourcesToClose.
ThreadMXBean is injectable via a package-private constructor for unit testing.

Related Issues

N/A (new functionality)

Check List

New functionality includes testing
New functionality has been documented (Javadoc)
API changes companion pull request - N/A (no public API change)

Testing

Unit tests in NodeRuntimeMetricsTests verify:

All expected gauges are registered with correct names, units, and tags
Gauge values reflect JvmStats snapshot data
Per-state thread counts are correctly computed from ThreadInfo[]
close() releases all gauge handles and is idempotent

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2026-03-11T19:30:11Z

PR Reviewer Guide 🔍

(Review updated until commit `f23fd7b`)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
📝 TODO sections server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [218]: Add event-driven GC pause duration histograms via GarbageCollectorMXBean notification listeners for percentile analysis.
🔀 No multiple PR themes
⚡ Recommended focus areas for review Linear Search Performance The helper methods `getPoolByName`, `getCollectorByName`, and `getBufferPoolByName` each perform a linear scan through all pools/collectors on every gauge read. Since `jvmService.stats()` is called inside each of these methods (which already has a 1-second TTL cache), and these are called per-gauge during collection sweeps, consider caching the lookup results in a map keyed by name to avoid O(n) scans per gauge invocation. private JvmStats.MemoryPool getPoolByName(String poolName) { for (JvmStats.MemoryPool pool : jvmService.stats().getMem()) { if (pool.getName().equals(poolName)) { return pool; } } return null; } // ---- GC (cumulative gauges) ---- // TODO: Add event-driven GC pause duration histograms via GarbageCollectorMXBean // notification listeners. Cumulative gauges give rate-of-GC-time; histograms would // enable percentile analysis of individual pause durations (p50, p99). private void registerGcGauges(MetricsRegistry registry) { JvmStats stats = jvmService.stats(); for (JvmStats.GarbageCollector gc : stats.getGc()) { String collectorName = gc.getName(); Tags gcTags = Tags.of(TAG_GC, collectorName); gaugeHandles.add(registry.createGauge(JVM_GC_DURATION, "GC cumulative collection time", UNIT_SECONDS, () -> { JvmStats.GarbageCollector c = getCollectorByName(collectorName); return c == null ? 0.0 : c.getCollectionTime().getMillis() / 1000.0; }, gcTags)); gaugeHandles.add(registry.createGauge(JVM_GC_COUNT, "GC collection count", UNIT_1, () -> { JvmStats.GarbageCollector c = getCollectorByName(collectorName); return c == null ? 0.0 : (double) c.getCollectionCount(); }, gcTags)); } } private JvmStats.GarbageCollector getCollectorByName(String collectorName) { for (JvmStats.GarbageCollector gc : jvmService.stats().getGc()) { if (gc.getName().equals(collectorName)) { return gc; } } return null; } // ---- Buffer pools ---- private void registerBufferPoolGauges(MetricsRegistry registry) { JvmStats stats = jvmService.stats(); for (JvmStats.BufferPool bp : stats.getBufferPools()) { String bpName = bp.getName(); Tags bpTags = Tags.of(TAG_POOL, bpName); gaugeHandles.add(registry.createGauge(JVM_BUFFER_MEMORY_USED, "Buffer pool memory used", UNIT_BYTES, () -> { JvmStats.BufferPool b = getBufferPoolByName(bpName); return b == null ? 0.0 : (double) b.getUsed().getBytes(); }, bpTags)); gaugeHandles.add(registry.createGauge(JVM_BUFFER_MEMORY_LIMIT, "Buffer pool total capacity", UNIT_BYTES, () -> { JvmStats.BufferPool b = getBufferPoolByName(bpName); return b == null ? 0.0 : (double) b.getTotalCapacity().getBytes(); }, bpTags)); gaugeHandles.add(registry.createGauge(JVM_BUFFER_COUNT, "Buffer pool buffer count", UNIT_1, () -> { JvmStats.BufferPool b = getBufferPoolByName(bpName); return b == null ? 0.0 : (double) b.getCount(); }, bpTags)); } } private JvmStats.BufferPool getBufferPoolByName(String bpName) { for (JvmStats.BufferPool bp : jvmService.stats().getBufferPools()) { if (bp.getName().equals(bpName)) { return bp; } } return null; } Thread Safety The field `threadStateCounts` is written in `refreshThreadStateCounts()` (called from `synchronized getThreadStateCount`) but the write `threadStateCounts = counts` is not synchronized itself — it replaces the array reference. While `getThreadStateCount` is synchronized, `refreshThreadStateCounts` is not, and the array reference assignment is not volatile. If the reference is read outside the synchronized block in any future code path, this could cause visibility issues. Consider making `threadStateCounts` volatile or keeping the refresh fully within the synchronized method. private synchronized long getThreadStateCount(Thread.State state) { long now = System.currentTimeMillis(); if (now - threadStateTimestamp > CACHE_TTL_MS) { refreshThreadStateCounts(); threadStateTimestamp = now; } return threadStateCounts[state.ordinal()]; } private void refreshThreadStateCounts() { long[] counts = new long[Thread.State.values().length]; try { long[] threadIds = threadMXBean.getAllThreadIds(); ThreadInfo[] infos = threadMXBean.getThreadInfo(threadIds); for (ThreadInfo info : infos) { if (info != null) { counts[info.getThreadState().ordinal()]++; } } } catch (Exception e) { logger.debug("Failed to collect thread state counts", e); } threadStateCounts = counts; } Missing Non-Heap Limit A `jvm.memory.limit` gauge is registered for heap (with `heapTags`) but no corresponding `jvm.memory.limit` gauge is registered for non-heap memory (with `nonHeapTags`). This is an asymmetry compared to the heap gauges (used, committed, limit) vs non-heap gauges (used, committed only). If intentional, it should be documented; otherwise the missing non-heap limit gauge should be added. gaugeHandles.add( registry.createGauge( JVM_MEMORY_USED, "JVM non-heap memory used", UNIT_BYTES, () -> (double) jvmService.stats().getMem().getNonHeapUsed().getBytes(), nonHeapTags ) ); gaugeHandles.add( registry.createGauge( JVM_MEMORY_COMMITTED, "JVM non-heap memory committed", UNIT_BYTES, () -> (double) jvmService.stats().getMem().getNonHeapCommitted().getBytes(), nonHeapTags ) ); Gauge Registration Timing Memory pool, GC collector, and buffer pool gauges are discovered from the JVM stats snapshot taken at construction time. If the JVM later introduces new pools or collectors (e.g., with certain GC configurations or after class data sharing), those will not be picked up. The class Javadoc mentions this is dynamic discovery from the "initial snapshot," but operators should be aware that new pools appearing at runtime won't be tracked without a restart. JvmStats stats = jvmService.stats(); for (JvmStats.MemoryPool pool : stats.getMem()) { String poolName = pool.getName(); Tags poolTags = Tags.of(TAG_POOL, poolName); gaugeHandles.add(registry.createGauge(JVM_MEMORY_USED, "JVM memory pool used", UNIT_BYTES, () -> { JvmStats.MemoryPool p = getPoolByName(poolName); return p == null ? 0.0 : (double) p.getUsed().getBytes(); }, poolTags)); gaugeHandles.add(registry.createGauge(JVM_MEMORY_LIMIT, "JVM memory pool max", UNIT_BYTES, () -> { JvmStats.MemoryPool p = getPoolByName(poolName); if (p == null) return 0.0; long bytes = p.getMax().getBytes(); return bytes < 0 ? 0.0 : (double) bytes; }, poolTags)); gaugeHandles.add(registry.createGauge(JVM_MEMORY_USED_AFTER_LAST_GC, "JVM memory pool used after last GC", UNIT_BYTES, () -> { JvmStats.MemoryPool p = getPoolByName(poolName); return p == null ? 0.0 : (double) p.getLastGcStats().getUsed().getBytes(); }, poolTags)); }

github-actions · 2026-03-11T19:30:46Z

PR Code Suggestions ✨

Latest suggestions up to f23fd7b

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Guard against null last GC stats The call to `p.getLastGcStats()` may return `null` for certain memory pools (e.g., young generation pools that don't track post-GC usage), which would cause a `NullPointerException` at runtime when the gauge supplier is invoked. Add a null check for `getLastGcStats()` before calling `.getUsed()`. server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [201-204] gaugeHandles.add(registry.createGauge(JVM_MEMORY_USED_AFTER_LAST_GC, "JVM memory pool used after last GC", UNIT_BYTES, () -> { JvmStats.MemoryPool p = getPoolByName(poolName); - return p == null ? 0.0 : (double) p.getLastGcStats().getUsed().getBytes(); + if (p == null \|\| p.getLastGcStats() == null) return 0.0; + return (double) p.getLastGcStats().getUsed().getBytes(); }, poolTags)); Suggestion importance[1-10]: 7 __ Why: `getLastGcStats()` can legitimately return `null` for certain memory pools, and calling `.getUsed()` on a null reference would cause a `NullPointerException` at runtime during metric collection. This is a real potential bug that should be guarded against.	Medium
Possible issue	Synchronize close method to prevent concurrent modification The `closeQuietly()` method iterates over `gaugeHandles` and then calls `gaugeHandles.clear()`, but it is also called from the constructor's catch block before all gauges are registered. This is safe, but the `close()` method sets `closed` to `true` and then calls `closeQuietly()`, which clears the list. If `close()` is called concurrently, the second call returns early due to the `AtomicBoolean` check, but `closeQuietly()` in the constructor path bypasses this guard. More critically, `closeQuietly()` is not synchronized, so concurrent calls (e.g., from `close()` and a constructor failure) could cause a `ConcurrentModificationException` on `gaugeHandles`. Consider synchronizing `closeQuietly()` or using a thread-safe list. server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [417-426] -private void closeQuietly() { +private synchronized void closeQuietly() { for (Closeable handle : gaugeHandles) { try { handle.close(); } catch (IOException e) { logger.debug("Failed to close gauge handle", e); } } gaugeHandles.clear(); } Suggestion importance[1-10]: 4 __ Why: The concurrency concern is valid but the scenario is unlikely in practice since `closeQuietly()` is only called from the constructor (single-threaded) and from `close()` (protected by `AtomicBoolean`). The suggestion is technically sound but addresses a low-probability edge case.	Low
General	Prevent retry storm on persistent refresh failures The `threadStateTimestamp` is initialized to `0`, so the first call will always trigger a refresh (since `now - 0` will be much greater than `CACHE_TTL_MS`). However, if `refreshThreadStateCounts()` throws an unexpected unchecked exception, `threadStateTimestamp` will not be updated, causing every subsequent call to retry the refresh. Consider updating the timestamp before or in a `finally` block to prevent a tight retry loop on persistent failures. server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [311-318] private synchronized long getThreadStateCount(Thread.State state) { long now = System.currentTimeMillis(); if (now - threadStateTimestamp > CACHE_TTL_MS) { + threadStateTimestamp = now; refreshThreadStateCounts(); - threadStateTimestamp = now; } return threadStateCounts[state.ordinal()]; } Suggestion importance[1-10]: 5 __ Why: Moving `threadStateTimestamp = now` before `refreshThreadStateCounts()` prevents a tight retry loop if the refresh consistently throws an unchecked exception, which is a valid defensive improvement. However, `refreshThreadStateCounts()` already catches all exceptions internally, making the unchecked exception scenario unlikely.	Low
General	Clamp CPU ratio to valid range The `clampCpuPercent` method only guards against negative values but does not cap the upper bound. If `pct` exceeds 100 (which can happen on some platforms or under certain conditions), the returned ratio will exceed 1.0, violating the documented contract of the metric. Add an upper bound clamp to ensure the value stays within [0.0, 1.0]. server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [405-407] private static double clampCpuPercent(short pct) { - return pct < 0 ? 0.0 : pct / 100.0; + if (pct < 0) return 0.0; + if (pct > 100) return 1.0; + return pct / 100.0; +} Suggestion importance[1-10]: 5 __ Why: The suggestion correctly identifies that `pct` values above 100 could produce ratios exceeding 1.0, violating the metric's documented contract. Adding an upper bound clamp ensures the value stays within `[0.0, 1.0]` as expected.	Low

Previous suggestions

Suggestions up to commit e1ad2a6

Category	Suggestion	Impact
Possible issue	Guard against null last GC stats `p.getLastGcStats()` may return `null` if no GC has occurred yet for that pool, which would cause a `NullPointerException` at runtime when the gauge supplier is invoked. A null check should be added before calling `getUsed()` on the result. server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [201-204] gaugeHandles.add(registry.createGauge(JVM_MEMORY_USED_AFTER_LAST_GC, "JVM memory pool used after last GC", UNIT_BYTES, () -> { JvmStats.MemoryPool p = getPoolByName(poolName); - return p == null ? 0.0 : (double) p.getLastGcStats().getUsed().getBytes(); + if (p == null \|\| p.getLastGcStats() == null) return 0.0; + return (double) p.getLastGcStats().getUsed().getBytes(); }, poolTags)); Suggestion importance[1-10]: 7 __ Why: If `getLastGcStats()` returns `null` (e.g., before any GC has occurred), calling `getUsed()` on it would throw a `NullPointerException` at runtime when the gauge supplier is invoked. Adding a null check is a meaningful defensive fix.	Medium
Possible issue	Prevent double-close after constructor failure The `closeQuietly()` method iterates over `gaugeHandles` while it may be called from the constructor (on failure) and also from `close()`. However, iterating and then clearing is not thread-safe if `close()` could be called concurrently. More critically, when called from the constructor on failure, `closed` flag is never set to `true`, so a subsequent call to `close()` would attempt to close already-closed handles. The `closeQuietly()` should also set the `closed` flag to prevent double-close. server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [417-426] private void closeQuietly() { + closed.set(true); for (Closeable handle : gaugeHandles) { try { handle.close(); } catch (IOException e) { logger.debug("Failed to close gauge handle", e); } } gaugeHandles.clear(); } Suggestion importance[1-10]: 6 __ Why: When `closeQuietly()` is called from the constructor on failure, the `closed` flag remains `false`, so a subsequent call to `close()` would attempt to close already-cleared handles. Setting `closed.set(true)` in `closeQuietly()` prevents this double-close scenario.	Low
General	Update cache timestamp before refresh to avoid retry storms On the very first call, `threadStateTimestamp` is `0`, so `now - 0` will be much larger than `CACHE_TTL_MS` and `refreshThreadStateCounts()` will be called correctly. However, if `refreshThreadStateCounts()` throws an unexpected unchecked exception, `threadStateTimestamp` will not be updated, causing every subsequent call to retry the refresh. While the method catches `Exception`, consider ensuring `threadStateTimestamp` is always updated to avoid a tight retry loop on persistent failures. server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [311-318] private synchronized long getThreadStateCount(Thread.State state) { long now = System.currentTimeMillis(); if (now - threadStateTimestamp > CACHE_TTL_MS) { + threadStateTimestamp = now; refreshThreadStateCounts(); - threadStateTimestamp = now; } return threadStateCounts[state.ordinal()]; } Suggestion importance[1-10]: 5 __ Why: Moving `threadStateTimestamp = now` before `refreshThreadStateCounts()` ensures that even if an unchecked exception escapes (despite the internal catch), the timestamp is updated and avoids a tight retry loop. This is a valid defensive improvement, though the current `catch (Exception e)` already handles most failure cases.	Low

Suggestions up to commit a1ad84c

Category	Suggestion	Impact
Possible issue	Prevent double-close after constructor failure The `closeQuietly()` method is called both from `close()` (after the `AtomicBoolean` guard) and from the constructor's catch block. However, if `closeQuietly()` is called from the constructor on failure, and then `close()` is later called on the partially-constructed object, the `AtomicBoolean` guard will prevent a second cleanup since `closed` was never set to `true` in the constructor path. This means the `closed` flag should also be set when cleaning up in the constructor failure path. server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [417-426] private void closeQuietly() { + closed.set(true); for (Closeable handle : gaugeHandles) { try { handle.close(); } catch (IOException e) { logger.debug("Failed to close gauge handle", e); } } gaugeHandles.clear(); } Suggestion importance[1-10]: 7 __ Why: Valid concern: if the constructor fails and calls `closeQuietly()`, the `closed` flag remains `false`, so a subsequent `close()` call would invoke `closeQuietly()` again on an already-cleared `gaugeHandles` list. Setting `closed.set(true)` in `closeQuietly()` prevents this double-close scenario.	Medium
Possible issue	Guard against null last GC stats The call to `p.getLastGcStats()` may return `null` for some memory pools (e.g., pools that have never been collected), which would cause a `NullPointerException` at runtime when the gauge supplier is invoked. A null check should be added before calling `getUsed()` on the result. server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [201-204] gaugeHandles.add(registry.createGauge(JVM_MEMORY_USED_AFTER_LAST_GC, "JVM memory pool used after last GC", UNIT_BYTES, () -> { JvmStats.MemoryPool p = getPoolByName(poolName); - return p == null ? 0.0 : (double) p.getLastGcStats().getUsed().getBytes(); + if (p == null \|\| p.getLastGcStats() == null) return 0.0; + return (double) p.getLastGcStats().getUsed().getBytes(); }, poolTags)); Suggestion importance[1-10]: 7 __ Why: `getLastGcStats()` can return `null` for memory pools that have never undergone GC, which would cause a `NullPointerException` at runtime when the gauge supplier is invoked. Adding a null check is a valid defensive fix for a real potential runtime error.	Medium
General	Ensure thread-safe cache refresh within synchronized block The `threadStateTimestamp` is initialized to `0`, so `now - 0` will always be greater than `CACHE_TTL_MS` on the first call, which is correct. However, `refreshThreadStateCounts()` is called while holding the `synchronized` lock, and it can be slow (iterating all threads). More critically, `threadStateCounts` is a non-volatile field read outside the synchronized block in `refreshThreadStateCounts` (which assigns to it). Since `refreshThreadStateCounts` is called from within the synchronized method, this is safe, but the field assignment `threadStateCounts =` `counts` inside `refreshThreadStateCounts` (a non-synchronized method) could be a visibility issue if called from other contexts. Consider making `refreshThreadStateCounts` private and ensuring it's only called within the synchronized context, or inline it. server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [311-318] private synchronized long getThreadStateCount(Thread.State state) { long now = System.currentTimeMillis(); if (now - threadStateTimestamp > CACHE_TTL_MS) { - refreshThreadStateCounts(); + long[] counts = new long[Thread.State.values().length]; + try { + long[] threadIds = threadMXBean.getAllThreadIds(); + ThreadInfo[] infos = threadMXBean.getThreadInfo(threadIds); + for (ThreadInfo info : infos) { + if (info != null) { + counts[info.getThreadState().ordinal()]++; + } + } + } catch (Exception e) { + logger.debug("Failed to collect thread state counts", e); + } + threadStateCounts = counts; threadStateTimestamp = now; } return threadStateCounts[state.ordinal()]; } Suggestion importance[1-10]: 4 __ Why: The `refreshThreadStateCounts()` method is already `private` and only called from within the `synchronized` method `getThreadStateCount`, so visibility of `threadStateCounts` is already guaranteed. Inlining it is a minor refactoring with limited practical impact on correctness.	Low

Suggestions up to commit 461bfe9

Category	Suggestion	Impact
Possible issue	Guard against null last GC stats The call to `p.getLastGcStats()` may return `null` if no GC has occurred yet for that pool, which would cause a `NullPointerException` at runtime when the gauge supplier is invoked. A null check should be added before calling `.getUsed()` on the result. server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [201-204] gaugeHandles.add(registry.createGauge(JVM_MEMORY_USED_AFTER_LAST_GC, "JVM memory pool used after last GC", UNIT_BYTES, () -> { JvmStats.MemoryPool p = getPoolByName(poolName); - return p == null ? 0.0 : (double) p.getLastGcStats().getUsed().getBytes(); + if (p == null \|\| p.getLastGcStats() == null) return 0.0; + return (double) p.getLastGcStats().getUsed().getBytes(); }, poolTags)); Suggestion importance[1-10]: 7 __ Why: If `getLastGcStats()` returns `null` before any GC has occurred for a pool, calling `.getUsed()` on it would cause a `NullPointerException` at runtime when the gauge supplier is invoked. This is a real potential bug that should be guarded against.	Medium
Possible issue	Prevent concurrent modification during close The `closeQuietly()` method is called both from the constructor (on failure) and from `close()`. When called from the constructor, it iterates and clears `gaugeHandles`, which is correct. However, when called from `close()` after `closed` is set to `true`, a second call to `close()` is guarded by the `AtomicBoolean`, but `closeQuietly()` itself is not thread-safe since it iterates and clears the list without synchronization. Consider using a local copy of the list before clearing to avoid potential `ConcurrentModificationException` if close is called concurrently. server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [417-426] private void closeQuietly() { - for (Closeable handle : gaugeHandles) { + List<Closeable> handles = new ArrayList<>(gaugeHandles); + gaugeHandles.clear(); + for (Closeable handle : handles) { try { handle.close(); } catch (IOException e) { logger.debug("Failed to close gauge handle", e); } } - gaugeHandles.clear(); } Suggestion importance[1-10]: 4 __ Why: While the suggestion is technically valid, the `close()` method is guarded by `AtomicBoolean.compareAndSet`, making concurrent double-close safe. The `ConcurrentModificationException` risk is low in practice since `closeQuietly` is only called from `close()` or the constructor. The improvement is minor but does make the code more robust.	Low
General	Clamp CPU utilization ratio upper bound The `clampCpuPercent` method only guards against negative values but does not cap the upper bound. If `getProcessCpuPercent()` or `getSystemCpuPercent()` returns a value greater than 100 (e.g., due to a platform-specific anomaly), the returned ratio would exceed 1.0, violating the documented contract of the metric. An upper bound clamp should be added. server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [405-407] private static double clampCpuPercent(short pct) { - return pct < 0 ? 0.0 : pct / 100.0; + if (pct < 0) return 0.0; + if (pct > 100) return 1.0; + return pct / 100.0; +} Suggestion importance[1-10]: 5 __ Why: Adding an upper bound clamp prevents the metric from exceeding 1.0 in edge cases where platform-specific anomalies return values above 100. This is a reasonable defensive improvement for metric correctness, though `short` values above 100 from these probes are unlikely in practice.	Low
General	Document synchronization requirement for refresh method The `threadStateTimestamp` is initialized to `0`, so the first call will always trigger a refresh, which is correct. However, `refreshThreadStateCounts()` assigns a new array to `threadStateCounts` without synchronization on the field itself. Since `getThreadStateCount` is `synchronized` but `refreshThreadStateCounts` is not, and `threadStateCounts` is a non-volatile field written inside a synchronized method, this is safe. But the initial value of `threadStateTimestamp = 0` means `now - 0 > 1000` is always true initially, which is the intended behavior. This is fine, but consider initializing `threadStateTimestamp` to ensure clarity. More critically, `refreshThreadStateCounts` should be called only within the synchronized block — verify it is not called from outside synchronized context. server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [311-318] private synchronized long getThreadStateCount(Thread.State state) { long now = System.currentTimeMillis(); if (now - threadStateTimestamp > CACHE_TTL_MS) { refreshThreadStateCounts(); threadStateTimestamp = now; } return threadStateCounts[state.ordinal()]; } +private void refreshThreadStateCounts() { + // Must be called from within synchronized(this) + long[] counts = new long[Thread.State.values().length]; + try { + long[] threadIds = threadMXBean.getAllThreadIds(); + ThreadInfo[] infos = threadMXBean.getThreadInfo(threadIds); + for (ThreadInfo info : infos) { + if (info != null) { + counts[info.getThreadState().ordinal()]++; + } + } + } catch (Exception e) { + logger.debug("Failed to collect thread state counts", e); + } + threadStateCounts = counts; +} + Suggestion importance[1-10]: 1 __ Why: The `improved_code` is essentially the same as the `existing_code` with only a comment added, which falls into the category of adding comments. The suggestion doesn't fix any real bug and the `existing_code` and `improved_code` are functionally identical.	Low

Suggestions up to commit 484b3bd

Category	Suggestion	Impact
Possible issue	Guard against null last GC stats `p.getLastGcStats()` may return `null` if no GC has occurred yet for that pool, which would cause a `NullPointerException` when `.getUsed()` is called. A null check should be added before accessing `getLastGcStats().getUsed()`. server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [201-204] gaugeHandles.add(registry.createGauge(JVM_MEMORY_USED_AFTER_LAST_GC, "JVM memory pool used after last GC", UNIT_BYTES, () -> { JvmStats.MemoryPool p = getPoolByName(poolName); - return p == null ? 0.0 : (double) p.getLastGcStats().getUsed().getBytes(); + if (p == null \|\| p.getLastGcStats() == null) return 0.0; + return (double) p.getLastGcStats().getUsed().getBytes(); }, poolTags)); Suggestion importance[1-10]: 7 __ Why: `getLastGcStats()` can return null when no GC has occurred yet for a pool, making this a real NPE risk in production. The null check is a straightforward and necessary defensive fix.	Medium
Possible issue	Prevent double-close of gauge handles The `closeQuietly()` method is called both from `close()` (after the `AtomicBoolean` guard) and from the constructor's catch block. However, if `close()` is called concurrently while the constructor's catch is running, the `closed` flag won't be set yet, allowing `closeQuietly()` to run twice and potentially double-close handles. The constructor cleanup path should not rely on `closeQuietly()` since it bypasses the `closed` guard; consider using a separate internal cleanup method or setting `closed` to `true` before calling `closeQuietly()` in the constructor's catch block. server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [417-426] private void closeQuietly() { - for (Closeable handle : gaugeHandles) { + List<Closeable> snapshot = new ArrayList<>(gaugeHandles); + gaugeHandles.clear(); + for (Closeable handle : snapshot) { try { handle.close(); } catch (IOException e) { logger.debug("Failed to close gauge handle", e); } } - gaugeHandles.clear(); } +@Override +public void close() throws IOException { + if (closed.compareAndSet(false, true) == false) { + return; + } + closeQuietly(); +} + Suggestion importance[1-10]: 4 __ Why: The concern about double-close is valid but the actual risk is low since the constructor's catch block runs before the object is published to other threads. The improved code clears the list before iterating, which is a minor improvement but the core issue (concurrent access during construction) is not a realistic concern in practice.	Low
General	Update cache timestamp before refresh to avoid retry storms The initial value of `threadStateTimestamp` is `0`, so the first call will always trigger a refresh, which is correct. However, if `refreshThreadStateCounts()` throws an unexpected unchecked exception, `threadStateTimestamp` will not be updated, causing every subsequent call to retry the refresh. More critically, if the refresh fails silently (exception caught inside), stale zero-counts are returned without any indication. Consider updating `threadStateTimestamp` before or after the refresh regardless of outcome to avoid a tight retry loop on persistent failures. server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [311-318] private synchronized long getThreadStateCount(Thread.State state) { long now = System.currentTimeMillis(); if (now - threadStateTimestamp > CACHE_TTL_MS) { + threadStateTimestamp = now; refreshThreadStateCounts(); - threadStateTimestamp = now; } return threadStateCounts[state.ordinal()]; } Suggestion importance[1-10]: 5 __ Why: Moving `threadStateTimestamp = now` before `refreshThreadStateCounts()` is a valid improvement to prevent tight retry loops if the refresh consistently fails, since the exception is caught inside `refreshThreadStateCounts()` and would otherwise cause every call to retry.	Low
General	Clamp CPU ratio upper bound to 1.0 The CPU percent value is a `short` that can theoretically exceed 100 (e.g., on systems that report aggregate CPU across cores). The method only clamps the lower bound but not the upper bound, potentially returning values greater than 1.0 for the utilization ratio. An upper clamp should be added to ensure the returned value stays within [0.0, 1.0]. server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [405-407] private static double clampCpuPercent(short pct) { - return pct < 0 ? 0.0 : pct / 100.0; + if (pct < 0) return 0.0; + if (pct > 100) return 1.0; + return pct / 100.0; } Suggestion importance[1-10]: 4 __ Why: While theoretically valid, the `ProcessProbe.getProcessCpuPercent()` and `OsProbe.getSystemCpuPercent()` methods in OpenSearch typically return values in [0, 100] or -1, making values above 100 unlikely. The upper clamp is a minor defensive improvement.	Low

Suggestions up to commit 2debaed

Category	Suggestion	Impact
Possible issue	Prevent retry storm on persistent refresh failures The `threadStateTimestamp` is initialized to `0`, so the very first call will always trigger a refresh (since `now - 0 > 1000`). However, if `refreshThreadStateCounts()` throws an unexpected unchecked exception, `threadStateTimestamp` will not be updated, causing every subsequent call to retry the refresh. Consider updating `threadStateTimestamp` before or inside a try-finally block to prevent a tight retry loop on persistent failures. server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [324-331] private synchronized long getThreadStateCount(Thread.State state) { long now = System.currentTimeMillis(); if (now - threadStateTimestamp > CACHE_TTL_MS) { + threadStateTimestamp = now; refreshThreadStateCounts(); - threadStateTimestamp = now; } return threadStateCounts[state.ordinal()]; } Suggestion importance[1-10]: 5 __ Why: Moving `threadStateTimestamp = now` before `refreshThreadStateCounts()` prevents a tight retry loop if an unchecked exception is thrown during refresh. However, `refreshThreadStateCounts()` already catches all exceptions internally, so the risk of an unchecked exception escaping is low, making this a minor defensive improvement.	Low
General	Add missing non-heap committed memory gauge The non-heap `JVM_MEMORY_USED` gauge is registered but there is no corresponding `JVM_MEMORY_COMMITTED` gauge for non-heap memory, even though non-heap committed memory is a meaningful metric (it represents the actual OS memory reserved for non-heap). This asymmetry means non-heap committed memory is silently omitted, which could mislead users monitoring memory pressure. server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [162-170] gaugeHandles.add( registry.createGauge( JVM_MEMORY_USED, "JVM non-heap memory used", UNIT_BYTES, () -> (double) jvmService.stats().getMem().getNonHeapUsed().getBytes(), nonHeapTags ) ); +gaugeHandles.add( + registry.createGauge( + JVM_MEMORY_COMMITTED, + "JVM non-heap memory committed", + UNIT_BYTES, + () -> (double) jvmService.stats().getMem().getNonHeapCommitted().getBytes(), + nonHeapTags + ) +); Suggestion importance[1-10]: 4 __ Why: The suggestion points out a real asymmetry where non-heap committed memory is not tracked while heap committed memory is. This is a valid completeness concern, but it's a feature addition rather than a bug fix, and the PR author may have intentionally omitted it.	Low
General	Ensure closed flag is set during cleanup When `closeQuietly()` is called from the constructor's catch block (before `closed` is set to `true`), a subsequent call to `close()` will pass the `compareAndSet` check and call `closeQuietly()` again on an already-cleared list. This is harmless but inconsistent. More critically, if `close()` is called concurrently during construction failure, the `closed` flag won't be set, so `close()` will attempt to close already-cleared handles. Consider setting `closed` to `true` inside `closeQuietly()` as well. server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [430-439] private void closeQuietly() { + closed.set(true); for (Closeable handle : gaugeHandles) { try { handle.close(); } catch (IOException e) { logger.debug("Failed to close gauge handle", e); } } gaugeHandles.clear(); } Suggestion importance[1-10]: 4 __ Why: Setting `closed` to `true` inside `closeQuietly()` prevents a subsequent `close()` call from re-entering cleanup after a constructor failure. However, the constructor failure path clears `gaugeHandles`, so the double-close scenario is harmless in practice, making this a minor defensive improvement.	Low

Register pull-based gauges in NodeRuntimeMetrics covering JVM memory (heap, non-heap, per-pool), GC collectors, buffer pools, threads (including per-state counts), class loading, uptime, and CPU usage. Metric names follow OpenTelemetry semantic conventions (jvm.memory.used, jvm.gc.duration, jvm.thread.count, etc.) for consistency with the broader observability ecosystem. All gauge suppliers read through JvmService.stats(), which caches the JvmStats snapshot with a 1-second TTL, so a single collection sweep reuses one snapshot across all gauges. Thread state counts use a separate synchronized cache to avoid redundant getThreadInfo() calls. Memory pools, GC collectors, and buffer pools are discovered dynamically from the initial JvmStats snapshot and tagged by name, so the gauges work identically across G1, Parallel, CMS, ZGC, and other collectors. Signed-off-by: Sam Akrah <sakrah@uber.com> Made-with: Cursor

github-actions · 2026-03-11T19:34:01Z

Persistent review updated to latest commit d0a1714

github-actions · 2026-03-11T19:47:22Z

❌ Gradle check result for d0a1714: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

- Clamp CPU probe values to 0.0 when platform returns -1 (unavailable) - Clamp memory pool max to 0.0 when pool has no defined upper bound - Wrap gauge registration in try-catch to close already-created handles if construction fails partway through - Add CHANGELOG entry for PR opensearch-project#20844 - Add tests for negative CPU guard and constructor cleanup on failure Signed-off-by: Sam Akrah <sakrah@uber.com> Made-with: Cursor

github-actions · 2026-03-11T19:51:10Z

Persistent review updated to latest commit b107cd1

github-actions · 2026-03-11T20:04:06Z

❌ Gradle check result for b107cd1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

- Clamp CPU probe values to 0.0 when platform returns -1 (unavailable) - Clamp memory pool max to 0.0 when pool has no defined upper bound - Wrap gauge registration in try-catch to close already-created handles if construction fails partway through - Add CHANGELOG entry for PR opensearch-project#20844 - Add tests for negative CPU guard and constructor cleanup on failure Signed-off-by: Sam Akrah <sakrah@uber.com> Made-with: Cursor Signed-off-by: Sam Akrah <sakrah@uber.com> Made-with: Cursor

github-actions · 2026-03-11T20:39:45Z

Persistent review updated to latest commit 2debaed

server/src/main/java/org/opensearch/node/Node.java

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java

github-actions · 2026-03-11T22:25:51Z

✅ Gradle check result for 2debaed: SUCCESS

codecov · 2026-03-11T22:26:46Z

Codecov Report

❌ Patch coverage is 83.73494% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.28%. Comparing base (db61a88) to head (f23fd7b).
⚠️ Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
...ava/org/opensearch/monitor/NodeRuntimeMetrics.java	83.22%	12 Missing and 15 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #20844      +/-   ##
============================================
- Coverage     73.30%   73.28%   -0.03%     
- Complexity    72252    72272      +20     
============================================
  Files          5795     5796       +1     
  Lines        330056   330222     +166     
  Branches      47643    47662      +19     
============================================
+ Hits         241947   241994      +47     
- Misses        68695    68816     +121     
+ Partials      19414    19412       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…erage - Add jvm.memory.used_after_last_gc per-pool gauge (post-GC heap usage shows true heap pressure vs lazy-GC-inflated current usage) - Replace boolean-flag helpers (poolBytes, gcMetric, bufferPoolMetric) with domain-object-returning helpers (getPoolByName, getCollectorByName, getBufferPoolByName) per reviewer feedback - Remove BufferPoolField enum — no longer needed - Add supplier-invocation tests for memory pools, GC, buffer pools, and class loading to close code coverage gaps Signed-off-by: Sam Akrah <sakrah@uber.com> Made-with: Cursor

- Add jvm.memory.committed gauge for non-heap memory (was missing, unlike heap which had all three: used, committed, limit) - Add Objects.requireNonNull for all constructor parameters Signed-off-by: Sam Akrah <sakrah@uber.com> Made-with: Cursor

github-actions · 2026-03-12T02:29:39Z

Persistent review updated to latest commit 484b3bd

github-actions · 2026-03-12T03:38:59Z

❌ Gradle check result for 484b3bd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

sakrah · 2026-03-12T03:56:18Z

CI Failure — Unrelated Flaky Test

The Gradle check failed on :qa:rolling-upgrade:v2.19.5-remote#twoThirdsUpgradedTest:

process was found dead while waiting for cluster health yellow

Root cause is an AssertionError in IndexShard.updateShardState during a rolling upgrade from 2.19.5 to 3.6.0:

fatal error in thread [opensearch[v2.19.5-remote-0][clusterApplierService#updateTask][T#1]], exiting
java.lang.AssertionError: a started primary with non-pending operation term must be in primary mode
  [closed_index_replica_allocation][0]
  at org.opensearch.index.shard.IndexShard.updateShardState(IndexShard.java:929)

Related flaky test issues in the rolling-upgrade suite:

Retrying CI with an empty commit.

Signed-off-by: Sam Akrah <sakrah@uber.com> Made-with: Cursor

github-actions · 2026-03-12T03:58:58Z

Persistent review updated to latest commit 461bfe9

Signed-off-by: Sam Akrah <sakrah@uber.com> Made-with: Cursor

github-actions · 2026-03-12T04:03:55Z

Persistent review updated to latest commit a1ad84c

github-actions · 2026-03-12T06:04:48Z

❌ Gradle check result for a1ad84c: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Sam Akrah <sakrah@uber.com>

github-actions · 2026-03-12T16:52:08Z

Persistent review updated to latest commit e1ad2a6

Signed-off-by: Sam Akrah <sakrah@uber.com>

github-actions · 2026-03-12T17:09:39Z

Persistent review updated to latest commit f23fd7b

github-actions · 2026-03-12T18:36:50Z

❕ Gradle check result for f23fd7b: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

sakrah requested review from a team, Bukhtawar, CEHENKLE, Rishikesh1159, anasalkouz, andrross, ashking94, cwperks, dbwiddis, gbbafna, jed326, kotwanikunal, mch2, msfroh, owaiskazi19, reta, sachinpkale, saratvemulapalli, shwetathareja and sohami as code owners March 11, 2026 19:28

sakrah force-pushed the sakrah/node-runtime-jvm-metrics-upstream branch from 1f4e7a1 to d0a1714 Compare March 11, 2026 19:32

sakrah changed the title ~~Sakrah/node runtime jvm metrics upstream~~ Add node-level JVM and CPU runtime metrics Mar 11, 2026

sakrah force-pushed the sakrah/node-runtime-jvm-metrics-upstream branch from b107cd1 to 2debaed Compare March 11, 2026 20:38

msfroh reviewed Mar 11, 2026

View reviewed changes

server/src/main/java/org/opensearch/node/Node.java Show resolved Hide resolved

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java Outdated Show resolved Hide resolved

akrahdan added 2 commits March 11, 2026 19:19

Retry CI: flaky rolling-upgrade test (unrelated to this PR)

461bfe9

Signed-off-by: Sam Akrah <sakrah@uber.com> Made-with: Cursor

Retry CI: GitHub runner auth failure (transient)

a1ad84c

Signed-off-by: Sam Akrah <sakrah@uber.com> Made-with: Cursor

Retry CI: flaky rolling-upgrade test (unrelated to this PR)

e1ad2a6

Signed-off-by: Sam Akrah <sakrah@uber.com>

Retry CI: flaky rolling-upgrade test (unrelated to this PR)

f23fd7b

Signed-off-by: Sam Akrah <sakrah@uber.com>

This was referenced Mar 12, 2026

[AUTOCUT] Gradle Check Flaky Test Report for AzureBlobStoreRepositoryTests #14291

Open

[AUTOCUT] Gradle Check Flaky Test Report for ClientYamlTestSuiteIT #14319

Open

msfroh approved these changes Mar 12, 2026

View reviewed changes

msfroh merged commit 6e92a76 into opensearch-project:main Mar 12, 2026
35 checks passed

opensearch-ci-bot mentioned this pull request Mar 13, 2026

[AUTOCUT] Gradle Check Flaky Test Report for RemoteStoreBackpressureAndResiliencyIT #17936

Closed

Conversation

sakrah commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Check List

Testing

Uh oh!

github-actions bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Reviewer Guide 🔍

(Review updated until commit f23fd7b)

Uh oh!

github-actions bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Code Suggestions ✨

Previous suggestions

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

codecov bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

sakrah commented Mar 12, 2026

CI Failure — Unrelated Flaky Test

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sakrah commented Mar 11, 2026 •

edited

Loading

github-actions bot commented Mar 11, 2026 •

edited

Loading

(Review updated until commit `f23fd7b`)

github-actions bot commented Mar 11, 2026 •

edited

Loading

codecov bot commented Mar 11, 2026 •

edited

Loading