Skip to content

Add node-level JVM and CPU runtime metrics#20844

Merged
msfroh merged 8 commits intoopensearch-project:mainfrom
sakrah:sakrah/node-runtime-jvm-metrics-upstream
Mar 12, 2026
Merged

Add node-level JVM and CPU runtime metrics#20844
msfroh merged 8 commits intoopensearch-project:mainfrom
sakrah:sakrah/node-runtime-jvm-metrics-upstream

Conversation

@sakrah
Copy link
Copy Markdown
Contributor

@sakrah sakrah commented Mar 11, 2026

Description

Motivation

OpenSearch's telemetry framework (MetricsRegistry) currently exposes no JVM runtime metrics. Operators monitoring production clusters rely on external mechanisms -- the _nodes/stats REST API (polling-based, adds HTTP overhead), standalone JMX exporters (separate process, separate config), or the OTel Java agent (bytecode instrumentation, heavyweight dependency). None of these integrate with the telemetry backend that OpenSearch is already configured to use.

This means if a cluster has the OTel plugin (or any future telemetry plugin) enabled, JVM metrics like heap pressure, GC pause time, and thread exhaustion still require a separate collection pipeline. There's no reason for that gap -- the data is already available via JvmService.stats(), which caches JvmStats snapshots with a 1-second TTL.

What this PR does

Adds NodeRuntimeMetrics, a Closeable component that registers ~30 pull-based gauges through MetricsRegistry, covering JVM memory, GC, buffer pools, threads, class loading, and CPU. Since it uses the framework-agnostic MetricsRegistry API, these metrics are automatically available through whatever telemetry backend the cluster is configured with -- no additional infrastructure required.

Metric names follow OpenTelemetry JVM semantic conventions (jvm.memory.used, jvm.gc.duration, jvm.thread.count, etc.) so they align with the broader ecosystem and are immediately recognizable to operators already using OTel-based tooling.

Metrics registered

Category Metrics Tags
Memory jvm.memory.used, jvm.memory.committed, jvm.memory.limit type (heap/non_heap), pool (per memory pool)
GC jvm.gc.duration, jvm.gc.count gc (per collector)
Buffer pools jvm.buffer.memory.used, jvm.buffer.memory.limit, jvm.buffer.count pool (direct/mapped)
Threads jvm.thread.count (total + per-state breakdown) state (runnable, waiting, etc.)
Classes jvm.class.count, jvm.class.loaded
CPU jvm.cpu.time, jvm.cpu.recent_utilization, process.uptime

Design

  • All gauge suppliers read through JvmService.stats(), which caches JvmStats with a 1-second TTL -- a single collection sweep reuses one snapshot across all gauges with no redundant MXBean calls.
  • Memory pools, GC collectors, and buffer pools are discovered dynamically from the initial JvmStats snapshot and tagged by name, so gauges work across G1, Parallel, CMS, ZGC, and other collectors without configuration.
  • Per-state thread counts use a separate synchronized cache (1s TTL) to avoid redundant getThreadInfo() calls across the 6 per-state gauges.
  • NodeRuntimeMetrics implements Closeable with an idempotent AtomicBoolean guard and is added to Node's resourcesToClose.
  • ThreadMXBean is injectable via a package-private constructor for unit testing.

Related Issues

N/A (new functionality)

Check List

  • New functionality includes testing
  • New functionality has been documented (Javadoc)
  • API changes companion pull request - N/A (no public API change)

Testing

Unit tests in NodeRuntimeMetricsTests verify:

  • All expected gauges are registered with correct names, units, and tags
  • Gauge values reflect JvmStats snapshot data
  • Per-state thread counts are correctly computed from ThreadInfo[]
  • close() releases all gauge handles and is idempotent

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 11, 2026

PR Reviewer Guide 🔍

(Review updated until commit f23fd7b)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
📝 TODO sections

🔀 No multiple PR themes
⚡ Recommended focus areas for review

Linear Search Performance

The helper methods getPoolByName, getCollectorByName, and getBufferPoolByName each perform a linear scan through all pools/collectors on every gauge read. Since jvmService.stats() is called inside each of these methods (which already has a 1-second TTL cache), and these are called per-gauge during collection sweeps, consider caching the lookup results in a map keyed by name to avoid O(n) scans per gauge invocation.

private JvmStats.MemoryPool getPoolByName(String poolName) {
    for (JvmStats.MemoryPool pool : jvmService.stats().getMem()) {
        if (pool.getName().equals(poolName)) {
            return pool;
        }
    }
    return null;
}

// ---- GC (cumulative gauges) ----
// TODO: Add event-driven GC pause duration histograms via GarbageCollectorMXBean
// notification listeners. Cumulative gauges give rate-of-GC-time; histograms would
// enable percentile analysis of individual pause durations (p50, p99).

private void registerGcGauges(MetricsRegistry registry) {
    JvmStats stats = jvmService.stats();
    for (JvmStats.GarbageCollector gc : stats.getGc()) {
        String collectorName = gc.getName();
        Tags gcTags = Tags.of(TAG_GC, collectorName);
        gaugeHandles.add(registry.createGauge(JVM_GC_DURATION, "GC cumulative collection time", UNIT_SECONDS, () -> {
            JvmStats.GarbageCollector c = getCollectorByName(collectorName);
            return c == null ? 0.0 : c.getCollectionTime().getMillis() / 1000.0;
        }, gcTags));
        gaugeHandles.add(registry.createGauge(JVM_GC_COUNT, "GC collection count", UNIT_1, () -> {
            JvmStats.GarbageCollector c = getCollectorByName(collectorName);
            return c == null ? 0.0 : (double) c.getCollectionCount();
        }, gcTags));
    }
}

private JvmStats.GarbageCollector getCollectorByName(String collectorName) {
    for (JvmStats.GarbageCollector gc : jvmService.stats().getGc()) {
        if (gc.getName().equals(collectorName)) {
            return gc;
        }
    }
    return null;
}

// ---- Buffer pools ----

private void registerBufferPoolGauges(MetricsRegistry registry) {
    JvmStats stats = jvmService.stats();
    for (JvmStats.BufferPool bp : stats.getBufferPools()) {
        String bpName = bp.getName();
        Tags bpTags = Tags.of(TAG_POOL, bpName);
        gaugeHandles.add(registry.createGauge(JVM_BUFFER_MEMORY_USED, "Buffer pool memory used", UNIT_BYTES, () -> {
            JvmStats.BufferPool b = getBufferPoolByName(bpName);
            return b == null ? 0.0 : (double) b.getUsed().getBytes();
        }, bpTags));
        gaugeHandles.add(registry.createGauge(JVM_BUFFER_MEMORY_LIMIT, "Buffer pool total capacity", UNIT_BYTES, () -> {
            JvmStats.BufferPool b = getBufferPoolByName(bpName);
            return b == null ? 0.0 : (double) b.getTotalCapacity().getBytes();
        }, bpTags));
        gaugeHandles.add(registry.createGauge(JVM_BUFFER_COUNT, "Buffer pool buffer count", UNIT_1, () -> {
            JvmStats.BufferPool b = getBufferPoolByName(bpName);
            return b == null ? 0.0 : (double) b.getCount();
        }, bpTags));
    }
}

private JvmStats.BufferPool getBufferPoolByName(String bpName) {
    for (JvmStats.BufferPool bp : jvmService.stats().getBufferPools()) {
        if (bp.getName().equals(bpName)) {
            return bp;
        }
    }
    return null;
}
Thread Safety

The field threadStateCounts is written in refreshThreadStateCounts() (called from synchronized getThreadStateCount) but the write threadStateCounts = counts is not synchronized itself — it replaces the array reference. While getThreadStateCount is synchronized, refreshThreadStateCounts is not, and the array reference assignment is not volatile. If the reference is read outside the synchronized block in any future code path, this could cause visibility issues. Consider making threadStateCounts volatile or keeping the refresh fully within the synchronized method.

private synchronized long getThreadStateCount(Thread.State state) {
    long now = System.currentTimeMillis();
    if (now - threadStateTimestamp > CACHE_TTL_MS) {
        refreshThreadStateCounts();
        threadStateTimestamp = now;
    }
    return threadStateCounts[state.ordinal()];
}

private void refreshThreadStateCounts() {
    long[] counts = new long[Thread.State.values().length];
    try {
        long[] threadIds = threadMXBean.getAllThreadIds();
        ThreadInfo[] infos = threadMXBean.getThreadInfo(threadIds);
        for (ThreadInfo info : infos) {
            if (info != null) {
                counts[info.getThreadState().ordinal()]++;
            }
        }
    } catch (Exception e) {
        logger.debug("Failed to collect thread state counts", e);
    }
    threadStateCounts = counts;
}
Missing Non-Heap Limit

A jvm.memory.limit gauge is registered for heap (with heapTags) but no corresponding jvm.memory.limit gauge is registered for non-heap memory (with nonHeapTags). This is an asymmetry compared to the heap gauges (used, committed, limit) vs non-heap gauges (used, committed only). If intentional, it should be documented; otherwise the missing non-heap limit gauge should be added.

gaugeHandles.add(
    registry.createGauge(
        JVM_MEMORY_USED,
        "JVM non-heap memory used",
        UNIT_BYTES,
        () -> (double) jvmService.stats().getMem().getNonHeapUsed().getBytes(),
        nonHeapTags
    )
);
gaugeHandles.add(
    registry.createGauge(
        JVM_MEMORY_COMMITTED,
        "JVM non-heap memory committed",
        UNIT_BYTES,
        () -> (double) jvmService.stats().getMem().getNonHeapCommitted().getBytes(),
        nonHeapTags
    )
);
Gauge Registration Timing

Memory pool, GC collector, and buffer pool gauges are discovered from the JVM stats snapshot taken at construction time. If the JVM later introduces new pools or collectors (e.g., with certain GC configurations or after class data sharing), those will not be picked up. The class Javadoc mentions this is dynamic discovery from the "initial snapshot," but operators should be aware that new pools appearing at runtime won't be tracked without a restart.

JvmStats stats = jvmService.stats();
for (JvmStats.MemoryPool pool : stats.getMem()) {
    String poolName = pool.getName();
    Tags poolTags = Tags.of(TAG_POOL, poolName);
    gaugeHandles.add(registry.createGauge(JVM_MEMORY_USED, "JVM memory pool used", UNIT_BYTES, () -> {
        JvmStats.MemoryPool p = getPoolByName(poolName);
        return p == null ? 0.0 : (double) p.getUsed().getBytes();
    }, poolTags));
    gaugeHandles.add(registry.createGauge(JVM_MEMORY_LIMIT, "JVM memory pool max", UNIT_BYTES, () -> {
        JvmStats.MemoryPool p = getPoolByName(poolName);
        if (p == null) return 0.0;
        long bytes = p.getMax().getBytes();
        return bytes < 0 ? 0.0 : (double) bytes;
    }, poolTags));
    gaugeHandles.add(registry.createGauge(JVM_MEMORY_USED_AFTER_LAST_GC, "JVM memory pool used after last GC", UNIT_BYTES, () -> {
        JvmStats.MemoryPool p = getPoolByName(poolName);
        return p == null ? 0.0 : (double) p.getLastGcStats().getUsed().getBytes();
    }, poolTags));
}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 11, 2026

PR Code Suggestions ✨

Latest suggestions up to f23fd7b

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against null last GC stats

The call to p.getLastGcStats() may return null for certain memory pools (e.g., young
generation pools that don't track post-GC usage), which would cause a
NullPointerException at runtime when the gauge supplier is invoked. Add a null check
for getLastGcStats() before calling .getUsed().

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [201-204]

 gaugeHandles.add(registry.createGauge(JVM_MEMORY_USED_AFTER_LAST_GC, "JVM memory pool used after last GC", UNIT_BYTES, () -> {
     JvmStats.MemoryPool p = getPoolByName(poolName);
-    return p == null ? 0.0 : (double) p.getLastGcStats().getUsed().getBytes();
+    if (p == null || p.getLastGcStats() == null) return 0.0;
+    return (double) p.getLastGcStats().getUsed().getBytes();
 }, poolTags));
Suggestion importance[1-10]: 7

__

Why: getLastGcStats() can legitimately return null for certain memory pools, and calling .getUsed() on a null reference would cause a NullPointerException at runtime during metric collection. This is a real potential bug that should be guarded against.

Medium
Synchronize close method to prevent concurrent modification

The closeQuietly() method iterates over gaugeHandles and then calls
gaugeHandles.clear(), but it is also called from the constructor's catch block
before all gauges are registered. This is safe, but the close() method sets closed
to true and then calls closeQuietly(), which clears the list. If close() is called
concurrently, the second call returns early due to the AtomicBoolean check, but
closeQuietly() in the constructor path bypasses this guard. More critically,
closeQuietly() is not synchronized, so concurrent calls (e.g., from close() and a
constructor failure) could cause a ConcurrentModificationException on gaugeHandles.
Consider synchronizing closeQuietly() or using a thread-safe list.

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [417-426]

-private void closeQuietly() {
+private synchronized void closeQuietly() {
     for (Closeable handle : gaugeHandles) {
         try {
             handle.close();
         } catch (IOException e) {
             logger.debug("Failed to close gauge handle", e);
         }
     }
     gaugeHandles.clear();
 }
Suggestion importance[1-10]: 4

__

Why: The concurrency concern is valid but the scenario is unlikely in practice since closeQuietly() is only called from the constructor (single-threaded) and from close() (protected by AtomicBoolean). The suggestion is technically sound but addresses a low-probability edge case.

Low
General
Prevent retry storm on persistent refresh failures

The threadStateTimestamp is initialized to 0, so the first call will always trigger
a refresh (since now - 0 will be much greater than CACHE_TTL_MS). However, if
refreshThreadStateCounts() throws an unexpected unchecked exception,
threadStateTimestamp will not be updated, causing every subsequent call to retry the
refresh. Consider updating the timestamp before or in a finally block to prevent a
tight retry loop on persistent failures.

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [311-318]

 private synchronized long getThreadStateCount(Thread.State state) {
     long now = System.currentTimeMillis();
     if (now - threadStateTimestamp > CACHE_TTL_MS) {
+        threadStateTimestamp = now;
         refreshThreadStateCounts();
-        threadStateTimestamp = now;
     }
     return threadStateCounts[state.ordinal()];
 }
Suggestion importance[1-10]: 5

__

Why: Moving threadStateTimestamp = now before refreshThreadStateCounts() prevents a tight retry loop if the refresh consistently throws an unchecked exception, which is a valid defensive improvement. However, refreshThreadStateCounts() already catches all exceptions internally, making the unchecked exception scenario unlikely.

Low
Clamp CPU ratio to valid range

The clampCpuPercent method only guards against negative values but does not cap the
upper bound. If pct exceeds 100 (which can happen on some platforms or under certain
conditions), the returned ratio will exceed 1.0, violating the documented contract
of the metric. Add an upper bound clamp to ensure the value stays within [0.0, 1.0].

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [405-407]

 private static double clampCpuPercent(short pct) {
-    return pct < 0 ? 0.0 : pct / 100.0;
+    if (pct < 0) return 0.0;
+    if (pct > 100) return 1.0;
+    return pct / 100.0;
+}
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that pct values above 100 could produce ratios exceeding 1.0, violating the metric's documented contract. Adding an upper bound clamp ensures the value stays within [0.0, 1.0] as expected.

Low

Previous suggestions

Suggestions up to commit e1ad2a6
CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against null last GC stats

p.getLastGcStats() may return null if no GC has occurred yet for that pool, which
would cause a NullPointerException at runtime when the gauge supplier is invoked. A
null check should be added before calling getUsed() on the result.

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [201-204]

 gaugeHandles.add(registry.createGauge(JVM_MEMORY_USED_AFTER_LAST_GC, "JVM memory pool used after last GC", UNIT_BYTES, () -> {
     JvmStats.MemoryPool p = getPoolByName(poolName);
-    return p == null ? 0.0 : (double) p.getLastGcStats().getUsed().getBytes();
+    if (p == null || p.getLastGcStats() == null) return 0.0;
+    return (double) p.getLastGcStats().getUsed().getBytes();
 }, poolTags));
Suggestion importance[1-10]: 7

__

Why: If getLastGcStats() returns null (e.g., before any GC has occurred), calling getUsed() on it would throw a NullPointerException at runtime when the gauge supplier is invoked. Adding a null check is a meaningful defensive fix.

Medium
Prevent double-close after constructor failure

The closeQuietly() method iterates over gaugeHandles while it may be called from the
constructor (on failure) and also from close(). However, iterating and then clearing
is not thread-safe if close() could be called concurrently. More critically, when
called from the constructor on failure, closed flag is never set to true, so a
subsequent call to close() would attempt to close already-closed handles. The
closeQuietly() should also set the closed flag to prevent double-close.

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [417-426]

 private void closeQuietly() {
+    closed.set(true);
     for (Closeable handle : gaugeHandles) {
         try {
             handle.close();
         } catch (IOException e) {
             logger.debug("Failed to close gauge handle", e);
         }
     }
     gaugeHandles.clear();
 }
Suggestion importance[1-10]: 6

__

Why: When closeQuietly() is called from the constructor on failure, the closed flag remains false, so a subsequent call to close() would attempt to close already-cleared handles. Setting closed.set(true) in closeQuietly() prevents this double-close scenario.

Low
General
Update cache timestamp before refresh to avoid retry storms

On the very first call, threadStateTimestamp is 0, so now - 0 will be much larger
than CACHE_TTL_MS and refreshThreadStateCounts() will be called correctly. However,
if refreshThreadStateCounts() throws an unexpected unchecked exception,
threadStateTimestamp will not be updated, causing every subsequent call to retry the
refresh. While the method catches Exception, consider ensuring threadStateTimestamp
is always updated to avoid a tight retry loop on persistent failures.

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [311-318]

 private synchronized long getThreadStateCount(Thread.State state) {
     long now = System.currentTimeMillis();
     if (now - threadStateTimestamp > CACHE_TTL_MS) {
+        threadStateTimestamp = now;
         refreshThreadStateCounts();
-        threadStateTimestamp = now;
     }
     return threadStateCounts[state.ordinal()];
 }
Suggestion importance[1-10]: 5

__

Why: Moving threadStateTimestamp = now before refreshThreadStateCounts() ensures that even if an unchecked exception escapes (despite the internal catch), the timestamp is updated and avoids a tight retry loop. This is a valid defensive improvement, though the current catch (Exception e) already handles most failure cases.

Low
Suggestions up to commit a1ad84c
CategorySuggestion                                                                                                                                    Impact
Possible issue
Prevent double-close after constructor failure

The closeQuietly() method is called both from close() (after the AtomicBoolean
guard) and from the constructor's catch block. However, if closeQuietly() is called
from the constructor on failure, and then close() is later called on the
partially-constructed object, the AtomicBoolean guard will prevent a second cleanup
since closed was never set to true in the constructor path. This means the closed
flag should also be set when cleaning up in the constructor failure path.

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [417-426]

 private void closeQuietly() {
+    closed.set(true);
     for (Closeable handle : gaugeHandles) {
         try {
             handle.close();
         } catch (IOException e) {
             logger.debug("Failed to close gauge handle", e);
         }
     }
     gaugeHandles.clear();
 }
Suggestion importance[1-10]: 7

__

Why: Valid concern: if the constructor fails and calls closeQuietly(), the closed flag remains false, so a subsequent close() call would invoke closeQuietly() again on an already-cleared gaugeHandles list. Setting closed.set(true) in closeQuietly() prevents this double-close scenario.

Medium
Guard against null last GC stats

The call to p.getLastGcStats() may return null for some memory pools (e.g., pools
that have never been collected), which would cause a NullPointerException at runtime
when the gauge supplier is invoked. A null check should be added before calling
getUsed() on the result.

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [201-204]

 gaugeHandles.add(registry.createGauge(JVM_MEMORY_USED_AFTER_LAST_GC, "JVM memory pool used after last GC", UNIT_BYTES, () -> {
     JvmStats.MemoryPool p = getPoolByName(poolName);
-    return p == null ? 0.0 : (double) p.getLastGcStats().getUsed().getBytes();
+    if (p == null || p.getLastGcStats() == null) return 0.0;
+    return (double) p.getLastGcStats().getUsed().getBytes();
 }, poolTags));
Suggestion importance[1-10]: 7

__

Why: getLastGcStats() can return null for memory pools that have never undergone GC, which would cause a NullPointerException at runtime when the gauge supplier is invoked. Adding a null check is a valid defensive fix for a real potential runtime error.

Medium
General
Ensure thread-safe cache refresh within synchronized block

The threadStateTimestamp is initialized to 0, so now - 0 will always be greater than
CACHE_TTL_MS on the first call, which is correct. However,
refreshThreadStateCounts() is called while holding the synchronized lock, and it can
be slow (iterating all threads). More critically, threadStateCounts is a
non-volatile field read outside the synchronized block in refreshThreadStateCounts
(which assigns to it). Since refreshThreadStateCounts is called from within the
synchronized method, this is safe, but the field assignment threadStateCounts =
counts inside refreshThreadStateCounts (a non-synchronized method) could be a
visibility issue if called from other contexts. Consider making
refreshThreadStateCounts private and ensuring it's only called within the
synchronized context, or inline it.

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [311-318]

 private synchronized long getThreadStateCount(Thread.State state) {
     long now = System.currentTimeMillis();
     if (now - threadStateTimestamp > CACHE_TTL_MS) {
-        refreshThreadStateCounts();
+        long[] counts = new long[Thread.State.values().length];
+        try {
+            long[] threadIds = threadMXBean.getAllThreadIds();
+            ThreadInfo[] infos = threadMXBean.getThreadInfo(threadIds);
+            for (ThreadInfo info : infos) {
+                if (info != null) {
+                    counts[info.getThreadState().ordinal()]++;
+                }
+            }
+        } catch (Exception e) {
+            logger.debug("Failed to collect thread state counts", e);
+        }
+        threadStateCounts = counts;
         threadStateTimestamp = now;
     }
     return threadStateCounts[state.ordinal()];
 }
Suggestion importance[1-10]: 4

__

Why: The refreshThreadStateCounts() method is already private and only called from within the synchronized method getThreadStateCount, so visibility of threadStateCounts is already guaranteed. Inlining it is a minor refactoring with limited practical impact on correctness.

Low
Suggestions up to commit 461bfe9
CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against null last GC stats

The call to p.getLastGcStats() may return null if no GC has occurred yet for that
pool, which would cause a NullPointerException at runtime when the gauge supplier is
invoked. A null check should be added before calling .getUsed() on the result.

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [201-204]

 gaugeHandles.add(registry.createGauge(JVM_MEMORY_USED_AFTER_LAST_GC, "JVM memory pool used after last GC", UNIT_BYTES, () -> {
     JvmStats.MemoryPool p = getPoolByName(poolName);
-    return p == null ? 0.0 : (double) p.getLastGcStats().getUsed().getBytes();
+    if (p == null || p.getLastGcStats() == null) return 0.0;
+    return (double) p.getLastGcStats().getUsed().getBytes();
 }, poolTags));
Suggestion importance[1-10]: 7

__

Why: If getLastGcStats() returns null before any GC has occurred for a pool, calling .getUsed() on it would cause a NullPointerException at runtime when the gauge supplier is invoked. This is a real potential bug that should be guarded against.

Medium
Prevent concurrent modification during close

The closeQuietly() method is called both from the constructor (on failure) and from
close(). When called from the constructor, it iterates and clears gaugeHandles,
which is correct. However, when called from close() after closed is set to true, a
second call to close() is guarded by the AtomicBoolean, but closeQuietly() itself is
not thread-safe since it iterates and clears the list without synchronization.
Consider using a local copy of the list before clearing to avoid potential
ConcurrentModificationException if close is called concurrently.

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [417-426]

 private void closeQuietly() {
-    for (Closeable handle : gaugeHandles) {
+    List<Closeable> handles = new ArrayList<>(gaugeHandles);
+    gaugeHandles.clear();
+    for (Closeable handle : handles) {
         try {
             handle.close();
         } catch (IOException e) {
             logger.debug("Failed to close gauge handle", e);
         }
     }
-    gaugeHandles.clear();
 }
Suggestion importance[1-10]: 4

__

Why: While the suggestion is technically valid, the close() method is guarded by AtomicBoolean.compareAndSet, making concurrent double-close safe. The ConcurrentModificationException risk is low in practice since closeQuietly is only called from close() or the constructor. The improvement is minor but does make the code more robust.

Low
General
Clamp CPU utilization ratio upper bound

The clampCpuPercent method only guards against negative values but does not cap the
upper bound. If getProcessCpuPercent() or getSystemCpuPercent() returns a value
greater than 100 (e.g., due to a platform-specific anomaly), the returned ratio
would exceed 1.0, violating the documented contract of the metric. An upper bound
clamp should be added.

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [405-407]

 private static double clampCpuPercent(short pct) {
-    return pct < 0 ? 0.0 : pct / 100.0;
+    if (pct < 0) return 0.0;
+    if (pct > 100) return 1.0;
+    return pct / 100.0;
+}
Suggestion importance[1-10]: 5

__

Why: Adding an upper bound clamp prevents the metric from exceeding 1.0 in edge cases where platform-specific anomalies return values above 100. This is a reasonable defensive improvement for metric correctness, though short values above 100 from these probes are unlikely in practice.

Low
Document synchronization requirement for refresh method

The threadStateTimestamp is initialized to 0, so the first call will always trigger
a refresh, which is correct. However, refreshThreadStateCounts() assigns a new array
to threadStateCounts without synchronization on the field itself. Since
getThreadStateCount is synchronized but refreshThreadStateCounts is not, and
threadStateCounts is a non-volatile field written inside a synchronized method, this
is safe. But the initial value of threadStateTimestamp = 0 means now - 0 > 1000 is
always true initially, which is the intended behavior. This is fine, but consider
initializing threadStateTimestamp to ensure clarity. More critically,
refreshThreadStateCounts should be called only within the synchronized block —
verify it is not called from outside synchronized context.

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [311-318]

 private synchronized long getThreadStateCount(Thread.State state) {
     long now = System.currentTimeMillis();
     if (now - threadStateTimestamp > CACHE_TTL_MS) {
         refreshThreadStateCounts();
         threadStateTimestamp = now;
     }
     return threadStateCounts[state.ordinal()];
 }
 
+private void refreshThreadStateCounts() {
+    // Must be called from within synchronized(this)
+    long[] counts = new long[Thread.State.values().length];
+    try {
+        long[] threadIds = threadMXBean.getAllThreadIds();
+        ThreadInfo[] infos = threadMXBean.getThreadInfo(threadIds);
+        for (ThreadInfo info : infos) {
+            if (info != null) {
+                counts[info.getThreadState().ordinal()]++;
+            }
+        }
+    } catch (Exception e) {
+        logger.debug("Failed to collect thread state counts", e);
+    }
+    threadStateCounts = counts;
+}
+
Suggestion importance[1-10]: 1

__

Why: The improved_code is essentially the same as the existing_code with only a comment added, which falls into the category of adding comments. The suggestion doesn't fix any real bug and the existing_code and improved_code are functionally identical.

Low
Suggestions up to commit 484b3bd
CategorySuggestion                                                                                                                                    Impact
Possible issue
Guard against null last GC stats

p.getLastGcStats() may return null if no GC has occurred yet for that pool, which
would cause a NullPointerException when .getUsed() is called. A null check should be
added before accessing getLastGcStats().getUsed().

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [201-204]

 gaugeHandles.add(registry.createGauge(JVM_MEMORY_USED_AFTER_LAST_GC, "JVM memory pool used after last GC", UNIT_BYTES, () -> {
     JvmStats.MemoryPool p = getPoolByName(poolName);
-    return p == null ? 0.0 : (double) p.getLastGcStats().getUsed().getBytes();
+    if (p == null || p.getLastGcStats() == null) return 0.0;
+    return (double) p.getLastGcStats().getUsed().getBytes();
 }, poolTags));
Suggestion importance[1-10]: 7

__

Why: getLastGcStats() can return null when no GC has occurred yet for a pool, making this a real NPE risk in production. The null check is a straightforward and necessary defensive fix.

Medium
Prevent double-close of gauge handles

The closeQuietly() method is called both from close() (after the AtomicBoolean
guard) and from the constructor's catch block. However, if close() is called
concurrently while the constructor's catch is running, the closed flag won't be set
yet, allowing closeQuietly() to run twice and potentially double-close handles. The
constructor cleanup path should not rely on closeQuietly() since it bypasses the
closed guard; consider using a separate internal cleanup method or setting closed to
true before calling closeQuietly() in the constructor's catch block.

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [417-426]

 private void closeQuietly() {
-    for (Closeable handle : gaugeHandles) {
+    List<Closeable> snapshot = new ArrayList<>(gaugeHandles);
+    gaugeHandles.clear();
+    for (Closeable handle : snapshot) {
         try {
             handle.close();
         } catch (IOException e) {
             logger.debug("Failed to close gauge handle", e);
         }
     }
-    gaugeHandles.clear();
 }
 
+@Override
+public void close() throws IOException {
+    if (closed.compareAndSet(false, true) == false) {
+        return;
+    }
+    closeQuietly();
+}
+
Suggestion importance[1-10]: 4

__

Why: The concern about double-close is valid but the actual risk is low since the constructor's catch block runs before the object is published to other threads. The improved code clears the list before iterating, which is a minor improvement but the core issue (concurrent access during construction) is not a realistic concern in practice.

Low
General
Update cache timestamp before refresh to avoid retry storms

The initial value of threadStateTimestamp is 0, so the first call will always
trigger a refresh, which is correct. However, if refreshThreadStateCounts() throws
an unexpected unchecked exception, threadStateTimestamp will not be updated, causing
every subsequent call to retry the refresh. More critically, if the refresh fails
silently (exception caught inside), stale zero-counts are returned without any
indication. Consider updating threadStateTimestamp before or after the refresh
regardless of outcome to avoid a tight retry loop on persistent failures.

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [311-318]

 private synchronized long getThreadStateCount(Thread.State state) {
     long now = System.currentTimeMillis();
     if (now - threadStateTimestamp > CACHE_TTL_MS) {
+        threadStateTimestamp = now;
         refreshThreadStateCounts();
-        threadStateTimestamp = now;
     }
     return threadStateCounts[state.ordinal()];
 }
Suggestion importance[1-10]: 5

__

Why: Moving threadStateTimestamp = now before refreshThreadStateCounts() is a valid improvement to prevent tight retry loops if the refresh consistently fails, since the exception is caught inside refreshThreadStateCounts() and would otherwise cause every call to retry.

Low
Clamp CPU ratio upper bound to 1.0

The CPU percent value is a short that can theoretically exceed 100 (e.g., on systems
that report aggregate CPU across cores). The method only clamps the lower bound but
not the upper bound, potentially returning values greater than 1.0 for the
utilization ratio. An upper clamp should be added to ensure the returned value stays
within [0.0, 1.0].

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [405-407]

 private static double clampCpuPercent(short pct) {
-    return pct < 0 ? 0.0 : pct / 100.0;
+    if (pct < 0) return 0.0;
+    if (pct > 100) return 1.0;
+    return pct / 100.0;
 }
Suggestion importance[1-10]: 4

__

Why: While theoretically valid, the ProcessProbe.getProcessCpuPercent() and OsProbe.getSystemCpuPercent() methods in OpenSearch typically return values in [0, 100] or -1, making values above 100 unlikely. The upper clamp is a minor defensive improvement.

Low
Suggestions up to commit 2debaed
CategorySuggestion                                                                                                                                    Impact
Possible issue
Prevent retry storm on persistent refresh failures

The threadStateTimestamp is initialized to 0, so the very first call will always
trigger a refresh (since now - 0 > 1000). However, if refreshThreadStateCounts()
throws an unexpected unchecked exception, threadStateTimestamp will not be updated,
causing every subsequent call to retry the refresh. Consider updating
threadStateTimestamp before or inside a try-finally block to prevent a tight retry
loop on persistent failures.

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [324-331]

 private synchronized long getThreadStateCount(Thread.State state) {
     long now = System.currentTimeMillis();
     if (now - threadStateTimestamp > CACHE_TTL_MS) {
+        threadStateTimestamp = now;
         refreshThreadStateCounts();
-        threadStateTimestamp = now;
     }
     return threadStateCounts[state.ordinal()];
 }
Suggestion importance[1-10]: 5

__

Why: Moving threadStateTimestamp = now before refreshThreadStateCounts() prevents a tight retry loop if an unchecked exception is thrown during refresh. However, refreshThreadStateCounts() already catches all exceptions internally, so the risk of an unchecked exception escaping is low, making this a minor defensive improvement.

Low
General
Add missing non-heap committed memory gauge

The non-heap JVM_MEMORY_USED gauge is registered but there is no corresponding
JVM_MEMORY_COMMITTED gauge for non-heap memory, even though non-heap committed
memory is a meaningful metric (it represents the actual OS memory reserved for
non-heap). This asymmetry means non-heap committed memory is silently omitted, which
could mislead users monitoring memory pressure.

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [162-170]

 gaugeHandles.add(
     registry.createGauge(
         JVM_MEMORY_USED,
         "JVM non-heap memory used",
         UNIT_BYTES,
         () -> (double) jvmService.stats().getMem().getNonHeapUsed().getBytes(),
         nonHeapTags
     )
 );
+gaugeHandles.add(
+    registry.createGauge(
+        JVM_MEMORY_COMMITTED,
+        "JVM non-heap memory committed",
+        UNIT_BYTES,
+        () -> (double) jvmService.stats().getMem().getNonHeapCommitted().getBytes(),
+        nonHeapTags
+    )
+);
Suggestion importance[1-10]: 4

__

Why: The suggestion points out a real asymmetry where non-heap committed memory is not tracked while heap committed memory is. This is a valid completeness concern, but it's a feature addition rather than a bug fix, and the PR author may have intentionally omitted it.

Low
Ensure closed flag is set during cleanup

When closeQuietly() is called from the constructor's catch block (before closed is
set to true), a subsequent call to close() will pass the compareAndSet check and
call closeQuietly() again on an already-cleared list. This is harmless but
inconsistent. More critically, if close() is called concurrently during construction
failure, the closed flag won't be set, so close() will attempt to close
already-cleared handles. Consider setting closed to true inside closeQuietly() as
well.

server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java [430-439]

 private void closeQuietly() {
+    closed.set(true);
     for (Closeable handle : gaugeHandles) {
         try {
             handle.close();
         } catch (IOException e) {
             logger.debug("Failed to close gauge handle", e);
         }
     }
     gaugeHandles.clear();
 }
Suggestion importance[1-10]: 4

__

Why: Setting closed to true inside closeQuietly() prevents a subsequent close() call from re-entering cleanup after a constructor failure. However, the constructor failure path clears gaugeHandles, so the double-close scenario is harmless in practice, making this a minor defensive improvement.

Low

Register pull-based gauges in NodeRuntimeMetrics covering JVM memory
(heap, non-heap, per-pool), GC collectors, buffer pools, threads
(including per-state counts), class loading, uptime, and CPU usage.

Metric names follow OpenTelemetry semantic conventions (jvm.memory.used,
jvm.gc.duration, jvm.thread.count, etc.) for consistency with the
broader observability ecosystem.

All gauge suppliers read through JvmService.stats(), which caches the
JvmStats snapshot with a 1-second TTL, so a single collection sweep
reuses one snapshot across all gauges. Thread state counts use a
separate synchronized cache to avoid redundant getThreadInfo() calls.

Memory pools, GC collectors, and buffer pools are discovered dynamically
from the initial JvmStats snapshot and tagged by name, so the gauges
work identically across G1, Parallel, CMS, ZGC, and other collectors.

Signed-off-by: Sam Akrah <sakrah@uber.com>
Made-with: Cursor
@sakrah sakrah force-pushed the sakrah/node-runtime-jvm-metrics-upstream branch from 1f4e7a1 to d0a1714 Compare March 11, 2026 19:32
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit d0a1714

@sakrah sakrah changed the title Sakrah/node runtime jvm metrics upstream Add node-level JVM and CPU runtime metrics Mar 11, 2026
@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for d0a1714: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

sakrah pushed a commit to sakrah/OpenSearch that referenced this pull request Mar 11, 2026
- Clamp CPU probe values to 0.0 when platform returns -1 (unavailable)
- Clamp memory pool max to 0.0 when pool has no defined upper bound
- Wrap gauge registration in try-catch to close already-created handles
  if construction fails partway through
- Add CHANGELOG entry for PR opensearch-project#20844
- Add tests for negative CPU guard and constructor cleanup on failure

Signed-off-by: Sam Akrah <sakrah@uber.com>
Made-with: Cursor
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit b107cd1

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for b107cd1: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

- Clamp CPU probe values to 0.0 when platform returns -1 (unavailable)
- Clamp memory pool max to 0.0 when pool has no defined upper bound
- Wrap gauge registration in try-catch to close already-created handles
  if construction fails partway through
- Add CHANGELOG entry for PR opensearch-project#20844
- Add tests for negative CPU guard and constructor cleanup on failure

Signed-off-by: Sam Akrah <sakrah@uber.com>
Made-with: Cursor
Signed-off-by: Sam Akrah <sakrah@uber.com>
Made-with: Cursor
@sakrah sakrah force-pushed the sakrah/node-runtime-jvm-metrics-upstream branch from b107cd1 to 2debaed Compare March 11, 2026 20:38
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 2debaed

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for 2debaed: SUCCESS

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 11, 2026

Codecov Report

❌ Patch coverage is 83.73494% with 27 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.28%. Comparing base (db61a88) to head (f23fd7b).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
...ava/org/opensearch/monitor/NodeRuntimeMetrics.java 83.22% 12 Missing and 15 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #20844      +/-   ##
============================================
- Coverage     73.30%   73.28%   -0.03%     
- Complexity    72252    72272      +20     
============================================
  Files          5795     5796       +1     
  Lines        330056   330222     +166     
  Branches      47643    47662      +19     
============================================
+ Hits         241947   241994      +47     
- Misses        68695    68816     +121     
+ Partials      19414    19412       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…erage

- Add jvm.memory.used_after_last_gc per-pool gauge (post-GC heap usage
  shows true heap pressure vs lazy-GC-inflated current usage)
- Replace boolean-flag helpers (poolBytes, gcMetric, bufferPoolMetric)
  with domain-object-returning helpers (getPoolByName,
  getCollectorByName, getBufferPoolByName) per reviewer feedback
- Remove BufferPoolField enum — no longer needed
- Add supplier-invocation tests for memory pools, GC, buffer pools,
  and class loading to close code coverage gaps

Signed-off-by: Sam Akrah <sakrah@uber.com>
Made-with: Cursor
- Add jvm.memory.committed gauge for non-heap memory (was missing,
  unlike heap which had all three: used, committed, limit)
- Add Objects.requireNonNull for all constructor parameters

Signed-off-by: Sam Akrah <sakrah@uber.com>
Made-with: Cursor
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 484b3bd

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 484b3bd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@sakrah
Copy link
Copy Markdown
Contributor Author

sakrah commented Mar 12, 2026

CI Failure — Unrelated Flaky Test

The Gradle check failed on :qa:rolling-upgrade:v2.19.5-remote#twoThirdsUpgradedTest:

process was found dead while waiting for cluster health yellow

Root cause is an AssertionError in IndexShard.updateShardState during a rolling upgrade from 2.19.5 to 3.6.0:

fatal error in thread [opensearch[v2.19.5-remote-0][clusterApplierService#updateTask][T#1]], exiting
java.lang.AssertionError: a started primary with non-pending operation term must be in primary mode
  [closed_index_replica_allocation][0]
  at org.opensearch.index.shard.IndexShard.updateShardState(IndexShard.java:929)

Related flaky test issues in the rolling-upgrade suite:

Retrying CI with an empty commit.

Signed-off-by: Sam Akrah <sakrah@uber.com>
Made-with: Cursor
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 461bfe9

Signed-off-by: Sam Akrah <sakrah@uber.com>
Made-with: Cursor
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit a1ad84c

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for a1ad84c: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Sam Akrah <sakrah@uber.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit e1ad2a6

Signed-off-by: Sam Akrah <sakrah@uber.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit f23fd7b

@github-actions
Copy link
Copy Markdown
Contributor

❕ Gradle check result for f23fd7b: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@msfroh msfroh merged commit 6e92a76 into opensearch-project:main Mar 12, 2026
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants