Add node-level JVM and CPU runtime metrics#20844
Add node-level JVM and CPU runtime metrics#20844msfroh merged 8 commits intoopensearch-project:mainfrom
Conversation
PR Reviewer Guide 🔍(Review updated until commit f23fd7b)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to f23fd7b Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit e1ad2a6
Suggestions up to commit a1ad84c
Suggestions up to commit 461bfe9
Suggestions up to commit 484b3bd
Suggestions up to commit 2debaed
|
Register pull-based gauges in NodeRuntimeMetrics covering JVM memory (heap, non-heap, per-pool), GC collectors, buffer pools, threads (including per-state counts), class loading, uptime, and CPU usage. Metric names follow OpenTelemetry semantic conventions (jvm.memory.used, jvm.gc.duration, jvm.thread.count, etc.) for consistency with the broader observability ecosystem. All gauge suppliers read through JvmService.stats(), which caches the JvmStats snapshot with a 1-second TTL, so a single collection sweep reuses one snapshot across all gauges. Thread state counts use a separate synchronized cache to avoid redundant getThreadInfo() calls. Memory pools, GC collectors, and buffer pools are discovered dynamically from the initial JvmStats snapshot and tagged by name, so the gauges work identically across G1, Parallel, CMS, ZGC, and other collectors. Signed-off-by: Sam Akrah <sakrah@uber.com> Made-with: Cursor
1f4e7a1 to
d0a1714
Compare
|
Persistent review updated to latest commit d0a1714 |
|
❌ Gradle check result for d0a1714: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
- Clamp CPU probe values to 0.0 when platform returns -1 (unavailable) - Clamp memory pool max to 0.0 when pool has no defined upper bound - Wrap gauge registration in try-catch to close already-created handles if construction fails partway through - Add CHANGELOG entry for PR opensearch-project#20844 - Add tests for negative CPU guard and constructor cleanup on failure Signed-off-by: Sam Akrah <sakrah@uber.com> Made-with: Cursor
|
Persistent review updated to latest commit b107cd1 |
|
❌ Gradle check result for b107cd1: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
- Clamp CPU probe values to 0.0 when platform returns -1 (unavailable) - Clamp memory pool max to 0.0 when pool has no defined upper bound - Wrap gauge registration in try-catch to close already-created handles if construction fails partway through - Add CHANGELOG entry for PR opensearch-project#20844 - Add tests for negative CPU guard and constructor cleanup on failure Signed-off-by: Sam Akrah <sakrah@uber.com> Made-with: Cursor Signed-off-by: Sam Akrah <sakrah@uber.com> Made-with: Cursor
b107cd1 to
2debaed
Compare
|
Persistent review updated to latest commit 2debaed |
server/src/main/java/org/opensearch/monitor/NodeRuntimeMetrics.java
Outdated
Show resolved
Hide resolved
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20844 +/- ##
============================================
- Coverage 73.30% 73.28% -0.03%
- Complexity 72252 72272 +20
============================================
Files 5795 5796 +1
Lines 330056 330222 +166
Branches 47643 47662 +19
============================================
+ Hits 241947 241994 +47
- Misses 68695 68816 +121
+ Partials 19414 19412 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…erage - Add jvm.memory.used_after_last_gc per-pool gauge (post-GC heap usage shows true heap pressure vs lazy-GC-inflated current usage) - Replace boolean-flag helpers (poolBytes, gcMetric, bufferPoolMetric) with domain-object-returning helpers (getPoolByName, getCollectorByName, getBufferPoolByName) per reviewer feedback - Remove BufferPoolField enum — no longer needed - Add supplier-invocation tests for memory pools, GC, buffer pools, and class loading to close code coverage gaps Signed-off-by: Sam Akrah <sakrah@uber.com> Made-with: Cursor
- Add jvm.memory.committed gauge for non-heap memory (was missing, unlike heap which had all three: used, committed, limit) - Add Objects.requireNonNull for all constructor parameters Signed-off-by: Sam Akrah <sakrah@uber.com> Made-with: Cursor
|
Persistent review updated to latest commit 484b3bd |
|
❌ Gradle check result for 484b3bd: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
CI Failure — Unrelated Flaky TestThe Gradle check failed on Root cause is an Related flaky test issues in the rolling-upgrade suite:
Retrying CI with an empty commit. |
Signed-off-by: Sam Akrah <sakrah@uber.com> Made-with: Cursor
|
Persistent review updated to latest commit 461bfe9 |
Signed-off-by: Sam Akrah <sakrah@uber.com> Made-with: Cursor
|
Persistent review updated to latest commit a1ad84c |
|
❌ Gradle check result for a1ad84c: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Sam Akrah <sakrah@uber.com>
|
Persistent review updated to latest commit e1ad2a6 |
Signed-off-by: Sam Akrah <sakrah@uber.com>
|
Persistent review updated to latest commit f23fd7b |
|
❕ Gradle check result for f23fd7b: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Description
Motivation
OpenSearch's telemetry framework (
MetricsRegistry) currently exposes no JVM runtime metrics. Operators monitoring production clusters rely on external mechanisms -- the_nodes/statsREST API (polling-based, adds HTTP overhead), standalone JMX exporters (separate process, separate config), or the OTel Java agent (bytecode instrumentation, heavyweight dependency). None of these integrate with the telemetry backend that OpenSearch is already configured to use.This means if a cluster has the OTel plugin (or any future telemetry plugin) enabled, JVM metrics like heap pressure, GC pause time, and thread exhaustion still require a separate collection pipeline. There's no reason for that gap -- the data is already available via
JvmService.stats(), which cachesJvmStatssnapshots with a 1-second TTL.What this PR does
Adds
NodeRuntimeMetrics, aCloseablecomponent that registers ~30 pull-based gauges throughMetricsRegistry, covering JVM memory, GC, buffer pools, threads, class loading, and CPU. Since it uses the framework-agnosticMetricsRegistryAPI, these metrics are automatically available through whatever telemetry backend the cluster is configured with -- no additional infrastructure required.Metric names follow OpenTelemetry JVM semantic conventions (
jvm.memory.used,jvm.gc.duration,jvm.thread.count, etc.) so they align with the broader ecosystem and are immediately recognizable to operators already using OTel-based tooling.Metrics registered
jvm.memory.used,jvm.memory.committed,jvm.memory.limittype(heap/non_heap),pool(per memory pool)jvm.gc.duration,jvm.gc.countgc(per collector)jvm.buffer.memory.used,jvm.buffer.memory.limit,jvm.buffer.countpool(direct/mapped)jvm.thread.count(total + per-state breakdown)state(runnable, waiting, etc.)jvm.class.count,jvm.class.loadedjvm.cpu.time,jvm.cpu.recent_utilization,process.uptimeDesign
JvmService.stats(), which cachesJvmStatswith a 1-second TTL -- a single collection sweep reuses one snapshot across all gauges with no redundant MXBean calls.JvmStatssnapshot and tagged by name, so gauges work across G1, Parallel, CMS, ZGC, and other collectors without configuration.getThreadInfo()calls across the 6 per-state gauges.NodeRuntimeMetricsimplementsCloseablewith an idempotentAtomicBooleanguard and is added toNode'sresourcesToClose.ThreadMXBeanis injectable via a package-private constructor for unit testing.Related Issues
N/A (new functionality)
Check List
Testing
Unit tests in
NodeRuntimeMetricsTestsverify:JvmStatssnapshot dataThreadInfo[]close()releases all gauge handles and is idempotentBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.