Improve metrics performance and reliability #391

santigimeno · 2025-11-20T17:03:38Z

test: refactor OTLP metric validators

Move common validator logic for gRPC to test/common/nsolid-grpc-agent/validators.js.
Add a new validator for OTLP Metrics: [checkOTLPMetricsData()](cci:1://file:///home/sgimeno/nodesource/nsolid/test/common/nsolid-grpc-agent/validators.js:157:0-267:1).
Add also a helper method [hasThreadAttributes()](cci:1://file:///home/sgimeno/nodesource/nsolid/test/common/nsolid-grpc-agent/validators.js:94:0-98:1) to check whether a metric datapoint is a thread or a process.
Refactor `test/agents/test-grpc-metrics.mjs` to use these new helpers.

agents: add MetricDataBatch for OTLP metrics

Introduce MetricDataBatch class to consolidate metric collection before export, reducing per-metric overhead and improving batch processing efficiency.
Update OTLP/grpc agents to accumulate metrics in batches and dump them into ScopeMetrics during export.
This change reduces the number of individual metric exports and improves performance for high-frequency metric generation.

Changes:
- Introduce MetricDataBatch class for metric accumulation.
- Update OTLP agents to use batched metric collection.
- Modify test expectations to handle batched datapoints.

agents: generalize DelegateAsyncExport

Refactor DelegateAsyncExport to accept generic stub, event, and response types as template parameters, enabling reuse across different gRPC services beyond the original NSolidService. This generalization prepares the codebase for supporting the OTLP MetricsService exporter while maintaining backward compatibility with existing event exports.

Key changes:
- Template DelegateAsyncExport with <Stub, EventT, ResponseT> params.
- Update GrpcAsyncCallData to use generic ResponseType.
- Change callback signature from bool return to void return.
- Update all existing DelegateAsyncExport call sites.
- Remove unnecessary “return true” statements from callbacks.

core: add RingBuffer resizing with coverage

Allow RingBuffer to resize at runtime by moving the newest entries into a freshly allocated buffer and resetting head/tail bookkeeping.
Require strictly positive capacities via CHECK_GT and drop the const on capacity_ so it can reflect the current size.
Add gtest coverage for all resizing scenarios (no-op, grow empty/full, shrink with and without wraparound, death test for resize(0)) to prevent regressions.

agents: add async OTLP GrpcMetricsExporter

Introduce GrpcMetricsExporter that uses DelegateAsyncExport + AsyncTSQueue to flush MetricDataBatch instances off the main thread and retry transient gRPC failures with backpressure.
Add NSOLID_METRICS_BUFFER_SIZE option (plumbed through lib/nsolid.js, reconfigure proto/pb.cc/pb.h, and node.gyp) plus new test-grpc-metrics-buffer-size.mjs and retry coverage. It defaults to 100 elements, following the ZmqAgent behavior.
Wire GrpcAgent config/reconfigure paths to construct the exporter once and call resize_buffer on buffer-size changes, ensuring the new queue depth propagates without losing metrics.

lib: validate metrics batch/buffer config inputs

Require positive finite numbers when metricsBatchSize/metricsBufferSize are set via nsolid.start() or live reconfigure so odd JS values (true, objects, NaN, etc.) are ignored.
Add test-nsolid-start-metrics-buffer.js to cover these cases.

Summary by CodeRabbit

Release Notes

New Features
- Added metrics batching and buffering capabilities with configurable batch size and buffer size parameters for improved metrics export efficiency.
- Introduced dynamic metrics reconfiguration allowing batch size and buffer size adjustments at runtime without restart.
- Added metrics pause and resume functionality for better control over metric collection during operations.
Improvements
- Enhanced metrics export resilience with improved retry logic and handling for interrupted connections.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Move common validator logic for gRPC to test/common/nsolid-grpc-agent/validators.js. Add a new validator for OTLP Metrics: `checkOTLPMetricsData()`. Add also a helper method `hasThreadAttributes()` to check whether a metric datapoint is a thread or a process. Refactor `test/agents/test-grpc-metrics.mjs` to use these new helpers.

Introduce MetricDataBatch class to consolidate metric collection before export, reducing per-metric overhead and improving batch processing efficiency. Update OTLP/grpc agents to accumulate metrics in batches and dump them into ScopeMetrics during export. This change reduces the number of individual metric exports and improves performance for high-frequency metric generation. Changes: - Introduce MetricDataBatch class for metric accumulation. - Update OTLP agents to use batched metric collection. - Modify test expectations to handle batched datapoints.

Refactor DelegateAsyncExport to accept generic stub, event, and response types as template parameters, enabling reuse across different gRPC services beyond the original NSolidService. This generalization prepares the codebase for supporting the OTLP MetricsService exporter while maintaining backward compatibility with existing event exports. Key changes: - Template DelegateAsyncExport with <Stub, EventT, ResponseT> params. - Update GrpcAsyncCallData to use generic ResponseType. - Change callback signature from bool return to void return. - Update all existing DelegateAsyncExport call sites. - Remove unnecessary return true statements from callbacks.

Allow RingBuffer to resize at runtime by moving the newest entries into a freshly allocated buffer and resetting head/tail bookkeeping. Require strictly positive capacities via CHECK_GT and drop the const on capacity_ so it can reflect the current size. Add gtest coverage for all resizing scenarios (no-op, grow empty/full, shrink with and without wraparound, death test for resize(0)) to prevent regressions.

Introduce GrpcMetricsExporter that uses DelegateAsyncExport + AsyncTSQueue to flush MetricDataBatch instances off the main thread and retry transient gRPC failures with backpressure. Add NSOLID_METRICS_BUFFER_SIZE option (plumbed through lib/nsolid.js, reconfigure proto/pb.cc/pb.h, and node.gyp) plus new test-grpc-metrics-buffer-size.mjs and retry coverage. It defaults to 100 elements, following the ZmqAgent behavior. Wire GrpcAgent config/reconfigure paths to construct the exporter once and call resize_buffer on buffer-size changes, ensuring the new queue depth propagates without losing metrics.

coderabbitai · 2025-11-20T17:03:48Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The PR refactors the gRPC agent's metrics export system from immediate per-call exports to a batched, buffered approach. It introduces a new GrpcMetricsExporter class, extends the reconfigure proto with metrics batch and buffer size fields, adds dynamic RingBuffer resizing, and includes comprehensive test coverage for validation.

Changes

Cohort / File(s)	Summary
Proto schema updates `agents/grpc/proto/reconfigure.proto`, `agents/grpc/src/proto/reconfigure.pb.cc`, `agents/grpc/src/proto/reconfigure.pb.h`	Added optional uint32 fields metricsBatchSize (field 13) and metricsBufferSize (field 14) to ReconfigureBody message with corresponding accessors and serialization logic.
GrpcMetricsExporter new implementation `agents/grpc/src/grpc_metrics_exporter.h`, `agents/grpc/src/grpc_metrics_exporter.cc`	Introduced new GrpcMetricsExporter class managing batched metrics export via gRPC with enqueue/flush/resize operations, async completion handling, and metrics service stub integration.
Core agent refactoring `agents/grpc/src/grpc_agent.h`, `agents/grpc/src/grpc_agent.cc`	Refactored metrics export from direct per-call to batched/buffered approach; added metrics_paused_ flag, per-entity MetricDataBatch objects (proc/thread), metrics timer callback, and integrated new GrpcMetricsExporter; updated DelegateAsyncExport call signatures.
gRPC client template updates `agents/grpc/src/grpc_client.h`	Extended GrpcAsyncCallData with dual-template parameters (EventType, ResponseType); updated DelegateAsyncExport signature to return ::grpc::Status and accept generic ResponseType; added destructor and deleted copy/move operations.
OTLP metrics utilities `agents/otlp/src/otlp_common.h`, `agents/otlp/src/otlp_common.cc`	Introduced MetricDataBatch class for metric accumulation with AddDataPoint, DumpMetricsAndReset, ShouldFlush, Resize methods; refactored fill_proc_metrics and fill_env_metrics to use batching.
Metrics collection updates `agents/otlp/src/otlp_metrics.cc`	Updated got_proc_metrics and got_thr_metrics to use MetricDataBatch instead of vector and call DumpMetricsAndReset().
Ring buffer enhancements `src/nsolid/nsolid_util.h`	Added resize(size_t) method to RingBuffer, changed capacity_ from const to mutable, added initial size validation in constructor.
JavaScript configuration `lib/nsolid.js`	Added DEFAULT_METRICS_BATCH_SIZE (1) and DEFAULT_METRICS_BUFFER_SIZE (100) constants; extended updateConfig and initializeConfig to handle metricsBatchSize and metricsBufferSize with validation for finite positive values.
Build configuration `node.gyp`	Added grpc_metrics_exporter.cc and grpc_metrics_exporter.h to nsolid/grpc sources list.
Metrics batching tests `test/agents/test-grpc-metrics-batch.mjs`	New test suite validating static and dynamic metrics batching with reconfiguration, pause/flush behavior, and batch size changes via runSimpleTest and runTest functions.
Metrics buffer size tests `test/agents/test-grpc-metrics-buffer-size.mjs`	New test validating metrics buffering across multiple buffer sizes (10, 50, 200) with OTLP export verification.
Metrics retry tests `test/agents/test-grpc-metrics-retry.mjs`	New test validating GrpcMetricsExporter retry behavior when server restarts mid-export.
Test infrastructure updates `test/agents/test-grpc-metrics.mjs`, `test/agents/test-otlp-grpc-metrics.mjs`, `test/agents/test-otlp-metrics.mjs`	Updated metric validation to use new checkOTLPMetricsData helper and refined datapoint extraction logic for multi-datapoint and thread-specific metrics handling.
Test utilities `test/common/nsolid-grpc-agent/index.js`, `test/common/nsolid-grpc-agent/server.mjs`, `test/common/nsolid-grpc-agent/validators.js`	Extracted validators module (checkOTLPMetricsData, hasThreadAttributes, checkResource, checkExitData); added port parameter to startServer and GRPCServer.start for flexible binding.
Ring buffer tests `test/cctest/test_nsolid_ring_buffer.cc`	Added comprehensive unit tests for RingBuffer resize functionality including same/larger/smaller capacity scenarios and edge cases.
Metrics configuration tests `test/parallel/test-nsolid-start-metrics-buffer.js`	New test suite validating metricsBatchSize and metricsBufferSize configuration via nsolid.start with numeric and string inputs, and invalid input handling.

Sequence Diagram

sequenceDiagram
    participant Agent as GrpcAgent
    participant Batch as MetricDataBatch
    participant Exporter as GrpcMetricsExporter
    participant Queue as Metrics Queue
    participant gRPC as gRPC Stub

    rect rgb(200, 220, 255)
    Note over Agent,gRPC: Metric Collection & Batching
    
    Agent->>Batch: AddDataPoint(metric_data)
    Batch->>Batch: IncrementPoints()
    Note over Batch: total_points < max_points
    
    Agent->>Batch: ShouldFlush()?
    Batch-->>Agent: false (below threshold)
    Agent->>Batch: AddDataPoint(next_metric)
    end

    rect rgb(200, 255, 220)
    Note over Agent,gRPC: Batch Threshold Reached or Timer Fired
    
    Agent->>Batch: DumpMetricsAndReset()
    Batch-->>Agent: ResourceMetrics
    Agent->>Exporter: enqueue(metrics)
    Exporter->>Queue: push(metrics)
    Exporter->>Exporter: export_current()
    end

    rect rgb(255, 220, 200)
    Note over Agent,gRPC: Async Export in Flight
    
    Exporter->>gRPC: DelegateAsyncExport(ExportMetricsServiceRequest)
    Note over gRPC: Async operation
    gRPC-->>Exporter: on_metrics_export_complete(Status)
    Exporter->>Exporter: Pop from queue if success
    Exporter->>Exporter: export_current() (next batch)
    end

    rect rgb(255, 200, 220)
    Note over Agent,gRPC: Pause/Flush Path
    
    Agent->>Agent: metrics_paused_ = true
    Agent->>Batch: DumpMetricsAndReset()
    Agent->>Exporter: enqueue(remaining_metrics)
    Agent->>Exporter: flush()
    Note over Exporter: Ensures all queued metrics exported
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Core agent refactoring (grpc_agent.cc/h): Dense logic changes involving lifecycle management, batching state, timer-driven flushing, and integration of new GrpcMetricsExporter—requires careful verification of control flow transitions
Template generalization (grpc_client.h): Dual-template refactor with signature changes affecting all export call sites; verify compatibility across all event types and response handling
MetricDataBatch implementation (otlp_common.h/cc): New batching abstraction with deduplication logic and threshold-based flushing—requires verification of metric accumulation correctness
Configuration propagation: Metrics settings flow from proto → agent state → JavaScript config; verify initialization, reconfiguration, and validation across layers
Test coverage: Multiple new test files with complex async/event-driven validation logic; verify test scenarios comprehensively exercise batching, buffering, and retry paths

Suggested reviewers

juanarbol
RafaelGSS

Poem

🐰 Batch by batch, the metrics flow,
Buffered queues in orderly row,
No more one-by-one in haste,
Grouped and grouped with perfect taste!
Flushing timers, pause control—
Metrics management, now whole. 🚀

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 19.63% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Improve metrics performance and reliability' is directly related to the main changes in the pull request, which focus on implementing batched metrics export, adding async retry logic, and improving overall metrics handling efficiency.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

santigimeno · 2025-11-20T17:17:58Z

@coderabbitai review

coderabbitai · 2025-11-20T17:18:04Z

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

agents/otlp/src/otlp_common.cc (1)
297-400: http_client summary quantiles are swapped (p50 vs p99)

In fill_env_metrics(), the http_client summary currently does:
{{ 0.5, stor.http_client99_ptile },
 { 0.99, stor.http_client_median }},
All other summaries (GC, DNS, http_server) map 0.5 -> median and 0.99 -> 99th percentile. Here those two are reversed, so exported http_client metrics will report the 99th percentile under the 0.5 quantile and vice versa.

Recommend swapping the fields:
-  {{ 0.5, stor.http_client99_ptile },
-   { 0.99, stor.http_client_median }},
+  {{ 0.5, stor.http_client_median },
+   { 0.99, stor.http_client99_ptile }},

🧹 Nitpick comments (8)

src/nsolid/nsolid_util.h (1)

226-233: RingBuffer resize + move-push semantics look correct; only minor readability nit

The ctor check for s > 0, move-based push(T&&), and resize() logic (preserve order on grow, keep newest elements on shrink, reset head/tail/size) all look consistent and are well covered by the new tests.

If you want to polish later, you could hoist start_idx out of the for loop in resize() (it’s loop-invariant) and/or factor the copy logic into a small helper to make future changes safer, but there’s no functional issue here.

Also applies to: 266-274, 276-312, 315-320

lib/nsolid.js (2)

3-16: Primordials Number shadowing global Number trips lint; consider renaming

You’re pulling Number from primordials, which is fine semantically, but it shadows the global Number and triggers noShadowRestrictedNames. If you want to keep linters quiet and reduce ambiguity, consider renaming the binding (e.g., PrimordialsNumber) and using that instead.

672-723: JS config validation for metricsBatchSize/metricsBufferSize is permissive (silent ignore)

The updateConfig() branch for metricsBatchSize/metricsBufferSize correctly restricts JS-provided values to positive finite numbers, but invalid values are just skipped rather than rejected. If callers misconfigure (e.g., pass 0, -1, NaN, or Infinity), they’ll get the previous or default value with no signal.

If the intent per PR description is “validate and reject bad inputs”, consider either:

Throwing on invalid values, or

At least logging a debug/warning so misconfigurations are discoverable.

As-is it’s functionally safe but a bit opaque from an API standpoint.
test/agents/test-grpc-metrics-buffer-size.mjs (1)
59-67: Avoid async timeout handler to ensure assertion is not just an unhandled rejection

Using an async function in setTimeout means assert.fail() only rejects the returned Promise, which the timer machinery ignores. Depending on unhandledRejection behavior, this may not reliably fail the test.

You can keep semantics but make the failure synchronous by dropping async/await and treating shutdown/close as best‑effort:
-      const timeoutId = setTimeout(async () => {
-        console.log('Timeout reached - test failed!');
-        await child.shutdown(0);
-        if (grpcServer) {
-          grpcServer.close();
-        }
-
-        assert.fail(`Test timeout: Expected at least ${targetMetrics} metrics exports with buffer size ${bufferSize}, but got ${metricsReceived}`);
-      }, 5000);
+      const timeoutId = setTimeout(() => {
+        console.log('Timeout reached - test failed!');
+        child.shutdown(0);
+        if (grpcServer) {
+          grpcServer.close();
+        }
+        assert.fail(
+          `Test timeout: Expected at least ${targetMetrics} metrics ` +
+          `exports with buffer size ${bufferSize}, but got ${metricsReceived}`
+        );
+      }, 5000);
test/common/nsolid-grpc-agent/index.js (1)
26-27: Consider treating 0 as a valid explicit port

start(cb, port = null) currently computes:
const args = port ? [port.toString()] : [];
This treats port === 0 as “no explicit port”, which may be surprising if you ever want to force the server to bind to port 0. If that’s a valid use case, consider an explicit null/undefined check instead:
const args = (port === null || port === undefined) ? [] : [String(port)];
If you never intend to pass 0, current code is fine; this is just a small robustness tweak.
agents/grpc/src/grpc_metrics_exporter.cc (1)

55-60: Clarify/exporter-level retry behavior for failed exports

The queue + in_flight_ logic guarantees at most one export in flight and pops only on status.ok(), leaving failed items at the front:

enqueue() triggers export_current() only when !in_flight_.

export_current() sets in_flight_ = true only when DelegateAsyncExport returns an OK immediate status.

on_metrics_export_complete() pops and chains to export_current() only on success; on error, it just clears in_flight_.

This means a permanently failing front element can pin the queue until some higher-level retry/flush path decides what to do, and newer metrics can be retried repeatedly behind it. That may be intentional (to preserve ordering and allow metrics_retry_cb_ to drive retries), but it would be good to:

Confirm this matches the intended backoff/retry policy at the GrpcAgent layer.

Consider tracking a per-item retry count or deciding when to drop/skips entries to avoid long‑term queue pinning under repeated failures.

Also applies to: 62-108, 118-124
test/agents/test-grpc-metrics-batch.mjs (1)
136-185: runTest always sets phaseExpectFlushed = true, making phaseExpectedBatchSize effectively unused

In runTest, all phases set:
let phaseExpectedBatchSize;
let phaseExpectFlushed;

// Phase 1–4
phaseExpectedBatchSize = ...;
phaseExpectFlushed = true;
But in checkOTLPMetricsData, when expectFlushedBatch is true, the datapoint count is validated only against [min,max] (or expectFlushRange) and ignores expectedBatchSize. That means:

The “partial” vs “flushed” phase comments and initialBatchSize/newBatchSize values are not actually asserted.

All 8 exports are only checked to fall within the same generic range, which weakens the tests for the steady‑state batch behavior.

Consider:

Using phaseExpectFlushed = false for phases where you want to assert exact batchSize, and true only for the real flush range, or

Passing different expectFlushRange per phase if the intention is to tolerate partial batches everywhere.

This would align the test logic with the comments and make regressions in batch sizing easier to catch.

Also applies to: 189-197, 213-236
agents/otlp/src/otlp_common.h (1)

47-88: Tighten MetricDataBatch invariants around max_points_ and construction

MetricDataBatch’s interface is clean and fits the batching use‑case well. Two small robustness nits:

Resize(std::size_t limit) directly assigns max_points_ = limit; with no validation. Given higher layers already validate metricsBatchSize > 0, this is probably safe, but you may want to either:

Treat limit == 0 explicitly as “no auto‑flush” and document that, or

assert(limit > 0) (or similar) so misuse is caught close to the source.

Consider having the constructor delegate to Resize(limit) so the limit semantics live in one place.

None of this blocks the current change, but it would make the batching primitive more self‑contained and harder to misuse in future call sites.

Also applies to: 89-102, 112-121

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e74e183 and a011ab5.

📒 Files selected for processing (25)

agents/grpc/proto/reconfigure.proto (1 hunks)
agents/grpc/src/grpc_agent.cc (20 hunks)
agents/grpc/src/grpc_agent.h (6 hunks)
agents/grpc/src/grpc_client.h (4 hunks)
agents/grpc/src/grpc_metrics_exporter.cc (1 hunks)
agents/grpc/src/grpc_metrics_exporter.h (1 hunks)
agents/grpc/src/proto/reconfigure.pb.cc (18 hunks)
agents/grpc/src/proto/reconfigure.pb.h (4 hunks)
agents/otlp/src/otlp_common.cc (14 hunks)
agents/otlp/src/otlp_common.h (3 hunks)
agents/otlp/src/otlp_metrics.cc (2 hunks)
lib/nsolid.js (4 hunks)
node.gyp (1 hunks)
src/nsolid/nsolid_util.h (3 hunks)
test/agents/test-grpc-metrics-batch.mjs (1 hunks)
test/agents/test-grpc-metrics-buffer-size.mjs (1 hunks)
test/agents/test-grpc-metrics-retry.mjs (1 hunks)
test/agents/test-grpc-metrics.mjs (2 hunks)
test/agents/test-otlp-grpc-metrics.mjs (3 hunks)
test/agents/test-otlp-metrics.mjs (1 hunks)
test/cctest/test_nsolid_ring_buffer.cc (1 hunks)
test/common/nsolid-grpc-agent/index.js (3 hunks)
test/common/nsolid-grpc-agent/server.mjs (2 hunks)
test/common/nsolid-grpc-agent/validators.js (1 hunks)
test/parallel/test-nsolid-start-metrics-buffer.js (1 hunks)

🧰 Additional context used

🧠 Learnings (3)

📚 Learning: 2025-07-08T14:47:34.724Z

Learnt from: santigimeno
Repo: nodesource/nsolid PR: 339
File: test/addons/nsolid-elf-utils/nsolid-elf-utils.js:16-18
Timestamp: 2025-07-08T14:47:34.724Z
Learning: In the nsolid test suite, external tool invocations (e.g., execSync to run `readelf`) are intentionally left uncaught so that any failure causes the test to fail rather than being skipped.

Applied to files:

test/agents/test-grpc-metrics-retry.mjs
test/agents/test-grpc-metrics-buffer-size.mjs

📚 Learning: 2025-10-27T13:06:17.506Z

Learnt from: santigimeno
Repo: nodesource/nsolid PR: 0
File: :0-0
Timestamp: 2025-10-27T13:06:17.506Z
Learning: In the nodesource/nsolid repository, when reviewing PRs that add or modify V8 patches (e.g., in deps/v8/patches/), focus review comments on the patch management infrastructure (configure.py, apply-patches.sh, README.md, etc.) and not on the V8 code changes contained within the .patch files themselves, as the patch content is out of scope for the patch management PR.

Applied to files:

node.gyp

📚 Learning: 2025-07-08T14:48:04.827Z

Learnt from: santigimeno
Repo: nodesource/nsolid PR: 339
File: test/addons/nsolid-elf-utils/binding.cc:13-28
Timestamp: 2025-07-08T14:48:04.827Z
Learning: In nsolid test native addons (e.g., `test/addons/*/binding.cc`), additional JavaScript-facing argument validation is typically omitted because the tests supply well-formed inputs.

Applied to files:

test/agents/test-grpc-metrics-buffer-size.mjs
test/common/nsolid-grpc-agent/validators.js

🪛 Biome (2.1.2)

lib/nsolid.js

[error] 9-9: Do not shadow the global "Number" property.

Consider renaming this variable. It's easy to confuse the origin of variables when they're named after a known global.

(lint/suspicious/noShadowRestrictedNames)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: test-tarball-linux
GitHub Check: coverage-windows
GitHub Check: test-linux (ubuntu-24.04-arm)
GitHub Check: test-macOS

🔇 Additional comments (24)

test/cctest/test_nsolid_ring_buffer.cc (1)

41-188: Good coverage of RingBuffer resize semantics

The new tests exercise same-capacity, grow/shrink (with and without wrap-around), and the death-on-zero path, and they match the implemented resize() behavior (keeping the newest N elements when shrinking). This gives solid protection against regressions.

agents/otlp/src/otlp_common.cc (1)

3-11: Metric helpers + MetricDataBatch::AddDataPoint wiring look sound

The new add_counter/add_gauge/add_summary helpers correctly construct SumPointData, LastValuePointData, and SummaryPointData and feed them through MetricDataBatch::AddDataPoint() with appropriate instrument types and temporalities. AddDataPoint()’s name-based dedup into metrics_ also aligns with batching by metric name.

As long as MetricDataBatch’s internal point-counting/limit logic is consistent with how IncrementPoints() and ShouldFlush() are implemented, this structure should behave as intended.

Would be good to double‑check that MetricDataBatch’s internal point counter is incremented in a way that actually reflects the number of PointDataAttributes added (either inside AddDataPoint or via IncrementPoints()), so batch limits are enforced as expected.

Also applies to: 30-45, 80-101, 103-123, 125-144, 146-171

agents/grpc/src/proto/reconfigure.pb.cc (1)

29-52: Generated proto wiring for metricsBatchSize/metricsBufferSize looks consistent

The new metricsbatchsize_ and metricsbuffersize_ fields are threaded through Impl_ initialization, offset/has-bits tables, descriptor text, parsing tables, Clear/Merge/Swap, serialization, and size computation in the same pattern as existing optional scalar fields.

Assuming this file is fully regenerated from the updated reconfigure.proto (and not hand‑edited), it looks good to go.

If not already done, please confirm this file (and the corresponding .pb.h) were regenerated with your canonical protoc/toolchain version so future proto changes don’t get out of sync.

Also applies to: 104-144, 147-200, 344-408, 410-438, 449-478, 590-603, 668-684, 728-745, 767-773

agents/grpc/src/proto/reconfigure.pb.h (1)

233-234: LGTM! Generated protobuf code is consistent.

The additions for metricsBatchSize and metricsBufferSize follow standard protobuf patterns with correct field numbering, has-bit masks, and accessor implementations.

Also applies to: 393-414, 420-420, 454-455, 1188-1242

node.gyp (1)

449-450: LGTM! New source files properly added to build.

The GrpcMetricsExporter source files are correctly added to the build in the appropriate section.

agents/grpc/proto/reconfigure.proto (1)

20-21: LGTM! Proto schema extended correctly.

The new optional metricsBatchSize and metricsBufferSize fields are appropriately typed as uint32 and numbered sequentially, maintaining backward compatibility.

test/common/nsolid-grpc-agent/server.mjs (1)

18-18: LGTM! Port handling enhances test flexibility.

The updated server now supports configurable port binding with proper ephemeral port allocation (default 0) and reports the actual bound port, enabling deterministic test scenarios while maintaining backward compatibility.

Also applies to: 216-229, 231-231

agents/otlp/src/otlp_metrics.cc (2)

107-118: LGTM! Process metrics correctly use batching.

The refactor from std::vector<MetricData> to MetricDataBatch with DumpMetricsAndReset() aligns with the batched export architecture, reducing per-metric overhead.

120-135: LGTM! Thread metrics batching follows same pattern.

Consistent batching approach for thread metrics using MetricDataBatch and DumpMetricsAndReset(), maintaining parallelism with process metrics implementation.

test/agents/test-grpc-metrics.mjs (1)

5-5: LGTM! Test refactor improves maintainability.

Consolidating OTLP metrics validation into the reusable checkOTLPMetricsData helper reduces duplication and improves test clarity without changing validation coverage.

Also applies to: 23-23

test/parallel/test-nsolid-start-metrics-buffer.js (2)

1-44: LGTM! Comprehensive validation test coverage.

The test thoroughly validates numeric inputs, string parsing, invalid value rejection, and reset behavior for the new metrics configuration fields.

24-37: Review comment is verified and valid.

The validation logic at lines 713-719 of lib/nsolid.js correctly rejects 0 (and other invalid values) through the condition if (NumberIsFinite(value) && value > 0). Since 0 fails the value > 0 check, it is not assigned to the config and the previous value persists, which aligns with the test expectations that config values remain unchanged when invalid inputs are provided.

test/agents/test-grpc-metrics-retry.mjs (1)

13-96: LGTM! Retry test logic is well-structured.

The test effectively validates GrpcMetricsExporter retry behavior with:

Clear state tracking (metricsReceived, serverKilled, serverRestarted)

Appropriate timing (5x interval wait, 5s timeout)

Thread vs process metrics differentiation via hasThreadAttributes

Reasonable thresholds (≥2 metrics before restart, ≥8 total)

test/agents/test-grpc-metrics-buffer-size.mjs (1)

12-55: Solid coverage of metrics buffering behavior

The test wiring (per‑bufferSize env, GRPCServer/TestClient harness, validator usage, and early shutdown on targetMetrics) looks consistent with the rest of the suite and should give good signal on buffer sizing behavior.

Also applies to: 72-94

test/agents/test-otlp-metrics.mjs (1)

491-522: Thread‑aware datapoint selection looks correct

The updated checkMetricsData correctly handles multiple datapoints per metric by:

Selecting the point matching context.threadId,

Verifying thread.name (main vs worker),

Consuming the datapoint and only dropping the metric once all datapoints are used.

This lines up with the batching behavior and keeps the state machine over metrics/expected coherent.
test/agents/test-otlp-grpc-metrics.mjs (2)
489-527: Consistent per‑thread datapoint handling for gRPC metrics

The new logic to:

Locate the datapoint by thread.id,

Assert thread.name for main vs worker,

Splice out the consumed datapoint and defer metric removal until all points are consumed,
keeps the gRPC test aligned with the HTTP OTLP variant and with the batching semantics.

592-617: Confirm metric.data shape in updateStateAndExpected

updateStateAndExpected() infers whether the batch contains thread metrics via:
const hasThreadMetrics = context.metrics.some((metric) => {
  if (!metric || !metric[metric.data] || !metric[metric.data].dataPoints ||
      !metric[metric.data].dataPoints[0] ||
      !metric[metric.data].dataPoints[0].attributes) {
    return false;
  }
  const attributes = metric[metric.data].dataPoints[0].attributes;
  return attributes.some((attr) => attr.key === 'thread.id');
});
This assumes each metric carries a .data discriminator that names the active field ('sum', 'gauge', 'summary', etc.). If the actual objects only expose metric.sum / metric.gauge / metric.summary (as in the commented example), metric.data would be undefined, and hasThreadMetrics would remain false, breaking the state detection.

Please double‑check the OTLP gRPC metric object shape produced by OTLPGRPCServer and confirm that:

metric.data is indeed present and holds the active data field name, or

Otherwise, this should be refactored to derive the active field another way (e.g., by checking known keys or reusing the aggregation information).

Also worth confirming that by the time a batch with thread metrics arrives, context.threadList has been populated with all relevant thread IDs so the assert.ok(context.threadList.length > 0, 'No more threads available'); path cannot trigger spuriously.
agents/grpc/src/grpc_metrics_exporter.h (1)

33-69: GrpcMetricsExporter interface looks well‑scoped

The exporter’s public surface (init, enqueue, flush, resize_buffer) together with the RingBuffer queue and AsyncTSQueue<grpc::Status> for completions is clean and matches the intended batched, off‑thread export model. Deleted copy ctor/assignment and enable_shared_from_this are appropriate for an async, shared‑ownership component.

test/common/nsolid-grpc-agent/index.js (1)

9-16: Centralizing validator logic via validators.js is a good cleanup

Re‑using checkExitData, checkResource, checkOTLPMetricsData, hasThreadAttributes, expectedProcMetrics, and expectedThreadMetrics from validators.js and re‑exporting them here keeps the GRPC test harness thin and avoids duplication across tests.

Also applies to: 532-538

agents/grpc/src/grpc_metrics_exporter.cc (2)

38-50: Ensure init() is always called after constructing a shared_ptr

init() uses weak_from_this() and initializes metrics_completion_q_. If GrpcMetricsExporter is ever used (enqueue/flush) before init() runs, or constructed outside a shared_ptr, this will either throw (bad_weak_ptr) or dereference a null metrics_completion_q_. Please confirm all call sites always construct via std::make_shared<GrpcMetricsExporter> and immediately call init() before any other method.

25-36: Overall async export path looks solid

Constructor wiring (metrics_service_stub_, bounded metrics_q_), arena allocation for each request, and the use of AsyncTSQueue + weak pointers to deliver completion results back to on_metrics_export_complete() all look consistent and safe for the single-threaded agent loop model.

Also applies to: 74-83, 85-101

agents/grpc/src/grpc_agent.h (1)

153-154: Confirm initialization and lifecycle of new metrics batching/exporter fields

The new metrics fields (otlp_grpc_client_, otlp_grpc_client_options_, metrics_paused_, proc_metrics_batch_, thr_metrics_batch_, metrics_exporter_, plus kMaxMetricsExportRetries, metrics_retry_cb_, and on_metrics_timer()) are a good fit for the new batched/exporter design.

Please double‑check in grpc_agent.cc that:

metrics_paused_ is explicitly initialized (likely false) and correctly toggled on reconfigure.

proc_metrics_batch_ / thr_metrics_batch_ are constructed with the intended batch size and resized on metricsBatchSize changes.

metrics_exporter_ is always created with the same loop as the agent thread and init() is called before use.

kMaxMetricsExportRetries and metrics_retry_cb_ are actually used to bound retries and avoid unbounded requeueing.

Also applies to: 190-195, 266-267, 324-328, 338-349

test/agents/test-grpc-metrics-batch.mjs (1)

12-18: Static batch + pause/flush test flow looks reasonable

The runSimpleTest harness (server/client wiring, metrics listener with mustCall, and optional pauseMetricsAt reconfigure) provides good coverage for static batching and pause‑flush behavior. The guard on both process and thread metrics before shutdown, plus the explicit expected export count, looks sane.

Also applies to: 31-37, 38-87, 89-107
test/common/nsolid-grpc-agent/validators.js (1)
154-171: Re-check use of dataPoint.value vs. asInt/asDouble in OTLP datapoints

In checkOTLPMetricsData you assert:
assert.strictEqual(dataPoint.value, expectedMetric[2]); // expectedMetric[2] is 'asInt' / 'asDouble'
...
assert.strictEqual(dataPoint.value, expectedMetric[2]); // process metrics
while also validating the numeric fields via:
if (type === 'asInt') {
  validateNumber(parseInt(dataPoint[type], 10), `${name}.${type}`);
} else {
  validateNumber(dataPoint[type], `${name}.${type}`);
}
For standard OTLP NumberDataPoint JSON, you’d typically have asInt / asDouble (or similar) but not a separate value string field indicating which one is set. If dataPoint.value is not actually present or doesn’t contain 'asInt' / 'asDouble', these assertions will systematically fail even though the numeric fields are correct.

Suggested adjustment:

Drop or relax the dataPoint.value === expectedMetric[2] assertions, and rely on checking the presence and type of dataPoint.asInt / dataPoint.asDouble (and/or metric.data / aggregation) instead.

Please confirm the exact shape of the decoded OTLP metrics objects produced by your test server and adjust the assertions accordingly so they match the real wire format.

Also applies to: 173-234, 236-263

coderabbitai · 2025-11-20T17:31:23Z

agents/grpc/src/grpc_agent.cc


-  ResourceMetrics data;
-  data.resource_ = otlp::GetResource();
-  std::vector<MetricData> metrics;
-
  ThreadMetricsStor stor;
  while (agent->thr_metrics_msg_q_.dequeue(stor)) {
-    otlp::fill_env_metrics(metrics, stor, false);
+    otlp::fill_env_metrics(agent->thr_metrics_batch_, stor, false);
    agent->thr_metrics_cache_.insert_or_assign(stor.thread_id, std::move(stor));
  }

-  data.scope_metric_data_ =
-    std::vector<ScopeMetrics>{{otlp::GetScope(), metrics}};
-  auto result = agent->metrics_exporter_->Export(data);
-  Debug("# ThreadMetrics Exported. Result: %d\n", static_cast<int>(result));
+  if (agent->thr_metrics_batch_.ShouldFlush()) {
+    auto metric_data = agent->thr_metrics_batch_.DumpMetricsAndReset();
+    if (!metric_data.empty()) {
+      ResourceMetrics data;
+      data.resource_ = otlp::GetResource();
+      data.scope_metric_data_ =
+        std::vector<ScopeMetrics>{{ otlp::GetScope(), std::move(metric_data) }};
+      agent->metrics_exporter_->enqueue(std::move(data));
+    }
+  }


⚠️ Potential issue | 🔴 Critical

Avoid null dereference during shutdown in metrics_msg_cb_.

metrics_msg_cb_ always dereferences metrics_exporter_, but do_stop() resets that pointer before closing metrics_msg_. If a thread metrics update lands right as we begin shutting down, the pending async callback will run after the reset and we’ll crash. Please either guard every use of metrics_exporter_ in this callback or defer the reset until after metrics_msg_ (and its queue) are drained/closed. A simple if (!metrics_exporter_) return; ahead of the batching/export path would also work.

Also applies to: 1332-1340

🤖 Prompt for AI Agents

In agents/grpc/src/grpc_agent.cc around lines 918 to 934 (and similarly 1332-1340), the metrics_msg_cb_ callback unconditionally dereferences metrics_exporter_, which can be reset by do_stop() before the message queue is closed, causing a null dereference during shutdown; fix this by guarding uses of metrics_exporter_ at the top of the callback (e.g., if (!metrics_exporter_) return;) or by checking before enqueueing and before any dereference, ensuring you return early if metrics_exporter_ is null, or alternatively defer resetting metrics_exporter_ in do_stop() until after metrics_msg_ and its queue are drained/closed.

coderabbitai · 2025-11-20T17:31:23Z

lib/nsolid.js

+  // Metrics batch size
+  nsolidConfig.metricsBatchSize =
+    +process.env.NSOLID_METRICS_BATCH_SIZE ||
+    pkgConfig.nsolid.metricsBatchSize ||
+    DEFAULT_METRICS_BATCH_SIZE;
+
+  // Metrics buffer size
+  nsolidConfig.metricsBufferSize =
+    +process.env.NSOLID_METRICS_BUFFER_SIZE ||
+    pkgConfig.nsolid.metricsBufferSize ||
+    DEFAULT_METRICS_BUFFER_SIZE;
+


⚠️ Potential issue | 🟠 Major

Env/pkg metricsBatchSize/metricsBufferSize aren’t checked for “finite & >0”

Unlike the JS updateConfig() path, initializeConfig() takes the first truthy of env, pkg, or default:

nsolidConfig.metricsBatchSize = +process.env.NSOLID_METRICS_BATCH_SIZE || pkgConfig.nsolid.metricsBatchSize || DEFAULT_METRICS_BATCH_SIZE;

This means values like Infinity or negative numbers from env/pkg can slip through (they’re truthy), even though the PR aims to enforce “positive finite” sizes. That’s especially risky for buffer sizing and batching.

I’d suggest normalizing this path to the same validation as updateConfig():

Parse env/pkg to a number.

Only accept it if NumberIsFinite(value) && value > 0.

Otherwise fall back to pkg/default.

You can factor this into a small helper to avoid duplication between initializeConfig() and updateConfig().

🤖 Prompt for AI Agents

In lib/nsolid.js around lines 862 to 873, the initializeConfig() assignments for metricsBatchSize and metricsBufferSize use the first truthy value and therefore accept Infinity or negative numbers from env/pkg; change this to parse env and pkg values to numbers and only accept them if Number.isFinite(value) && value > 0, otherwise fall back to pkg or default, and factor the logic into a small helper (e.g., normalizePositiveFinite(envVal, pkgVal, defaultVal)) so updateConfig() and initializeConfig() share the same validation.

RafaelGSS · 2025-11-20T18:44:56Z

lib/nsolid.js

+    } else if (key === 'metricsBatchSize' || key === 'metricsBufferSize') {
+      const raw = config[key];
+      if (typeof raw === 'number' || (typeof raw === 'string' && raw.trim() !== '')) {
+        const value = Number(raw);


Should we check if the value is NaN after this conversion?

Introduce custom OTLP metrics serialization in gRPC agent, replacing SDK with BatchedMetricData and direct protobuf generation. Provides per-data-point timestamps, custom batching, and improved performance for high-frequency metrics. Add debug logging, fix thread ID type mismatches, implement robust batch processing for partial metrics, and update test expectations to resolve inconsistent processing and hangs.

Require positive finite numbers when metricsBatchSize/metricsBufferSize are set via nsolid.start() or live reconfigure so odd JS values (true, objects, NaN, etc.) are ignored Add test-nsolid-start-metrics-buffer.js test to cover for this.

santigimeno added 5 commits November 19, 2025 16:29

santigimeno self-assigned this Nov 20, 2025

santigimeno requested review from RafaelGSS and juanarbol November 20, 2025 17:03

santigimeno changed the title ~~Santi/grpc metrics batch~~ Improve metrics performance and reliability Nov 20, 2025

coderabbitai bot requested changes Nov 20, 2025

View reviewed changes

RafaelGSS approved these changes Nov 20, 2025

View reviewed changes

santigimeno force-pushed the santi/grpc_metrics_batch branch from a011ab5 to c540acb Compare November 24, 2025 14:21

santigimeno added 4 commits November 24, 2025 15:29

linting fixes

60f8328

Optimize, linting

662c5c5

santigimeno force-pushed the santi/grpc_metrics_batch branch from c540acb to 662c5c5 Compare November 24, 2025 19:15

Fixup

b78b7fd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve metrics performance and reliability #391

Improve metrics performance and reliability #391

Uh oh!

santigimeno commented Nov 20, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 20, 2025 •

edited

Loading

Review skipped

Uh oh!

santigimeno commented Nov 20, 2025

Uh oh!

coderabbitai bot commented Nov 20, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 20, 2025

Uh oh!

coderabbitai bot Nov 20, 2025

Uh oh!

RafaelGSS Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Improve metrics performance and reliability #391

Are you sure you want to change the base?

Improve metrics performance and reliability #391

Uh oh!

Conversation

santigimeno commented Nov 20, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

santigimeno commented Nov 20, 2025

Uh oh!

coderabbitai bot commented Nov 20, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

RafaelGSS Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

santigimeno commented Nov 20, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 20, 2025 •

edited

Loading