-
Notifications
You must be signed in to change notification settings - Fork 10
Improve metrics performance and reliability #391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: santi/grpc_retry_policy
Are you sure you want to change the base?
Conversation
Move common validator logic for gRPC to test/common/nsolid-grpc-agent/validators.js. Add a new validator for OTLP Metrics: `checkOTLPMetricsData()`. Add also a helper method `hasThreadAttributes()` to check whether a metric datapoint is a thread or a process. Refactor `test/agents/test-grpc-metrics.mjs` to use these new helpers.
Introduce MetricDataBatch class to consolidate metric collection before export, reducing per-metric overhead and improving batch processing efficiency. Update OTLP/grpc agents to accumulate metrics in batches and dump them into ScopeMetrics during export. This change reduces the number of individual metric exports and improves performance for high-frequency metric generation. Changes: - Introduce MetricDataBatch class for metric accumulation. - Update OTLP agents to use batched metric collection. - Modify test expectations to handle batched datapoints.
Refactor DelegateAsyncExport to accept generic stub, event, and response types as template parameters, enabling reuse across different gRPC services beyond the original NSolidService. This generalization prepares the codebase for supporting the OTLP MetricsService exporter while maintaining backward compatibility with existing event exports. Key changes: - Template DelegateAsyncExport with <Stub, EventT, ResponseT> params. - Update GrpcAsyncCallData to use generic ResponseType. - Change callback signature from bool return to void return. - Update all existing DelegateAsyncExport call sites. - Remove unnecessary return true statements from callbacks.
Allow RingBuffer to resize at runtime by moving the newest entries into a freshly allocated buffer and resetting head/tail bookkeeping. Require strictly positive capacities via CHECK_GT and drop the const on capacity_ so it can reflect the current size. Add gtest coverage for all resizing scenarios (no-op, grow empty/full, shrink with and without wraparound, death test for resize(0)) to prevent regressions.
Introduce GrpcMetricsExporter that uses DelegateAsyncExport + AsyncTSQueue to flush MetricDataBatch instances off the main thread and retry transient gRPC failures with backpressure. Add NSOLID_METRICS_BUFFER_SIZE option (plumbed through lib/nsolid.js, reconfigure proto/pb.cc/pb.h, and node.gyp) plus new test-grpc-metrics-buffer-size.mjs and retry coverage. It defaults to 100 elements, following the ZmqAgent behavior. Wire GrpcAgent config/reconfigure paths to construct the exporter once and call resize_buffer on buffer-size changes, ensuring the new queue depth propagates without losing metrics.
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughThe PR refactors the gRPC agent's metrics export system from immediate per-call exports to a batched, buffered approach. It introduces a new GrpcMetricsExporter class, extends the reconfigure proto with metrics batch and buffer size fields, adds dynamic RingBuffer resizing, and includes comprehensive test coverage for validation. Changes
Sequence DiagramsequenceDiagram
participant Agent as GrpcAgent
participant Batch as MetricDataBatch
participant Exporter as GrpcMetricsExporter
participant Queue as Metrics Queue
participant gRPC as gRPC Stub
rect rgb(200, 220, 255)
Note over Agent,gRPC: Metric Collection & Batching
Agent->>Batch: AddDataPoint(metric_data)
Batch->>Batch: IncrementPoints()
Note over Batch: total_points < max_points
Agent->>Batch: ShouldFlush()?
Batch-->>Agent: false (below threshold)
Agent->>Batch: AddDataPoint(next_metric)
end
rect rgb(200, 255, 220)
Note over Agent,gRPC: Batch Threshold Reached or Timer Fired
Agent->>Batch: DumpMetricsAndReset()
Batch-->>Agent: ResourceMetrics
Agent->>Exporter: enqueue(metrics)
Exporter->>Queue: push(metrics)
Exporter->>Exporter: export_current()
end
rect rgb(255, 220, 200)
Note over Agent,gRPC: Async Export in Flight
Exporter->>gRPC: DelegateAsyncExport(ExportMetricsServiceRequest)
Note over gRPC: Async operation
gRPC-->>Exporter: on_metrics_export_complete(Status)
Exporter->>Exporter: Pop from queue if success
Exporter->>Exporter: export_current() (next batch)
end
rect rgb(255, 200, 220)
Note over Agent,gRPC: Pause/Flush Path
Agent->>Agent: metrics_paused_ = true
Agent->>Batch: DumpMetricsAndReset()
Agent->>Exporter: enqueue(remaining_metrics)
Agent->>Exporter: flush()
Note over Exporter: Ensures all queued metrics exported
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
agents/otlp/src/otlp_common.cc (1)
297-400: http_client summary quantiles are swapped (p50 vs p99)In
fill_env_metrics(), thehttp_clientsummary currently does:{{ 0.5, stor.http_client99_ptile }, { 0.99, stor.http_client_median }},All other summaries (GC, DNS, http_server) map
0.5 -> medianand0.99 -> 99th percentile. Here those two are reversed, so exported http_client metrics will report the 99th percentile under the 0.5 quantile and vice versa.Recommend swapping the fields:
- {{ 0.5, stor.http_client99_ptile }, - { 0.99, stor.http_client_median }}, + {{ 0.5, stor.http_client_median }, + { 0.99, stor.http_client99_ptile }},
🧹 Nitpick comments (8)
src/nsolid/nsolid_util.h (1)
226-233: RingBuffer resize + move-push semantics look correct; only minor readability nitThe ctor check for
s > 0, move-basedpush(T&&), andresize()logic (preserve order on grow, keep newest elements on shrink, reset head/tail/size) all look consistent and are well covered by the new tests.If you want to polish later, you could hoist
start_idxout of theforloop inresize()(it’s loop-invariant) and/or factor the copy logic into a small helper to make future changes safer, but there’s no functional issue here.Also applies to: 266-274, 276-312, 315-320
lib/nsolid.js (2)
3-16: PrimordialsNumbershadowing globalNumbertrips lint; consider renamingYou’re pulling
Numberfromprimordials, which is fine semantically, but it shadows the globalNumberand triggersnoShadowRestrictedNames. If you want to keep linters quiet and reduce ambiguity, consider renaming the binding (e.g.,PrimordialsNumber) and using that instead.
672-723: JS config validation for metricsBatchSize/metricsBufferSize is permissive (silent ignore)The
updateConfig()branch formetricsBatchSize/metricsBufferSizecorrectly restricts JS-provided values to positive finite numbers, but invalid values are just skipped rather than rejected. If callers misconfigure (e.g., pass0,-1,NaN, orInfinity), they’ll get the previous or default value with no signal.If the intent per PR description is “validate and reject bad inputs”, consider either:
- Throwing on invalid values, or
- At least logging a debug/warning so misconfigurations are discoverable.
As-is it’s functionally safe but a bit opaque from an API standpoint.
test/agents/test-grpc-metrics-buffer-size.mjs (1)
59-67: Avoidasynctimeout handler to ensure assertion is not just an unhandled rejectionUsing an
asyncfunction insetTimeoutmeansassert.fail()only rejects the returned Promise, which the timer machinery ignores. Depending onunhandledRejectionbehavior, this may not reliably fail the test.You can keep semantics but make the failure synchronous by dropping
async/awaitand treating shutdown/close as best‑effort:- const timeoutId = setTimeout(async () => { - console.log('Timeout reached - test failed!'); - await child.shutdown(0); - if (grpcServer) { - grpcServer.close(); - } - - assert.fail(`Test timeout: Expected at least ${targetMetrics} metrics exports with buffer size ${bufferSize}, but got ${metricsReceived}`); - }, 5000); + const timeoutId = setTimeout(() => { + console.log('Timeout reached - test failed!'); + child.shutdown(0); + if (grpcServer) { + grpcServer.close(); + } + assert.fail( + `Test timeout: Expected at least ${targetMetrics} metrics ` + + `exports with buffer size ${bufferSize}, but got ${metricsReceived}` + ); + }, 5000);test/common/nsolid-grpc-agent/index.js (1)
26-27: Consider treating0as a valid explicit port
start(cb, port = null)currently computes:const args = port ? [port.toString()] : [];This treats
port === 0as “no explicit port”, which may be surprising if you ever want to force the server to bind to port 0. If that’s a valid use case, consider an explicit null/undefined check instead:const args = (port === null || port === undefined) ? [] : [String(port)];If you never intend to pass 0, current code is fine; this is just a small robustness tweak.
agents/grpc/src/grpc_metrics_exporter.cc (1)
55-60: Clarify/exporter-level retry behavior for failed exportsThe queue +
in_flight_logic guarantees at most one export in flight and pops only onstatus.ok(), leaving failed items at the front:
enqueue()triggersexport_current()only when!in_flight_.export_current()setsin_flight_ = trueonly whenDelegateAsyncExportreturns an OK immediate status.on_metrics_export_complete()pops and chains toexport_current()only on success; on error, it just clearsin_flight_.This means a permanently failing front element can pin the queue until some higher-level retry/flush path decides what to do, and newer metrics can be retried repeatedly behind it. That may be intentional (to preserve ordering and allow
metrics_retry_cb_to drive retries), but it would be good to:
- Confirm this matches the intended backoff/retry policy at the
GrpcAgentlayer.- Consider tracking a per-item retry count or deciding when to drop/skips entries to avoid long‑term queue pinning under repeated failures.
Also applies to: 62-108, 118-124
test/agents/test-grpc-metrics-batch.mjs (1)
136-185:runTestalways setsphaseExpectFlushed = true, makingphaseExpectedBatchSizeeffectively unusedIn
runTest, all phases set:let phaseExpectedBatchSize; let phaseExpectFlushed; // Phase 1–4 phaseExpectedBatchSize = ...; phaseExpectFlushed = true;But in
checkOTLPMetricsData, whenexpectFlushedBatchistrue, the datapoint count is validated only against[min,max](orexpectFlushRange) and ignoresexpectedBatchSize. That means:
- The “partial” vs “flushed” phase comments and
initialBatchSize/newBatchSizevalues are not actually asserted.- All 8 exports are only checked to fall within the same generic range, which weakens the tests for the steady‑state batch behavior.
Consider:
- Using
phaseExpectFlushed = falsefor phases where you want to assert exactbatchSize, andtrueonly for the real flush range, or- Passing different
expectFlushRangeper phase if the intention is to tolerate partial batches everywhere.This would align the test logic with the comments and make regressions in batch sizing easier to catch.
Also applies to: 189-197, 213-236
agents/otlp/src/otlp_common.h (1)
47-88: TightenMetricDataBatchinvariants aroundmax_points_and construction
MetricDataBatch’s interface is clean and fits the batching use‑case well. Two small robustness nits:
Resize(std::size_t limit)directly assignsmax_points_ = limit;with no validation. Given higher layers already validate metricsBatchSize > 0, this is probably safe, but you may want to either:
- Treat
limit == 0explicitly as “no auto‑flush” and document that, orassert(limit > 0)(or similar) so misuse is caught close to the source.- Consider having the constructor delegate to
Resize(limit)so the limit semantics live in one place.None of this blocks the current change, but it would make the batching primitive more self‑contained and harder to misuse in future call sites.
Also applies to: 89-102, 112-121
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (25)
agents/grpc/proto/reconfigure.proto(1 hunks)agents/grpc/src/grpc_agent.cc(20 hunks)agents/grpc/src/grpc_agent.h(6 hunks)agents/grpc/src/grpc_client.h(4 hunks)agents/grpc/src/grpc_metrics_exporter.cc(1 hunks)agents/grpc/src/grpc_metrics_exporter.h(1 hunks)agents/grpc/src/proto/reconfigure.pb.cc(18 hunks)agents/grpc/src/proto/reconfigure.pb.h(4 hunks)agents/otlp/src/otlp_common.cc(14 hunks)agents/otlp/src/otlp_common.h(3 hunks)agents/otlp/src/otlp_metrics.cc(2 hunks)lib/nsolid.js(4 hunks)node.gyp(1 hunks)src/nsolid/nsolid_util.h(3 hunks)test/agents/test-grpc-metrics-batch.mjs(1 hunks)test/agents/test-grpc-metrics-buffer-size.mjs(1 hunks)test/agents/test-grpc-metrics-retry.mjs(1 hunks)test/agents/test-grpc-metrics.mjs(2 hunks)test/agents/test-otlp-grpc-metrics.mjs(3 hunks)test/agents/test-otlp-metrics.mjs(1 hunks)test/cctest/test_nsolid_ring_buffer.cc(1 hunks)test/common/nsolid-grpc-agent/index.js(3 hunks)test/common/nsolid-grpc-agent/server.mjs(2 hunks)test/common/nsolid-grpc-agent/validators.js(1 hunks)test/parallel/test-nsolid-start-metrics-buffer.js(1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-07-08T14:47:34.724Z
Learnt from: santigimeno
Repo: nodesource/nsolid PR: 339
File: test/addons/nsolid-elf-utils/nsolid-elf-utils.js:16-18
Timestamp: 2025-07-08T14:47:34.724Z
Learning: In the nsolid test suite, external tool invocations (e.g., execSync to run `readelf`) are intentionally left uncaught so that any failure causes the test to fail rather than being skipped.
Applied to files:
test/agents/test-grpc-metrics-retry.mjstest/agents/test-grpc-metrics-buffer-size.mjs
📚 Learning: 2025-10-27T13:06:17.506Z
Learnt from: santigimeno
Repo: nodesource/nsolid PR: 0
File: :0-0
Timestamp: 2025-10-27T13:06:17.506Z
Learning: In the nodesource/nsolid repository, when reviewing PRs that add or modify V8 patches (e.g., in deps/v8/patches/), focus review comments on the patch management infrastructure (configure.py, apply-patches.sh, README.md, etc.) and not on the V8 code changes contained within the .patch files themselves, as the patch content is out of scope for the patch management PR.
Applied to files:
node.gyp
📚 Learning: 2025-07-08T14:48:04.827Z
Learnt from: santigimeno
Repo: nodesource/nsolid PR: 339
File: test/addons/nsolid-elf-utils/binding.cc:13-28
Timestamp: 2025-07-08T14:48:04.827Z
Learning: In nsolid test native addons (e.g., `test/addons/*/binding.cc`), additional JavaScript-facing argument validation is typically omitted because the tests supply well-formed inputs.
Applied to files:
test/agents/test-grpc-metrics-buffer-size.mjstest/common/nsolid-grpc-agent/validators.js
🪛 Biome (2.1.2)
lib/nsolid.js
[error] 9-9: Do not shadow the global "Number" property.
Consider renaming this variable. It's easy to confuse the origin of variables when they're named after a known global.
(lint/suspicious/noShadowRestrictedNames)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: test-tarball-linux
- GitHub Check: coverage-windows
- GitHub Check: test-linux (ubuntu-24.04-arm)
- GitHub Check: test-macOS
🔇 Additional comments (24)
test/cctest/test_nsolid_ring_buffer.cc (1)
41-188: Good coverage of RingBuffer resize semanticsThe new tests exercise same-capacity, grow/shrink (with and without wrap-around), and the death-on-zero path, and they match the implemented
resize()behavior (keeping the newest N elements when shrinking). This gives solid protection against regressions.agents/otlp/src/otlp_common.cc (1)
3-11: Metric helpers +MetricDataBatch::AddDataPointwiring look soundThe new
add_counter/add_gauge/add_summaryhelpers correctly constructSumPointData,LastValuePointData, andSummaryPointDataand feed them throughMetricDataBatch::AddDataPoint()with appropriate instrument types and temporalities.AddDataPoint()’s name-based dedup intometrics_also aligns with batching by metric name.As long as
MetricDataBatch’s internal point-counting/limit logic is consistent with howIncrementPoints()andShouldFlush()are implemented, this structure should behave as intended.Would be good to double‑check that
MetricDataBatch’s internal point counter is incremented in a way that actually reflects the number ofPointDataAttributesadded (either insideAddDataPointor viaIncrementPoints()), so batch limits are enforced as expected.Also applies to: 30-45, 80-101, 103-123, 125-144, 146-171
agents/grpc/src/proto/reconfigure.pb.cc (1)
29-52: Generated proto wiring for metricsBatchSize/metricsBufferSize looks consistentThe new
metricsbatchsize_andmetricsbuffersize_fields are threaded through Impl_ initialization, offset/has-bits tables, descriptor text, parsing tables, Clear/Merge/Swap, serialization, and size computation in the same pattern as existing optional scalar fields.Assuming this file is fully regenerated from the updated
reconfigure.proto(and not hand‑edited), it looks good to go.If not already done, please confirm this file (and the corresponding
.pb.h) were regenerated with your canonical protoc/toolchain version so future proto changes don’t get out of sync.Also applies to: 104-144, 147-200, 344-408, 410-438, 449-478, 590-603, 668-684, 728-745, 767-773
agents/grpc/src/proto/reconfigure.pb.h (1)
233-234: LGTM! Generated protobuf code is consistent.The additions for
metricsBatchSizeandmetricsBufferSizefollow standard protobuf patterns with correct field numbering, has-bit masks, and accessor implementations.Also applies to: 393-414, 420-420, 454-455, 1188-1242
node.gyp (1)
449-450: LGTM! New source files properly added to build.The GrpcMetricsExporter source files are correctly added to the build in the appropriate section.
agents/grpc/proto/reconfigure.proto (1)
20-21: LGTM! Proto schema extended correctly.The new optional
metricsBatchSizeandmetricsBufferSizefields are appropriately typed asuint32and numbered sequentially, maintaining backward compatibility.test/common/nsolid-grpc-agent/server.mjs (1)
18-18: LGTM! Port handling enhances test flexibility.The updated server now supports configurable port binding with proper ephemeral port allocation (default 0) and reports the actual bound port, enabling deterministic test scenarios while maintaining backward compatibility.
Also applies to: 216-229, 231-231
agents/otlp/src/otlp_metrics.cc (2)
107-118: LGTM! Process metrics correctly use batching.The refactor from
std::vector<MetricData>toMetricDataBatchwithDumpMetricsAndReset()aligns with the batched export architecture, reducing per-metric overhead.
120-135: LGTM! Thread metrics batching follows same pattern.Consistent batching approach for thread metrics using
MetricDataBatchandDumpMetricsAndReset(), maintaining parallelism with process metrics implementation.test/agents/test-grpc-metrics.mjs (1)
5-5: LGTM! Test refactor improves maintainability.Consolidating OTLP metrics validation into the reusable
checkOTLPMetricsDatahelper reduces duplication and improves test clarity without changing validation coverage.Also applies to: 23-23
test/parallel/test-nsolid-start-metrics-buffer.js (2)
1-44: LGTM! Comprehensive validation test coverage.The test thoroughly validates numeric inputs, string parsing, invalid value rejection, and reset behavior for the new metrics configuration fields.
24-37: Review comment is verified and valid.The validation logic at lines 713-719 of
lib/nsolid.jscorrectly rejects0(and other invalid values) through the conditionif (NumberIsFinite(value) && value > 0). Since0fails thevalue > 0check, it is not assigned to the config and the previous value persists, which aligns with the test expectations that config values remain unchanged when invalid inputs are provided.test/agents/test-grpc-metrics-retry.mjs (1)
13-96: LGTM! Retry test logic is well-structured.The test effectively validates GrpcMetricsExporter retry behavior with:
- Clear state tracking (metricsReceived, serverKilled, serverRestarted)
- Appropriate timing (5x interval wait, 5s timeout)
- Thread vs process metrics differentiation via
hasThreadAttributes- Reasonable thresholds (≥2 metrics before restart, ≥8 total)
test/agents/test-grpc-metrics-buffer-size.mjs (1)
12-55: Solid coverage of metrics buffering behaviorThe test wiring (per‑bufferSize env, GRPCServer/TestClient harness, validator usage, and early shutdown on
targetMetrics) looks consistent with the rest of the suite and should give good signal on buffer sizing behavior.Also applies to: 72-94
test/agents/test-otlp-metrics.mjs (1)
491-522: Thread‑aware datapoint selection looks correctThe updated
checkMetricsDatacorrectly handles multiple datapoints per metric by:
- Selecting the point matching
context.threadId,- Verifying
thread.name(main vs worker),- Consuming the datapoint and only dropping the metric once all datapoints are used.
This lines up with the batching behavior and keeps the state machine over
metrics/expectedcoherent.test/agents/test-otlp-grpc-metrics.mjs (2)
489-527: Consistent per‑thread datapoint handling for gRPC metricsThe new logic to:
- Locate the datapoint by
thread.id,- Assert
thread.namefor main vs worker,- Splice out the consumed datapoint and defer metric removal until all points are consumed,
keeps the gRPC test aligned with the HTTP OTLP variant and with the batching semantics.
592-617: Confirmmetric.datashape inupdateStateAndExpected
updateStateAndExpected()infers whether the batch contains thread metrics via:const hasThreadMetrics = context.metrics.some((metric) => { if (!metric || !metric[metric.data] || !metric[metric.data].dataPoints || !metric[metric.data].dataPoints[0] || !metric[metric.data].dataPoints[0].attributes) { return false; } const attributes = metric[metric.data].dataPoints[0].attributes; return attributes.some((attr) => attr.key === 'thread.id'); });This assumes each
metriccarries a.datadiscriminator that names the active field ('sum','gauge','summary', etc.). If the actual objects only exposemetric.sum/metric.gauge/metric.summary(as in the commented example),metric.datawould beundefined, andhasThreadMetricswould remain false, breaking the state detection.Please double‑check the OTLP gRPC metric object shape produced by
OTLPGRPCServerand confirm that:
metric.datais indeed present and holds the active data field name, or- Otherwise, this should be refactored to derive the active field another way (e.g., by checking known keys or reusing the
aggregationinformation).Also worth confirming that by the time a batch with thread metrics arrives,
context.threadListhas been populated with all relevant thread IDs so theassert.ok(context.threadList.length > 0, 'No more threads available');path cannot trigger spuriously.agents/grpc/src/grpc_metrics_exporter.h (1)
33-69: GrpcMetricsExporter interface looks well‑scopedThe exporter’s public surface (
init,enqueue,flush,resize_buffer) together with theRingBufferqueue andAsyncTSQueue<grpc::Status>for completions is clean and matches the intended batched, off‑thread export model. Deleted copy ctor/assignment andenable_shared_from_thisare appropriate for an async, shared‑ownership component.test/common/nsolid-grpc-agent/index.js (1)
9-16: Centralizing validator logic viavalidators.jsis a good cleanupRe‑using
checkExitData,checkResource,checkOTLPMetricsData,hasThreadAttributes,expectedProcMetrics, andexpectedThreadMetricsfromvalidators.jsand re‑exporting them here keeps the GRPC test harness thin and avoids duplication across tests.Also applies to: 532-538
agents/grpc/src/grpc_metrics_exporter.cc (2)
38-50: Ensureinit()is always called after constructing ashared_ptr
init()usesweak_from_this()and initializesmetrics_completion_q_. IfGrpcMetricsExporteris ever used (enqueue/flush) beforeinit()runs, or constructed outside ashared_ptr, this will either throw (bad_weak_ptr) or dereference a nullmetrics_completion_q_. Please confirm all call sites always construct viastd::make_shared<GrpcMetricsExporter>and immediately callinit()before any other method.
25-36: Overall async export path looks solidConstructor wiring (
metrics_service_stub_, boundedmetrics_q_), arena allocation for each request, and the use ofAsyncTSQueue+ weak pointers to deliver completion results back toon_metrics_export_complete()all look consistent and safe for the single-threaded agent loop model.Also applies to: 74-83, 85-101
agents/grpc/src/grpc_agent.h (1)
153-154: Confirm initialization and lifecycle of new metrics batching/exporter fieldsThe new metrics fields (
otlp_grpc_client_,otlp_grpc_client_options_,metrics_paused_,proc_metrics_batch_,thr_metrics_batch_,metrics_exporter_, pluskMaxMetricsExportRetries,metrics_retry_cb_, andon_metrics_timer()) are a good fit for the new batched/exporter design.Please double‑check in
grpc_agent.ccthat:
metrics_paused_is explicitly initialized (likelyfalse) and correctly toggled on reconfigure.proc_metrics_batch_/thr_metrics_batch_are constructed with the intended batch size and resized on metricsBatchSize changes.metrics_exporter_is always created with the same loop as the agent thread andinit()is called before use.kMaxMetricsExportRetriesandmetrics_retry_cb_are actually used to bound retries and avoid unbounded requeueing.Also applies to: 190-195, 266-267, 324-328, 338-349
test/agents/test-grpc-metrics-batch.mjs (1)
12-18: Static batch + pause/flush test flow looks reasonableThe
runSimpleTestharness (server/client wiring,metricslistener withmustCall, and optionalpauseMetricsAtreconfigure) provides good coverage for static batching and pause‑flush behavior. The guard on both process and thread metrics before shutdown, plus the explicit expected export count, looks sane.Also applies to: 31-37, 38-87, 89-107
test/common/nsolid-grpc-agent/validators.js (1)
154-171: Re-check use ofdataPoint.valuevs.asInt/asDoublein OTLP datapointsIn
checkOTLPMetricsDatayou assert:assert.strictEqual(dataPoint.value, expectedMetric[2]); // expectedMetric[2] is 'asInt' / 'asDouble' ... assert.strictEqual(dataPoint.value, expectedMetric[2]); // process metricswhile also validating the numeric fields via:
if (type === 'asInt') { validateNumber(parseInt(dataPoint[type], 10), `${name}.${type}`); } else { validateNumber(dataPoint[type], `${name}.${type}`); }For standard OTLP
NumberDataPointJSON, you’d typically haveasInt/asDouble(or similar) but not a separatevaluestring field indicating which one is set. IfdataPoint.valueis not actually present or doesn’t contain'asInt'/'asDouble', these assertions will systematically fail even though the numeric fields are correct.Suggested adjustment:
- Drop or relax the
dataPoint.value === expectedMetric[2]assertions, and rely on checking the presence and type ofdataPoint.asInt/dataPoint.asDouble(and/ormetric.data/ aggregation) instead.Please confirm the exact shape of the decoded OTLP metrics objects produced by your test server and adjust the assertions accordingly so they match the real wire format.
Also applies to: 173-234, 236-263
|
|
||
| ResourceMetrics data; | ||
| data.resource_ = otlp::GetResource(); | ||
| std::vector<MetricData> metrics; | ||
|
|
||
| ThreadMetricsStor stor; | ||
| while (agent->thr_metrics_msg_q_.dequeue(stor)) { | ||
| otlp::fill_env_metrics(metrics, stor, false); | ||
| otlp::fill_env_metrics(agent->thr_metrics_batch_, stor, false); | ||
| agent->thr_metrics_cache_.insert_or_assign(stor.thread_id, std::move(stor)); | ||
| } | ||
|
|
||
| data.scope_metric_data_ = | ||
| std::vector<ScopeMetrics>{{otlp::GetScope(), metrics}}; | ||
| auto result = agent->metrics_exporter_->Export(data); | ||
| Debug("# ThreadMetrics Exported. Result: %d\n", static_cast<int>(result)); | ||
| if (agent->thr_metrics_batch_.ShouldFlush()) { | ||
| auto metric_data = agent->thr_metrics_batch_.DumpMetricsAndReset(); | ||
| if (!metric_data.empty()) { | ||
| ResourceMetrics data; | ||
| data.resource_ = otlp::GetResource(); | ||
| data.scope_metric_data_ = | ||
| std::vector<ScopeMetrics>{{ otlp::GetScope(), std::move(metric_data) }}; | ||
| agent->metrics_exporter_->enqueue(std::move(data)); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid null dereference during shutdown in metrics_msg_cb_.
metrics_msg_cb_ always dereferences metrics_exporter_, but do_stop() resets that pointer before closing metrics_msg_. If a thread metrics update lands right as we begin shutting down, the pending async callback will run after the reset and we’ll crash. Please either guard every use of metrics_exporter_ in this callback or defer the reset until after metrics_msg_ (and its queue) are drained/closed. A simple if (!metrics_exporter_) return; ahead of the batching/export path would also work.
Also applies to: 1332-1340
🤖 Prompt for AI Agents
In agents/grpc/src/grpc_agent.cc around lines 918 to 934 (and similarly
1332-1340), the metrics_msg_cb_ callback unconditionally dereferences
metrics_exporter_, which can be reset by do_stop() before the message queue is
closed, causing a null dereference during shutdown; fix this by guarding uses of
metrics_exporter_ at the top of the callback (e.g., if (!metrics_exporter_)
return;) or by checking before enqueueing and before any dereference, ensuring
you return early if metrics_exporter_ is null, or alternatively defer resetting
metrics_exporter_ in do_stop() until after metrics_msg_ and its queue are
drained/closed.
| // Metrics batch size | ||
| nsolidConfig.metricsBatchSize = | ||
| +process.env.NSOLID_METRICS_BATCH_SIZE || | ||
| pkgConfig.nsolid.metricsBatchSize || | ||
| DEFAULT_METRICS_BATCH_SIZE; | ||
|
|
||
| // Metrics buffer size | ||
| nsolidConfig.metricsBufferSize = | ||
| +process.env.NSOLID_METRICS_BUFFER_SIZE || | ||
| pkgConfig.nsolid.metricsBufferSize || | ||
| DEFAULT_METRICS_BUFFER_SIZE; | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Env/pkg metricsBatchSize/metricsBufferSize aren’t checked for “finite & >0”
Unlike the JS updateConfig() path, initializeConfig() takes the first truthy of env, pkg, or default:
nsolidConfig.metricsBatchSize =
+process.env.NSOLID_METRICS_BATCH_SIZE ||
pkgConfig.nsolid.metricsBatchSize ||
DEFAULT_METRICS_BATCH_SIZE;This means values like Infinity or negative numbers from env/pkg can slip through (they’re truthy), even though the PR aims to enforce “positive finite” sizes. That’s especially risky for buffer sizing and batching.
I’d suggest normalizing this path to the same validation as updateConfig():
- Parse env/pkg to a number.
- Only accept it if
NumberIsFinite(value) && value > 0. - Otherwise fall back to pkg/default.
You can factor this into a small helper to avoid duplication between initializeConfig() and updateConfig().
🤖 Prompt for AI Agents
In lib/nsolid.js around lines 862 to 873, the initializeConfig() assignments for
metricsBatchSize and metricsBufferSize use the first truthy value and therefore
accept Infinity or negative numbers from env/pkg; change this to parse env and
pkg values to numbers and only accept them if Number.isFinite(value) && value >
0, otherwise fall back to pkg or default, and factor the logic into a small
helper (e.g., normalizePositiveFinite(envVal, pkgVal, defaultVal)) so
updateConfig() and initializeConfig() share the same validation.
| } else if (key === 'metricsBatchSize' || key === 'metricsBufferSize') { | ||
| const raw = config[key]; | ||
| if (typeof raw === 'number' || (typeof raw === 'string' && raw.trim() !== '')) { | ||
| const value = Number(raw); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check if the value is NaN after this conversion?
a011ab5 to
c540acb
Compare
Introduce custom OTLP metrics serialization in gRPC agent, replacing SDK with BatchedMetricData and direct protobuf generation. Provides per-data-point timestamps, custom batching, and improved performance for high-frequency metrics. Add debug logging, fix thread ID type mismatches, implement robust batch processing for partial metrics, and update test expectations to resolve inconsistent processing and hangs.
Require positive finite numbers when metricsBatchSize/metricsBufferSize are set via nsolid.start() or live reconfigure so odd JS values (true, objects, NaN, etc.) are ignored Add test-nsolid-start-metrics-buffer.js test to cover for this.
c540acb to
662c5c5
Compare
Summary by CodeRabbit
Release Notes
New Features
Improvements
✏️ Tip: You can customize this high-level summary in your review settings.