Fix MaxExternalSpansPerTest cap bypass when Activity.Parent chain is broken#5352
Fix MaxExternalSpansPerTest cap bypass when Activity.Parent chain is broken#5352
Conversation
Up to standards ✅🟢 Issues
|
| Metric | Results |
|---|---|
| Complexity | 8 |
TIP This summary will be updated as you push new changes. Give us feedback
…broken When libraries like Npgsql use async connection pooling, Activity.Parent may be null/broken by the time OnActivityStopped fires, causing FindTestCaseAncestor to return null and external spans to bypass the cap. Three-layer fix: 1. Register test case span IDs on ActivityStarted (before children stop) 2. Fallback in FindTestCaseAncestor to check ParentSpanId against known test case span IDs when Activity.Parent walk fails 3. Per-trace cap as ultimate safety net for completely broken chains Agent-Logs-Url: https://github.com/thomhurst/TUnit/sessions/1c7f9e0c-eca2-4452-a1b9-b94f99ac238c Co-authored-by: thomhurst <30480171+thomhurst@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This is a well-reasoned fix for a real-world bug (Npgsql async pooling breaking Activity.Parent chains) with clear documentation. The three-layer defense strategy is sound. A few issues worth addressing:
Issue 1: Memory leak in _testCaseSpanIds (and _externalSpanCountsByTrace)
_testCaseSpanIds is populated in OnActivityStarted but never cleaned up. For large test runs (hundreds/thousands of tests), this dictionary accumulates span ID strings indefinitely for the lifetime of the collector.
The fix is straightforward — since ActivityStopped fires for parents after all children have stopped, it's safe to remove the entry when the test case span stops:
private void OnActivityStopped(Activity activity)
{
// ... existing logic ...
// Cleanup: remove test case span from tracking set once it stops.
// All child spans will have already stopped by this point.
if (isTUnit && activity.GetTagItem("tunit.test.node_uid") is not null)
{
_testCaseSpanIds.TryRemove(activity.SpanId.ToString(), out _);
// Also clean up the trace-level fallback counter
_externalSpanCountsByTrace.TryRemove(activity.TraceId.ToString(), out _);
}
}The same applies to _externalSpanCountsByTrace — it grows forever too. Note that the existing _externalSpanCountsByTest has the same issue, so this is a pre-existing pattern, but adding a third unbounded dictionary amplifies the concern.
Issue 2: MaxExternalSpansPerTest reused for a per-trace concept
The constant MaxExternalSpansPerTest is semantically "per test" but is reused as the per-trace fallback cap:
var count = _externalSpanCountsByTrace.AddOrUpdate(traceId, 1, (_, c) => c + 1);
if (count > MaxExternalSpansPerTest) // ← misleading name hereConsider a dedicated constant or renaming:
private const int MaxExternalSpansPerTrace = 100; // fallback cap when test association is unknownThis makes the intent clearer and allows tuning them independently if needed.
Issue 3: Fallback only resolves one level, not the full ancestor chain
The new ParentSpanId fallback checks only if the direct parent is a known test case span. If a span's chain is broken two levels up (e.g., span → middleware span → test case, where both parent references are null), it won't be associated with the test and falls into the trace-level cap.
This is likely acceptable for the Npgsql case (which breaks the immediate parent reference), but the comment should note this limitation explicitly:
// Fallback: check if the direct ParentSpanId is a known test case span.
// Note: only one level — deeper broken chains fall through to the per-trace cap.
// This handles Npgsql async pooling where the direct parent reference is broken
// but W3C ParentSpanId is still correct.Issue 4: Double-counting risk allows 2× the intended max per trace
Spans in a trace can be counted in both _externalSpanCountsByTest (those that resolve) AND _externalSpanCountsByTrace (those that don't). Both caps are independent with the same limit. So a single trace could accept up to 2 × MaxExternalSpansPerTest = 200 external spans — potentially still enough to cause report bloat for extreme cases.
This is minor for now, but if the goal is a hard cap per trace, consider a combined strategy.
Overall
The fix correctly addresses the reported crash and is well-documented. The three-layer approach is a pragmatic defensive strategy. The main concerns are the memory leak (Issue 1 — should be fixed) and naming clarity (Issue 2 — easy win). Issues 3 and 4 are informational.
|
@copilot read the latest review on the pr |
…t, clarify fallback comment - Clean up _testCaseSpanIds and _externalSpanCountsByTrace when test case span stops (children stop before parents, so cleanup is safe) - Add separate MaxExternalSpansPerTrace constant for the per-trace fallback cap instead of reusing MaxExternalSpansPerTest - Update FindTestCaseAncestor comment to note single-level limitation Agent-Logs-Url: https://github.com/thomhurst/TUnit/sessions/e846703a-5b5a-4f60-afc3-cffadcc724c5 Co-authored-by: thomhurst <30480171+thomhurst@users.noreply.github.com>
Addressed all three actionable issues from the review in 947392d:
|
There was a problem hiding this comment.
Follow-up Review (after 947392d)
The previous review identified four issues. The Copilot agent has addressed Issues 1, 2, and 3 in commit 947392d. Here is the updated assessment:
Issue 1 (Memory leak) — RESOLVED
_testCaseSpanIds and _externalSpanCountsByTrace are now cleaned up in OnActivityStopped when the test case span stops. The ordering guarantee (children stop before parents) makes this safe.
Issue 2 (Naming) — RESOLVED
MaxExternalSpansPerTrace is now a dedicated constant, separate from MaxExternalSpansPerTest. The intent is now clear and the two limits can be tuned independently.
Issue 3 (Comment clarity) — RESOLVED
The FindTestCaseAncestor fallback comment now explicitly notes the single-level limitation.
Issue 4 (Double-counting / combined trace cap) — Still informational
This remains unaddressed, but is acceptable for now. A single trace could theoretically accept up to MaxExternalSpansPerTest + MaxExternalSpansPerTrace = 200 external spans in the pathological case where some spans resolve correctly and some do not. For the Npgsql scenario described, this is unlikely to cause practical problems, but worth keeping in mind if report bloat resurfaces.
One new observation: _externalSpanCountsByTest has the same pre-existing leak
The cleanup added in 947392d removes entries from _externalSpanCountsByTrace when the test case span stops — but _externalSpanCountsByTest (which existed before this PR) receives the same treatment via the TryRemove call. Wait — actually it does NOT. Looking at the cleanup block:
if (isTUnit && activity.GetTagItem("tunit.test.node_uid") is not null)
{
_testCaseSpanIds.TryRemove(activity.SpanId.ToString(), out _);
_externalSpanCountsByTrace.TryRemove(traceId, out _);
}_externalSpanCountsByTest is not cleaned up here (and was not cleaned up before this PR either). For a long test run, this dictionary accumulates one entry per completed test and is never pruned. This is a pre-existing issue, but it would be a low-hanging improvement to also add _externalSpanCountsByTest.TryRemove(activity.SpanId.ToString(), out _) to the same cleanup block, since the test case span ID is available at that point.
Overall
The substantive concerns from the first review are resolved. The fix is well-reasoned, well-documented, and now includes proper cleanup. The remaining items (Issue 4 and the _externalSpanCountsByTest pre-existing leak) are minor and non-blocking. This is ready to merge.
|
@copilot read the newest review |
…re-existing leak Agent-Logs-Url: https://github.com/thomhurst/TUnit/sessions/99eadeb1-f26a-4f94-8341-dc05e1ad3fe9 Co-authored-by: thomhurst <30480171+thomhurst@users.noreply.github.com>
The follow-up review confirmed all substantive issues are resolved. It also noted a pre-existing leak in |
FindTestCaseAncestorwalksactivity.Parentin-memory to find the owning test case span. Libraries like Npgsql with async connection pooling break this chain —Activity.Parentis null by the timeOnActivityStoppedfires, so spans fall into the uncapped infrastructure bucket. This grew HTML reports to 121+ MB, crashing withSystem.OverflowExceptioninConsole.WriteLine.Three-layer defense in
ActivityCollector:OnActivityStartedcallback — pre-registers test case span IDs in_testCaseSpanIdsso they're available before child spans stop (children stop before parents)FindTestCaseAncestorParentSpanId fallback — whenActivity.Parentwalk fails, checksactivity.ParentSpanIdagainst pre-registered test case span IDs (W3C trace context is intact even when the object reference is broken)traceIdvia_externalSpanCountsByTraceinstead of leaving them uncapped