Skip to content

Fix MaxExternalSpansPerTest cap bypass when Activity.Parent chain is broken#5352

Merged
thomhurst merged 4 commits intomainfrom
copilot/fix-max-external-spans-cap
Apr 3, 2026
Merged

Fix MaxExternalSpansPerTest cap bypass when Activity.Parent chain is broken#5352
thomhurst merged 4 commits intomainfrom
copilot/fix-max-external-spans-cap

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 3, 2026

FindTestCaseAncestor walks activity.Parent in-memory to find the owning test case span. Libraries like Npgsql with async connection pooling break this chain — Activity.Parent is null by the time OnActivityStopped fires, so spans fall into the uncapped infrastructure bucket. This grew HTML reports to 121+ MB, crashing with System.OverflowException in Console.WriteLine.

Three-layer defense in ActivityCollector:

  • OnActivityStarted callback — pre-registers test case span IDs in _testCaseSpanIds so they're available before child spans stop (children stop before parents)
  • FindTestCaseAncestor ParentSpanId fallback — when Activity.Parent walk fails, checks activity.ParentSpanId against pre-registered test case span IDs (W3C trace context is intact even when the object reference is broken)
  • Per-trace cap — when neither lookup resolves to a test case, caps external spans per traceId via _externalSpanCountsByTrace instead of leaving them uncapped
// Before: returns null → uncapped
private static string? FindTestCaseAncestor(Activity activity)
{
    var current = activity.Parent; // null when chain is broken
    while (current is not null) { ... }
    return null; // ← uncapped bucket
}

// After: falls back to W3C ParentSpanId lookup
private string? FindTestCaseAncestor(Activity activity)
{
    var current = activity.Parent;
    while (current is not null) { ... }

    // W3C ParentSpanId survives async context breaks
    if (activity.ParentSpanId != default)
    {
        var parentSpanId = activity.ParentSpanId.ToString();
        if (_testCaseSpanIds.ContainsKey(parentSpanId))
            return parentSpanId;
    }
    return null; // → per-trace cap applied by caller
}

@codacy-production
Copy link
Copy Markdown

codacy-production bot commented Apr 3, 2026

Up to standards ✅

🟢 Issues 0 issues

Results:
0 new issues

View in Codacy

🟢 Metrics 8 complexity

Metric Results
Complexity 8

View in Codacy

TIP This summary will be updated as you push new changes. Give us feedback

…broken

When libraries like Npgsql use async connection pooling, Activity.Parent
may be null/broken by the time OnActivityStopped fires, causing
FindTestCaseAncestor to return null and external spans to bypass the cap.

Three-layer fix:
1. Register test case span IDs on ActivityStarted (before children stop)
2. Fallback in FindTestCaseAncestor to check ParentSpanId against known
   test case span IDs when Activity.Parent walk fails
3. Per-trace cap as ultimate safety net for completely broken chains

Agent-Logs-Url: https://github.com/thomhurst/TUnit/sessions/1c7f9e0c-eca2-4452-a1b9-b94f99ac238c

Co-authored-by: thomhurst <30480171+thomhurst@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix MaxExternalSpansPerTest cap bypass in HtmlReporter Fix MaxExternalSpansPerTest cap bypass when Activity.Parent chain is broken Apr 3, 2026
Copilot AI requested a review from thomhurst April 3, 2026 08:32
@thomhurst thomhurst marked this pull request as ready for review April 3, 2026 09:04
Copilot AI temporarily deployed to Pull Requests April 3, 2026 09:04 Inactive
Copilot AI temporarily deployed to Pull Requests April 3, 2026 09:04 Inactive
Copilot AI temporarily deployed to Pull Requests April 3, 2026 09:04 Inactive
Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a well-reasoned fix for a real-world bug (Npgsql async pooling breaking Activity.Parent chains) with clear documentation. The three-layer defense strategy is sound. A few issues worth addressing:


Issue 1: Memory leak in _testCaseSpanIds (and _externalSpanCountsByTrace)

_testCaseSpanIds is populated in OnActivityStarted but never cleaned up. For large test runs (hundreds/thousands of tests), this dictionary accumulates span ID strings indefinitely for the lifetime of the collector.

The fix is straightforward — since ActivityStopped fires for parents after all children have stopped, it's safe to remove the entry when the test case span stops:

private void OnActivityStopped(Activity activity)
{
    // ... existing logic ...
    
    // Cleanup: remove test case span from tracking set once it stops.
    // All child spans will have already stopped by this point.
    if (isTUnit && activity.GetTagItem("tunit.test.node_uid") is not null)
    {
        _testCaseSpanIds.TryRemove(activity.SpanId.ToString(), out _);
        // Also clean up the trace-level fallback counter
        _externalSpanCountsByTrace.TryRemove(activity.TraceId.ToString(), out _);
    }
}

The same applies to _externalSpanCountsByTrace — it grows forever too. Note that the existing _externalSpanCountsByTest has the same issue, so this is a pre-existing pattern, but adding a third unbounded dictionary amplifies the concern.


Issue 2: MaxExternalSpansPerTest reused for a per-trace concept

The constant MaxExternalSpansPerTest is semantically "per test" but is reused as the per-trace fallback cap:

var count = _externalSpanCountsByTrace.AddOrUpdate(traceId, 1, (_, c) => c + 1);
if (count > MaxExternalSpansPerTest)  // ← misleading name here

Consider a dedicated constant or renaming:

private const int MaxExternalSpansPerTrace = 100; // fallback cap when test association is unknown

This makes the intent clearer and allows tuning them independently if needed.


Issue 3: Fallback only resolves one level, not the full ancestor chain

The new ParentSpanId fallback checks only if the direct parent is a known test case span. If a span's chain is broken two levels up (e.g., span → middleware span → test case, where both parent references are null), it won't be associated with the test and falls into the trace-level cap.

This is likely acceptable for the Npgsql case (which breaks the immediate parent reference), but the comment should note this limitation explicitly:

// Fallback: check if the direct ParentSpanId is a known test case span.
// Note: only one level — deeper broken chains fall through to the per-trace cap.
// This handles Npgsql async pooling where the direct parent reference is broken
// but W3C ParentSpanId is still correct.

Issue 4: Double-counting risk allows 2× the intended max per trace

Spans in a trace can be counted in both _externalSpanCountsByTest (those that resolve) AND _externalSpanCountsByTrace (those that don't). Both caps are independent with the same limit. So a single trace could accept up to 2 × MaxExternalSpansPerTest = 200 external spans — potentially still enough to cause report bloat for extreme cases.

This is minor for now, but if the goal is a hard cap per trace, consider a combined strategy.


Overall

The fix correctly addresses the reported crash and is well-documented. The three-layer approach is a pragmatic defensive strategy. The main concerns are the memory leak (Issue 1 — should be fixed) and naming clarity (Issue 2 — easy win). Issues 3 and 4 are informational.

@thomhurst
Copy link
Copy Markdown
Owner

@copilot read the latest review on the pr

…t, clarify fallback comment

- Clean up _testCaseSpanIds and _externalSpanCountsByTrace when test case
  span stops (children stop before parents, so cleanup is safe)
- Add separate MaxExternalSpansPerTrace constant for the per-trace fallback
  cap instead of reusing MaxExternalSpansPerTest
- Update FindTestCaseAncestor comment to note single-level limitation

Agent-Logs-Url: https://github.com/thomhurst/TUnit/sessions/e846703a-5b5a-4f60-afc3-cffadcc724c5

Co-authored-by: thomhurst <30480171+thomhurst@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 3, 2026

@copilot read the latest review on the pr

Addressed all three actionable issues from the review in 947392d:

  1. Memory leak: Added cleanup in OnActivityStopped — removes entries from _testCaseSpanIds and _externalSpanCountsByTrace when the test case span stops (safe since children stop before parents).
  2. Naming: Added dedicated MaxExternalSpansPerTrace constant for the per-trace fallback cap.
  3. Comment clarity: Updated FindTestCaseAncestor fallback comment to explicitly note single-level resolution limitation.

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up Review (after 947392d)

The previous review identified four issues. The Copilot agent has addressed Issues 1, 2, and 3 in commit 947392d. Here is the updated assessment:


Issue 1 (Memory leak) — RESOLVED

_testCaseSpanIds and _externalSpanCountsByTrace are now cleaned up in OnActivityStopped when the test case span stops. The ordering guarantee (children stop before parents) makes this safe.


Issue 2 (Naming) — RESOLVED

MaxExternalSpansPerTrace is now a dedicated constant, separate from MaxExternalSpansPerTest. The intent is now clear and the two limits can be tuned independently.


Issue 3 (Comment clarity) — RESOLVED

The FindTestCaseAncestor fallback comment now explicitly notes the single-level limitation.


Issue 4 (Double-counting / combined trace cap) — Still informational

This remains unaddressed, but is acceptable for now. A single trace could theoretically accept up to MaxExternalSpansPerTest + MaxExternalSpansPerTrace = 200 external spans in the pathological case where some spans resolve correctly and some do not. For the Npgsql scenario described, this is unlikely to cause practical problems, but worth keeping in mind if report bloat resurfaces.


One new observation: _externalSpanCountsByTest has the same pre-existing leak

The cleanup added in 947392d removes entries from _externalSpanCountsByTrace when the test case span stops — but _externalSpanCountsByTest (which existed before this PR) receives the same treatment via the TryRemove call. Wait — actually it does NOT. Looking at the cleanup block:

if (isTUnit && activity.GetTagItem("tunit.test.node_uid") is not null)
{
    _testCaseSpanIds.TryRemove(activity.SpanId.ToString(), out _);
    _externalSpanCountsByTrace.TryRemove(traceId, out _);
}

_externalSpanCountsByTest is not cleaned up here (and was not cleaned up before this PR either). For a long test run, this dictionary accumulates one entry per completed test and is never pruned. This is a pre-existing issue, but it would be a low-hanging improvement to also add _externalSpanCountsByTest.TryRemove(activity.SpanId.ToString(), out _) to the same cleanup block, since the test case span ID is available at that point.


Overall

The substantive concerns from the first review are resolved. The fix is well-reasoned, well-documented, and now includes proper cleanup. The remaining items (Issue 4 and the _externalSpanCountsByTest pre-existing leak) are minor and non-blocking. This is ready to merge.

@thomhurst
Copy link
Copy Markdown
Owner

@copilot read the newest review

…re-existing leak

Agent-Logs-Url: https://github.com/thomhurst/TUnit/sessions/99eadeb1-f26a-4f94-8341-dc05e1ad3fe9

Co-authored-by: thomhurst <30480171+thomhurst@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Apr 3, 2026

@copilot read the newest review

The follow-up review confirmed all substantive issues are resolved. It also noted a pre-existing leak in _externalSpanCountsByTest — addressed in 59a4217 by adding _externalSpanCountsByTest.TryRemove(spanId, out _) to the same cleanup block. All three tracking dictionaries are now cleaned up when test case spans stop.

Copilot AI temporarily deployed to Pull Requests April 3, 2026 10:48 Inactive
@thomhurst thomhurst enabled auto-merge (squash) April 3, 2026 10:51
Copilot AI temporarily deployed to Pull Requests April 3, 2026 12:27 Inactive
@thomhurst thomhurst merged commit feeaf9e into main Apr 3, 2026
15 of 17 checks passed
@thomhurst thomhurst deleted the copilot/fix-max-external-spans-cap branch April 3, 2026 13:06
@claude claude bot mentioned this pull request Apr 3, 2026
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HtmlReporter: MaxExternalSpansPerTest cap bypassed when Activity.Parent chain is broken (Npgsql/async)

2 participants