Skip to content

fix: test_nightly_fault_recovery_speed off-by-GATEWAYS node indexing#3881

Merged
sanity merged 3 commits intomainfrom
fix/nightly-fault-recovery-test-indexing
Apr 15, 2026
Merged

fix: test_nightly_fault_recovery_speed off-by-GATEWAYS node indexing#3881
sanity merged 3 commits intomainfrom
fix/nightly-fault-recovery-test-indexing

Conversation

@sanity
Copy link
Copy Markdown
Collaborator

@sanity sanity commented Apr 15, 2026

Problem

test_nightly_fault_recovery_speed deterministically fails its Phase 2 assertion on seed 0x3511FA170003:

assertion `left == right` failed: Phase 2: expected connection managers
for all 50 nodes, got 46. Seed: 0x3511FA170003

This was the headline failure on the 2026-04-07 → 2026-04-15 9-night red streak on Simulation Tests (Nightly).

Root cause

The test iterates (0..NODES) and constructs NodeLabel::node(NETWORK_NAME, i) at every query site. But SimNetwork::config_nodes (crates/core/src/node/testing_impl.rs:1580) builds regular-node labels with:

for node_no in self.number_of_gateways..num + self.number_of_gateways {
    let label = NodeLabel::node(&self.name, node_no);
    ...
}

For the test's GATEWAYS=4 / NODES=50 configuration, the live regular-node labels are therefore NodeLabel::node(NETWORK_NAME, 4..54), not 0..50. The test's query overlap with the actual labels is exactly 4..49 — 46 entries — which is precisely the failure number.

How this got misdiagnosed

The first pass on this regression (PR #3871) attributed it to a SimNetwork shared_cm capture race, because the symptom (4 nodes' connection managers missing from connection_managers) was consistent with both root causes. PR #3871 added a yield-poll helper, generic capture_shared_slot<T>, loud panic on timeout, and unit tests. That fix is still beneficial — it makes the startup slot-capture robust and unit-testable, and the suspend-detector half of #3871 (which addressed five unrelated debug_assert! panics in get-path tests) was correct and worked.

But the post-merge nightly (run 24435694147, head 8c4eb0ec which includes #3871) still hit 46/50 on the same seed with no "SimNetwork: node did not publish" warnings in the log, proving the harness was correctly publishing all 54 (4 gateway + 50 node) labels. Querying (0..NODES) was missing the top 4 regular-node labels by indexing convention, not by race.

Fix

Introduce a let node_ids = || GATEWAYS..GATEWAYS + NODES; closure at the top of the test and iterate through it at all three phase query sites (Phase 1 pre-fault, Phase 2 during-fault, Phase 3 post-recovery). The closure documents the invariant once and keeps each phase block clean.

Three call sites updated. The Phase 1 site has a long comment explaining the indexing convention so a future contributor doesn't reintroduce the same off-by-GATEWAYS bug.

Why this is a test fix, not a harness fix

The config_nodes shifted-index convention is longstanding. Other tests in the same file already rely on it (e.g. simulation_integration.rs:2070+ and :2284+ spell out NodeLabel::node(NETWORK_NAME, 1) when their gateway count is 1, using the next-after-gateways slot). Changing the harness to start regular nodes at index 0 would be a cross-cutting rewrite affecting dozens of test sites with unclear payoff. A 3-line surgical fix to the broken test is the right scope.

A follow-up issue could clean up the harness convention later if there's appetite, but this PR's job is unblocking the nightly.

Testing

Local cargo check --features "simulation_tests,testing,nightly_tests" --tests passes clean. The actual test_nightly_fault_recovery_speed simulation takes ~4 minutes of wall time which is too slow for local iteration, but the indexing fix is self-evidently correct from the math: with GATEWAYS=4 and NODES=50, iterating 4..54 yields exactly 50 labels matching the live node_4..node_53 set, so len() == NODES is now satisfied for all three phase assertions.

The next nightly run will validate end-to-end.

Fixes

The deterministic 46/50 failure on test_nightly_fault_recovery_speed that has been red on every nightly run since PR #3795 (2026-04-08) strengthened the assertion. Combined with the previously-merged PR #3871 (suspend-detector + Matrix→River notifier), this should restore the Simulation Tests (Nightly) workflow to green.

[AI-assisted - Claude]

The nightly test was iterating `(0..NODES)` and constructing
`NodeLabel::node(NETWORK_NAME, i)` for every query site, but
`SimNetwork::config_nodes` (testing_impl.rs:1580) actually builds
regular-node labels in the range
`number_of_gateways..number_of_gateways + num`. For the test's
GATEWAYS=4 / NODES=50 configuration that means the live labels are
`NodeLabel::node(NETWORK_NAME, 4..54)`, not `0..50`.

The test's query therefore hit only the overlap `4..49` — exactly 46
labels — and the strengthened Phase 2 / Phase 3 assertions added by
PR #3795 deterministically failed with `expected 50, got 46`.

This was misdiagnosed on the first pass as a `SimNetwork`
`shared_cm` capture race and "fixed" in PR #3871 with a yield-poll
helper. That fix is still beneficial (it makes the startup slot
capture robust and adds unit tests for the polling contract), but it
was not the actual root cause of this particular failure — the 2026-
04-15 post-merge nightly run (24435694147) still hit `46/50` on the
same seed with no "did not publish" warnings in the log, proving the
harness was correctly publishing all 54 `(4 gateway + 50 node)`
labels. The real bug is in the test's query range.

Fix: introduce a `let node_ids = || GATEWAYS..GATEWAYS + NODES;`
closure at the top of the three phase blocks and iterate through
that everywhere a regular-node label is constructed. The closure
documents the invariant in one place and keeps each phase block
clean.

Why this is a test fix, not a harness fix: the
`config_nodes` shifted-index convention is longstanding and several
other tests (e.g. `simulation_integration.rs:2070+`,
`simulation_integration.rs:2284+`) already rely on it by spelling
out e.g. `NodeLabel::node(NETWORK_NAME, 1)` when their gateway count
is 1. Changing the harness would be a cross-cutting rewrite with
unclear payoff; fixing the broken test is a 3-line surgical change.

Verified locally with `cargo check --features
"simulation_tests,testing,nightly_tests" --tests`. The actual
`test_nightly_fault_recovery_speed` assertion run takes ~4 minutes
of simulation wall time which is too slow for local iteration, but
the indexing fix is self-evidently correct from the math above —
the next nightly run will validate end-to-end.

[AI-assisted - Claude]
@sanity sanity enabled auto-merge April 15, 2026 06:08
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 15, 2026

Rule Review: No issues found

Rules checked: git-workflow.md, code-style.md, testing.md
Files reviewed: 1 (crates/core/tests/simulation_integration.rs)

No rule violations detected.

Summary of what was verified:

  • fix: PR + regression tests: The PR adds two regression tests. sim_network_regular_node_labels_start_at_gateway_count directly exercises the off-by-GATEWAYS bug (verifies node-0/node-1 are NOT live for a 2-gateway network, and node-2..node-4 ARE live). node_label_node_and_gateway_formats_are_stable is a plain #[test] added to satisfy the CI rule-lint regex, and independently pins the NodeLabel string encoding. Together they satisfy the regression-test requirement.
  • All three (0..NODES) sites replaced: Phase 1, Phase 2, and Phase 3 query ranges all now use the node_ids() closure.
  • Test annotation: #[test_log::test(tokio::test(flavor = "current_thread"))] is the standard async-test annotation in this file (used at ~15 existing test sites); no deviation from local convention.
  • Nightly-gate concern: test_nightly_fault_recovery_speed is #[cfg(feature = "nightly_tests")] so full end-to-end validation of the nightly test requires CI. The two new regression tests run without the feature gate. Using cargo check for the gated test is appropriate given the ~4-minute wall-time constraint.
  • No production code changes: All modifications are in test code; code-style rules for production paths (unwrap, spawn, retry, time/rng) are not applicable.
  • Commit messages: All three commits follow conventional-commit format with well-reasoned WHY explanations.

Rule review against .claude/rules/. WARNING findings block merge.

sanity added 2 commits April 15, 2026 12:21
Rule-lint requires `fix:` PRs to include at least one new `#[test]`
function. The previous commit only modified an existing test's query
ranges, which the lint correctly does not count.

Add `sim_network_regular_node_labels_start_at_gateway_count` — a
minimal (2 gateways, 3 regular nodes, 0.33s runtime) test that pins
the `SimNetwork::config_nodes` indexing convention. It verifies:

  * Labels `node-0` and `node-1` (inside the gateway ID range) are
    NOT live regular-node labels — the same class of mistake as the
    nightly failure would fail this assertion immediately.
  * Labels `node-2`, `node-3`, `node-4` (the `GATEWAYS..GATEWAYS +
    NODES` range) ARE live.
  * `node-5` (one past the end) is NOT live.

This is a targeted regression test for the convention itself, not
for `test_nightly_fault_recovery_speed` specifically. Any future
test that iterates `0..N` instead of `GATEWAYS..GATEWAYS+N` will now
have this test as prior-art documentation of the correct pattern,
and any refactor of `config_nodes` that changes the indexing will
immediately break this pin rather than silently corrupting the
downstream assertions of every test in the file.

[AI-assisted - Claude]
The previous regression test was `#[test_log::test(tokio::test(flavor =
"current_thread"))]` which the CI rule-lint regex
`#\[(tokio::)?test\]` does not recognize as a new test (it only
matches bare `#[test]` or `#[tokio::test]`). Rule-lint therefore
rejected the PR with "fix: PR must include at least one new
regression test" even though a new function had been added.

Add `node_label_node_and_gateway_formats_are_stable` as a plain
`#[test]` that pins the label-string encoding (`net-node-5`,
`net-gateway-2`) and the disjointness of the gateway / regular-node
label spaces at the same numeric ID. Both invariants are load-
bearing for every other test in this file and for
`config_nodes` itself. Any refactor that conflates the two label
types — or that changes the string encoding — would silently
corrupt every connection-count assertion downstream; now it would
immediately break this pin.

The companion async test
(`sim_network_regular_node_labels_start_at_gateway_count`) still
exists and still pins the live-label ID range, but the rule-lint
regex won't see it as "a test" because of its non-standard
annotation. A doc comment on each test now cross-references the
other so the two stay paired.

[AI-assisted - Claude]
@sanity sanity added this pull request to the merge queue Apr 15, 2026
Merged via the queue into main with commit a90f9fe Apr 15, 2026
12 checks passed
@sanity sanity deleted the fix/nightly-fault-recovery-test-indexing branch April 15, 2026 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant