Skip to content

fix(connect): restore VisitedPeers hash keys on relay op re-entry#3868

Merged
sanity merged 1 commit intomainfrom
fix-visited-bloom-rekey
Apr 13, 2026
Merged

fix(connect): restore VisitedPeers hash keys on relay op re-entry#3868
sanity merged 1 commit intomainfrom
fix-visited-bloom-rekey

Conversation

@sanity
Copy link
Copy Markdown
Collaborator

@sanity sanity commented Apr 13, 2026

Problem

CONNECT forwarding loop in 50-node simulation tests causes the Simulation CI job to hang for 27+ minutes silently until the 30-minute runner timeout kills it. Observed in PR #3861's CI run (run 24365113346, job 71154729735): a single CONNECT transaction (01E0YHAGAQ5N6DTVPA9BM7W000) bouncing between nodes 27/28/35, all rejecting at terminus, with the visited bloom filter set-bit count oscillating non-monotonically between 81 and 84 — a clear sign of bloom corruption rather than a routing decision issue.

Root cause

VisitedPeers::hash_keys is marked #[serde(skip)], so a freshly deserialized bloom filter has zero hash keys. ConnectOp::new_relay correctly calls with_transaction(&id) on first receipt. The ConnectOp::handle_request re-entry path (crates/core/src/operations/connect.rs, previously line 1379) assigned the incoming, deserialized request directly to state.request without re-keying. Subsequent mark_visited / probably_visited calls then ran against zeroed hash keys, setting bits at entirely different indices from the originally-marked peers. The bloom semantics silently collapsed: the filter no longer detected already-visited peers, so routing kept forwarding the same transaction back to nodes it had already tried.

This pattern only manifests at scale because small-topology tests rarely re-enter an existing relay op for the same transaction. CI 50-node sims with uphill bounce-back at terminus do, frequently.

Solution

Two-line fix: restore hash keys via request.visited = request.visited.with_transaction(&self.id) before assigning the incoming request to state.request. Surgical — touches only the broken path, leaves new_relay unchanged.

match self.state.as_mut() {
    Some(ConnectState::Relaying(state)) => {
        state.upstream_addr = upstream_addr;
        let mut request = request;
        request.visited = request.visited.with_transaction(&self.id);
        state.request = request;
        state.handle_request(...)
    }
    _ => RelayActions::default(),
}

Testing

Added regression_visited_hash_keys_preserved_across_reentry — a deterministic state-machine unit test that:

  1. Creates a ConnectOp::new_relay with a peer pre-marked in the visited bloom (correctly keyed).
  2. bincode-serializes and deserializes the ConnectRequest to simulate a wire round-trip, confirming hash keys are zeroed (sanity assertion on the deserialized bloom).
  3. Calls ConnectOp::handle_request on the existing op with the deserialized request — the re-entry path.
  4. Asserts the previously-marked peer is still detected via state.request.visited.probably_visited(...).

Verified to panic on current main (bloom corruption detected) and pass with the fix. No networking required. Ran 10x in a loop — deterministic. Full connect module test suite (50 tests) passes. Full cargo test -p freenet --lib passes (2240 passing; the one unrelated deadlock-detection failure is a known test-isolation issue that requires nextest or --test-threads=1, not introduced by this PR).

Out of scope (follow-up)

The invariant "every deserialized VisitedPeers must have with_transaction called before any mark_visited/probably_visited use" is enforced only by code review, not by the type system. A future hardening could introduce an UnkeyedVisitedPeers newtype that must be converted via with_transaction before becoming a usable VisitedPeers, making this class of bug unrepresentable. Not doing that here to keep the fix surgical.

Refs: investigation in CI run https://github.com/freenet/freenet-core/actions/runs/24365113346

[AI-assisted - Claude]

VisitedPeers::hash_keys is marked #[serde(skip)] so a freshly deserialized
bloom filter has zero hash keys. ConnectOp::new_relay correctly calls
with_transaction(&id) on first receipt, but the ConnectOp::handle_request
re-entry path assigned the incoming (deserialized) request directly to
state.request without re-keying. Subsequent mark_visited / probably_visited
calls then ran against zeroed hash keys, setting bits at the wrong indices
and silently corrupting the bloom.

This only manifested at scale because small-topology tests rarely re-enter
an existing relay op for the same transaction. Production CI 50-node sims
with uphill bounce-back at terminus do — frequently. The symptom was a
CONNECT forwarding loop on a single transaction bouncing between three
nodes that all rejected at terminus, with visited set-bit count oscillating
non-monotonically (observed in CI run 24365113346 job 71154729735,
transaction 01E0YHAGAQ5N6DTVPA9BM7W000).

Fix: restore hash keys via `request.visited.with_transaction(&self.id)`
before assigning the request to state.request in ConnectOp::handle_request.
Surgical — touches only the broken path, leaves new_relay unchanged.

Adds regression test `regression_visited_hash_keys_preserved_across_reentry`:
a deterministic state-machine unit test that constructs a relay op with a
known visited peer, bincode-serializes and deserializes the request to
simulate a wire round-trip (which zeros hash_keys), re-enters handle_request,
and asserts the previously-marked peer is still detected. Verified to fail
on current main (bloom corruption panic) and pass with the fix. No
networking required.

[AI-assisted - Claude]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sanity sanity enabled auto-merge April 13, 2026 22:33
@github-actions
Copy link
Copy Markdown
Contributor

Rule Review: No issues found

Rules checked: code-style.md, testing.md, operations.md, git-workflow.md
Files reviewed: 1 (crates/core/src/operations/connect.rs)

Production Fix

The two-line fix at connect.rs:1385-1386 is minimal and correct:

let mut request = request;
request.visited = request.visited.with_transaction(&self.id);
  • No .unwrap() in production code
  • No fire-and-forget spawns, retry loops, or push-before-send violations
  • Comment accurately explains why (serde skips hash_keys, only new_relay re-keys on first receipt)

Regression Test

regression_visited_hash_keys_preserved_across_reentry satisfies fix: PR requirements:

  • Reproduces the exact failure mode (deserialized bloom with zeroed hash keys corrupts visited-set operations)
  • Sanity assert confirms the serialization round-trip actually zeros the keys (test setup is self-validating)
  • Final assert would fail without the fix (prior hash values for marked_addr would map to wrong bit positions with zero keys)
  • Test name is specific and documents the scenario

Instant::now() in the test is flagged by the Rule Lint CI job — not reported here per review instructions.

Warnings

None.

Info

None.


Rule review against .claude/rules/. WARNING findings block merge.

@sanity sanity added this pull request to the merge queue Apr 13, 2026
Merged via the queue into main with commit ca06963 Apr 13, 2026
12 checks passed
@sanity sanity deleted the fix-visited-bloom-rekey branch April 13, 2026 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant