fix(connect): restore VisitedPeers hash keys on relay op re-entry#3868
Merged
fix(connect): restore VisitedPeers hash keys on relay op re-entry#3868
Conversation
VisitedPeers::hash_keys is marked #[serde(skip)] so a freshly deserialized bloom filter has zero hash keys. ConnectOp::new_relay correctly calls with_transaction(&id) on first receipt, but the ConnectOp::handle_request re-entry path assigned the incoming (deserialized) request directly to state.request without re-keying. Subsequent mark_visited / probably_visited calls then ran against zeroed hash keys, setting bits at the wrong indices and silently corrupting the bloom. This only manifested at scale because small-topology tests rarely re-enter an existing relay op for the same transaction. Production CI 50-node sims with uphill bounce-back at terminus do — frequently. The symptom was a CONNECT forwarding loop on a single transaction bouncing between three nodes that all rejected at terminus, with visited set-bit count oscillating non-monotonically (observed in CI run 24365113346 job 71154729735, transaction 01E0YHAGAQ5N6DTVPA9BM7W000). Fix: restore hash keys via `request.visited.with_transaction(&self.id)` before assigning the request to state.request in ConnectOp::handle_request. Surgical — touches only the broken path, leaves new_relay unchanged. Adds regression test `regression_visited_hash_keys_preserved_across_reentry`: a deterministic state-machine unit test that constructs a relay op with a known visited peer, bincode-serializes and deserializes the request to simulate a wire round-trip (which zeros hash_keys), re-enters handle_request, and asserts the previously-marked peer is still detected. Verified to fail on current main (bloom corruption panic) and pass with the fix. No networking required. [AI-assisted - Claude] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Rule Review: No issues foundRules checked: Production FixThe two-line fix at let mut request = request;
request.visited = request.visited.with_transaction(&self.id);
Regression Test
WarningsNone. InfoNone. Rule review against |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
CONNECT forwarding loop in 50-node simulation tests causes the Simulation CI job to hang for 27+ minutes silently until the 30-minute runner timeout kills it. Observed in PR #3861's CI run (run 24365113346, job 71154729735): a single CONNECT transaction (
01E0YHAGAQ5N6DTVPA9BM7W000) bouncing between nodes 27/28/35, all rejecting at terminus, with thevisitedbloom filter set-bit count oscillating non-monotonically between 81 and 84 — a clear sign of bloom corruption rather than a routing decision issue.Root cause
VisitedPeers::hash_keysis marked#[serde(skip)], so a freshly deserialized bloom filter has zero hash keys.ConnectOp::new_relaycorrectly callswith_transaction(&id)on first receipt. TheConnectOp::handle_requestre-entry path (crates/core/src/operations/connect.rs, previously line 1379) assigned the incoming, deserialized request directly tostate.requestwithout re-keying. Subsequentmark_visited/probably_visitedcalls then ran against zeroed hash keys, setting bits at entirely different indices from the originally-marked peers. The bloom semantics silently collapsed: the filter no longer detected already-visited peers, so routing kept forwarding the same transaction back to nodes it had already tried.This pattern only manifests at scale because small-topology tests rarely re-enter an existing relay op for the same transaction. CI 50-node sims with uphill bounce-back at terminus do, frequently.
Solution
Two-line fix: restore hash keys via
request.visited = request.visited.with_transaction(&self.id)before assigning the incoming request tostate.request. Surgical — touches only the broken path, leavesnew_relayunchanged.Testing
Added
regression_visited_hash_keys_preserved_across_reentry— a deterministic state-machine unit test that:ConnectOp::new_relaywith a peer pre-marked in the visited bloom (correctly keyed).bincode-serializes and deserializes theConnectRequestto simulate a wire round-trip, confirming hash keys are zeroed (sanity assertion on the deserialized bloom).ConnectOp::handle_requeston the existing op with the deserialized request — the re-entry path.state.request.visited.probably_visited(...).Verified to panic on current main (bloom corruption detected) and pass with the fix. No networking required. Ran 10x in a loop — deterministic. Full connect module test suite (50 tests) passes. Full
cargo test -p freenet --libpasses (2240 passing; the one unrelated deadlock-detection failure is a known test-isolation issue that requires nextest or--test-threads=1, not introduced by this PR).Out of scope (follow-up)
The invariant "every deserialized
VisitedPeersmust havewith_transactioncalled before anymark_visited/probably_visiteduse" is enforced only by code review, not by the type system. A future hardening could introduce anUnkeyedVisitedPeersnewtype that must be converted viawith_transactionbefore becoming a usableVisitedPeers, making this class of bug unrepresentable. Not doing that here to keep the fix surgical.Refs: investigation in CI run https://github.com/freenet/freenet-core/actions/runs/24365113346
[AI-assisted - Claude]