Conversation
`test_ping_blocked_peers` was flaky in the merge queue because the task-per-tx subscribe driver dispatched its initial Request through `OpCtx::send_and_await`, which routes via `op_execution_sender` and loops the message back to the local event loop as an `InboundMessage`. When the contract was already cached locally (every client node had just finished a GET), `process_message` short-circuited at "Subscribe completed (originator has contract locally)" and synthesized a success reply without ever sending the Request to the gateway. The gateway therefore never registered the node as a downstream subscriber, so subsequent UPDATE broadcasts never reached it. The asymmetric failure pattern (one client node receives all updates, the other only its own) is explained by `auto_subscribe_on_get_response` falling back to the legacy `request_subscribe` path on GET completion: one client wins the race and registers via the legacy path, the other loses and is left depending on the broken explicit Subscribe. Fix: - Add `OpCtx::send_to_and_await(target_addr, msg)`. The op-execution channel payload now carries `Option<SocketAddr>`. When set, `handle_op_execution` emits `OutboundMessageWithTarget` instead of `InboundMessage`, dispatching the message to the named peer over the wire while still registering the reply callback in `pending_op_results`. - Have `op_ctx_task` call `send_to_and_await(current_target_addr, ...)` for client-initiated subscribe Requests. The task already selected `current_target_addr` via `prepare_initial_request`; previously that address was being dropped on the floor. - On `Subscribed` reply, mirror the legacy Response handler's local side effects (`ring.subscribe`, `complete_subscription_request`, `interest_manager.add_local_client`) inside `op_ctx_task`. The task-per-tx reply forwarding bypass in `node.rs` skips `handle_op_request` for terminal Responses, so without these calls the local interest manager never learns about the subscription. - Preserve the existing `send_and_await(msg)` (target=None) variant for non-routed call sites and tests. The local-originator branch in `process_message` is preserved intentionally — it remains correct for true standalone/sole-holder nodes (PR #2327, #2004). The fix prevents the network-target path from entering it. Wire format is unchanged: this is purely an internal dispatch fix. Regression test: `send_to_and_await_forwards_target_address` pins the channel-level invariant (target propagates through the op execution channel). The end-to-end behavior is covered by `test_ping_blocked_peers`, which previously hung for ~280 s and now passes in ~12 s. Also includes unrelated cleanup: - `sqlite.rs`: replace `let _ =` on a sqlx Result with `drop(...)` to satisfy clippy's `let_underscore_must_use` (newer clippy than CI). Closes #3838 [AI-assisted - Claude]
Rule Review: No blocking issues; minor test coverage gap notedRules checked: WarningsNone. Info
Notes on what was checked and found clean:
Rule review against |
Problem
test_ping_blocked_peers(freenet-ping-app::run_app_blocked_peers) is flaky in the merge queue. PR #3837 hit it for an unrelated shell-page change, all 3 nextest retries failed, and the failure pattern matches what #2254 and #3490 tried to address. Locally the test fails in ~280 s (full timeout) on a fresh build.The failure is asymmetric: gateway and one client node end with all three tags, but the other client node has only its own. In ~2/3 of failures it's Node2 stuck, ~1/3 Node1.
Root cause
The Phase 2b task-per-tx subscribe driver (
op_ctx_task, PR #3806, #1454) was dispatching its initial Request throughOpCtx::send_and_await, which routes viaop_execution_senderand loops the message back to the local event loop as anInboundMessagewithsource_addr = None.Both client nodes had just done a GET, so the contract was cached locally. When
process_messageran on the loop-back, it took therequester_addr.is_none() && has_contractbranch atsubscribe.rs:1417("Subscribe completed (originator has contract locally)") and synthesized aSubscribedreply via the callback path innode.rs:1191-1220. The Request never reached the gateway. The gateway therefore never registered the node as a downstream subscriber, and subsequent UPDATE broadcasts never reached it.The asymmetry is
auto_subscribe_on_get_response(operations.rs:646) racing the explicit Subscribe: when GET completes, if not already subscribed, it falls back to the legacystart_subscription_request/request_subscribepath, which usesnotify_op_changeand DOES dispatch over the wire. One client wins that race and registers the legacy way; the other loses and depends on the broken explicit Subscribe.The comment at
subscribe.rs:480-487inprepare_initial_requestexplicitly documents the invariant the task-per-tx path was violating:Approach
Fix at the
OpCtxsend layer, not atprocess_message. The driver already selectscurrent_target_addrviaprepare_initial_request; thesend_and_awaitabstraction was throwing it away.OpCtx::send_to_and_await(target_addr, msg)— new variant that carries an explicit target through the op-execution channel. The channel payload is now(reply_sender, msg, Option<SocketAddr>). When the target isSome,handle_op_executionemitsOutboundMessageWithTarget { target_addr, msg }and dispatches to that peer's connection. When it'sNone, the existingInboundMessageloop-back path is preserved for non-routed call sites and tests.op_ctx_taskswitches tosend_to_and_awaitfor the initial Request, passingcurrent_target_addr. The reply still flows back through the samepending_op_resultscallbackhandle_op_executionregistered.Mirror legacy local side effects on
Subscribedinop_ctx_task. The task-per-tx reply forwarding bypass atnode.rs:1161skipshandle_op_requestfor terminal Responses, so without this the local interest manager (ring.subscribe,complete_subscription_request,interest_manager.add_local_client) never learned about the subscription. These are the same calls the legacySubscribeMsg::Responsearm makes.Preserve the local-originator branch in
process_message. It remains correct for true standalone/sole-holder nodes (PRs fix: complete Subscribe locally when contract is cached #2327, fix: register subscribers for locally-cached contracts (issue #2001) #2004 deliberately added it). The fix prevents the network-target path from entering it.Wire format is unchanged. The
SubscribeMsg::Requestbytes on the network are identical; only the internal dispatch path differs.Alternatives considered
process_messageto forward instead of short-circuit when a target is available. Requires plumbing a target through the op state, deeper change, and risks breaking the legitimate standalone case.notify_op_changefromop_ctx_task. Undercuts the task-per-tx model — would mix state ownership back intoOpManager.ops. Worse architectural fit.Codex independently arrived at the same conclusion when consulted as a second opinion.
Testing
New unit test:
send_to_and_await_forwards_target_address(crates/core/src/operations/op_ctx.rs) — pins the channel-level invariant. Asserts that the target address provided by the caller arrives intact at the op execution receiver. A regression that drops the target on the floor (the original bug) would fail this test in milliseconds.Why didn't CI catch this?
test_ping_blocked_peersexists and does exercise this path, but it depends on a race between the legacy GET-fallback subscribe and the explicit task-per-tx subscribe. CI usually wins one race and flakes only when timing slips. The new unit test pins the contract directly so future regressions surface deterministically before the integration test does.test_ping_blocked_peerslocally:Full test suites:
cargo test -p freenet --lib: 2221 passedcargo test -p freenet --lib subscribe: 113 passedcargo fmt && cargo clippy --all-targets --all-features: cleanIncidental cleanup
crates/core/src/contract/storages/sqlite.rs: replacedlet _ =on asqlx::query(...).execute(...).awaitwithdrop(...)to satisfy clippy'slet_underscore_must_use(newer local clippy than CI; pre-existing on main but only fires on stricter clippy versions).Fixes
Closes #3838
[AI-assisted - Claude]