fix: route initial subscribe Request to home node via OpCtx target by sanity · Pull Request #3864 · freenet/freenet-core

sanity · 2026-04-13T21:03:16Z

Problem

test_ping_blocked_peers (freenet-ping-app::run_app_blocked_peers) is flaky in the merge queue. PR #3837 hit it for an unrelated shell-page change, all 3 nextest retries failed, and the failure pattern matches what #2254 and #3490 tried to address. Locally the test fails in ~280 s (full timeout) on a fresh build.

The failure is asymmetric: gateway and one client node end with all three tags, but the other client node has only its own. In ~2/3 of failures it's Node2 stuck, ~1/3 Node1.

Root cause

The Phase 2b task-per-tx subscribe driver (op_ctx_task, PR #3806, #1454) was dispatching its initial Request through OpCtx::send_and_await, which routes via op_execution_sender and loops the message back to the local event loop as an InboundMessage with source_addr = None.

Both client nodes had just done a GET, so the contract was cached locally. When process_message ran on the loop-back, it took the requester_addr.is_none() && has_contract branch at subscribe.rs:1417 ("Subscribe completed (originator has contract locally)") and synthesized a Subscribed reply via the callback path in node.rs:1191-1220. The Request never reached the gateway. The gateway therefore never registered the node as a downstream subscriber, and subsequent UPDATE broadcasts never reached it.

The asymmetry is auto_subscribe_on_get_response (operations.rs:646) racing the explicit Subscribe: when GET completes, if not already subscribed, it falls back to the legacy start_subscription_request / request_subscribe path, which uses notify_op_change and DOES dispatch over the wire. One client wins that race and registers the legacy way; the other loses and depends on the broken explicit Subscribe.

The comment at subscribe.rs:480-487 in prepare_initial_request explicitly documents the invariant the task-per-tx path was violating:

Even if we have the contract locally (e.g., from PUT forwarding), we MUST send a Subscribe::Request to the network to register ourselves as a downstream subscriber in the subscription tree. Otherwise, when the contract is updated at the source, we won't receive the update because we're not registered in the upstream peer's subscriber list.

Approach

Fix at the OpCtx send layer, not at process_message. The driver already selects current_target_addr via prepare_initial_request; the send_and_await abstraction was throwing it away.

OpCtx::send_to_and_await(target_addr, msg) — new variant that carries an explicit target through the op-execution channel. The channel payload is now (reply_sender, msg, Option<SocketAddr>). When the target is Some, handle_op_execution emits OutboundMessageWithTarget { target_addr, msg } and dispatches to that peer's connection. When it's None, the existing InboundMessage loop-back path is preserved for non-routed call sites and tests.
op_ctx_task switches to send_to_and_await for the initial Request, passing current_target_addr. The reply still flows back through the same pending_op_results callback handle_op_execution registered.
Mirror legacy local side effects on Subscribed in op_ctx_task. The task-per-tx reply forwarding bypass at node.rs:1161 skips handle_op_request for terminal Responses, so without this the local interest manager (ring.subscribe, complete_subscription_request, interest_manager.add_local_client) never learned about the subscription. These are the same calls the legacy SubscribeMsg::Response arm makes.
Preserve the local-originator branch in process_message. It remains correct for true standalone/sole-holder nodes (PRs fix: complete Subscribe locally when contract is cached #2327, fix: register subscribers for locally-cached contracts (issue #2001) #2004 deliberately added it). The fix prevents the network-target path from entering it.

Wire format is unchanged. The SubscribeMsg::Request bytes on the network are identical; only the internal dispatch path differs.

Alternatives considered

Teach process_message to forward instead of short-circuit when a target is available. Requires plumbing a target through the op state, deeper change, and risks breaking the legitimate standalone case.
Use legacy notify_op_change from op_ctx_task. Undercuts the task-per-tx model — would mix state ownership back into OpManager.ops. Worse architectural fit.

Codex independently arrived at the same conclusion when consulted as a second opinion.

Testing

New unit test: send_to_and_await_forwards_target_address (crates/core/src/operations/op_ctx.rs) — pins the channel-level invariant. Asserts that the target address provided by the caller arrives intact at the op execution receiver. A regression that drops the target on the floor (the original bug) would fail this test in milliseconds.
Why didn't CI catch this? test_ping_blocked_peers exists and does exercise this path, but it depends on a race between the legacy GET-fallback subscribe and the explicit task-per-tx subscribe. CI usually wins one race and flakes only when timing slips. The new unit test pins the contract directly so future regressions surface deterministically before the integration test does.
test_ping_blocked_peers locally:
- Before fix: ~280-293 s, full propagation timeout (Node2 stuck with only its own tag)
- After fix: ~12-37 s, all tags propagate within first poll round (×4 consecutive runs)
Full test suites:
- cargo test -p freenet --lib: 2221 passed
- cargo test -p freenet --lib subscribe: 113 passed
- cargo fmt && cargo clippy --all-targets --all-features: clean

Incidental cleanup

crates/core/src/contract/storages/sqlite.rs: replaced let _ = on a sqlx::query(...).execute(...).await with drop(...) to satisfy clippy's let_underscore_must_use (newer local clippy than CI; pre-existing on main but only fires on stricter clippy versions).

Fixes

Closes #3838

[AI-assisted - Claude]

`test_ping_blocked_peers` was flaky in the merge queue because the task-per-tx subscribe driver dispatched its initial Request through `OpCtx::send_and_await`, which routes via `op_execution_sender` and loops the message back to the local event loop as an `InboundMessage`. When the contract was already cached locally (every client node had just finished a GET), `process_message` short-circuited at "Subscribe completed (originator has contract locally)" and synthesized a success reply without ever sending the Request to the gateway. The gateway therefore never registered the node as a downstream subscriber, so subsequent UPDATE broadcasts never reached it. The asymmetric failure pattern (one client node receives all updates, the other only its own) is explained by `auto_subscribe_on_get_response` falling back to the legacy `request_subscribe` path on GET completion: one client wins the race and registers via the legacy path, the other loses and is left depending on the broken explicit Subscribe. Fix: - Add `OpCtx::send_to_and_await(target_addr, msg)`. The op-execution channel payload now carries `Option<SocketAddr>`. When set, `handle_op_execution` emits `OutboundMessageWithTarget` instead of `InboundMessage`, dispatching the message to the named peer over the wire while still registering the reply callback in `pending_op_results`. - Have `op_ctx_task` call `send_to_and_await(current_target_addr, ...)` for client-initiated subscribe Requests. The task already selected `current_target_addr` via `prepare_initial_request`; previously that address was being dropped on the floor. - On `Subscribed` reply, mirror the legacy Response handler's local side effects (`ring.subscribe`, `complete_subscription_request`, `interest_manager.add_local_client`) inside `op_ctx_task`. The task-per-tx reply forwarding bypass in `node.rs` skips `handle_op_request` for terminal Responses, so without these calls the local interest manager never learns about the subscription. - Preserve the existing `send_and_await(msg)` (target=None) variant for non-routed call sites and tests. The local-originator branch in `process_message` is preserved intentionally — it remains correct for true standalone/sole-holder nodes (PR #2327, #2004). The fix prevents the network-target path from entering it. Wire format is unchanged: this is purely an internal dispatch fix. Regression test: `send_to_and_await_forwards_target_address` pins the channel-level invariant (target propagates through the op execution channel). The end-to-end behavior is covered by `test_ping_blocked_peers`, which previously hung for ~280 s and now passes in ~12 s. Also includes unrelated cleanup: - `sqlite.rs`: replace `let _ =` on a sqlx Result with `drop(...)` to satisfy clippy's `let_underscore_must_use` (newer clippy than CI). Closes #3838 [AI-assisted - Claude]

github-actions · 2026-04-13T21:06:19Z

Rule Review: No blocking issues; minor test coverage gap noted

Rules checked: git-workflow.md, code-style.md, testing.md, operations.md, contracts.md
Files reviewed: 8

Warnings

None.

Info

crates/core/src/operations/subscribe/op_ctx_task.rs:455-474 — The Subscribed branch side effects added here (ring.subscribe, complete_subscription_request, add_local_client, broadcast_change_interests) have no dedicated unit test. A future regression that removes any of these calls would only be caught by the integration test test_ping_blocked_peers, not by the new unit test send_to_and_await_forwards_target_address. The CI minimum (new #[test] function) is met and the end-to-end path is covered, but a focused unit test that exercises the reply-handling path in drive_client_subscribe_inner would make the coverage more robust. (rule: testing.md — regression tests should "catch this exact bug if reintroduced")
crates/core/src/operations/op_ctx.rs:419 — send_and_await is now #[allow(dead_code)] on a pub method. The comment ("Kept as stable API surface for later task-per-tx phases (Refactor core interfaces #1454)") is a reasonable justification, but if Phase 6 of Refactor core interfaces #1454 removes the legacy path without calling this method, it should be cleaned up at that point rather than left with a suppression. (rule: code-style.md — avoid backwards-compatibility stubs beyond their useful life)

Notes on what was checked and found clean:

No .unwrap() in new production code paths
No fire-and-forget spawns; broadcast_change_interests is properly awaited
No new biased; select usage
Push-before-send invariant is preserved: op_ctx(attempt_tx) is called (registering the tx) before send_to_and_await
sqlite.rs cleanup (drop(...) replacing let _ =) is correct and satisfies let_underscore_must_use
Commit message follows conventional commits format and is under 72 chars in subject
OpExecutionPayload type alias is well-documented with the motivating issue reference
All test call sites updated correctly with None for the new third tuple field

Rule review against .claude/rules/. WARNING findings block merge.

sanity merged commit f89bb60 into main Apr 13, 2026
12 checks passed

sanity deleted the fix-3838 branch April 13, 2026 21:21

This was referenced Apr 15, 2026

Migrate relay GETs to task-per-tx driver (#1454 follow-up to Phase 3b) #3883

Open

refactor(ops): migrate client-initiated GET to task-per-tx driver (#1454 phase 3b) #3884

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: route initial subscribe Request to home node via OpCtx target#3864

fix: route initial subscribe Request to home node via OpCtx target#3864
sanity merged 1 commit intomainfrom
fix-3838

sanity commented Apr 13, 2026

Uh oh!

github-actions Bot commented Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sanity commented Apr 13, 2026

Problem

Root cause

Approach

Alternatives considered

Testing

Incidental cleanup

Fixes

Uh oh!

github-actions Bot commented Apr 13, 2026

Rule Review: No blocking issues; minor test coverage gap noted

Warnings

Info

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant