Skip to content

fix: route initial subscribe Request to home node via OpCtx target#3864

Merged
sanity merged 1 commit intomainfrom
fix-3838
Apr 13, 2026
Merged

fix: route initial subscribe Request to home node via OpCtx target#3864
sanity merged 1 commit intomainfrom
fix-3838

Conversation

@sanity
Copy link
Copy Markdown
Collaborator

@sanity sanity commented Apr 13, 2026

Problem

test_ping_blocked_peers (freenet-ping-app::run_app_blocked_peers) is flaky in the merge queue. PR #3837 hit it for an unrelated shell-page change, all 3 nextest retries failed, and the failure pattern matches what #2254 and #3490 tried to address. Locally the test fails in ~280 s (full timeout) on a fresh build.

The failure is asymmetric: gateway and one client node end with all three tags, but the other client node has only its own. In ~2/3 of failures it's Node2 stuck, ~1/3 Node1.

Root cause

The Phase 2b task-per-tx subscribe driver (op_ctx_task, PR #3806, #1454) was dispatching its initial Request through OpCtx::send_and_await, which routes via op_execution_sender and loops the message back to the local event loop as an InboundMessage with source_addr = None.

Both client nodes had just done a GET, so the contract was cached locally. When process_message ran on the loop-back, it took the requester_addr.is_none() && has_contract branch at subscribe.rs:1417 ("Subscribe completed (originator has contract locally)") and synthesized a Subscribed reply via the callback path in node.rs:1191-1220. The Request never reached the gateway. The gateway therefore never registered the node as a downstream subscriber, and subsequent UPDATE broadcasts never reached it.

The asymmetry is auto_subscribe_on_get_response (operations.rs:646) racing the explicit Subscribe: when GET completes, if not already subscribed, it falls back to the legacy start_subscription_request / request_subscribe path, which uses notify_op_change and DOES dispatch over the wire. One client wins that race and registers the legacy way; the other loses and depends on the broken explicit Subscribe.

The comment at subscribe.rs:480-487 in prepare_initial_request explicitly documents the invariant the task-per-tx path was violating:

Even if we have the contract locally (e.g., from PUT forwarding), we MUST send a Subscribe::Request to the network to register ourselves as a downstream subscriber in the subscription tree. Otherwise, when the contract is updated at the source, we won't receive the update because we're not registered in the upstream peer's subscriber list.

Approach

Fix at the OpCtx send layer, not at process_message. The driver already selects current_target_addr via prepare_initial_request; the send_and_await abstraction was throwing it away.

  • OpCtx::send_to_and_await(target_addr, msg) — new variant that carries an explicit target through the op-execution channel. The channel payload is now (reply_sender, msg, Option<SocketAddr>). When the target is Some, handle_op_execution emits OutboundMessageWithTarget { target_addr, msg } and dispatches to that peer's connection. When it's None, the existing InboundMessage loop-back path is preserved for non-routed call sites and tests.

  • op_ctx_task switches to send_to_and_await for the initial Request, passing current_target_addr. The reply still flows back through the same pending_op_results callback handle_op_execution registered.

  • Mirror legacy local side effects on Subscribed in op_ctx_task. The task-per-tx reply forwarding bypass at node.rs:1161 skips handle_op_request for terminal Responses, so without this the local interest manager (ring.subscribe, complete_subscription_request, interest_manager.add_local_client) never learned about the subscription. These are the same calls the legacy SubscribeMsg::Response arm makes.

  • Preserve the local-originator branch in process_message. It remains correct for true standalone/sole-holder nodes (PRs fix: complete Subscribe locally when contract is cached #2327, fix: register subscribers for locally-cached contracts (issue #2001) #2004 deliberately added it). The fix prevents the network-target path from entering it.

Wire format is unchanged. The SubscribeMsg::Request bytes on the network are identical; only the internal dispatch path differs.

Alternatives considered

  • Teach process_message to forward instead of short-circuit when a target is available. Requires plumbing a target through the op state, deeper change, and risks breaking the legitimate standalone case.
  • Use legacy notify_op_change from op_ctx_task. Undercuts the task-per-tx model — would mix state ownership back into OpManager.ops. Worse architectural fit.

Codex independently arrived at the same conclusion when consulted as a second opinion.

Testing

  • New unit test: send_to_and_await_forwards_target_address (crates/core/src/operations/op_ctx.rs) — pins the channel-level invariant. Asserts that the target address provided by the caller arrives intact at the op execution receiver. A regression that drops the target on the floor (the original bug) would fail this test in milliseconds.

  • Why didn't CI catch this? test_ping_blocked_peers exists and does exercise this path, but it depends on a race between the legacy GET-fallback subscribe and the explicit task-per-tx subscribe. CI usually wins one race and flakes only when timing slips. The new unit test pins the contract directly so future regressions surface deterministically before the integration test does.

  • test_ping_blocked_peers locally:

    • Before fix: ~280-293 s, full propagation timeout (Node2 stuck with only its own tag)
    • After fix: ~12-37 s, all tags propagate within first poll round (×4 consecutive runs)
  • Full test suites:

    • cargo test -p freenet --lib: 2221 passed
    • cargo test -p freenet --lib subscribe: 113 passed
    • cargo fmt && cargo clippy --all-targets --all-features: clean

Incidental cleanup

  • crates/core/src/contract/storages/sqlite.rs: replaced let _ = on a sqlx::query(...).execute(...).await with drop(...) to satisfy clippy's let_underscore_must_use (newer local clippy than CI; pre-existing on main but only fires on stricter clippy versions).

Fixes

Closes #3838

[AI-assisted - Claude]

`test_ping_blocked_peers` was flaky in the merge queue because the
task-per-tx subscribe driver dispatched its initial Request through
`OpCtx::send_and_await`, which routes via `op_execution_sender` and
loops the message back to the local event loop as an `InboundMessage`.
When the contract was already cached locally (every client node had
just finished a GET), `process_message` short-circuited at
"Subscribe completed (originator has contract locally)" and synthesized
a success reply without ever sending the Request to the gateway. The
gateway therefore never registered the node as a downstream subscriber,
so subsequent UPDATE broadcasts never reached it.

The asymmetric failure pattern (one client node receives all updates,
the other only its own) is explained by `auto_subscribe_on_get_response`
falling back to the legacy `request_subscribe` path on GET completion:
one client wins the race and registers via the legacy path, the other
loses and is left depending on the broken explicit Subscribe.

Fix:

- Add `OpCtx::send_to_and_await(target_addr, msg)`. The op-execution
  channel payload now carries `Option<SocketAddr>`. When set,
  `handle_op_execution` emits `OutboundMessageWithTarget` instead of
  `InboundMessage`, dispatching the message to the named peer over
  the wire while still registering the reply callback in
  `pending_op_results`.

- Have `op_ctx_task` call `send_to_and_await(current_target_addr, ...)`
  for client-initiated subscribe Requests. The task already selected
  `current_target_addr` via `prepare_initial_request`; previously that
  address was being dropped on the floor.

- On `Subscribed` reply, mirror the legacy Response handler's local
  side effects (`ring.subscribe`, `complete_subscription_request`,
  `interest_manager.add_local_client`) inside `op_ctx_task`. The
  task-per-tx reply forwarding bypass in `node.rs` skips
  `handle_op_request` for terminal Responses, so without these calls
  the local interest manager never learns about the subscription.

- Preserve the existing `send_and_await(msg)` (target=None) variant
  for non-routed call sites and tests.

The local-originator branch in `process_message` is preserved
intentionally — it remains correct for true standalone/sole-holder
nodes (PR #2327, #2004). The fix prevents the network-target path
from entering it.

Wire format is unchanged: this is purely an internal dispatch fix.

Regression test: `send_to_and_await_forwards_target_address` pins the
channel-level invariant (target propagates through the op execution
channel). The end-to-end behavior is covered by `test_ping_blocked_peers`,
which previously hung for ~280 s and now passes in ~12 s.

Also includes unrelated cleanup:
- `sqlite.rs`: replace `let _ =` on a sqlx Result with `drop(...)`
  to satisfy clippy's `let_underscore_must_use` (newer clippy than CI).

Closes #3838

[AI-assisted - Claude]
@github-actions
Copy link
Copy Markdown
Contributor

Rule Review: No blocking issues; minor test coverage gap noted

Rules checked: git-workflow.md, code-style.md, testing.md, operations.md, contracts.md
Files reviewed: 8


Warnings

None.


Info

  • crates/core/src/operations/subscribe/op_ctx_task.rs:455-474 — The Subscribed branch side effects added here (ring.subscribe, complete_subscription_request, add_local_client, broadcast_change_interests) have no dedicated unit test. A future regression that removes any of these calls would only be caught by the integration test test_ping_blocked_peers, not by the new unit test send_to_and_await_forwards_target_address. The CI minimum (new #[test] function) is met and the end-to-end path is covered, but a focused unit test that exercises the reply-handling path in drive_client_subscribe_inner would make the coverage more robust. (rule: testing.md — regression tests should "catch this exact bug if reintroduced")

  • crates/core/src/operations/op_ctx.rs:419send_and_await is now #[allow(dead_code)] on a pub method. The comment ("Kept as stable API surface for later task-per-tx phases (Refactor core interfaces #1454)") is a reasonable justification, but if Phase 6 of Refactor core interfaces #1454 removes the legacy path without calling this method, it should be cleaned up at that point rather than left with a suppression. (rule: code-style.md — avoid backwards-compatibility stubs beyond their useful life)


Notes on what was checked and found clean:

  • No .unwrap() in new production code paths
  • No fire-and-forget spawns; broadcast_change_interests is properly awaited
  • No new biased; select usage
  • Push-before-send invariant is preserved: op_ctx(attempt_tx) is called (registering the tx) before send_to_and_await
  • sqlite.rs cleanup (drop(...) replacing let _ =) is correct and satisfies let_underscore_must_use
  • Commit message follows conventional commits format and is under 72 chars in subject
  • OpExecutionPayload type alias is well-documented with the motivating issue reference
  • All test call sites updated correctly with None for the new third tuple field

Rule review against .claude/rules/. WARNING findings block merge.

@sanity sanity merged commit f89bb60 into main Apr 13, 2026
12 checks passed
@sanity sanity deleted the fix-3838 branch April 13, 2026 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test: test_ping_blocked_peers flaky in merge queue (propagation timeout)

1 participant