refactor(ops): add OpCtx with send_and_await primitive (#1454 phase 2a)#3803
refactor(ops): add OpCtx with send_and_await primitive (#1454 phase 2a)#3803iduartgomez merged 3 commits intomainfrom
Conversation
Introduce the per-transaction `OpCtx` scaffolding described in #1454 §3 "Phase 2a minimum surface". This ships as dormant infrastructure only: the single production caller (SUBSCRIBE's client-initiated path) is deferred to Phase 2b, and no op in `operations/{connect,put,get, subscribe,update}.rs` is migrated or otherwise touched. What ships - New `crates/core/src/operations/op_ctx.rs` with: - `OpCtx { tx, op_execution_sender }` struct (`Send`, not `Clone`) - `OpCtx::new` (crate-private constructor) - `OpCtx::tx()` accessor - `OpCtx::send_and_await(msg)` round-trip primitive - Three unit tests (happy path, dropped reply_sender, closed executor channel) and a compile-time `Send` assertion - `OpManager::op_ctx(tx)` factory on `op_state_manager.rs`, cloning the event-loop `op_execution_sender` into a fresh context. - `pub(crate) mod op_ctx;` and `pub(crate) use op_ctx::OpCtx;` added to `operations.rs`, following the same re-export pattern as `visited_peers::VisitedPeers`. What this deletes - `OpManager::notify_op_execution` (previously at `op_state_manager.rs:701-745`). Its body moves verbatim into `OpCtx::send_and_await`; its "Scaffolding reach" and "Deadlock risk" doc sections are preserved and expanded there. - The unused `#[allow(dead_code)]`-gated accessor methods on `EventLoopNotificationsReceiver` (`notifications_receiver`, `op_execution_receiver`) and `EventLoopNotificationsSender` (`op_execution_sender`). A crate-wide search (`rg '\.op_execution_ sender\(\)|\.op_execution_receiver\(\)|\.notifications_receiver\(\)'`) confirmed zero callers; callers use the `pub(crate)` fields directly per the pattern at `op_state_manager.rs` tests. The fields themselves stay; only the method wrappers and their attributes are removed. `send_and_await` picks up four new doc sections beyond the preserved Phase 1 ones: - "Single-use per Transaction" — the Phase 1 callback fires at most once per tx (dedup via `completed` / `under_progress`). A second `send_and_await` on the same tx will hang. Phase 2b's planner must use a fresh `Transaction` per retry attempt. - "Where to call this" — must run from a spawned op task, not a `priority_select` arm; `.send().await` on a bounded channel blocks. - "Terminal reply, not success reply" — `Ok(reply)` may carry a non-success terminal (`SubscribeMsg::Response::NotFound`, etc.). - "Errors" — `NotificationError` on send failure or receiver hang-up. Because Phase 2a has no production caller, the `OpCtx` struct and its impl block carry `#[allow(dead_code)]` until Phase 2b wires the first consumer; same rationale for `OpManager::op_ctx`. The re-exports of `EventLoopNotificationsReceiver` / `event_loop_notification_channel` through `node` are `#[cfg(test)]`-gated because the only crate-wide consumer outside `node/` is the new test shim in `op_ctx.rs`. Verification (pre-push) - `cargo fmt --check` — clean - `cargo clippy -p freenet --all-targets --features bench -- -D warnings` (pre-commit hook invocation) — clean - `cargo clippy --locked -- -D warnings` (CI invocation) — clean - `cargo test -p freenet --lib op_ctx` — 3 passed - `cargo test -p freenet --lib --features bench` — 2125 passed, 16 ignored, 0 failures Design reference: issue #1454 body, §3 "Phase 2a minimum surface (what ships first)". Phase 1 landed in PR #3802 (commit e025ca6).
Rule Review: No blocking issues foundRules checked: WarningsNone. Info
Notes on what's clean
Rule review against |
Responds to the Claude Rule Review bot's Info findings on PR #3803. Two nits addressed, one explicitly declined. Addressed: 1. `send_and_await` doc now carries a "Push-before-send" section citing `.claude/rules/operations.md` and explaining how the invariant manifests in both the legacy path (push state to `op_manager` before calling) and the task-per-tx path (initialize task-local state before calling). Phase 2b authors reading this doc in isolation will now see the invariant without having to cross-reference `operations.md`. 2. New regression test `send_and_await_second_call_hangs_as_documented` encodes the "single-use per Transaction" behavior described in the doc. Drives a successful first round-trip, then calls `send_and_await` a second time against a fake executor that holds the second `reply_sender` alive without firing — which is what Phase 1's real dedup (`completed` / `under_progress` sets in `OpManager`) does at the pipeline level. Asserts the second call times out via a 200 ms `tokio::time::timeout`. The test guards against doc drift: a future refactor that makes the second call fail fast (via an internal timeout or runtime check) will break the `is_err()` assertion, forcing a doc update. Attempting the naive shape first showed a subtle interaction: if the fake executor drops `reply_sender_2` at the end of its function body, `send_and_await`'s `recv().await` observes `None` and returns `Err(NotificationError)` — not a hang. The test uses a `oneshot::channel` to keep the executor alive until the test has observed the timeout, only then releasing `reply_sender_2` for clean shutdown. Shutdown uses an explicit `match` to discard the `Result<(), ()>` return from `oneshot::Sender::send` because `let _ =` trips `clippy::let_underscore_must_use` (denied crate-wide) and `drop(...)` trips `clippy::dropping_copy_types` (since `Result<(), ()>` is `Copy`). Declined: 3. The bot also suggested a unit test for the `debug_assert_eq!` tx-mismatch path. Not done. The two `debug_assert_eq!` calls are defense-in-depth against caller bugs at development time — in release builds they compile out and the downstream behavior is "callback routed to the wrong tx's `pending_op_result` consumer", which is a correctness bug in the caller, not a silent pass-through in `send_and_await`. A test of the debug assertion would test the language feature, not a `send_and_await` invariant. If release-build protection is wanted later, the fix is upgrading the asserts to runtime errors, not adding a test of the current debug-only shape. No production code changes; no behavior changes; no new dependencies. Verification: - `cargo fmt --check` — clean - `cargo clippy -p freenet --all-targets --features bench -- -D warnings` — clean - `cargo test -p freenet --lib op_ctx` — 4/4 passed (3 existing + 1 new) - `cargo test -p freenet --lib --features bench` — 2126 passed, 0 failed, 16 ignored
Addresses nits surfaced by a /pr-review subagent pass on PR #3803. Three edits, all documentation or test-hygiene; no production behavior changes. 1. Rewrite stale `notify_op_execution` doc references. Phase 2a deleted `OpManager::notify_op_execution` but left ~14 textual references to the identifier in Phase 1's doc comments and test module headers in `node.rs`, plus two references in the simulation integration test `test_pending_op_results_bounded`. None were actual call sites — all were prose pointers that went stale the moment the function was deleted. Per `AGENTS.md`'s "WHEN you discover outdated or missing documentation → Fix it immediately" rule, update them in-place to point at `OpCtx::send_and_await` / `OpManager::op_ctx`, and rewrite the simulation test's docstring to reflect the Phase 2a scaffolding state (no production consumer until Phase 2b). Two references are kept intentionally as archaeology: - `node.rs:791` (new history sentence in the `forward_pending_op_result_if_completed` docblock: "the caller side used to live on `OpManager::notify_op_execution`, now deleted. Phase 2a moved the caller side into `OpCtx`..."). - `node.rs:2554` (same note in the test module header). These give future archaeologists (including git-blame-blind readers of the code) one anchor for where the symbol used to live and what replaced it. The rest of the references were neither anchors nor history — just stale. 2. Bump `send_and_await_second_call_hangs_as_documented` timeout from 200 ms to 500 ms. The subagent review flagged the 200 ms constant as a potential CI-flakiness source: the Simulation job runs four fdev sims in parallel on the same runner, and a heavily contended scheduler could plausibly delay a tokio task by more than 200 ms before the timeout fires. 500 ms is still ~5× the first call's resolution time under uncontended conditions and ~2.5× what the other three tests allow themselves. Comment updated to record the reasoning so a future maintainer doesn't shrink it back. 3. Clarifying doc additions (no code changes): - `send_and_await`'s release-build `debug_assert_eq!` on the reply tx now carries an inline comment noting that in release a mismatched reply tx silently returns — the assert is defense-in-depth against bugs elsewhere in the pipeline, not a runtime correctness guard inside `send_and_await` itself. The release-build behavior ("a mismatched reply is returned for a different transaction") is a correctness bug in whatever code mislabeled the message in `p2p_protoc::pending_op_results`. - `send_and_await_returns_reply_on_completion` test gets a new doc header explaining that "happy path" means "the round-trip mechanics work", NOT "the reply represents success". The test uses `NetMessageV1::Aborted` precisely because that variant is orthogonal to success — only carries a `Transaction` — so the assertion is purely on the tx-routing mechanics. This addresses the subagent review's observation that `send_and_await`'s doc explicitly decouples `Ok(reply)` from "protocol success" but no test made that decoupling visible. The subagent's declined nit — "add a test for the `debug_assert_eq!` tx-mismatch path" — remains declined with rationale in the prior commit. The reviewer agreed with the decline and characterized it as "testing the language feature". The inline comment added in edit 3 captures the release-build consequence in the code itself so future readers have the framing without needing to dig through commit history. Verification: - `cargo fmt --check` — clean - `cargo clippy -p freenet --all-targets --features bench -- -D warnings` (pre-commit hook invocation) — clean - `cargo test -p freenet --lib op_ctx` — 4/4 passed (0.51s total, 0.30s of which is the 500 ms timeout elapsing in the hangs test) - `cargo test -p freenet --lib --features bench` — 2126 passed, 0 failed, 16 ignored
Problem
Phase 1 of the async-transaction refactor (#1454, PR #3802) wired the
pending_op_resultcallback hook inhandle_pure_network_message_v1for every op variant, giving us a round-trip primitive that can send aNetMessageand await a reply keyed by the sameTransaction. But the only caller of that primitive wasOpManager::notify_op_execution, a private helper marked#[allow(dead_code)]with a FIXME, and there was no ergonomic path for a future op task to get at it. Phase 2b (SUBSCRIBE migration, separate PR) needs a per-transaction execution context to hand to the task function it spawns — henceOpCtx.Solution
Introduce
OpCtxas dormant scaffolding with exactly one method (send_and_await) built directly on top of Phase 1's round-trip. The full target API from the design doc (spawn_sub, per-tx inbox, WSnotifyfanout,OpRegistry, and a newOperationtrait) is deferred until a concrete consumer needs it — matches Phase 1's "ship exactly what's needed, document constraints loudly, wait for a real caller" shape.What ships
crates/core/src/operations/op_ctx.rswith:OpCtx { tx, op_execution_sender }—Send, notClone, one owner per transactionOpCtx::new(crate-private constructor)OpCtx::tx()accessorOpCtx::send_and_await(&mut self, msg)— the round-trip primitiveSendassertion#[tokio::test]tests (happy path, dropped reply sender, closed executor channel)OpManager::op_ctx(&self, tx: Transaction) -> OpCtxfactory inop_state_manager.rspub(crate) mod op_ctx;+ re-export inoperations.rs, matching the existingvisited_peers::VisitedPeerspatternWhat this deletes
OpManager::notify_op_execution(previously ~40 lines atop_state_manager.rs:701-745). Its body moves verbatim intoOpCtx::send_and_await; its "Scaffolding reach" and "Deadlock risk" doc sections are preserved there and expanded with four new sections (see below).#[allow(dead_code)]-gated accessor methods onEventLoopNotificationsReceiver/EventLoopNotificationsSender(notifications_receiver(),op_execution_receiver(),op_execution_sender()). A crate-wide grep confirmed zero callers — the fields arepub(crate)and callers destructure them directly per the pattern atop_state_manager.rs:2166-2170. The fields themselves stay; only the dead method wrappers are removed.Design decisions that deviate from a naive literal read of the design doc
&mut selfonsend_and_await, notselfby value. This leavesctx.tx()available after the call and matches the Rev 3 sketch. The single-use-per-tx constraint is documented in prose rather than compile-enforced.OpCtxispub(crate), notpub. Nothing outside this crate has a legitimate use today.OpCtxholds theop_execution_senderdirectly, notArc<OpManager>. This was the decisive call for testability:OpManager::newrequiresNodeConfig + ContractHandlerChannel + ConnectionManager + NetEventRegister + BackgroundTaskMonitor + result_router_txand there is no test constructor. The sender-only shape lets the tests wireevent_loop_notification_channel()directly, following the existing pattern atop_state_manager.rs:2155-2207.Doc comment expansions on
send_and_await(beyond the Phase 1 preserved sections)completed/under_progressdedup sets suppress subsequent dispatches. A secondsend_and_awaiton the same tx will hang onresponse_receiver.recv(). Callers needing retries must use a freshTransactionper attempt. This is a known constraint Phase 2b's planner must work around.GlobalExecutor::spawn), not from inside the main event loop or apriority_selectarm, because the internal.send().awaiton a bounded channel can block.Ok(reply)may carry a non-success terminal (e.g.,SubscribeMsg::Response::NotFound). Callers inspect the reply to decide what happened.OpError::NotificationErroron send failure or receiver hang-up.Dead code —
OpCtx, its impl block, andOpManager::op_ctxcarry#[allow(dead_code)]with a Phase 2b removal note until the first production caller lands. This mirrors how Phase 1 left the attribute onnotify_op_executionuntil Phase 2a became its first caller.Testing
cargo fmt --check— cleancargo clippy -p freenet --all-targets --features bench -- -D warnings(pre-commit hook invocation, stricter than CI) — cleancargo clippy --locked -- -D warnings(CI invocation) — cleancargo test -p freenet --lib op_ctx— 3/3 passedcargo test -p freenet --lib --features bench— 2125 passed, 0 failed, 16 ignored (full library suite)The three new unit tests follow the
op_state_manager.rs:2155-2207pattern — wireevent_loop_notification_channel()directly, destructureEventLoopNotificationsReceiver, spawn a synthetic executor that mirrors what the real pipeline would do. NoOpManageris constructed (there's no test constructor). Each test wrapssend_and_awaitin a 1-secondtokio::time::timeoutas a defensive guard so a future regression into a hang fails fast rather than stalling CI.Fixes
Partial progress on #1454. This is Phase 2a of the phased rollout in the design doc (§5). Does not close the tracking issue. Upcoming phases:
OpCtx(activates the first production caller, drops the#[allow(dead_code)]attrs)OpManagerDashMap removal, docs/rules sweepSee the design doc in the body of #1454 (Rev 3), specifically §3 "Phase 2a minimum surface" for the target API and the Errata "Rev 2 → Rev 3" section for the phasing rationale.