refactor(ops): add OpCtx with send_and_await primitive (#1454 phase 2a) by iduartgomez · Pull Request #3803 · freenet/freenet-core

iduartgomez · 2026-04-09T09:16:58Z

Problem

Phase 1 of the async-transaction refactor (#1454, PR #3802) wired the pending_op_result callback hook in handle_pure_network_message_v1 for every op variant, giving us a round-trip primitive that can send a NetMessage and await a reply keyed by the same Transaction. But the only caller of that primitive was OpManager::notify_op_execution, a private helper marked #[allow(dead_code)] with a FIXME, and there was no ergonomic path for a future op task to get at it. Phase 2b (SUBSCRIBE migration, separate PR) needs a per-transaction execution context to hand to the task function it spawns — hence OpCtx.

Solution

Introduce OpCtx as dormant scaffolding with exactly one method (send_and_await) built directly on top of Phase 1's round-trip. The full target API from the design doc (spawn_sub, per-tx inbox, WS notify fanout, OpRegistry, and a new Operation trait) is deferred until a concrete consumer needs it — matches Phase 1's "ship exactly what's needed, document constraints loudly, wait for a real caller" shape.

What ships

New file crates/core/src/operations/op_ctx.rs with:
- OpCtx { tx, op_execution_sender } — Send, not Clone, one owner per transaction
- OpCtx::new (crate-private constructor)
- OpCtx::tx() accessor
- OpCtx::send_and_await(&mut self, msg) — the round-trip primitive
- Compile-time Send assertion
- Three #[tokio::test] tests (happy path, dropped reply sender, closed executor channel)
OpManager::op_ctx(&self, tx: Transaction) -> OpCtx factory in op_state_manager.rs
pub(crate) mod op_ctx; + re-export in operations.rs, matching the existing visited_peers::VisitedPeers pattern

What this deletes

OpManager::notify_op_execution (previously ~40 lines at op_state_manager.rs:701-745). Its body moves verbatim into OpCtx::send_and_await; its "Scaffolding reach" and "Deadlock risk" doc sections are preserved there and expanded with four new sections (see below).
Three unused #[allow(dead_code)]-gated accessor methods on EventLoopNotificationsReceiver / EventLoopNotificationsSender (notifications_receiver(), op_execution_receiver(), op_execution_sender()). A crate-wide grep confirmed zero callers — the fields are pub(crate) and callers destructure them directly per the pattern at op_state_manager.rs:2166-2170. The fields themselves stay; only the dead method wrappers are removed.

Design decisions that deviate from a naive literal read of the design doc

&mut self on send_and_await, not self by value. This leaves ctx.tx() available after the call and matches the Rev 3 sketch. The single-use-per-tx constraint is documented in prose rather than compile-enforced.
OpCtx is pub(crate), not pub. Nothing outside this crate has a legitimate use today.
OpCtx holds the op_execution_sender directly, not Arc<OpManager>. This was the decisive call for testability: OpManager::new requires NodeConfig + ContractHandlerChannel + ConnectionManager + NetEventRegister + BackgroundTaskMonitor + result_router_tx and there is no test constructor. The sender-only shape lets the tests wire event_loop_notification_channel() directly, following the existing pattern at op_state_manager.rs:2155-2207.

Doc comment expansions on send_and_await (beyond the Phase 1 preserved sections)

Single-use per Transaction — Phase 1's callback fires at most once per tx because the completed / under_progress dedup sets suppress subsequent dispatches. A second send_and_await on the same tx will hang on response_receiver.recv(). Callers needing retries must use a fresh Transaction per attempt. This is a known constraint Phase 2b's planner must work around.
Where to call this — must run from an op task (e.g., GlobalExecutor::spawn), not from inside the main event loop or a priority_select arm, because the internal .send().await on a bounded channel can block.
Terminal reply, not success reply — Ok(reply) may carry a non-success terminal (e.g., SubscribeMsg::Response::NotFound). Callers inspect the reply to decide what happened.
Errors — OpError::NotificationError on send failure or receiver hang-up.

Dead code — OpCtx, its impl block, and OpManager::op_ctx carry #[allow(dead_code)] with a Phase 2b removal note until the first production caller lands. This mirrors how Phase 1 left the attribute on notify_op_execution until Phase 2a became its first caller.

Testing

cargo fmt --check — clean
cargo clippy -p freenet --all-targets --features bench -- -D warnings (pre-commit hook invocation, stricter than CI) — clean
cargo clippy --locked -- -D warnings (CI invocation) — clean
cargo test -p freenet --lib op_ctx — 3/3 passed
cargo test -p freenet --lib --features bench — 2125 passed, 0 failed, 16 ignored (full library suite)

The three new unit tests follow the op_state_manager.rs:2155-2207 pattern — wire event_loop_notification_channel() directly, destructure EventLoopNotificationsReceiver, spawn a synthetic executor that mirrors what the real pipeline would do. No OpManager is constructed (there's no test constructor). Each test wraps send_and_await in a 1-second tokio::time::timeout as a defensive guard so a future regression into a hang fails fast rather than stalling CI.

Fixes

Partial progress on #1454. This is Phase 2a of the phased rollout in the design doc (§5). Does not close the tracking issue. Upcoming phases:

Phase 2b: Migrate SUBSCRIBE's client-initiated path onto OpCtx (activates the first production caller, drops the #[allow(dead_code)] attrs)
Phase 2.5: Port PUT-completion → SUBSCRIBE sub-op hand-off
Phase 2c: Migrate CONNECT (schedule TBD)
Phases 3–6: GET/PUT streaming, UPDATE broadcast, OpManager DashMap removal, docs/rules sweep

See the design doc in the body of #1454 (Rev 3), specifically §3 "Phase 2a minimum surface" for the target API and the Errata "Rev 2 → Rev 3" section for the phasing rationale.

Introduce the per-transaction `OpCtx` scaffolding described in #1454 §3 "Phase 2a minimum surface". This ships as dormant infrastructure only: the single production caller (SUBSCRIBE's client-initiated path) is deferred to Phase 2b, and no op in `operations/{connect,put,get, subscribe,update}.rs` is migrated or otherwise touched. What ships - New `crates/core/src/operations/op_ctx.rs` with: - `OpCtx { tx, op_execution_sender }` struct (`Send`, not `Clone`) - `OpCtx::new` (crate-private constructor) - `OpCtx::tx()` accessor - `OpCtx::send_and_await(msg)` round-trip primitive - Three unit tests (happy path, dropped reply_sender, closed executor channel) and a compile-time `Send` assertion - `OpManager::op_ctx(tx)` factory on `op_state_manager.rs`, cloning the event-loop `op_execution_sender` into a fresh context. - `pub(crate) mod op_ctx;` and `pub(crate) use op_ctx::OpCtx;` added to `operations.rs`, following the same re-export pattern as `visited_peers::VisitedPeers`. What this deletes - `OpManager::notify_op_execution` (previously at `op_state_manager.rs:701-745`). Its body moves verbatim into `OpCtx::send_and_await`; its "Scaffolding reach" and "Deadlock risk" doc sections are preserved and expanded there. - The unused `#[allow(dead_code)]`-gated accessor methods on `EventLoopNotificationsReceiver` (`notifications_receiver`, `op_execution_receiver`) and `EventLoopNotificationsSender` (`op_execution_sender`). A crate-wide search (`rg '\.op_execution_ sender\(\)|\.op_execution_receiver\(\)|\.notifications_receiver\(\)'`) confirmed zero callers; callers use the `pub(crate)` fields directly per the pattern at `op_state_manager.rs` tests. The fields themselves stay; only the method wrappers and their attributes are removed. `send_and_await` picks up four new doc sections beyond the preserved Phase 1 ones: - "Single-use per Transaction" — the Phase 1 callback fires at most once per tx (dedup via `completed` / `under_progress`). A second `send_and_await` on the same tx will hang. Phase 2b's planner must use a fresh `Transaction` per retry attempt. - "Where to call this" — must run from a spawned op task, not a `priority_select` arm; `.send().await` on a bounded channel blocks. - "Terminal reply, not success reply" — `Ok(reply)` may carry a non-success terminal (`SubscribeMsg::Response::NotFound`, etc.). - "Errors" — `NotificationError` on send failure or receiver hang-up. Because Phase 2a has no production caller, the `OpCtx` struct and its impl block carry `#[allow(dead_code)]` until Phase 2b wires the first consumer; same rationale for `OpManager::op_ctx`. The re-exports of `EventLoopNotificationsReceiver` / `event_loop_notification_channel` through `node` are `#[cfg(test)]`-gated because the only crate-wide consumer outside `node/` is the new test shim in `op_ctx.rs`. Verification (pre-push) - `cargo fmt --check` — clean - `cargo clippy -p freenet --all-targets --features bench -- -D warnings` (pre-commit hook invocation) — clean - `cargo clippy --locked -- -D warnings` (CI invocation) — clean - `cargo test -p freenet --lib op_ctx` — 3 passed - `cargo test -p freenet --lib --features bench` — 2125 passed, 16 ignored, 0 failures Design reference: issue #1454 body, §3 "Phase 2a minimum surface (what ships first)". Phase 1 landed in PR #3802 (commit e025ca6).

github-actions · 2026-04-09T10:17:56Z

Rule Review: No blocking issues found

Rules checked: git-workflow.md, code-style.md, testing.md, operations.md
Files reviewed: 6 (node.rs, network_bridge.rs, op_state_manager.rs, operations.rs, op_ctx.rs [new], simulation_integration.rs)

Warnings

None.

Info

op_ctx.rs:342 — tx() is declared pub on a pub(crate) struct; pub(crate) would be consistent with the struct visibility and new(). (code-style.md)
op_ctx.rs:447 — response_receiver.recv().await has no timeout, which the # Deadlock risk doc section explicitly documents and defers to Phase 2b. Since there are no production callers yet this is a well-controlled deferral, but the first production caller in Phase 2b must add a tokio::time::timeout wrapper before using this path in a real op task. (code-style.md — "use tokio::time::timeout() instead")

Notes on what's clean

New op_ctx.rs module uses foo.rs layout (not mod.rs) ✓
No .unwrap() in production code; tests use .expect() with messages ✓
All four tests cover happy path, both error paths, and the single-use-per-tx constraint — solid edge-case coverage ✓
Push-before-send invariant is documented in the # Push-before-send section of send_and_await ✓
Dead executor methods removed cleanly from network_bridge.rs without leaving stale #[allow(dead_code)] ✓
Doc comments updated across all affected call sites to reference OpCtx::send_and_await instead of deleted notify_op_execution ✓
This is a refactor: PR (not fix:), so no regression test requirement ✓

Rule review against .claude/rules/. WARNING findings block merge.

Responds to the Claude Rule Review bot's Info findings on PR #3803. Two nits addressed, one explicitly declined. Addressed: 1. `send_and_await` doc now carries a "Push-before-send" section citing `.claude/rules/operations.md` and explaining how the invariant manifests in both the legacy path (push state to `op_manager` before calling) and the task-per-tx path (initialize task-local state before calling). Phase 2b authors reading this doc in isolation will now see the invariant without having to cross-reference `operations.md`. 2. New regression test `send_and_await_second_call_hangs_as_documented` encodes the "single-use per Transaction" behavior described in the doc. Drives a successful first round-trip, then calls `send_and_await` a second time against a fake executor that holds the second `reply_sender` alive without firing — which is what Phase 1's real dedup (`completed` / `under_progress` sets in `OpManager`) does at the pipeline level. Asserts the second call times out via a 200 ms `tokio::time::timeout`. The test guards against doc drift: a future refactor that makes the second call fail fast (via an internal timeout or runtime check) will break the `is_err()` assertion, forcing a doc update. Attempting the naive shape first showed a subtle interaction: if the fake executor drops `reply_sender_2` at the end of its function body, `send_and_await`'s `recv().await` observes `None` and returns `Err(NotificationError)` — not a hang. The test uses a `oneshot::channel` to keep the executor alive until the test has observed the timeout, only then releasing `reply_sender_2` for clean shutdown. Shutdown uses an explicit `match` to discard the `Result<(), ()>` return from `oneshot::Sender::send` because `let _ =` trips `clippy::let_underscore_must_use` (denied crate-wide) and `drop(...)` trips `clippy::dropping_copy_types` (since `Result<(), ()>` is `Copy`). Declined: 3. The bot also suggested a unit test for the `debug_assert_eq!` tx-mismatch path. Not done. The two `debug_assert_eq!` calls are defense-in-depth against caller bugs at development time — in release builds they compile out and the downstream behavior is "callback routed to the wrong tx's `pending_op_result` consumer", which is a correctness bug in the caller, not a silent pass-through in `send_and_await`. A test of the debug assertion would test the language feature, not a `send_and_await` invariant. If release-build protection is wanted later, the fix is upgrading the asserts to runtime errors, not adding a test of the current debug-only shape. No production code changes; no behavior changes; no new dependencies. Verification: - `cargo fmt --check` — clean - `cargo clippy -p freenet --all-targets --features bench -- -D warnings` — clean - `cargo test -p freenet --lib op_ctx` — 4/4 passed (3 existing + 1 new) - `cargo test -p freenet --lib --features bench` — 2126 passed, 0 failed, 16 ignored

Addresses nits surfaced by a /pr-review subagent pass on PR #3803. Three edits, all documentation or test-hygiene; no production behavior changes. 1. Rewrite stale `notify_op_execution` doc references. Phase 2a deleted `OpManager::notify_op_execution` but left ~14 textual references to the identifier in Phase 1's doc comments and test module headers in `node.rs`, plus two references in the simulation integration test `test_pending_op_results_bounded`. None were actual call sites — all were prose pointers that went stale the moment the function was deleted. Per `AGENTS.md`'s "WHEN you discover outdated or missing documentation → Fix it immediately" rule, update them in-place to point at `OpCtx::send_and_await` / `OpManager::op_ctx`, and rewrite the simulation test's docstring to reflect the Phase 2a scaffolding state (no production consumer until Phase 2b). Two references are kept intentionally as archaeology: - `node.rs:791` (new history sentence in the `forward_pending_op_result_if_completed` docblock: "the caller side used to live on `OpManager::notify_op_execution`, now deleted. Phase 2a moved the caller side into `OpCtx`..."). - `node.rs:2554` (same note in the test module header). These give future archaeologists (including git-blame-blind readers of the code) one anchor for where the symbol used to live and what replaced it. The rest of the references were neither anchors nor history — just stale. 2. Bump `send_and_await_second_call_hangs_as_documented` timeout from 200 ms to 500 ms. The subagent review flagged the 200 ms constant as a potential CI-flakiness source: the Simulation job runs four fdev sims in parallel on the same runner, and a heavily contended scheduler could plausibly delay a tokio task by more than 200 ms before the timeout fires. 500 ms is still ~5× the first call's resolution time under uncontended conditions and ~2.5× what the other three tests allow themselves. Comment updated to record the reasoning so a future maintainer doesn't shrink it back. 3. Clarifying doc additions (no code changes): - `send_and_await`'s release-build `debug_assert_eq!` on the reply tx now carries an inline comment noting that in release a mismatched reply tx silently returns — the assert is defense-in-depth against bugs elsewhere in the pipeline, not a runtime correctness guard inside `send_and_await` itself. The release-build behavior ("a mismatched reply is returned for a different transaction") is a correctness bug in whatever code mislabeled the message in `p2p_protoc::pending_op_results`. - `send_and_await_returns_reply_on_completion` test gets a new doc header explaining that "happy path" means "the round-trip mechanics work", NOT "the reply represents success". The test uses `NetMessageV1::Aborted` precisely because that variant is orthogonal to success — only carries a `Transaction` — so the assertion is purely on the tx-routing mechanics. This addresses the subagent review's observation that `send_and_await`'s doc explicitly decouples `Ok(reply)` from "protocol success" but no test made that decoupling visible. The subagent's declined nit — "add a test for the `debug_assert_eq!` tx-mismatch path" — remains declined with rationale in the prior commit. The reviewer agreed with the decline and characterized it as "testing the language feature". The inline comment added in edit 3 captures the release-build consequence in the code itself so future readers have the framing without needing to dig through commit history. Verification: - `cargo fmt --check` — clean - `cargo clippy -p freenet --all-targets --features bench -- -D warnings` (pre-commit hook invocation) — clean - `cargo test -p freenet --lib op_ctx` — 4/4 passed (0.51s total, 0.30s of which is the 500 ms timeout elapsing in the hangs test) - `cargo test -p freenet --lib --features bench` — 2126 passed, 0 failed, 16 ignored

iduartgomez marked this pull request as ready for review April 9, 2026 10:15

iduartgomez added 2 commits April 9, 2026 12:28

iduartgomez enabled auto-merge April 9, 2026 10:57

iduartgomez added this pull request to the merge queue Apr 9, 2026

Merged via the queue into main with commit 81396e6 Apr 9, 2026
12 checks passed

iduartgomez deleted the claude/1454-phase2a-op-ctx branch April 9, 2026 11:16

iduartgomez mentioned this pull request Apr 9, 2026

refactor(ops): migrate client-initiated SUBSCRIBE to OpCtx (#1454 phase 2b) #3806

Merged

5 tasks

sanity mentioned this pull request Apr 13, 2026

ci: isolate and contain fdev simulation hangs #3866

Merged

iduartgomez mentioned this pull request Apr 15, 2026

refactor(ops): migrate client-initiated GET to task-per-tx driver (#1454 phase 3b) #3884

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(ops): add OpCtx with send_and_await primitive (#1454 phase 2a)#3803

refactor(ops): add OpCtx with send_and_await primitive (#1454 phase 2a)#3803
iduartgomez merged 3 commits intomainfrom
claude/1454-phase2a-op-ctx

iduartgomez commented Apr 9, 2026

Uh oh!

github-actions Bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

iduartgomez commented Apr 9, 2026

Problem

Solution

Testing

Fixes

Uh oh!

github-actions Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rule Review: No blocking issues found

Warnings

Info

Notes on what's clean

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Apr 9, 2026 •

edited

Loading