Skip to content

refactor(ops): add OpCtx with send_and_await primitive (#1454 phase 2a)#3803

Merged
iduartgomez merged 3 commits intomainfrom
claude/1454-phase2a-op-ctx
Apr 9, 2026
Merged

refactor(ops): add OpCtx with send_and_await primitive (#1454 phase 2a)#3803
iduartgomez merged 3 commits intomainfrom
claude/1454-phase2a-op-ctx

Conversation

@iduartgomez
Copy link
Copy Markdown
Collaborator

Problem

Phase 1 of the async-transaction refactor (#1454, PR #3802) wired the pending_op_result callback hook in handle_pure_network_message_v1 for every op variant, giving us a round-trip primitive that can send a NetMessage and await a reply keyed by the same Transaction. But the only caller of that primitive was OpManager::notify_op_execution, a private helper marked #[allow(dead_code)] with a FIXME, and there was no ergonomic path for a future op task to get at it. Phase 2b (SUBSCRIBE migration, separate PR) needs a per-transaction execution context to hand to the task function it spawns — hence OpCtx.

Solution

Introduce OpCtx as dormant scaffolding with exactly one method (send_and_await) built directly on top of Phase 1's round-trip. The full target API from the design doc (spawn_sub, per-tx inbox, WS notify fanout, OpRegistry, and a new Operation trait) is deferred until a concrete consumer needs it — matches Phase 1's "ship exactly what's needed, document constraints loudly, wait for a real caller" shape.

What ships

  • New file crates/core/src/operations/op_ctx.rs with:
    • OpCtx { tx, op_execution_sender }Send, not Clone, one owner per transaction
    • OpCtx::new (crate-private constructor)
    • OpCtx::tx() accessor
    • OpCtx::send_and_await(&mut self, msg) — the round-trip primitive
    • Compile-time Send assertion
    • Three #[tokio::test] tests (happy path, dropped reply sender, closed executor channel)
  • OpManager::op_ctx(&self, tx: Transaction) -> OpCtx factory in op_state_manager.rs
  • pub(crate) mod op_ctx; + re-export in operations.rs, matching the existing visited_peers::VisitedPeers pattern

What this deletes

  • OpManager::notify_op_execution (previously ~40 lines at op_state_manager.rs:701-745). Its body moves verbatim into OpCtx::send_and_await; its "Scaffolding reach" and "Deadlock risk" doc sections are preserved there and expanded with four new sections (see below).
  • Three unused #[allow(dead_code)]-gated accessor methods on EventLoopNotificationsReceiver / EventLoopNotificationsSender (notifications_receiver(), op_execution_receiver(), op_execution_sender()). A crate-wide grep confirmed zero callers — the fields are pub(crate) and callers destructure them directly per the pattern at op_state_manager.rs:2166-2170. The fields themselves stay; only the dead method wrappers are removed.

Design decisions that deviate from a naive literal read of the design doc

  • &mut self on send_and_await, not self by value. This leaves ctx.tx() available after the call and matches the Rev 3 sketch. The single-use-per-tx constraint is documented in prose rather than compile-enforced.
  • OpCtx is pub(crate), not pub. Nothing outside this crate has a legitimate use today.
  • OpCtx holds the op_execution_sender directly, not Arc<OpManager>. This was the decisive call for testability: OpManager::new requires NodeConfig + ContractHandlerChannel + ConnectionManager + NetEventRegister + BackgroundTaskMonitor + result_router_tx and there is no test constructor. The sender-only shape lets the tests wire event_loop_notification_channel() directly, following the existing pattern at op_state_manager.rs:2155-2207.

Doc comment expansions on send_and_await (beyond the Phase 1 preserved sections)

  • Single-use per Transaction — Phase 1's callback fires at most once per tx because the completed / under_progress dedup sets suppress subsequent dispatches. A second send_and_await on the same tx will hang on response_receiver.recv(). Callers needing retries must use a fresh Transaction per attempt. This is a known constraint Phase 2b's planner must work around.
  • Where to call this — must run from an op task (e.g., GlobalExecutor::spawn), not from inside the main event loop or a priority_select arm, because the internal .send().await on a bounded channel can block.
  • Terminal reply, not success replyOk(reply) may carry a non-success terminal (e.g., SubscribeMsg::Response::NotFound). Callers inspect the reply to decide what happened.
  • ErrorsOpError::NotificationError on send failure or receiver hang-up.

Dead codeOpCtx, its impl block, and OpManager::op_ctx carry #[allow(dead_code)] with a Phase 2b removal note until the first production caller lands. This mirrors how Phase 1 left the attribute on notify_op_execution until Phase 2a became its first caller.

Testing

  • cargo fmt --check — clean
  • cargo clippy -p freenet --all-targets --features bench -- -D warnings (pre-commit hook invocation, stricter than CI) — clean
  • cargo clippy --locked -- -D warnings (CI invocation) — clean
  • cargo test -p freenet --lib op_ctx3/3 passed
  • cargo test -p freenet --lib --features bench2125 passed, 0 failed, 16 ignored (full library suite)

The three new unit tests follow the op_state_manager.rs:2155-2207 pattern — wire event_loop_notification_channel() directly, destructure EventLoopNotificationsReceiver, spawn a synthetic executor that mirrors what the real pipeline would do. No OpManager is constructed (there's no test constructor). Each test wraps send_and_await in a 1-second tokio::time::timeout as a defensive guard so a future regression into a hang fails fast rather than stalling CI.

Fixes

Partial progress on #1454. This is Phase 2a of the phased rollout in the design doc (§5). Does not close the tracking issue. Upcoming phases:

  • Phase 2b: Migrate SUBSCRIBE's client-initiated path onto OpCtx (activates the first production caller, drops the #[allow(dead_code)] attrs)
  • Phase 2.5: Port PUT-completion → SUBSCRIBE sub-op hand-off
  • Phase 2c: Migrate CONNECT (schedule TBD)
  • Phases 3–6: GET/PUT streaming, UPDATE broadcast, OpManager DashMap removal, docs/rules sweep

See the design doc in the body of #1454 (Rev 3), specifically §3 "Phase 2a minimum surface" for the target API and the Errata "Rev 2 → Rev 3" section for the phasing rationale.

Introduce the per-transaction `OpCtx` scaffolding described in #1454 §3
"Phase 2a minimum surface". This ships as dormant infrastructure only:
the single production caller (SUBSCRIBE's client-initiated path) is
deferred to Phase 2b, and no op in `operations/{connect,put,get,
subscribe,update}.rs` is migrated or otherwise touched.

What ships
- New `crates/core/src/operations/op_ctx.rs` with:
  - `OpCtx { tx, op_execution_sender }` struct (`Send`, not `Clone`)
  - `OpCtx::new` (crate-private constructor)
  - `OpCtx::tx()` accessor
  - `OpCtx::send_and_await(msg)` round-trip primitive
  - Three unit tests (happy path, dropped reply_sender, closed executor
    channel) and a compile-time `Send` assertion
- `OpManager::op_ctx(tx)` factory on `op_state_manager.rs`, cloning the
  event-loop `op_execution_sender` into a fresh context.
- `pub(crate) mod op_ctx;` and `pub(crate) use op_ctx::OpCtx;` added to
  `operations.rs`, following the same re-export pattern as
  `visited_peers::VisitedPeers`.

What this deletes
- `OpManager::notify_op_execution` (previously at
  `op_state_manager.rs:701-745`). Its body moves verbatim into
  `OpCtx::send_and_await`; its "Scaffolding reach" and "Deadlock risk"
  doc sections are preserved and expanded there.
- The unused `#[allow(dead_code)]`-gated accessor methods on
  `EventLoopNotificationsReceiver` (`notifications_receiver`,
  `op_execution_receiver`) and `EventLoopNotificationsSender`
  (`op_execution_sender`). A crate-wide search (`rg '\.op_execution_
  sender\(\)|\.op_execution_receiver\(\)|\.notifications_receiver\(\)'`)
  confirmed zero callers; callers use the `pub(crate)` fields
  directly per the pattern at `op_state_manager.rs` tests. The fields
  themselves stay; only the method wrappers and their attributes are
  removed.

`send_and_await` picks up four new doc sections beyond the preserved
Phase 1 ones:
- "Single-use per Transaction" — the Phase 1 callback fires at most
  once per tx (dedup via `completed` / `under_progress`). A second
  `send_and_await` on the same tx will hang. Phase 2b's planner must
  use a fresh `Transaction` per retry attempt.
- "Where to call this" — must run from a spawned op task, not a
  `priority_select` arm; `.send().await` on a bounded channel blocks.
- "Terminal reply, not success reply" — `Ok(reply)` may carry a
  non-success terminal (`SubscribeMsg::Response::NotFound`, etc.).
- "Errors" — `NotificationError` on send failure or receiver hang-up.

Because Phase 2a has no production caller, the `OpCtx` struct and its
impl block carry `#[allow(dead_code)]` until Phase 2b wires the first
consumer; same rationale for `OpManager::op_ctx`. The re-exports of
`EventLoopNotificationsReceiver` / `event_loop_notification_channel`
through `node` are `#[cfg(test)]`-gated because the only crate-wide
consumer outside `node/` is the new test shim in `op_ctx.rs`.

Verification (pre-push)
- `cargo fmt --check` — clean
- `cargo clippy -p freenet --all-targets --features bench -- -D warnings`
  (pre-commit hook invocation) — clean
- `cargo clippy --locked -- -D warnings` (CI invocation) — clean
- `cargo test -p freenet --lib op_ctx` — 3 passed
- `cargo test -p freenet --lib --features bench` — 2125 passed,
  16 ignored, 0 failures

Design reference: issue #1454 body, §3 "Phase 2a minimum surface
(what ships first)". Phase 1 landed in PR #3802 (commit e025ca6).
@iduartgomez iduartgomez marked this pull request as ready for review April 9, 2026 10:15
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 9, 2026

Rule Review: No blocking issues found

Rules checked: git-workflow.md, code-style.md, testing.md, operations.md
Files reviewed: 6 (node.rs, network_bridge.rs, op_state_manager.rs, operations.rs, op_ctx.rs [new], simulation_integration.rs)

Warnings

None.

Info

  • op_ctx.rs:342tx() is declared pub on a pub(crate) struct; pub(crate) would be consistent with the struct visibility and new(). (code-style.md)

  • op_ctx.rs:447response_receiver.recv().await has no timeout, which the # Deadlock risk doc section explicitly documents and defers to Phase 2b. Since there are no production callers yet this is a well-controlled deferral, but the first production caller in Phase 2b must add a tokio::time::timeout wrapper before using this path in a real op task. (code-style.md — "use tokio::time::timeout() instead")

Notes on what's clean

  • New op_ctx.rs module uses foo.rs layout (not mod.rs) ✓
  • No .unwrap() in production code; tests use .expect() with messages ✓
  • All four tests cover happy path, both error paths, and the single-use-per-tx constraint — solid edge-case coverage ✓
  • Push-before-send invariant is documented in the # Push-before-send section of send_and_await
  • Dead executor methods removed cleanly from network_bridge.rs without leaving stale #[allow(dead_code)]
  • Doc comments updated across all affected call sites to reference OpCtx::send_and_await instead of deleted notify_op_execution
  • This is a refactor: PR (not fix:), so no regression test requirement ✓

Rule review against .claude/rules/. WARNING findings block merge.

Responds to the Claude Rule Review bot's Info findings on PR #3803.
Two nits addressed, one explicitly declined.

Addressed:

1. `send_and_await` doc now carries a "Push-before-send" section
   citing `.claude/rules/operations.md` and explaining how the
   invariant manifests in both the legacy path (push state to
   `op_manager` before calling) and the task-per-tx path
   (initialize task-local state before calling). Phase 2b authors
   reading this doc in isolation will now see the invariant
   without having to cross-reference `operations.md`.

2. New regression test `send_and_await_second_call_hangs_as_documented`
   encodes the "single-use per Transaction" behavior described in
   the doc. Drives a successful first round-trip, then calls
   `send_and_await` a second time against a fake executor that
   holds the second `reply_sender` alive without firing — which
   is what Phase 1's real dedup (`completed` / `under_progress`
   sets in `OpManager`) does at the pipeline level. Asserts the
   second call times out via a 200 ms `tokio::time::timeout`.
   The test guards against doc drift: a future refactor that
   makes the second call fail fast (via an internal timeout or
   runtime check) will break the `is_err()` assertion, forcing
   a doc update.

   Attempting the naive shape first showed a subtle interaction:
   if the fake executor drops `reply_sender_2` at the end of its
   function body, `send_and_await`'s `recv().await` observes
   `None` and returns `Err(NotificationError)` — not a hang. The
   test uses a `oneshot::channel` to keep the executor alive
   until the test has observed the timeout, only then releasing
   `reply_sender_2` for clean shutdown. Shutdown uses an explicit
   `match` to discard the `Result<(), ()>` return from
   `oneshot::Sender::send` because `let _ =` trips
   `clippy::let_underscore_must_use` (denied crate-wide) and
   `drop(...)` trips `clippy::dropping_copy_types` (since
   `Result<(), ()>` is `Copy`).

Declined:

3. The bot also suggested a unit test for the `debug_assert_eq!`
   tx-mismatch path. Not done. The two `debug_assert_eq!` calls
   are defense-in-depth against caller bugs at development
   time — in release builds they compile out and the downstream
   behavior is "callback routed to the wrong tx's `pending_op_result`
   consumer", which is a correctness bug in the caller, not a
   silent pass-through in `send_and_await`. A test of the debug
   assertion would test the language feature, not a `send_and_await`
   invariant. If release-build protection is wanted later, the
   fix is upgrading the asserts to runtime errors, not adding a
   test of the current debug-only shape.

No production code changes; no behavior changes; no new dependencies.

Verification:
- `cargo fmt --check` — clean
- `cargo clippy -p freenet --all-targets --features bench -- -D warnings` — clean
- `cargo test -p freenet --lib op_ctx` — 4/4 passed (3 existing + 1 new)
- `cargo test -p freenet --lib --features bench` — 2126 passed, 0 failed, 16 ignored
Addresses nits surfaced by a /pr-review subagent pass on PR #3803.
Three edits, all documentation or test-hygiene; no production behavior
changes.

1. Rewrite stale `notify_op_execution` doc references.

   Phase 2a deleted `OpManager::notify_op_execution` but left ~14
   textual references to the identifier in Phase 1's doc comments
   and test module headers in `node.rs`, plus two references in the
   simulation integration test `test_pending_op_results_bounded`.
   None were actual call sites — all were prose pointers that went
   stale the moment the function was deleted. Per `AGENTS.md`'s
   "WHEN you discover outdated or missing documentation → Fix it
   immediately" rule, update them in-place to point at
   `OpCtx::send_and_await` / `OpManager::op_ctx`, and rewrite the
   simulation test's docstring to reflect the Phase 2a scaffolding
   state (no production consumer until Phase 2b).

   Two references are kept intentionally as archaeology:
   - `node.rs:791` (new history sentence in the
     `forward_pending_op_result_if_completed` docblock: "the caller
     side used to live on `OpManager::notify_op_execution`, now
     deleted. Phase 2a moved the caller side into `OpCtx`...").
   - `node.rs:2554` (same note in the test module header).
   These give future archaeologists (including git-blame-blind
   readers of the code) one anchor for where the symbol used to
   live and what replaced it. The rest of the references were
   neither anchors nor history — just stale.

2. Bump `send_and_await_second_call_hangs_as_documented` timeout
   from 200 ms to 500 ms.

   The subagent review flagged the 200 ms constant as a potential
   CI-flakiness source: the Simulation job runs four fdev sims in
   parallel on the same runner, and a heavily contended scheduler
   could plausibly delay a tokio task by more than 200 ms before
   the timeout fires. 500 ms is still ~5× the first call's
   resolution time under uncontended conditions and ~2.5× what the
   other three tests allow themselves. Comment updated to record
   the reasoning so a future maintainer doesn't shrink it back.

3. Clarifying doc additions (no code changes):

   - `send_and_await`'s release-build `debug_assert_eq!` on the
     reply tx now carries an inline comment noting that in release
     a mismatched reply tx silently returns — the assert is
     defense-in-depth against bugs elsewhere in the pipeline, not a
     runtime correctness guard inside `send_and_await` itself. The
     release-build behavior ("a mismatched reply is returned for a
     different transaction") is a correctness bug in whatever code
     mislabeled the message in `p2p_protoc::pending_op_results`.

   - `send_and_await_returns_reply_on_completion` test gets a new
     doc header explaining that "happy path" means "the round-trip
     mechanics work", NOT "the reply represents success". The test
     uses `NetMessageV1::Aborted` precisely because that variant is
     orthogonal to success — only carries a `Transaction` — so the
     assertion is purely on the tx-routing mechanics. This addresses
     the subagent review's observation that `send_and_await`'s doc
     explicitly decouples `Ok(reply)` from "protocol success" but
     no test made that decoupling visible.

The subagent's declined nit — "add a test for the `debug_assert_eq!`
tx-mismatch path" — remains declined with rationale in the prior
commit. The reviewer agreed with the decline and characterized it
as "testing the language feature". The inline comment added in
edit 3 captures the release-build consequence in the code itself
so future readers have the framing without needing to dig through
commit history.

Verification:
- `cargo fmt --check` — clean
- `cargo clippy -p freenet --all-targets --features bench -- -D warnings`
  (pre-commit hook invocation) — clean
- `cargo test -p freenet --lib op_ctx` — 4/4 passed (0.51s total,
  0.30s of which is the 500 ms timeout elapsing in the hangs test)
- `cargo test -p freenet --lib --features bench` — 2126 passed,
  0 failed, 16 ignored
@iduartgomez iduartgomez enabled auto-merge April 9, 2026 10:57
@iduartgomez iduartgomez added this pull request to the merge queue Apr 9, 2026
Merged via the queue into main with commit 81396e6 Apr 9, 2026
12 checks passed
@iduartgomez iduartgomez deleted the claude/1454-phase2a-op-ctx branch April 9, 2026 11:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant