Refactor core interfaces

Tracking issue for some internal refactoring we need to do. The goal is to:
- improve debugging capabilities over the kernel
- simplify and clean up some of the core code, cleaner abstractions
- improve certain error handling, ws api client/server comms, etc.
- add more testing around the boundaries of some components (like we did with handhsake handler, transport, etc.

Changes:
- move towards transactions executed end-to-end asynchronously instead of the current model of execution
 - wait for subtransactions results, communicate to ws client subscribers as network messages are inbound etc. and simplify overall the execution flow and reasoning (we essentially don't need to keep the state of the ops internally anymore in the OpManager)

---

# Design: end-to-end async transactions

> This section replaces the previous `[TODO: add more detail]` placeholder. Line numbers are verified against `main` at the time of writing. Follow-up PRs implementing each phase should re-verify before making changes.
>
> **Revision history:**
> - **Rev 1 (2026-04-08):** Initial draft. Over-promised the reach of existing scaffolding and mis-scoped Phase 1. See *Errata* at the end of this document.
> - **Rev 2 (2026-04-08):** Corrected Phase 1 after reading `request_subscribe` and the pure-network message dispatch path. Rev 1's Phase 1 has been split into a real prerequisite (Phase 1) and deferred work (Phase 2.5). Line numbers corrected.
> - **Rev 3 (2026-04-09):** After Phase 1 shipped (PR #3802), investigated SUBSCRIBE's full surface area before starting Phase 2 implementation. Found that Rev 2's Phase 2 bundles too much work in one slice. Split Phase 2 into **2a** (introduce `OpCtx` as dormant scaffolding — no op migrated), **2b** (migrate SUBSCRIBE client-initiated path only), and **2c** (migrate CONNECT — deferred, schedule TBD). Phases 2.5 and 3–6 are unchanged. See the *Errata (Rev 2 → Rev 3)* section at the end for the specific findings that motivated the split.
> - **Rev 4 (2026-04-11):** After Phase 2b shipped (PR #3806), drafted the Phase 2.5 implementation and verified it against HEAD. Found that the intended structural win is narrow (only the `track_parent=true` branch of `start_subscription_request_internal` moves; `SubOperationTracker`, `root_ops_awaiting_sub_ops`, and the async non-blocking subscribe path all stay on the legacy re-entry loop) and that the scaffolding it would introduce (a `DriverRole::SubOp` variant, a `register_local_subscription` helper to bypass `NodeEvent::LocalSubscribeComplete`) becomes dead weight the moment Phase 3's task-per-tx PUT/GET parent can `await` the child future inline. **Phase 2.5 is folded into Phase 3.** The blocking-subscription sub-op child migrates together with PUT/GET. The async non-blocking subscribe path (`track_parent=false`) also stays on the legacy loop until Phase 3. See the *Errata (Rev 3 → Rev 4)* section at the end.
> - **Rev 5 (2026-04-11):** Pre-implementation investigation of GET, PUT, and the subscribe sub-op hand-off (posted as issue comment) confirmed that PUT and GET are each large enough to merit their own slice, and that `SubOperationTracker` can only be deleted after BOTH callers (`operations.rs:755` blocking path, invoked from PUT's `start_subscription_after_put` AND GET's `auto_subscribe_on_get_response`) are gone. **Phase 3 is split into 3a / 3b / 3c**: 3a migrates PUT + PUT's blocking-subscribe child (tracker stays alive for GET); 3b migrates GET + GET's blocking-subscribe child + streaming assembly inside `send_and_await`; 3c is a mechanical deletion PR that retires `SubOperationTracker`, `root_ops_awaiting_sub_ops`, `parent_of`, `failed_parents`, `expected_sub_operations`, and their 10 unit tests once both callers are gone. PUT-first is chosen because PUT's state machine is simpler (no streaming on the originator completion path), and its duplicate-send collapse (`put.rs:575–594` + `:841–917`) is the single largest structural change in Phase 3 — validating it in isolation de-risks GET. The `attempt_tx` terminology referenced in earlier Rev 3/4 notes does not exist on HEAD and has been dropped. See the *Errata (Rev 4 → Rev 5)* section at the end.

## 1. Overview

### Problem

Today the core runs each transaction through a **centralized re-entry loop**:

1. A client request calls `op_manager.push(tx, OpEnum)` and returns.
2. A `NetMessage` for `tx` arrives at `p2p_protoc`.
3. `handle_op_result` pops the `OpEnum` out of an `OpManager` `DashMap`, calls `Operation::process_message`, and gets back an `OperationResult`.
4. Depending on the variant (`Completed` / `ContinueOp` / `SendAndContinue` / `SendAndComplete`), it either marks the op `completed`, re-inserts it into the same `DashMap` (push-before-send), or forwards a message and re-inserts.
5. Sub-transactions are tracked side-band in a separate `SubOperationTracker` with its own `DashMap`s (`sub_operations`, `root_ops_awaiting_sub_ops`, `parent_of`, `expected_sub_operations`, `failed_parents`). A parent op only proceeds when `all_sub_operations_completed` returns true.
6. Client notifications are fan-out post-hoc via a `ResultRouter` → `SessionActor`.

Pain points:

- Transaction state lives in a global `OpManager` rather than in the task that logically owns it. Stack traces are useless for reasoning about "what is this transaction doing right now?"
- Sub-operation orchestration is a state machine driven by DashMap writes and GC, not Rust's async runtime. There is no `await` between "parent wants a value from a child" and "child produces it."
- Bugs in this area tend to be of the form "the parent op's DashMap entry was cleaned up before the child's callback fired", or "the parent's `expected_sub_operations` count disagreed with reality." These are not expressible as compile errors.
- The client-notification path (`ResultRouter` → `SessionActor`) is decoupled from the operation that produced the result, so streaming updates (e.g. incremental `GET` responses) are harder to reason about than they should be.

### Goals (from the tracking issue)

- Improve debugging of the kernel / transaction flow
- Cleaner operation abstractions (operations own their own state)
- Better error handling and WS client/server communication
- Expand tests at component boundaries
- **End-to-end async transactions:** each transaction runs as a task that owns its state in locals, awaits sub-transaction futures, and streams updates to WS subscribers directly as network messages arrive

### Non-goals

- No wire-format changes. `NetMessage` variants stay the same.
- No contract/WASM API changes.
- No changes to the transport layer or handshake handler.
- No changes to the `SessionActor` semantics for 1→N client fanout; only the producer side changes.

## 2. Current architecture (baseline)

All line numbers verified against `main` at the time of writing.

| Concern | File | Lines |
|---|---|---|
| `OpManager.ops` — `DashMap`s for `connect`/`put`/`get`/`subscribe`/`update`, plus `completed` and `under_progress` `DashSet`s | `crates/core/src/node/op_state_manager.rs` | 204–213 |
| `OpManager` struct | `crates/core/src/node/op_state_manager.rs` | 217–281 |
| `SubOperationTracker` | `crates/core/src/node/op_state_manager.rs` | 73–202 |
| `all_sub_operations_completed` | `crates/core/src/node/op_state_manager.rs` | 161–176 |
| `Operation` trait | `crates/core/src/operations.rs` | 32–56 |
| `OperationResult` enum (`Completed`, `ContinueOp`, `SendAndContinue`, `SendAndComplete`) | `crates/core/src/operations.rs` | 69–99 |
| `handle_op_result` re-entry loop (push-before-send at line 238) | `crates/core/src/operations.rs` | 131–287 |
| `PrioritySelectStream` with `op_execution` receiver | `crates/core/src/node/network_bridge/priority_select.rs` | 27, 55–93 |
| `p2p_protoc::handle_op_execution` inserts into `pending_op_results` | `crates/core/src/node/network_bridge/p2p_protoc.rs` | 3861–3879 |
| `handle_pure_network_message_v1` entry | `crates/core/src/node.rs` | 775 |
| `handle_pure_network_message_v1` PUT branch (callback send at 871) | `crates/core/src/node.rs` | 849–906 |
| `handle_pure_network_message_v1` GET branch (callback send at 920) | `crates/core/src/node.rs` | 907–990 |
| `handle_pure_network_message_v1` SUBSCRIBE path (**no** `pending_op_result` callback) | `crates/core/src/node.rs` | 992–1027 |
| `start_subscription_request_internal` (sub-op launcher) | `crates/core/src/operations.rs` | 745–811 |
| `request_subscribe` (routing + op construction + push) | `crates/core/src/operations/subscribe.rs` | 303–487 |
| `ResultRouter` (network → `SessionActor` routing) | `crates/core/src/client_events/result_router.rs` | — |
| `SessionActor` (1→N client fanout, TTL pending cache) | `crates/core/src/client_events/session_actor.rs` | — |

### GET request flow (current)

```mermaid
sequenceDiagram
 participant C as WS Client
 participant SA as SessionActor
 participant OM as OpManager (DashMaps)
 participant EL as Event loop / priority_select
 participant PP as p2p_protoc
 participant N as Network
 participant RR as ResultRouter

 C->>SA: GET(tx)
 SA->>OM: push(tx, GetOp::init)
 OM-->>SA: ok
 SA-->>C: accepted
 Note over OM: GetOp lives in DashMap<Tx, GetOp>

 N-->>PP: NetMessage(tx)
 PP->>EL: dispatch
 EL->>OM: pop(tx) -> GetOp
 OM-->>EL: GetOp
 EL->>EL: process_message -> OperationResult
 alt SendAndContinue
 EL->>OM: push(tx, GetOp') (push-before-send)
 EL->>N: NetMessage
 else Completed
 EL->>OM: completed(tx)
 EL->>RR: HostResult(tx, value)
 RR->>SA: DeliverHostResponse
 SA->>C: GET result
 end
```

## 3. Target architecture

### Key idea

Each transaction runs as a **dedicated async task** whose locals hold the op state. The task drives itself by `await`ing on a small per-tx inbox and on child-transaction futures. The `OpManager` shrinks to a bookkeeping table mapping `Transaction → JoinHandle` (plus the existing `completed` / `under_progress` de-dup sets). The op-specific `DashMap`s go away.

A new primitive — `OpCtx::send_and_await(msg)` — lets an operation fire a `NetMessage` and `await` the reply that arrives keyed by the same `Transaction`. Sub-transactions become child futures: a parent op calls `ctx.spawn_sub(child_op).await` and gets back the child's result directly, with no side-band tracker.

### Phase 2a minimum surface (what ships first)

Phase 2a introduces `OpCtx` as scaffolding with exactly one method — `send_and_await` — built directly on top of Phase 1's `forward_pending_op_result_if_completed` round-trip. No ops are migrated in 2a; its only caller is a unit test shim. The wider API (sub-op spawning, WS fanout, per-tx inbox, a new `Operation` trait) is deferred until a concrete consumer needs it.

```rust
/// Per-transaction execution context for the task-per-tx model.
/// Phase 2a ships only `send_and_await`; the rest of the target API
/// below is deferred.
pub struct OpCtx {
 tx: Transaction,
 op_execution_sender: mpsc::Sender<(mpsc::Sender<NetMessage>, NetMessage)>,
}

impl OpCtx {
 /// Fire `msg` and await the terminal reply keyed by this ctx's tx.
 ///
 /// Single-use per `Transaction`: `forward_pending_op_result_if_completed`
 /// fires exactly once because the `completed` / `under_progress` dedup
 /// sets suppress subsequent dispatches. Callers that need to retry a
 /// multi-attempt protocol must use a fresh `Transaction` per attempt.
 ///
 /// Must be called from an op task, not from the event loop or a
 /// `priority_select` arm — the inner `.send().await` on
 /// `op_execution_sender` is allowed only because the caller is
 /// short-lived spawned task (see `.claude/rules/channel-safety.md`).
 pub async fn send_and_await(&mut self, msg: NetMessage) -> Result<NetMessage, OpError>;

 pub fn tx(&self) -> Transaction;
}
```

`OpManager` grows a thin `op_ctx(&self, tx: Transaction) -> OpCtx` factory that clones the `op_execution_sender` out of its inner `EventLoopNotificationsSender`. The old `OpManager::notify_op_execution` helper is **deleted** — its body becomes `OpCtx::send_and_await`. This keeps the primitive in exactly one place.

### Target architecture (eventual vision, for reference)

Phases 2b, 2.5, and 3–6 will grow `OpCtx` toward the full shape below. These fields and methods are **not** in Phase 2a; they are added as each phase's concrete consumer arrives, so the scaffolding stays minimal until it has a real caller.

```rust
// Eventual shape — NOT Phase 2a. Deferred until later phases need them.
#[async_trait]
pub trait Operation: Sized + Send + 'static {
 type Output: Send;
 async fn run(self, ctx: OpCtx) -> Result<Self::Output, OpError>;
}

pub struct OpCtx {
 tx: Transaction,
 op_execution_sender: /* … */, // Phase 2a
 inbox: mpsc::Receiver<NetMessage>, // deferred: non-terminal msgs
 outbound: mpsc::Sender<NetMessage>, // deferred: direct bridge send
 updates: broadcast::Sender<HostResponse>, // deferred: WS fanout
 op_registry: Arc<OpRegistry>, // Phase 5
}

impl OpCtx {
 pub async fn send_and_await(&mut self, msg: NetMessage) -> Result<NetMessage, OpError>; // Phase 2a
 pub async fn spawn_sub<O: Operation>(&self, child: O) -> Result<O::Output, OpError>; // Phase 2.5
 pub fn notify(&self, update: HostResponse); // Phase 3
}
```

### Contrast diagram

```mermaid
flowchart LR
 subgraph Current
 A1[op_manager.push] --> B1[DashMap&lt;Tx,OpEnum&gt;]
 B1 --> C1[handle_op_result re-entry loop]
 C1 -->|Completed| D1[ResultRouter]
 C1 -->|ContinueOp| B1
 C1 -->|SendAndContinue| B1
 E1[SubOperationTracker DashMaps] -.polls.-> C1
 end

 subgraph Target
 A2[spawn tx task] --> B2[task locals own OpState]
 B2 --> C2[loop: ctx.send_and_await ctx.spawn_sub]
 C2 -->|return| D2[Ok/Err -> notify]
 C2 -->|child future| C2
 end
```

### What goes away

- `OpManager.ops.{connect,put,get,subscribe,update}` — the five `DashMap<Transaction, *Op>` fields. Replaced by a single `DashMap<Transaction, JoinHandle<Result<(), OpError>>>` (or equivalent) used only for cancellation and cleanup.
- `SubOperationTracker` entirely. Parent/child relationships become parent-awaits-child futures.
- `handle_op_result` and `OperationResult`. The re-entry loop's state transitions become ordinary Rust control flow inside each op's `run`.
- The `Operation::process_message` seam. Ops express themselves as `async fn run`.

### What stays

- `Transaction` IDs and the wire format.
- `completed` / `under_progress` de-dup sets.
- `pending_op_results: HashMap<Transaction, Sender<NetMessage>>` inside `p2p_protoc` — this is already the "send message, await reply keyed by tx" primitive we need. The tx task registers its inbox sender here via `notify_op_execution`, then `await`s.
- `SessionActor` and its TTL pending cache. The producer side changes (tx task pushes to `broadcast::Sender`), but the 1→N fanout and the session lifecycle stay.
- `ResultRouter` in a reduced role: it still exists as the bridge between "network layer produced a final result" and "SessionActor delivers it", but in the target design the tx task itself publishes via `OpCtx::notify` and `ResultRouter` handles only legacy / non-per-tx paths during the transition.

## 4. Existing scaffolding to reuse

Some of the infrastructure already exists in the tree but is currently `#[allow(dead_code)]`, marked `FIXME: enable async sub-transactions`. **Important:** the reach of this scaffolding is narrower than the original draft of this design suggested. See *Scaffolding reach* below.

| Primitive | File | Lines | Status |
|---|---|---|---|
| `OpManager::notify_op_execution` — creates `(Sender<NetMessage>, Receiver<NetMessage>)`, sends the pair to `op_execution_sender`, awaits the callback reply | `crates/core/src/node/op_state_manager.rs` | 701–717 | **dead**, `FIXME: enable async sub-transactions` |
| `EventLoopNotificationsSender::op_execution_sender: Sender<(Sender<NetMessage>, NetMessage)>` + getter | `crates/core/src/node/network_bridge.rs` | 203, 218–221 | **dead**, `FIXME: enable async sub-transactions` |
| `PrioritySelectStream.op_execution` receiver wiring + `SelectResult::OpExecution` variant | `crates/core/src/node/network_bridge/priority_select.rs` | 27, 63–64 | **live** (drains into `handle_op_execution`) |
| `p2p_protoc::handle_op_execution` — inserts callback into `state.pending_op_results` **and** forwards the message into the op pipeline as `ConnEvent::InboundMessage` | `crates/core/src/node/network_bridge/p2p_protoc.rs` | 3861–3879 | **live** |
| PUT reply forwarding via `pending_op_result` callback on completion | `crates/core/src/node.rs` | 868–878 (send at 871) | **live** (PUT only) |
| GET reply forwarding via `pending_op_result` callback on completion | `crates/core/src/node.rs` | 917–927 (send at 920) | **live** (GET only) |
| SUBSCRIBE, CONNECT, UPDATE callback forwarding on completion | `crates/core/src/node.rs` | — | **missing** |

### Scaffolding reach

The end-to-end plumbing for "send a message tagged with a `Transaction` and await a single reply keyed by the same `Transaction`" is currently live **only for PUT and GET**. Specifically:

- `handle_op_execution` (`p2p_protoc.rs:3861-3879`) inserts a callback sender into `state.pending_op_results` keyed by tx, then forwards the wrapped message into the main dispatch path as a `ConnEvent::InboundMessage`. This is wired end to end.
- Inside the op pipeline, `handle_pure_network_message_v1` (`node.rs:775` onwards) calls `handle_op_request::<OpT, _>(...)` and then — **only for PUT and GET** — pulls the callback sender out of `pending_op_result` and calls `op_execution_callback.send(completed_msg).await` (node.rs:871, 920).
- The SUBSCRIBE, CONNECT, and UPDATE branches of `handle_pure_network_message_v1` do **not** have this callback block. If a caller today were to invoke `notify_op_execution(subscribe_msg).await`, the sender would be inserted into `pending_op_results` and the forwarded message would drive the SUBSCRIBE op, but `notify_op_execution` would then hang forever on `response_receiver.recv()` because no code path fires the reply.

This means the scaffolding is **not** "almost all wired up" for all ops. It is wired for PUT and GET because those ops have a wire-level terminal message that maps cleanly onto a single-reply model. For the other ops, the callback hook is missing and must be added before any of them can participate in the `notify_op_execution` round-trip.

### The SUBSCRIBE complication

Even once the SUBSCRIBE callback hook exists, SUBSCRIBE initiation is not "construct a `NetMessage` and send it". `subscribe::request_subscribe` (`operations/subscribe.rs:303-487`) does all of the following before any wire message exists:

1. Detects "standalone node with contract available locally" and routes to `complete_local_subscription` without any network round-trip.
2. Runs `k_closest_potentially_hosting` against the ring, then falls back to any connected peer if that returns empty.
3. Builds the `AwaitingResponse` routing state (`next_hop`, `tried_peers`, `visited`, `alternatives`, `speculative_paths`) that future retries consult.
4. Constructs the `SubscribeMsg::Request { id, instance_id, htl, visited, is_renewal }` wire message only after the target peer is chosen.
5. Calls `op_manager.notify_op_change(msg, OpEnum::Subscribe(op))` — a push-before-send primitive distinct from `notify_op_execution`.

None of this fits the `notify_op_execution(msg).await` shape, because:

- The local-completion branch has no network round-trip at all → nothing to reply with.
- The routing state needs to exist in the op's `DashMap<Tx, SubscribeOp>` entry **before** the wire message is sent, because retry/alt-peer logic reads it. `notify_op_execution`'s current contract is "send a `NetMessage`, get a reply"; it has no op-construction hook.
- Retries and alternative-peer selection live inside the re-entry loop, not inside the caller of `notify_op_execution`.

So a "first slice" that ports the client-initiated SUBSCRIBE path onto `notify_op_execution` is genuinely Phase-2-shaped work: it drags in task-per-tx ownership of the routing state, a `prepare_initial_request` helper that knows the local-completion fallback, and either a synthetic reply for local completion or a change to the `notify_op_execution` contract to allow "no network round-trip, completed immediately".

## 5. Phased rollout (revised)

Each phase should be a separate sub-issue / PR. Phase boundaries are chosen so that after each phase the tree compiles, CI passes, and the existing integration tests stay green.

| Phase | Scope | Entry tests | Exit criteria |
|---|---|---|---|
| **0** | This design (the issue body). Also: remove the `S-needs-design` label from #1454 once Rev 2 is published. | n/a | Design published here; #1454 unblocked. |
| **1** | **Extend the `pending_op_result` callback hook in `handle_pure_network_message_v1` to fire for SUBSCRIBE, CONNECT, and UPDATE completions.** This matches the existing PUT/GET pattern at `node.rs:869-877` / `918-926`. The `#[allow(dead_code)]` attributes on `notify_op_execution` and `op_execution_sender` stay for now — this phase does **not** add a caller. Adds a test that fires a round-trip through the scaffolding for each op (via a test-only caller or `#[cfg(test)]` shim). | existing integration suite | All five op kinds (CONNECT/PUT/GET/SUBSCRIBE/UPDATE) can terminate a `notify_op_execution` round-trip without hanging. No behavioral change to any op. |
| **2a** | **Introduce `OpCtx` as a new module with no production callers.** Define the `OpCtx` struct, the `send_and_await` primitive (built on the Phase 1 `notify_op_execution` round-trip), and the `tx → JoinHandle` bookkeeping that Phase 5 will eventually replace the op-specific DashMaps with. No ops migrated. The `#[allow(dead_code)]` attributes on `notify_op_execution` and `op_execution_sender` from Phase 1 are lifted here — Phase 2a becomes the first production caller, but only from a dormant test shim, not from any real op path. Pattern mirrors Phase 1's "widen the primitive without adding a live consumer". | new unit tests on `OpCtx::send_and_await` round-trip via each of the five op kinds | `OpCtx` exists, tests exercise the round-trip end-to-end, nothing in `operations/` changes. |
| **2b** | **Migrate SUBSCRIBE's client-initiated path only to `OpCtx`.** Scope excludes renewals (live on legacy path because their jitter/spam-prevention in `ring.rs:944-1051` is load-bearing) and PUT sub-ops (live on legacy path until Phase 2.5). Factor `prepare_initial_request` out of `request_subscribe` to handle the local-completion branch explicitly. `SubscribeOp` DashMap entry disappears for client-initiated subscribes; renewal and sub-op paths still populate it. Route via a per-tx switch (not a global feature flag). | existing subscribe unit + integration tests, new task-per-tx tests for the client path | Client-initiated SUBSCRIBE runs via `OpCtx`; renewal and sub-op SUBSCRIBE unchanged. No regressions in either path. |
| **2c** | **Migrate CONNECT to `OpCtx`.** Deferred from Rev 2's Phase 2 because CONNECT is 5,025 lines (2.3× SUBSCRIBE) with a genuinely different shape: no client-facing origin (joiner is the local node's `connection_maintenance` loop, not a WS client), no sub-op semantics, hole-punch decoupled from op lifetime (op `Completed` ≠ connection established), and most of the file's bulk is in `RelayContext` / `ConnectForwardEstimator` / visited-bloom-filter machinery rather than the state machine itself. Schedule TBD — not a strict blocker for Phases 3–6, which can proceed on the op kinds they target. | CONNECT unit + integration tests | Internal CONNECT transactions run via `OpCtx`. |
| ~~**2.5**~~ | **Folded into Phase 3 as of Rev 4.** Original scope: port the PUT-completion → SUBSCRIBE sub-op hand-off onto task-per-tx. Reason for fold: the scaffolding (`DriverRole::SubOp`, a registration helper bypassing `NodeEvent::LocalSubscribeComplete`, extra delivery plumbing through `op_manager.completed(child_tx)`) becomes redundant as soon as Phase 3's task-per-tx PUT/GET parent can `await` the child future directly. See Errata (Rev 3 → Rev 4). | — | — |
| **3a** | **Migrate PUT (client-initiated originator path) + PUT's blocking-subscribe child to task-per-tx.** Collapses the duplicate-send pattern in `put.rs:575–594` (early `result_router_tx.try_send`) + `:841–917` (response-leg second send) into a single client-result delivery from the driver task. The blocking-subscribe child (when `blocking_subscribe=true`) is awaited inline by the PUT task — PUT stops calling `expect_and_register_sub_operation`. The async non-blocking child path migrates with it. **`SubOperationTracker` stays alive** (GET still calls it in 3b). PUT relays stay on the legacy re-entry loop. Speculative paths stay GC-controlled on the DashMap. Install `try_forward_task_per_tx_reply` for PUT `Response` / `ResponseStreaming` variants at the TOP of the PUT branch in `handle_pure_network_message_v1`; filter `ForwardingAck`. Synthesize a `PutMsg::Response` where PUT currently completes locally without a network reply. | `test_put_with_blocking_subscribe` (`tests/operations.rs:1971`), `test_put_contract_three_hop_returns_response` (`:2551`), `test_put_then_immediate_subscribe_succeeds_locally_regression_2326` (`:3154`), `test_put_triggers_update_for_subscribers` (`:3407`), `test_put_merge_persists_state` (`:468`), `test_put_with_subscribe_flag` (`:1504`), all `test_put_retry_*` unit tests (`put.rs:2672+`), existing PUT unit tests (`put.rs:2102–2920`) | Client-initiated PUT runs via `OpCtx`; PUT originator stops touching `OpManager.ops.put`; blocking PUT parent awaits child subscribe future directly (no `SubOperationTracker` call from PUT path); client result delivered exactly once. Relay, speculative, and GET paths unchanged. |
| **3b** | **Migrate GET (client-initiated originator path) + GET's blocking-subscribe child to task-per-tx.** Reuses the patterns validated in 3a. Adds streaming assembly handling: `send_and_await` internally blocks on `stream_handle.assemble().await` for the `ResponseStreaming` reply variant, preserving the single-reply shape of `OpCtx`. Four local-completion paths to handle (originator has contract with no peers, relay has contract, originator receives network Response, originator auto-subscribe) — only the first needs synthesis; the network Response path reuses the arriving message directly. Same bypass + filter install as 3a for GET. GET relays stay on the legacy loop. | `test_get_with_blocking_subscribe` (`tests/operations.rs:2071`), `test_get_with_subscribe_flag` (`:1244`), `test_get_notfound_no_forwarding_targets` (`:3285`), `test_delegate_contract_get` (`:4009`), `test_repeated_get_with_unchanged_state_succeeds` (`:4380`), streaming integration tests, GET unit tests (`get.rs:3824–4980`) including `test_blocking_subscribe_propagation` and `test_retry_and_fallback` | Client-initiated GET runs via `OpCtx`; GET originator stops touching `OpManager.ops.get`; blocking GET parent awaits child subscribe future directly; streaming assembly flows through `send_and_await`. |
| **3c** | **Delete `SubOperationTracker` and its machinery.** Mechanical deletion PR once 3a and 3b have removed both callers. Delete: the struct (`op_state_manager.rs:74–86`), `sub_operations` / `root_ops_awaiting_sub_ops` / `parent_of` / `expected_sub_operations` / `failed_parents` fields, `expect_and_register_sub_operation`, `mark_sub_op_completed`, `remove_child_link`, `cleanup_parent_tracking`, `get_parent`, `is_sub_operation`, `all_sub_operations_completed`, `count_pending_sub_operations`, `sub_operation_failed`, and the finalized-parent-parking branch in `operations.rs:177–208`. Delete the 10 `sub_operation_tracker_*` unit tests (`op_state_manager.rs:2462–3090`). No Phase 5 dependency — this work slots in immediately after 3b. | existing PUT/GET regression tests (no new tests — deletion is mechanical) | `SubOperationTracker` and `root_ops_awaiting_sub_ops` gone; `OpManager.ops.put` and `.get` DashMaps remain but are no longer touched by originator paths (their deletion is still Phase 5). |
| **4** | Migrate UPDATE broadcast — the trickiest variant because it fans out to many peers. Verify broadcast semantics match the pre-refactor behaviour (see existing regression test at f28e8e60). | UPDATE integration + broadcast regression tests | UPDATE live phase stops touching `OpManager.ops`. |
| **5** | Delete `OpManager.ops.{connect,put,get,subscribe,update}` DashMaps. (`SubOperationTracker` was already deleted in Phase 3c.) `OpManager` keeps `completed` + `under_progress` + a new `tx → JoinHandle` table for cancellation. Delete `Operation::process_message` / `OperationResult` / `handle_op_result`. | full suite incl. DST replay | `Ops` struct reduced to the bookkeeping fields. |
| **6** | Boundary tests: component-level tests for `OpCtx`, `send_and_await`, `spawn_sub`. Documentation sweep: update `.claude/rules/operations.md` (the push-before-send invariant is now structurally enforced and the rule wording should reflect that). Update `docs/architecture/operations/README.md`. | new boundary tests | Rules / docs aligned with new code. |

### Phase 1 — concrete first slice (revised)

This is the smallest self-contained PR. Unlike Rev 1's Phase 1, this one does **not** touch any op's routing logic. It only extends an existing pattern to three more op variants.

1. In `crates/core/src/node.rs`, add a `pending_op_result` callback block after `handle_op_request::<subscribe::SubscribeOp, _>` (around line 999), mirroring the PUT pattern at `node.rs:868-878`. The "completed message" is the response `SubscribeMsg::ReturnSub` (or equivalent terminal variant — verify in `subscribe.rs`).
2. Repeat for CONNECT (`NetMessageV1::Connect`) and UPDATE (`NetMessageV1::Update`) branches of `handle_pure_network_message_v1`.
3. **Decide how to handle SUBSCRIBE's `complete_local_subscription` branch**: either (a) fire a synthetic "locally completed" reply on the callback so `notify_op_execution` returns, or (b) declare that `notify_op_execution` may also observe completion via the op's push-to-`completed`-set transition and add a fallback wakeup. Option (a) is simpler; option (b) is more general. Recommend (a) for Phase 1.
4. Add a test-only caller of `notify_op_execution` (e.g., a `#[cfg(test)]` module) that fires a round-trip through each of the five op kinds and asserts the call returns. Keep the `#[allow(dead_code)]` on the production getters — Phase 2 will activate them for real.
5. Run:
 ```
 cargo fmt && cargo clippy -- -D warnings
 cargo test -p freenet
 cargo test -p freenet --features "simulation_tests,testing" --test simulation_integration
 ```

No production caller is added; no behavior changes for any existing code path. Phase 1 is purely "widen the callback hook so the round-trip primitive can terminate for all op kinds."

## 6. Risks and mitigations

- **Push-before-send invariant** (`.claude/rules/operations.md`, currently enforced in `operations.rs:238`): preserved **by definition** in the target design — op state never leaves the owning task, so the concept of "forgetting to re-push" goes away. The rule text should be updated in Phase 6 to describe the new invariant (state lives in locals, `send_and_await` is atomic).

- **Bounded-channel rules** (`.claude/rules/channel-safety.md`): the `notify_op_execution` / `send_and_await` primitive already uses a bounded `Sender<(Sender<NetMessage>, NetMessage)>`. `notify_op_execution` calls `.send().await` (op_state_manager.rs:710), which **violates** the rule "never use `.send().await` in event loops" — however, the caller of `notify_op_execution` is *not* an event loop (it's an op task), so this is currently acceptable. Phase 2 must re-audit once there are production callers. The callback-reply side (`node.rs:871, 920`) also uses `.send().await` on the bounded `pending_op_result` channel (capacity 1), which blocks the pure-network-message handler; this is acceptable because the caller is `pending_op_results[tx]` consumed by the awaiting op task, but must be re-audited once every op kind uses it.

- **Determinism (DST):** task spawning order has to be stable under the deterministic runtime. `tokio::spawn` is not deterministic by itself; use `deterministic_select!` and the `TimeSource` / `GlobalRng` abstractions from `crates/core/src/simulation/`. Verify via the existing `test_deterministic_replay_*` tests at every phase. If task spawn order proves to be a blocker, Phase 2 may need to run ops cooperatively on a single-threaded runtime rather than via `tokio::spawn`.

- **Backpressure:** the per-tx inbox mpsc and the per-tx broadcast for WS updates both need explicit bounds. Drop-oldest vs fail-fast policy must be chosen and documented. For WS updates, dropping is likely acceptable (clients will reconcile); for the inbox, dropping is not (it would wedge the transaction).

- **Cancellation:** the `tx → JoinHandle` table gives us clean `abort()` on timeout, which today is handled implicitly by `under_progress` expiry. The timeout enforcement logic needs to move from the GC path to each tx task's own timeout arm.

- **Interaction with GC / cleanup exemptions** (see `AGENTS.md` on time-bounded exemptions): the new `tx → JoinHandle` table must not create unbounded GC blind spots. Any exemption must have a TTL or an absolute age threshold. Phase 5 should include a test that asserts orphaned tx tasks are torn down within a bounded interval.

- **Rollout safety:** phases 2–4 leave the tree in a mixed state (some ops via task-per-tx, others via the old re-entry loop). Both paths must coexist cleanly. The `OpRegistry` introduced in Phase 2 needs to be additive, not a replacement, until Phase 5.

## 7. Out of scope / future work

- WS API surface redesign (separate design issue).
- Error-handling surface redesign across the WS client/server boundary. This design assumes `OpError` stays roughly as-is; a follow-up can reshape it once the task-per-tx model is in place.
- Transport resiliency work that depends on this refactor (mentioned in the prioritizer comment) can begin once Phase 3 is merged.

## Errata (Rev 1 → Rev 2)

The initial revision of this design document (published 2026-04-08, earlier today) contained two material errors that were caught during Phase 1 implementation planning and are corrected in Rev 2:

1. **Phase 1 was mis-scoped.** Rev 1 claimed Phase 1 could be "activate `notify_op_execution` and convert `start_subscription_request_internal` to `await notify_op_execution(...)`" and described this as a drop-in change. In fact:
 - The `pending_op_result` callback that makes `notify_op_execution` return is only wired for PUT and GET in `handle_pure_network_message_v1` (`node.rs:871, 920`). SUBSCRIBE, CONNECT, and UPDATE branches have no such hook. A SUBSCRIBE-targeted `notify_op_execution` call would hang forever on `response_receiver.recv()`.
 - `start_subscription_request_internal` does not just send a wire message. It spawns a task that calls `subscribe::request_subscribe`, which runs ring routing, builds pre-populated `AwaitingResponse` state, has a local-completion fallback path that never touches the network, and uses `notify_op_change` (not `notify_op_execution`). Porting this path onto `notify_op_execution` requires pulling in task-per-tx ownership of the routing state — i.e., it is Phase-2-shaped work, not Phase-1-shaped work.

 Rev 2 reassigns the original Phase 1 description to a new **Phase 2.5** (the PUT-completion → SUBSCRIBE sub-op hand-off, done properly after task-per-tx exists) and redefines **Phase 1** as the genuine structural prerequisite: extending the callback hook to cover all five op variants.

2. **`start_subscription_request_internal` line numbers were wrong.** Rev 1 cited `operations.rs:699-811`. Verified against `main`, the function is at `operations.rs:745-811`; `699-742` covers the public wrappers (`start_subscription_request_async`, `start_subscription_request_blocking`, `start_subscription_request`).

Rev 2 also adds a "Scaffolding reach" subsection to Section 4 making the PUT/GET-only reality of the callback hook explicit, and clarifies in Section 2 that SUBSCRIBE / CONNECT / UPDATE branches of `handle_pure_network_message_v1` lack the callback block.

## Errata (Rev 2 → Rev 3)

After Phase 1 shipped (PR #3802), a pre-implementation investigation of SUBSCRIBE's surface area surfaced three material issues with Rev 2's Phase 2 scope:

1. **Phase 2 bundled two unrelated op migrations.** SUBSCRIBE is 2,225 lines; CONNECT is 5,025 lines (2.3× larger). Rev 2's parenthetical "(simplest ops)" is accurate relative to GET/PUT/UPDATE but not relative to each other. The two ops also have fundamentally different shapes: SUBSCRIBE has a client-facing origin, blocking sub-op semantics with PUT, and inline retry loops that serialize state through `notify_op_change()`; CONNECT has no client-facing origin at all (the "joiner" is the local node's `connection_maintenance` loop, not a WS client), no sub-op semantics, decouples hole-punch from op lifetime (op `Completed` does not mean the connection is established), and most of its 5,025 lines are in peer-selection machinery (`RelayContext`, `ConnectForwardEstimator`, visited-bloom-filter) rather than in the state machine itself. Rev 3 splits these into separate phases: **2b** for SUBSCRIBE, **2c** for CONNECT. Phase 2c's schedule is TBD and does not block Phases 3–6. (An earlier draft of this Errata claimed CONNECT ops have "hour-long connection attempts"; the subsequent CONNECT investigation [posted as an issue comment] refuted that — CONNECT ops time out at `OPERATION_TTL = 60s`; it's the resulting peer *connections* that live for hours, managed by the transport layer independently of the op.)

2. **SUBSCRIBE has three distinct entry points with different failure semantics**, not one:
 - **Client-initiated** (from a WS `Subscribe` request) — errors surface to the WS client.
 - **Renewal-initiated** (`ring.rs:944-1051`) — fire-and-forget, protected by `can_request_subscription()` spam prevention and jitter scheduling.
 - **Sub-op of PUT** (`operations.rs:745-811`, called from `put.rs:1589-1608`, blocking mode) — parent PUT blocks on subscribe completion via `expect_and_register_sub_operation` / `all_sub_operations_completed`.

 Rev 2's Phase 2 implied all three would migrate together, but doing so would require the new task-per-tx SUBSCRIBE to cooperate with the legacy re-entry-loop PUT via the `SubOperationTracker` machinery that Phase 2.5 is supposed to start retiring — circular. Rev 3 scopes Phase **2b to the client-initiated path only**. Renewals stay on the legacy path (their jitter/spam-prevention timing is load-bearing and decoupling it is a separate rabbit hole); PUT sub-ops stay on the legacy path until Phase 2.5 ports them properly.

3. **SUBSCRIBE's retry logic is not localized.** Rev 2's description "routing logic moves into the spawned task's locals" handwaves a significant rewrite. Today every retry path (NotFound → next alternative, alternatives exhausted → fresh `k_closest` round, peer timeout → abort-and-replace) reconstructs the full `SubscribeOp` and re-inserts it via `notify_op_change()`. Moving to task locals means rewriting all three retry paths, not just the happy path. This work belongs in Phase **2b**, but Rev 3 introduces **Phase 2a** as a strict prerequisite: land `OpCtx` itself (struct definition, `send_and_await` primitive on top of Phase 1's round-trip, `tx → JoinHandle` bookkeeping) as dormant scaffolding with no production callers, analogous to how Phase 1 landed the callback hook without activating `notify_op_execution`. Phase 2b then activates `OpCtx` for exactly one entry point (client-initiated SUBSCRIBE) and reuses 2a's primitives rather than inventing them mid-migration.

Rev 3 also lifts the `#[allow(dead_code)]` attributes on `notify_op_execution` and `op_execution_sender` in Phase 2a (via the test shim that exercises `OpCtx::send_and_await`), satisfying the TODO left in Phase 1's doc comments.

Pre-implementation investigation reports for each op surface will be posted as comments on this issue rather than folded into the design body, so the design stays the plan-of-record and the investigations stay the research trail.

## Errata (Rev 3 → Rev 4)

After Phase 2b shipped (PR #3806, 2026-04-10), the Phase 2.5 implementation was drafted and verified against HEAD. Rev 3's Phase 2.5 description suggested "retiring `SubOperationTracker` usage for subscribe children" and "awaiting the child future in the parent task" — the latter is incompatible with the current PUT/GET parent, which is not a task but a re-entry-loop DashMap entry. Two corrections followed from that verification:

1. **`SubOperationTracker` is not a polling mechanism; it is event-driven.** PUT/GET stash their terminal state in `root_ops_awaiting_sub_ops` (`operations.rs:205`); `op_state_manager.rs:926-973` finalizes the parent when the last child calls `op_manager.completed(child_tx)`. Replacing this with a oneshot while the parent still lives in a DashMap entry is a lateral move — it introduces new delivery plumbing (a `DriverRole::SubOp` variant of Phase 2b's `drive_subscribe_inner`, a `register_local_subscription` helper so the sub-op path skips the `NodeEvent::LocalSubscribeComplete` publish that the client path needs) without actually enabling the parent to `await` anything.

2. **`start_subscription_request_internal` has legitimate async-path users.** `track_parent=false` (non-blocking PUT/GET with subscribe flag) does not touch `expect_and_register_sub_operation` at all — it just spawns a best-effort subscribe. That path also must stay on the legacy re-entry loop until its parent migrates.

The honest structural win — the parent task owning its child future directly — only exists once PUT/GET themselves are task-per-tx, which is Phase 3. Doing Phase 2.5 first means landing `DriverRole::SubOp` + the registration helper + extra `completed(child_tx)` / `sub_operation_failed` wiring as scaffolding that gets immediately removed when Phase 3 rewrites the parent side.

**Rev 4 therefore folds Phase 2.5 into Phase 3.** The blocking-subscription sub-op child (`track_parent=true`) and the async non-blocking child (`track_parent=false`) both migrate as part of Phase 3's PUT/GET task-per-tx work, which is when the parent task can hold the child future as a local and await it without any of the `SubOperationTracker` round-trip. The existing blocking-subscribe regression tests (`tests/operations.rs:1971` and `:2071`) continue to guard the legacy path until Phase 3 lands.

Facts verified during the Rev 4 investigation that Phase 3 should reuse as inputs (the exploration itself is not preserved as a separate document — Phase 3 will re-verify against HEAD per the Phase N rule):

- The call-site inventory for every subscribe entry point (public wrappers, `start_subscription_after_put`, `auto_subscribe_on_get_response`, the renewal path in `ring.rs:1046`, and the Phase 2b SUBSCRIBE bypass in `node.rs:1161`) distinguishes which branches a Phase 3 PUT/GET migration must touch.
- The `complete_local_subscription` → `NodeEvent::LocalSubscribeComplete` publish is client-path-only; any sub-op pathway that calls into the local-completion helper must skip that NodeEvent because the parent PUT/GET owns the client reply.
- The `parent_tx` / `child_tx` / per-attempt `attempt_tx` bookkeeping from Phase 2b is already safe for sub-op use: `expect_and_register_sub_operation` is called only with `child_tx`, and the per-retry `attempt_tx` never enters `parent_of`.
- The three Phase 2b bugs (bypass gap in `forward_pending_op_result_if_completed`, ForwardingAck / non-terminal message filter, local-completion reply synthesis) must be re-checked for every op path Phase 3 adds.

## Errata (Rev 4 → Rev 5)

After Rev 4 folded Phase 2.5 into Phase 3, a pre-implementation investigation of GET, PUT, and the subscribe sub-op hand-off was run against HEAD (posted as an issue comment on 2026-04-11). Three findings motivated a further split:

1. **PUT and GET are independently large.** PUT's originator path has a duplicate-send pattern (`put.rs:575–594` early client result via `result_router_tx.try_send` + `:841–917` second send on the response leg, relying on client-side `SessionActor` dedup) that is the single biggest structural change in Phase 3 and is orthogonal to anything GET does. GET adds streaming assembly inside `send_and_await` and four local-completion paths to handle. Bundling them in one PR makes debugging either change harder. Rev 5 splits them: **3a is PUT, 3b is GET**.

2. **`SubOperationTracker` can only be deleted after BOTH callers are removed.** Grep of HEAD shows exactly one non-test caller of `expect_and_register_sub_operation` — `operations.rs:755` inside `start_subscription_request_internal`, which is invoked from PUT's `start_subscription_after_put` *and* GET's `auto_subscribe_on_get_response`. 3a can stop PUT from calling the tracker, but the tracker itself must stay alive until GET also stops calling it in 3b. The tracker deletion becomes a mechanical third PR — **3c** — which slots in immediately after 3b rather than waiting until Phase 5. Phase 5's scope is now just the `OpManager.ops` DashMap deletion (the tracker is gone by then).

3. **The `attempt_tx` terminology from Rev 3/4 notes does not exist on HEAD.** Grep finds zero occurrences. The actual bookkeeping distinguishes only `parent_tx` and `child_tx`; there is no per-retry intermediate transaction kind. The Rev 4 bullet that claimed "per-attempt `attempt_tx` never enters `parent_of`" was verifying something that isn't there. The bullet is preserved above as historical context but should not influence Phase 3 design — **drop the term**.

Additional facts verified during the Rev 5 investigation that 3a/3b/3c should reuse:

- **PUT's duplicate-send is intentional and load-bearing until the task-per-tx driver replaces it.** The early send at `put.rs:575–594` is what makes PUT feel fast from a client's perspective (local storage → immediate response), and the second send at `:841–917` is what guarantees delivery when the originator is not the final hop. The task-per-tx driver must collapse these to a single delivery by routing the synthesized local-completion `PutMsg::Response` through the `pending_op_result` callback instead of `result_router_tx.try_send`.
- **PUT relay ops with `state: None` are normal, not an error state.** `finalized()` at `put.rs:91–93` correctly treats `None` as done. A 3a driver must not assume `state: Some(_)` for every PUT in flight.
- **PUT speculative paths stay GC-controlled on the legacy DashMap path.** They are spawned from the garbage cleanup task (`put.rs:51–54`, `:251–264`), not from the originator driver. 3a must not replicate speculative spawning in the task-per-tx driver.
- **GET `fetch_contract` is always `true` post-#3757.** The client's `return_contract_code` flag only controls the client-facing payload, not the wire request. 3b must preserve that distinction.
- **GET streaming assembly blocks the originator's state transition** — `stream_handle.assemble().await` at `get.rs:2892` is the boundary between `AwaitingResponse` and `Finished`. 3b's `send_and_await` should block on assembly internally for the `ResponseStreaming` variant.
- **Relay stale-cache deferral logic** (`get.rs:1406–1431`): relays with only a stale LRU entry defer to the network and keep the stale copy as fallback. 3b must preserve this — relays stay on the legacy path, but any helper shared between originator and relay must not break the interest check.
- **`complete_local_subscription`'s `NodeEvent::LocalSubscribeComplete` publish stays.** It is still needed for the renewal and intermediate-peer paths. 3a/3b's inline-await pattern bypasses the event in the blocking-subscribe case (the parent task's own completion is the notification), but the helper itself is not deleted.

---

*Design written for #1454. Line numbers verified against `main` at the time of writing; re-verify before implementing each phase.*

Concern	File	Lines
`OpManager.ops` — `DashMap`s for `connect`/`put`/`get`/`subscribe`/`update`, plus `completed` and `under_progress` `DashSet`s	`crates/core/src/node/op_state_manager.rs`	204–213
`OpManager` struct	`crates/core/src/node/op_state_manager.rs`	217–281
`SubOperationTracker`	`crates/core/src/node/op_state_manager.rs`	73–202
`all_sub_operations_completed`	`crates/core/src/node/op_state_manager.rs`	161–176
`Operation` trait	`crates/core/src/operations.rs`	32–56
`OperationResult` enum (`Completed`, `ContinueOp`, `SendAndContinue`, `SendAndComplete`)	`crates/core/src/operations.rs`	69–99
`handle_op_result` re-entry loop (push-before-send at line 238)	`crates/core/src/operations.rs`	131–287
`PrioritySelectStream` with `op_execution` receiver	`crates/core/src/node/network_bridge/priority_select.rs`	27, 55–93
`p2p_protoc::handle_op_execution` inserts into `pending_op_results`	`crates/core/src/node/network_bridge/p2p_protoc.rs`	3861–3879
`handle_pure_network_message_v1` entry	`crates/core/src/node.rs`	775
`handle_pure_network_message_v1` PUT branch (callback send at 871)	`crates/core/src/node.rs`	849–906
`handle_pure_network_message_v1` GET branch (callback send at 920)	`crates/core/src/node.rs`	907–990
`handle_pure_network_message_v1` SUBSCRIBE path (no `pending_op_result` callback)	`crates/core/src/node.rs`	992–1027
`start_subscription_request_internal` (sub-op launcher)	`crates/core/src/operations.rs`	745–811
`request_subscribe` (routing + op construction + push)	`crates/core/src/operations/subscribe.rs`	303–487
`ResultRouter` (network → `SessionActor` routing)	`crates/core/src/client_events/result_router.rs`	—
`SessionActor` (1→N client fanout, TTL pending cache)	`crates/core/src/client_events/session_actor.rs`	—

Primitive	File	Lines	Status
`OpManager::notify_op_execution` — creates `(Sender<NetMessage>, Receiver<NetMessage>)`, sends the pair to `op_execution_sender`, awaits the callback reply	`crates/core/src/node/op_state_manager.rs`	701–717	dead, `FIXME: enable async sub-transactions`
`EventLoopNotificationsSender::op_execution_sender: Sender<(Sender<NetMessage>, NetMessage)>` + getter	`crates/core/src/node/network_bridge.rs`	203, 218–221	dead, `FIXME: enable async sub-transactions`
`PrioritySelectStream.op_execution` receiver wiring + `SelectResult::OpExecution` variant	`crates/core/src/node/network_bridge/priority_select.rs`	27, 63–64	live (drains into `handle_op_execution`)
`p2p_protoc::handle_op_execution` — inserts callback into `state.pending_op_results` and forwards the message into the op pipeline as `ConnEvent::InboundMessage`	`crates/core/src/node/network_bridge/p2p_protoc.rs`	3861–3879	live
PUT reply forwarding via `pending_op_result` callback on completion	`crates/core/src/node.rs`	868–878 (send at 871)	live (PUT only)
GET reply forwarding via `pending_op_result` callback on completion	`crates/core/src/node.rs`	917–927 (send at 920)	live (GET only)
SUBSCRIBE, CONNECT, UPDATE callback forwarding on completion	`crates/core/src/node.rs`	—	missing

Phase	Scope	Entry tests	Exit criteria
0	This design (the issue body). Also: remove the `S-needs-design` label from #1454 once Rev 2 is published.	n/a	Design published here; #1454 unblocked.
1	Extend the `pending_op_result` callback hook in `handle_pure_network_message_v1` to fire for SUBSCRIBE, CONNECT, and UPDATE completions. This matches the existing PUT/GET pattern at `node.rs:869-877` / `918-926`. The `#[allow(dead_code)]` attributes on `notify_op_execution` and `op_execution_sender` stay for now — this phase does not add a caller. Adds a test that fires a round-trip through the scaffolding for each op (via a test-only caller or `#[cfg(test)]` shim).	existing integration suite	All five op kinds (CONNECT/PUT/GET/SUBSCRIBE/UPDATE) can terminate a `notify_op_execution` round-trip without hanging. No behavioral change to any op.
2a	Introduce `OpCtx` as a new module with no production callers. Define the `OpCtx` struct, the `send_and_await` primitive (built on the Phase 1 `notify_op_execution` round-trip), and the `tx → JoinHandle` bookkeeping that Phase 5 will eventually replace the op-specific DashMaps with. No ops migrated. The `#[allow(dead_code)]` attributes on `notify_op_execution` and `op_execution_sender` from Phase 1 are lifted here — Phase 2a becomes the first production caller, but only from a dormant test shim, not from any real op path. Pattern mirrors Phase 1's "widen the primitive without adding a live consumer".	new unit tests on `OpCtx::send_and_await` round-trip via each of the five op kinds	`OpCtx` exists, tests exercise the round-trip end-to-end, nothing in `operations/` changes.
2b	Migrate SUBSCRIBE's client-initiated path only to `OpCtx`. Scope excludes renewals (live on legacy path because their jitter/spam-prevention in `ring.rs:944-1051` is load-bearing) and PUT sub-ops (live on legacy path until Phase 2.5). Factor `prepare_initial_request` out of `request_subscribe` to handle the local-completion branch explicitly. `SubscribeOp` DashMap entry disappears for client-initiated subscribes; renewal and sub-op paths still populate it. Route via a per-tx switch (not a global feature flag).	existing subscribe unit + integration tests, new task-per-tx tests for the client path	Client-initiated SUBSCRIBE runs via `OpCtx`; renewal and sub-op SUBSCRIBE unchanged. No regressions in either path.
2c	Migrate CONNECT to `OpCtx`. Deferred from Rev 2's Phase 2 because CONNECT is 5,025 lines (2.3× SUBSCRIBE) with a genuinely different shape: no client-facing origin (joiner is the local node's `connection_maintenance` loop, not a WS client), no sub-op semantics, hole-punch decoupled from op lifetime (op `Completed` ≠ connection established), and most of the file's bulk is in `RelayContext` / `ConnectForwardEstimator` / visited-bloom-filter machinery rather than the state machine itself. Schedule TBD — not a strict blocker for Phases 3–6, which can proceed on the op kinds they target.	CONNECT unit + integration tests	Internal CONNECT transactions run via `OpCtx`.
~~2.5~~	Folded into Phase 3 as of Rev 4. Original scope: port the PUT-completion → SUBSCRIBE sub-op hand-off onto task-per-tx. Reason for fold: the scaffolding (`DriverRole::SubOp`, a registration helper bypassing `NodeEvent::LocalSubscribeComplete`, extra delivery plumbing through `op_manager.completed(child_tx)`) becomes redundant as soon as Phase 3's task-per-tx PUT/GET parent can `await` the child future directly. See Errata (Rev 3 → Rev 4).	—	—
3a	Migrate PUT (client-initiated originator path) + PUT's blocking-subscribe child to task-per-tx. Collapses the duplicate-send pattern in `put.rs:575–594` (early `result_router_tx.try_send`) + `:841–917` (response-leg second send) into a single client-result delivery from the driver task. The blocking-subscribe child (when `blocking_subscribe=true`) is awaited inline by the PUT task — PUT stops calling `expect_and_register_sub_operation`. The async non-blocking child path migrates with it. `SubOperationTracker` stays alive (GET still calls it in 3b). PUT relays stay on the legacy re-entry loop. Speculative paths stay GC-controlled on the DashMap. Install `try_forward_task_per_tx_reply` for PUT `Response` / `ResponseStreaming` variants at the TOP of the PUT branch in `handle_pure_network_message_v1`; filter `ForwardingAck`. Synthesize a `PutMsg::Response` where PUT currently completes locally without a network reply.	`test_put_with_blocking_subscribe` (`tests/operations.rs:1971`), `test_put_contract_three_hop_returns_response` (`:2551`), `test_put_then_immediate_subscribe_succeeds_locally_regression_2326` (`:3154`), `test_put_triggers_update_for_subscribers` (`:3407`), `test_put_merge_persists_state` (`:468`), `test_put_with_subscribe_flag` (`:1504`), all `test_put_retry_*` unit tests (`put.rs:2672+`), existing PUT unit tests (`put.rs:2102–2920`)	Client-initiated PUT runs via `OpCtx`; PUT originator stops touching `OpManager.ops.put`; blocking PUT parent awaits child subscribe future directly (no `SubOperationTracker` call from PUT path); client result delivered exactly once. Relay, speculative, and GET paths unchanged.
3b	Migrate GET (client-initiated originator path) + GET's blocking-subscribe child to task-per-tx. Reuses the patterns validated in 3a. Adds streaming assembly handling: `send_and_await` internally blocks on `stream_handle.assemble().await` for the `ResponseStreaming` reply variant, preserving the single-reply shape of `OpCtx`. Four local-completion paths to handle (originator has contract with no peers, relay has contract, originator receives network Response, originator auto-subscribe) — only the first needs synthesis; the network Response path reuses the arriving message directly. Same bypass + filter install as 3a for GET. GET relays stay on the legacy loop.	`test_get_with_blocking_subscribe` (`tests/operations.rs:2071`), `test_get_with_subscribe_flag` (`:1244`), `test_get_notfound_no_forwarding_targets` (`:3285`), `test_delegate_contract_get` (`:4009`), `test_repeated_get_with_unchanged_state_succeeds` (`:4380`), streaming integration tests, GET unit tests (`get.rs:3824–4980`) including `test_blocking_subscribe_propagation` and `test_retry_and_fallback`	Client-initiated GET runs via `OpCtx`; GET originator stops touching `OpManager.ops.get`; blocking GET parent awaits child subscribe future directly; streaming assembly flows through `send_and_await`.
3c	Delete `SubOperationTracker` and its machinery. Mechanical deletion PR once 3a and 3b have removed both callers. Delete: the struct (`op_state_manager.rs:74–86`), `sub_operations` / `root_ops_awaiting_sub_ops` / `parent_of` / `expected_sub_operations` / `failed_parents` fields, `expect_and_register_sub_operation`, `mark_sub_op_completed`, `remove_child_link`, `cleanup_parent_tracking`, `get_parent`, `is_sub_operation`, `all_sub_operations_completed`, `count_pending_sub_operations`, `sub_operation_failed`, and the finalized-parent-parking branch in `operations.rs:177–208`. Delete the 10 `sub_operation_tracker_*` unit tests (`op_state_manager.rs:2462–3090`). No Phase 5 dependency — this work slots in immediately after 3b.	existing PUT/GET regression tests (no new tests — deletion is mechanical)	`SubOperationTracker` and `root_ops_awaiting_sub_ops` gone; `OpManager.ops.put` and `.get` DashMaps remain but are no longer touched by originator paths (their deletion is still Phase 5).
4	Migrate UPDATE broadcast — the trickiest variant because it fans out to many peers. Verify broadcast semantics match the pre-refactor behaviour (see existing regression test at `f28e8e6`).	UPDATE integration + broadcast regression tests	UPDATE live phase stops touching `OpManager.ops`.
5	Delete `OpManager.ops.{connect,put,get,subscribe,update}` DashMaps. (`SubOperationTracker` was already deleted in Phase 3c.) `OpManager` keeps `completed` + `under_progress` + a new `tx → JoinHandle` table for cancellation. Delete `Operation::process_message` / `OperationResult` / `handle_op_result`.	full suite incl. DST replay	`Ops` struct reduced to the bookkeeping fields.
6	Boundary tests: component-level tests for `OpCtx`, `send_and_await`, `spawn_sub`. Documentation sweep: update `.claude/rules/operations.md` (the push-before-send invariant is now structurally enforced and the rule wording should reflect that). Update `docs/architecture/operations/README.md`.	new boundary tests	Rules / docs aligned with new code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor core interfaces #1454

Design: end-to-end async transactions

1. Overview

Problem

Goals (from the tracking issue)

Non-goals

2. Current architecture (baseline)

GET request flow (current)

3. Target architecture

Key idea

Phase 2a minimum surface (what ships first)

Target architecture (eventual vision, for reference)

Contrast diagram

What goes away

What stays

4. Existing scaffolding to reuse

Scaffolding reach

The SUBSCRIBE complication

5. Phased rollout (revised)

Phase 1 — concrete first slice (revised)

6. Risks and mitigations

7. Out of scope / future work

Errata (Rev 1 → Rev 2)

Errata (Rev 2 → Rev 3)

Errata (Rev 3 → Rev 4)

Errata (Rev 4 → Rev 5)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Refactor core interfaces #1454

Description

Design: end-to-end async transactions

1. Overview

Problem

Goals (from the tracking issue)

Non-goals

2. Current architecture (baseline)

GET request flow (current)

3. Target architecture

Key idea

Phase 2a minimum surface (what ships first)

Target architecture (eventual vision, for reference)

Contrast diagram

What goes away

What stays

4. Existing scaffolding to reuse

Scaffolding reach

The SUBSCRIBE complication

5. Phased rollout (revised)

Phase 1 — concrete first slice (revised)

6. Risks and mitigations

7. Out of scope / future work

Errata (Rev 1 → Rev 2)

Errata (Rev 2 → Rev 3)

Errata (Rev 3 → Rev 4)

Errata (Rev 4 → Rev 5)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions