Skip to content

fix: signal active containers to close when a scheduled task is enqueued#320

Open
aoberoi wants to merge 2 commits intoqwibitai:mainfrom
aoberoi:fix-293
Open

fix: signal active containers to close when a scheduled task is enqueued#320
aoberoi wants to merge 2 commits intoqwibitai:mainfrom
aoberoi:fix-293

Conversation

@aoberoi
Copy link
Copy Markdown

@aoberoi aoberoi commented Feb 20, 2026

Type of Change

  • Skill - adds a new skill in .claude/skills/
  • Fix - bug fix or security fix to source code
  • Simplification - reduces or simplifies source code

Description

Fixes #293. This is equivalent to Option A in that issue. I believe this is the cleanest way to fix the issue. Option B would screw up the workflow for scheduled tasks with context_mode: group, since a parallel container trying to resume a currently active session is going to lead to undefined outcomes. The single container per group invariant is worth preserving. Option C doesn't really provide any benefit if both idleWaiting and processing containers can correctly handle resuming sessions.

Also fixes a race condition (#320 (comment))

When a scheduled task was enqueued for a group with an active container, it would sit in pendingTasks until the container exited naturally — up to 30 minutes (idle timeout), or indefinitely if users kept resetting it.

Now writes the _close sentinel immediately when queuing a task against an active container. The container winds down after its current query, drainGroup fires, and the task runs in a new container within seconds.

I did some investigation to make sure the two kinds of scheduled tasks (context_mode: isolated and group) are considered. Here are all the scenarios:

Scenario 1: Container is idle, isolated task fires

  • Container already emitted the between-query session update marker before
    going idle → sessions[group.folder] = "session-A" is current
  • _close written, container exits from waitForIpcMessage immediately
  • drainGroup runs isolated task with sessionId = undefined → creates fresh
    "session-B" → discarded by scheduler
  • sessions[group.folder] is still "session-A"
  • User's next message: resumes "session-A" correctly ✓

Scenario 2: Container is busy (mid-query), isolated task fires

  • _close detected by pollIpcDuringQuery, stream.end() called
  • Current query runs to completion, emits OUTPUT_MARKER with newSessionId →
    wrappedOnOutput updates sessions[group.folder] = "session-A" (or a new turn
    ID)
  • Container exits (note: closedDuringQuery = true means it skips the
    between-query session update marker, but the result OUTPUT_MARKER already
    updated sessions)
  • drainGroup runs isolated task with sessionId = undefined → fresh session,
    discarded
  • User's next message: resumes "session-A" correctly ✓

Scenario 3: Container is idle, group task fires

  • Same session preservation as Scenario 1
  • Group task runs with sessionId = "session-A" — resumes and extends that
    session
  • Scheduler discards newSessionId returned from the task — but since resuming
    doesn't create a new session ID, this is fine
  • User's next message: resumes "session-A", which now includes the group
    task's turns ✓

Scenario 4: Container is busy, group task fires

  • Same as Scenario 2 — current query completes, sessions[group.folder]
    updated
  • Group task runs resuming that session, extends it
  • User's next message: resumes correctly ✓

For Skills

  • I have not made any changes to source code
  • My skill contains instructions for Claude to follow (not pre-built code)
  • I tested this skill on a fresh clone

@aoberoi aoberoi requested a review from gavrielc as a code owner February 20, 2026 08:29
@longregen
Copy link
Copy Markdown

Please clean the commit history before creating a PR: 8c1cad8 and dc4a443 don't seem relevant for the purpose of the PR.

@aoberoi
Copy link
Copy Markdown
Author

aoberoi commented Feb 20, 2026

@longregen whoops! good catch

When a scheduled task was enqueued for a group with an active container,
it would sit in pendingTasks until the container exited naturally — up to
30 minutes (idle timeout), or indefinitely if users kept resetting it.

Now writes the _close sentinel immediately when queuing a task against an
active container. The container winds down after its current query, drainGroup
fires, and the task runs in a new container within seconds.

Fixes qwibitai#293

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@aoberoi
Copy link
Copy Markdown
Author

aoberoi commented Feb 20, 2026

cleaned the commit history

@aoberoi
Copy link
Copy Markdown
Author

aoberoi commented Feb 20, 2026

Found another related issue in my testing. Scheduled tasks can sometimes fire more than once because of a race condition. The race is between the SCHEDULER_POLL_INTERVAL (60s) and the execution time of a task (which can take longer than 60s).

The solution I'm working on is to add a new "running" state to a task, and make sure the scheduler doesn't pick up running tasks to run on the next scheduler poll interval. There was an alternative approach, to just mark the status = NULL, but I prefer this solution because its explicit (which can help for inspecting issues in the system) and we can actually implement scheduled task recovery if there's an unexpected crash.

The fix includes all the state transitions to make this work and some tests.

Introduces a status='running' transition set atomically at enqueue time,
preventing the scheduler's 60s poll from re-enqueueing a task that is
still executing. Also fixes updateTaskAfterRun to transition running→active
(not running→running) for recurring tasks, and adds startup crash recovery
via resetRunningTasks().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Andy-NanoClaw-AI Andy-NanoClaw-AI added PR: Fix Bug fix Status: Blocked Blocked by merge conflicts or dependencies labels Mar 5, 2026
@gavrielc gavrielc requested a review from gabi-simons as a code owner March 6, 2026 10:17
@Andy-NanoClaw-AI Andy-NanoClaw-AI added Status: Needs Review Ready for maintainer review Status: Blocked Blocked by merge conflicts or dependencies and removed Status: Blocked Blocked by merge conflicts or dependencies Status: Needs Review Ready for maintainer review labels Mar 9, 2026
aviadr1 added a commit to Garsson-io/nanoclaw that referenced this pull request Mar 21, 2026
…qwibitai#320)

Migrate the three highest-complexity hooks from bash to TypeScript:
- pr-review-loop.sh (452 lines) → src/hooks/pr-review-loop.ts
- pr-kaizen-clear.sh (290 lines) → src/hooks/pr-kaizen-clear.ts
- kaizen-reflect.sh (197 lines) → src/hooks/kaizen-reflect.ts

Shared infrastructure:
- src/hooks/hook-utils.ts — stdin JSON parsing, git helpers
- src/hooks/parse-command.ts — command parsing (port of lib/parse-command.sh)
- src/hooks/state-utils.ts — atomic state writes, typed objects, no stat portability
- src/hooks/telegram-ipc.ts — Telegram notification via IPC

Improvements over bash:
- Atomic state writes (temp file + rename) prevent race conditions
- Native JSON parsing (no jq pipelines or sed extraction)
- No pipe-splitting corruption (IFS='|' read bug)
- No stat portability issues (fs.statSync works everywhere)
- Proper typed validation with clear error messages
- 115 vitest tests covering all state machine paths

Also:
- Add priority:critical and priority:high labels to issue taxonomy
- Update /pick-work and /make-a-dent to prefer high-priority issues
- Old bash hooks deactivated with migration comments
- Filed qwibitai#331 (worktree-du migration), qwibitai#332 (CI smoke tests), qwibitai#333 (shared lib retirement)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
aviadr1 added a commit to Garsson-io/nanoclaw that referenced this pull request Mar 21, 2026
First hook migrated to TypeScript, establishing the pattern for all
future L3-L4 hook migrations per docs/hook-language-boundaries.md.

What changed:
- src/hooks/kaizen-reflect.ts: Full TS port of the PostToolUse hook
- src/hooks/hook-io.ts: Shared hook I/O (stdin JSON, stdout advisory)
- src/hooks/parse-command.ts: TS port of parse-command.sh library
- src/hooks/state-utils.ts: TS port of state-utils.sh library
- Thin bash wrapper (kaizen-reflect-ts.sh) delegates to npx tsx
- settings.json updated to use the TS wrapper
- 53 vitest tests covering all three modules
- Old kaizen-reflect.sh marked as deactivated (kept for reference)

Run tag: batch-260321-1108-3ef8/run-7

Closes Garsson-io/kaizen#320

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
aviadr1 added a commit to Garsson-io/nanoclaw that referenced this pull request Mar 21, 2026
…261)

First hook migrated to TypeScript, establishing the pattern for all
future L3-L4 hook migrations per docs/hook-language-boundaries.md.

What changed:
- src/hooks/kaizen-reflect.ts: Full TS port of the PostToolUse hook
- src/hooks/hook-io.ts: Shared hook I/O (stdin JSON, stdout advisory)
- src/hooks/parse-command.ts: TS port of parse-command.sh library
- src/hooks/state-utils.ts: TS port of state-utils.sh library
- Thin bash wrapper (kaizen-reflect-ts.sh) delegates to npx tsx
- settings.json updated to use the TS wrapper
- 53 vitest tests covering all three modules
- Old kaizen-reflect.sh marked as deactivated (kept for reference)

Run tag: batch-260321-1108-3ef8/run-7

Closes Garsson-io/kaizen#320

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
aviadr1 added a commit to Garsson-io/nanoclaw that referenced this pull request Mar 21, 2026
…ke tests & policy (qwibitai#320, qwibitai#332)

Builds on the kaizen-reflect migration (already in main) to complete Phase 3:
- pr-review-loop.ts: multi-round PR self-review state machine (452→240 lines)
- pr-kaizen-clear.ts: kaizen gate validation + clearing (290→370 lines, with kaizen #198 waiver elimination)
- wrapper-smoke.test.ts: 9 end-to-end smoke tests for bash→tsx→hook chain
- resolve-project-root.sh: DRY shared lib for TS wrapper path resolution

Both hooks use main's patterns: extracted processHookInput() for testability,
async readHookInput from hook-io.ts, import guard, explicit stateDir params.

Policy changes:
- Policy #18: "Smoke tests ship WITH the feature — never after"
- Priority labels (priority:critical, priority:high) in issue taxonomy
- /pick-work and /make-a-dent updated to prefer priority-labeled issues

129 tests across 6 files, all passing.

Closes Garsson-io/kaizen#320
Closes Garsson-io/kaizen#332

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
aviadr1 added a commit to Garsson-io/nanoclaw that referenced this pull request Mar 21, 2026
…wibitai#320, qwibitai#332) (#268)

* feat: migrate L3-L4 bash hooks to TypeScript — Phase 3 of #223 (kaizen qwibitai#320)

Migrate the three highest-complexity hooks from bash to TypeScript:
- pr-review-loop.sh (452 lines) → src/hooks/pr-review-loop.ts
- pr-kaizen-clear.sh (290 lines) → src/hooks/pr-kaizen-clear.ts
- kaizen-reflect.sh (197 lines) → src/hooks/kaizen-reflect.ts

Shared infrastructure:
- src/hooks/hook-utils.ts — stdin JSON parsing, git helpers
- src/hooks/parse-command.ts — command parsing (port of lib/parse-command.sh)
- src/hooks/state-utils.ts — atomic state writes, typed objects, no stat portability
- src/hooks/telegram-ipc.ts — Telegram notification via IPC

Improvements over bash:
- Atomic state writes (temp file + rename) prevent race conditions
- Native JSON parsing (no jq pipelines or sed extraction)
- No pipe-splitting corruption (IFS='|' read bug)
- No stat portability issues (fs.statSync works everywhere)
- Proper typed validation with clear error messages
- 115 vitest tests covering all state machine paths

Also:
- Add priority:critical and priority:high labels to issue taxonomy
- Update /pick-work and /make-a-dent to prefer high-priority issues
- Old bash hooks deactivated with migration comments
- Filed qwibitai#331 (worktree-du migration), qwibitai#332 (CI smoke tests), qwibitai#333 (shared lib retirement)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* style: format hook files with Prettier

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* feat: smoke tests for TS hook wrappers + "tests ship with feature" policy (qwibitai#332)

- Add wrapper-smoke.test.ts: 9 tests verifying the full bash→tsx→hook chain
  for all 3 migrated hooks (pr-review-loop, pr-kaizen-clear, kaizen-reflect)
- Fix wrapper path resolution: use `git rev-parse --show-toplevel` instead of
  `git worktree list` — the old approach resolved to main checkout where the
  TS hooks don't exist yet in other worktrees
- Use randomized PR numbers + isolated STATE_DIR to prevent smoke test state
  from leaking into production state dir (incident: PR 99999 gate blocked session)
- Add policy #18: "Smoke tests ship WITH the feature — never after"
  to .claude/kaizen/policies.md, CLAUDE.md, and /review-pr skill

Closes Garsson-io/kaizen#332

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR: Fix Bug fix Status: Blocked Blocked by merge conflicts or dependencies

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Scheduled tasks deadlock behind idle containers

3 participants