Skip to content

enhancement: deterministic blocked-by resolution via cached dependency graph #17871

@robstiles

Description

@robstiles

Description

The pulse system currently relies on the LLM supervisor to transition status:blockedstatus:available when blockers resolve. The LLM supervisor is gated behind _should_run_llm_supervisor() which requires either a 1-hour backlog stall (PULSE_LLM_STALL_THRESHOLD=3600) or a 24-hour daily sweep. This creates a systemic delay in dependency chains: when a worker completes a task and closes its issue, downstream tasks that were blocked by it remain status:blocked for up to 1 hour — even though the resolution check is entirely deterministic.

Observed impact

A managed private repo has a 15-task dependency chain (sequential phases where each task is blocked-by its predecessor). When the pulse dispatches workers for the first available tasks and they complete successfully (PRs merged, issues closed), the downstream tasks should become available for dispatch immediately. Instead:

  1. Workers complete 3 tasks, closing issues and merging PRs (~20 min total)
  2. _should_run_llm_supervisor() sees the backlog decreased (3 fewer issues) → records "progress" → skips the LLM supervisor
  3. The backlog count stabilizes (remaining work is all status:blocked)
  4. The stall timer starts counting from the last snapshot update
  5. ~48 minutes later, the LLM supervisor finally runs, detects "status:blocked but blockers resolved", and transitions labels
  6. The next deterministic fill floor cycle (2 min later) dispatches workers
  7. Total delay: ~50 minutes per dependency layer

For a 15-task chain with 5 dependency layers, this means ~4 hours of idle time waiting for label transitions that are fundamentally a set membership check.

Why this is a deterministic operation

is_blocked_by_unresolved() (pulse-wrapper.sh:8630) already does the exact check: parse blocked-by:tNNN / blocked-by:#NNN from the issue body, check if the referenced issues are still open. This is a pure function — no judgment, no edge cases, no ambiguity. It belongs in the deterministic pass, not behind the LLM gate.

Proposed Solution: Cached Dependency Graph

A two-phase approach that separates the expensive graph construction (infrequent) from the cheap resolution check (every cycle).

Phase 1: Graph construction (infrequent, O(B) API calls)

Parse all status:blocked issue bodies for blocked-by references. Build a forward + reverse dependency map and cache it to disk:

{
  "built_at": 1775658447,
  "repo_slug": "owner/repo",
  "forward": {
    "105": [104],
    "106": [104, 105],
    "108": [101, 107],
    "109": [108]
  },
  "reverse": {
    "101": [108],
    "104": [105, 106],
    "105": [106],
    "107": [108],
    "108": [109]
  }
}

When to rebuild:

  • On LLM supervisor runs (daily sweep or stall — the supervisor already reads issue state)
  • On a standalone 1-hour cadence via file-based timestamp gate (decouples from LLM scheduling)
  • Incrementally: only read bodies for issues added to status:blocked since last build

Data source options (in order of preference):

  1. GitHub's native sub-issues/blocked-by GraphQL API — issue-sync-helper.sh already syncs addBlockedBy relationships (line 1094). Query these directly instead of parsing body text. Zero body reads needed.
  2. Prefetch body parsing — if the prefetch is extended to include issue bodies (currently fetches number, title, url, assignees, labels, updatedAt only).
  3. Dedicated body fetch — O(B) API calls where B = blocked issues. Acceptable at <500 issues.

Cache location: ${PULSE_DIR}/dependency-graph.json (per-repo, written atomically via temp+rename).

Phase 2: Graph resolution (every 2-min cycle, zero additional API calls)

The prefetch already fetches all open issues every cycle (for build_ranked_dispatch_candidates_json). That data contains the set of all open issue numbers. Checking "is blocker X closed?" = "is X absent from the open issues set?" — a pure set membership check.

resolve_blocked_by_graph():
  1. Read cached dependency graph from disk
  2. If cache is missing or stale (>2h), skip (LLM supervisor will rebuild)
  3. Build set of open issue numbers from prefetch data (already in memory)
  4. For each entry in forward map:
     - If ALL blockers are NOT in the open set → all resolved
     - gh issue edit --remove-label "status:blocked" --add-label "status:available"
     - Post comment: "Blockers resolved (#{X}, #{Y} closed). Unblocked for dispatch."
  5. Remove resolved entries from the cached graph (avoid re-checking)

Cost at scale:

Blocked issues Graph edges API calls per cycle Computation
50 ~80 0 (label swaps only when something changes) ~80 set lookups
500 ~800 0 ~800 set lookups
1,000 ~1,500 0 ~1,500 set lookups
5,000 ~8,000 0 ~8,000 set lookups

The only API calls are the label transition itself (one gh issue edit per newly-unblocked issue), and those only fire when something actually changes — typically 1-3 per cycle at most.

Simpler Variant: Event-Driven Post-Merge Forward-Unblock

Instead of scanning the full graph every cycle, trigger resolution only when the merge pass closes an issue:

After merge_ready_prs_all_repos() closes issue X:
  1. Look up X in the reverse map → get downstream issues [Y, Z]
  2. For each downstream issue:
     - Check if ALL its blockers (from forward map) are closed
     - If yes → swap labels
  3. Next fill floor cycle (2 min) dispatches worker for Y/Z

Cost per merge: O(D) where D = number of downstream issues of the closed blocker. Typically 1-3. Scales with merge volume, not total issue count.

Trade-off: More targeted (fewer checks) but only triggers on PR merges. Issues closed without a PR merge (manual close, duplicate, etc.) wouldn't trigger forward-unblock until the next graph scan. The full graph scan (Phase 2) could run at a lower frequency (every 10 min) as a catch-all.

Recommendation: Implement both — event-driven forward-unblock for the fast path, periodic graph scan as the safety net.

Integration point in main()

main()
├── _run_preflight_stages()           # existing — fetches open issues
├── merge_ready_prs_all_repos()       # existing — merges ready PRs
│   └── (event-driven forward-unblock after each merge)  # NEW
├── resolve_blocked_by_graph()        # NEW — periodic graph scan (~60 lines)
│   ├── read cached dependency graph
│   ├── build open-issue set from prefetch data
│   ├── for each blocked issue: check if all blockers ∉ open set
│   └── swap labels for newly-resolved issues
├── apply_deterministic_fill_floor()  # existing — dispatches available issues
└── (LLM supervisor, if triggered)
        └── rebuild_dependency_graph() as side effect  # NEW (~40 lines)

Cache Maintenance and Staleness

  • Rebuild triggers: LLM supervisor run, standalone 1-hour cadence, new status:blocked issue detected in prefetch
  • Incremental updates: Track last_build_epoch and only read bodies for issues with updatedAt > last_build_epoch
  • Staleness bound: Worst case, a newly-blocked issue waits one rebuild cycle (1 hour) before entering the graph. The LLM supervisor remains the safety net for anything the cache misses.
  • Invalidation: When resolve_blocked_by_graph() transitions labels, remove the entry from the forward map and write back. Prevents re-processing.

Risk Assessment

Low risk:

  • Resolution is a pure deterministic check — no judgment, no model calls, no new edge cases
  • Uses existing is_blocked_by_unresolved() parsing logic for graph construction
  • Label swaps are the same operation the LLM supervisor already performs
  • Cache staleness is bounded and the LLM supervisor remains the safety net
  • Zero impact on the LLM supervisor path — it can still handle blocked-by if it runs

Potential concerns and mitigations:

  • Stale cache serves wrong data: Bounded by rebuild cadence (1h). Worst case: an issue stays blocked one extra hour — same as current behavior.
  • Race condition on label swap: The dispatch dedup guards (7-layer) already handle this. A label swap during an active dispatch cycle is safe.
  • Graph construction cost at scale: With GitHub native sub-issues API as the data source, graph construction is a single GraphQL query — not O(B) REST calls. Fall back to body parsing only if native API is unavailable.

Files to Modify

  • EDIT: .agents/scripts/pulse-wrapper.sh — add resolve_blocked_by_graph(), integrate into main() between merge pass and fill floor
  • EDIT: .agents/scripts/pulse-wrapper.sh — add rebuild_dependency_graph(), call from LLM supervisor post-run and standalone cadence
  • EDIT: .agents/scripts/pulse-wrapper.sh — extend merge_ready_prs_all_repos() with event-driven forward-unblock hook
  • NEW: ${PULSE_DIR}/dependency-graph.json — cached graph (runtime artifact, not committed)

Verification

# After implementation, on a repo with blocked-by chains:
# 1. Create two issues: A (no blockers) and B (blocked-by A)
# 2. Let the pulse dispatch a worker for A
# 3. Worker completes A, PR merged, issue closed
# 4. Within 2-4 minutes (not 1 hour): B should have status:available
# 5. Next fill floor cycle dispatches worker for B

# Verify graph cache:
cat "${PULSE_DIR}/dependency-graph.json" | jq .

# Verify resolution log:
grep "resolve_blocked_by_graph\|forward-unblock" "$LOGFILE"

Environment

  • aidevops: 3.6.174
  • AI Assistant: Claude Code (claude-opus-4-6)
  • OS: Ubuntu 24.04.4 LTS
  • Shell: bash 5.2.21
  • gh CLI: 2.89.0

Related

  • pulse-wrapper.sh:8630is_blocked_by_unresolved() — existing blocked-by parser (reuse for graph construction)
  • pulse-wrapper.sh:9429_should_run_llm_supervisor() — the stall gate that causes the delay
  • pulse-wrapper.sh:10512-10526 — deterministic merge pass + fill floor — the integration point
  • issue-sync-helper.sh:1048-1102 — GitHub native sub-issues sync (addBlockedBy GraphQL) — potential data source
  • pulse.md:275 — "status:blocked but blockers resolved → remove label, add status:available" — the LLM instruction being moved to deterministic
  • GH#17779 — _is_task_committed_to_main() false positive fix — related blocked-by dispatch bug

Review Notes (Approved — tier:standard)

Reviewer: claude-opus-4-6 via /review-issue-pr

Validation

  • Reproducible: Yes — code confirms _should_run_llm_supervisor() (line 9598) gates blocked-by resolution behind a 1h stall threshold. is_blocked_by_unresolved() (line 8799) is only called defensively at dispatch time to skip blocked issues — it never transitions labels.
  • Not duplicate: Confirmed. No prior issues address deterministic blocked-by resolution. GH#17779 is related but distinct.
  • Classification: Enhancement (current behavior is by design; the issue correctly identifies it should be deterministic).

Design Corrections

  1. addBlockedBy GraphQL claim is inaccurate. The issue states issue-sync-helper.sh already syncs addBlockedBy relationships at line 1094. This doesn't exist in the current codebase (1462-line file, no blocked/dependency/addBlocked references). Drop data source option 1 from the plan. Body parsing (option 3) is the realistic data source.

  2. The "progress suppresses LLM" paradox is the real insight. Line 9654 shows that when total_now < total_before, the snapshot is updated and the LLM is skipped. Workers completing tasks = fewer open issues = "progress" = LLM suppressed. But the remaining issues are all status:blocked and can't progress without the LLM. This is the core bug — call it out in commit messages.

Implementation Guidance for Worker

Approach: self-maintaining dependency graph (per @robstiles addendum). The incremental maintenance makes the graph cheap enough that a cache-less intermediate step adds no value. Implement the full graph lifecycle directly:

  1. Cold start (no cache): Fetch ALL status:blocked issue bodies (one-time O(B) cost). Build forward + reverse maps. Write to ${PULSE_DIR}/dependency-graph.json.
  2. Steady state (every 2-min cycle, ~0 API calls): Read cached graph. Diff prefetch labels against graph — fetch body only for newly status:blocked issues (typically 0-1). Remove entries no longer status:blocked. Run resolution check against open-issue set from PULSE_PREFETCH_CACHE_FILE (already on disk, zero API calls). Swap labels for resolved issues.
  3. Supervisor backstop: Full graph rebuild on LLM supervisor runs. Catches body edits, manual label changes, drift. Overwrites cache — consistency check, not primary path.
  4. Event-driven forward-unblock: After merge_ready_prs_all_repos() closes an issue, look up its reverse map entries and check if downstream issues are now fully unblocked.

Integration point: Between merge_ready_prs_all_repos() and apply_deterministic_fill_floor() in main() (~line 10688).

Data source: PULSE_PREFETCH_CACHE_FILE (~/.aidevops/logs/pulse-prefetch-cache.json) for the open-issue set. Issue bodies via gh issue view for blocked-by parsing (reuse is_blocked_by_unresolved() logic).

Known limitations (acceptable):

  • Cross-repo blocked-by not supported (existing limitation, separate enhancement)
  • Body edits between graph rebuilds not detected until next supervisor run (rare in practice)
  • Stall detection unaffected — label swaps don't change open issue count

Tier: tier:standard — straightforward engineering using existing functions and data sources, no novel design needed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAuto-created from TODO.md tagnot-plannedClosed without implementation — not plannedorigin:workerAuto-created by pulse labelless backfill (t2112)tier:standardAuto-created by pulse labelless backfill (t2112)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions