Skip to content

[Epic] Large Document Council Reliability + Cost Optimization #558

@simonholmes001

Description

@simonholmes001

Outcome

Deliver a reliable and understandable large-document council experience with clear live progress, robust failure handling, and materially lower runtime/token cost.

Scope

  • Keep council watch alive until terminal completion.
  • Expose explicit run stages (queued/chunking/analysis/critique/synthesis/completed/failed).
  • Isolate per-agent failures so one failure does not abort the whole run.
  • Reduce large-doc cost and duration via map-reduce style context pipeline.
  • Add SLOs/metrics and rollout guardrails.

Acceptance Criteria

  • Users always see progress or terminal state; no ambiguous idle stop behavior.
  • Council runs persist run metadata and failure reason codes.
  • Full-run failure rate reduced with per-agent fault isolation.
  • Cost/runtime targets defined and measured in production telemetry.
  • Rollout has feature flags and rollback plan.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions