[RFC] Native Agent Swarm Architecture - Multi-Process Parallel Execution

## Summary

This is a follow-up to #1493 (software-layer Agent Swarm). While #1493 provides immediate value, this RFC proposes a **native architecture** for true multi-agent parallel execution using process isolation and message passing.

**Goal**: Transform nanobot from single-agent to multi-agent orchestrator without breaking existing skills.

## Motivation

Current limitations of single-agent architecture:
- **No true parallelism**: All tool calls block each other
- **Context pollution**: One complex task crowds out previous context
- **No fault isolation**: One failed tool crashes entire session
- **Limited scalability**: Can't distribute workloads

## Proposed Architecture

```
┌─────────────────────────────────────────────────────────────┐
│                    nanobot (Orchestrator)                   │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │   Planner    │  │  Message Bus │  │  Supervisor  │      │
│  └──────────────┘  └──────────────┘  └──────────────┘      │
└─────────────────────────────────────────────────────────────┘
                            │
        ┌───────────────────┼───────────────────┐
        ▼                   ▼                   ▼
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ Agent Worker │    │ Agent Worker │    │ Agent Worker │
│  (Research)  │    │  (Architect) │    │  (Analyst)   │
│  ┌──────────┐│    │  ┌──────────┐│    │  ┌──────────┐│
│  │Skill Env ││    │  │Skill Env ││    │  │Skill Env ││
│  │(isolated)││    │  │(isolated)││    │  │(isolated)││
│  └──────────┘│    │  └──────────┘│    │  └──────────┘│
└──────────────┘    └──────────────┘    └──────────────┘
```

## Core Components

### 1. Agent Worker Process (Docker/Subprocess)

```python
# Each agent runs in isolated environment
class AgentWorker:
    def __init__(self, role: str, skills: List[str]):
        self.container = DockerContainer()  # or subprocess
        self.memory = AgentMemory()  # isolated from orchestrator
        self.tool_registry = ToolRegistry(sks)
    
    async def execute(self, task: Task) -> Result:
        # Runs independently, reports via message bus
        pass
```

### 2. Message Bus (IPC)

```python
# Async message passing between agents
class MessageBus:
    async def publish(self, channel: str, message: Message)
    async def subscribe(self, channel: str) -> AsyncIterator[Message]
    
# Supported backends: Redis (production) / SQLite (local) / In-memory (dev)
```

### 3. Task Queue & Scheduler

```python
# SQLite-based queue (no external deps for v1)
class TaskQueue:
    def enqueue(self, task: Task, priority: int)
    def dequeue(self, agent_roles: List[str]) -> Optional[Task]
    def complete(self, task_id: str, result: Result)
```

### 4. Blackboard Pattern

Shared workspace for agent collaboration:
```
/blackboard/{session_id}/
  ├── context.md      # Shared context
  ├── findings/       # Researcher deposits here
  ├── designs/        # Architect deposits here
  ├── critiques/      # Critic deposits here
  └── final/          # Coordinator synthesizes here
```

## Execution Flow

```
User Request
    ↓
[Orchestrator] Decompose into subtasks
    ↓
[Task Queue] Assign to agents
    ↓
[Agent A] ──┐
[Agent B] ──┼──► [Blackboard] ◄── [Orchestrator] ──► Final Response
[Agent C] ──┘         ↑
              (continuous updates)
```

## Implementation Phases

### Phase 1: Subprocess Workers (MVP)
- No Docker dependency
- Python multiprocessing for isolation
- SQLite message queue
- **Timeline**: 2-3 weeks

### Phase 2: Docker Isolation
- Each agent in container
- Shared volume for blackboard
- Redis message bus option
- **Timeline**: 4-6 weeks

### Phase 3: Distributed (Future)
- Multiple machines
- Kubernetes orchestration
- **Timeline**: Not in scope

## Backward Compatibility

```python
# Existing skills work unchanged
class LegacySkill:
    def execute(self, **kwargs):  # Still works
        pass

# New swarm-aware skills can opt-in
class SwarmSkill:
    async def execute(self, context: SwarmContext):
        # Can access other agents, publish to blackboard
        pass
```

## Trade-offs

| Aspect | Software Layer (#1493) | Native Architecture (this) |
|--------|------------------------|---------------------------|
| Complexity | Low | High |
| Parallelism | Simulated | Real |
| Isolation | None | Process/Container |
| Latency | Single call | Coordination overhead |
| Reliability | Single point | Fault tolerant |
| Maintenance | Easy | Harder |

## Recommendation

1. **Merge #1493 first** (software layer) - gives immediate value
2. **Run beta for 1-2 months** - validate demand
3. **If high usage**: Implement Phase 1 (subprocess)
4. **If enterprise demand**: Phase 2 (Docker)

## Open Questions

1. Should we use existing solutions (Celery, Ray) or build minimal?
2. How to handle skill dependency isolation (different Python versions)?
3. What's the max agent count before coordination overhead dominates?

## Labels

rfc, architecture, enhancement, help wanted

---

**Related**: #1493 (software-layer alternative), YourBot multi-bot collaboration

**Credit**: Inspired by TinyClaw's TUI dashboard, NanoClaw's container security, and OpenAI's Swarm framework.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Native Agent Swarm Architecture - Multi-Process Parallel Execution #1495

Summary

Motivation

Proposed Architecture

Core Components

1. Agent Worker Process (Docker/Subprocess)

2. Message Bus (IPC)

3. Task Queue & Scheduler

4. Blackboard Pattern

Execution Flow

Implementation Phases

Phase 1: Subprocess Workers (MVP)

Phase 2: Docker Isolation

Phase 3: Distributed (Future)

Backward Compatibility

Trade-offs

Recommendation

Open Questions

Labels

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Aspect	Software Layer (#1493)	Native Architecture (this)
Complexity	Low	High
Parallelism	Simulated	Real
Isolation	None	Process/Container
Latency	Single call	Coordination overhead
Reliability	Single point	Fault tolerant
Maintenance	Easy	Harder

[RFC] Native Agent Swarm Architecture - Multi-Process Parallel Execution #1495

Description

Summary

Motivation

Proposed Architecture

Core Components

1. Agent Worker Process (Docker/Subprocess)

2. Message Bus (IPC)

3. Task Queue & Scheduler

4. Blackboard Pattern

Execution Flow

Implementation Phases

Phase 1: Subprocess Workers (MVP)

Phase 2: Docker Isolation

Phase 3: Distributed (Future)

Backward Compatibility

Trade-offs

Recommendation

Open Questions

Labels

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions