Summary
Add a lightweight HTTP server exposing status and health endpoints. This is the foundation for all observability — once we can query the system over HTTP, everything else (dashboards, uptime monitors, alerting) becomes trivial.
Motivation
Currently there's no way to check system health without shelling in and reading logs or querying SQLite directly. We need a programmatic interface to answer: Is it alive? What's running? What happened recently?
Endpoints
GET /health
Quick liveness/readiness check. Returns:
{
"status": "ok",
"uptime_seconds": 84321,
"whatsapp_connected": true,
"channels": ["whatsapp", "discord"],
"db_ok": true,
"last_message_at": "2026-03-06T10:23:00Z"
}
GET /status
Active system state:
{
"active_containers": [
{ "name": "nanoclaw-main-1709721600", "group": "main", "duration_s": 45, "type": "message" }
],
"active_count": 1,
"queue_depths": { "main": 2, "work": 0 },
"registered_groups": 3,
"pending_tasks": 1
}
GET /tasks
Scheduled tasks with recent run history:
{
"tasks": [
{
"id": 1,
"group": "main",
"schedule": "0 9 * * *",
"status": "active",
"last_run": "2026-03-06T09:00:00Z",
"last_result": "success",
"last_duration_ms": 12340
}
]
}
GET /audit
Recent agent activity (last N container runs):
{
"recent_runs": [
{
"group": "main",
"started_at": "2026-03-06T10:20:00Z",
"duration_ms": 34000,
"exit_code": 0,
"type": "message",
"trigger": "whatsapp"
}
]
}
Implementation Details
- New file:
src/status-server.ts (~150 lines)
- No new dependencies — use
node:http directly
- Port: configurable via
STATUS_PORT env var, default 9100
- Bind:
127.0.0.1 by default (local only)
- Data sources: GroupQueue in-memory state, SQLite DB, channel connection status
- Needs read access to
GroupQueue state (active containers, queue depths) — may need to expose a getStatus() method
- Needs read access to channel connection status from
index.ts
- Task/audit data queried from SQLite (
scheduled_tasks, task_run_logs)
Acceptance Criteria
Labels
observability, phase-1
Summary
Add a lightweight HTTP server exposing status and health endpoints. This is the foundation for all observability — once we can query the system over HTTP, everything else (dashboards, uptime monitors, alerting) becomes trivial.
Motivation
Currently there's no way to check system health without shelling in and reading logs or querying SQLite directly. We need a programmatic interface to answer: Is it alive? What's running? What happened recently?
Endpoints
GET /healthQuick liveness/readiness check. Returns:
{ "status": "ok", "uptime_seconds": 84321, "whatsapp_connected": true, "channels": ["whatsapp", "discord"], "db_ok": true, "last_message_at": "2026-03-06T10:23:00Z" }GET /statusActive system state:
{ "active_containers": [ { "name": "nanoclaw-main-1709721600", "group": "main", "duration_s": 45, "type": "message" } ], "active_count": 1, "queue_depths": { "main": 2, "work": 0 }, "registered_groups": 3, "pending_tasks": 1 }GET /tasksScheduled tasks with recent run history:
{ "tasks": [ { "id": 1, "group": "main", "schedule": "0 9 * * *", "status": "active", "last_run": "2026-03-06T09:00:00Z", "last_result": "success", "last_duration_ms": 12340 } ] }GET /auditRecent agent activity (last N container runs):
{ "recent_runs": [ { "group": "main", "started_at": "2026-03-06T10:20:00Z", "duration_ms": 34000, "exit_code": 0, "type": "message", "trigger": "whatsapp" } ] }Implementation Details
src/status-server.ts(~150 lines)node:httpdirectlySTATUS_PORTenv var, default9100127.0.0.1by default (local only)GroupQueuestate (active containers, queue depths) — may need to expose agetStatus()methodindex.tsscheduled_tasks,task_run_logs)Acceptance Criteria
/healthreturns 200 when system is operational, includes channel connection status/statusshows active containers with names, groups, and durations/taskslists scheduled tasks with last run info/auditshows recent container runs (last 50)Labels
observability,phase-1