Skip to content

Observability: Status API (health, active containers, tasks, audit) #773

@jjcaine

Description

@jjcaine

Summary

Add a lightweight HTTP server exposing status and health endpoints. This is the foundation for all observability — once we can query the system over HTTP, everything else (dashboards, uptime monitors, alerting) becomes trivial.

Motivation

Currently there's no way to check system health without shelling in and reading logs or querying SQLite directly. We need a programmatic interface to answer: Is it alive? What's running? What happened recently?

Endpoints

GET /health

Quick liveness/readiness check. Returns:

{
  "status": "ok",
  "uptime_seconds": 84321,
  "whatsapp_connected": true,
  "channels": ["whatsapp", "discord"],
  "db_ok": true,
  "last_message_at": "2026-03-06T10:23:00Z"
}

GET /status

Active system state:

{
  "active_containers": [
    { "name": "nanoclaw-main-1709721600", "group": "main", "duration_s": 45, "type": "message" }
  ],
  "active_count": 1,
  "queue_depths": { "main": 2, "work": 0 },
  "registered_groups": 3,
  "pending_tasks": 1
}

GET /tasks

Scheduled tasks with recent run history:

{
  "tasks": [
    {
      "id": 1,
      "group": "main",
      "schedule": "0 9 * * *",
      "status": "active",
      "last_run": "2026-03-06T09:00:00Z",
      "last_result": "success",
      "last_duration_ms": 12340
    }
  ]
}

GET /audit

Recent agent activity (last N container runs):

{
  "recent_runs": [
    {
      "group": "main",
      "started_at": "2026-03-06T10:20:00Z",
      "duration_ms": 34000,
      "exit_code": 0,
      "type": "message",
      "trigger": "whatsapp"
    }
  ]
}

Implementation Details

  • New file: src/status-server.ts (~150 lines)
  • No new dependencies — use node:http directly
  • Port: configurable via STATUS_PORT env var, default 9100
  • Bind: 127.0.0.1 by default (local only)
  • Data sources: GroupQueue in-memory state, SQLite DB, channel connection status
  • Needs read access to GroupQueue state (active containers, queue depths) — may need to expose a getStatus() method
  • Needs read access to channel connection status from index.ts
  • Task/audit data queried from SQLite (scheduled_tasks, task_run_logs)

Acceptance Criteria

  • HTTP server starts alongside main process
  • /health returns 200 when system is operational, includes channel connection status
  • /status shows active containers with names, groups, and durations
  • /tasks lists scheduled tasks with last run info
  • /audit shows recent container runs (last 50)
  • Server binds to localhost only by default
  • Port configurable via env var
  • No new npm dependencies
  • Graceful shutdown (server closes when process exits)

Labels

observability, phase-1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions