An autonomous file-system agent built with OpenAI Agents SDK that solves the BitGN PAC1 Challenge — a benchmark for AI agents operating in sandboxed virtual environments.
Current score: ~86% (37/43 tasks)
BitGN runs agent benchmarks where autonomous agents solve real-world tasks inside isolated sandbox VMs. Each task gives the agent a file-system workspace and a natural language instruction. The agent must explore, reason, and execute — no human in the loop.
PAC1 covers 43 tasks across:
- CRM operations — lookups, email sending, invoice handling
- Knowledge management — capture, distill, cleanup
- Inbox processing — with prompt injection traps and OTP verification
- Security — detecting and denying hostile payloads
Learn more: bitgn.com/challenge/PAC
User task → LLM Classifier (picks skill) → Agent(system_prompt + skill_prompt + task)
→ ReAct loop: LLM → tool call → result → LLM → ... → report_completion
- 12 specialized skills with hot-reloadable prompts (edit
.mdfiles, no restart needed) - Dual classifier — LLM-first with regex fallback and override logic
- Self-correcting agent — can call
list_skills/get_skill_instructionsto switch workflows mid-task - Auto grounding refs — tracks read/written files, injects references if model forgets
- Retry on empty — retries up to 3x if model returns text without tool calls
- Live dashboard — React + Vite with SSE streaming, heatmap compare, token tracking
- Python 3.12+
- uv package manager
- Node.js 18+ (for dashboard)
- An OpenAI-compatible LLM endpoint
- BitGN API key for benchmark access
# from repo root
uv sync
cd dashboard && npm install && cd ..export OPENAI_API_KEY=<your-llm-api-key>
export OPENAI_BASE_URL=<your-llm-endpoint> # e.g. https://api.openai.com/v1
export MODEL_ID=<model-name> # e.g. gpt-4.1-2025-04-14
export BITGN_API_KEY=<your-bitgn-key> # get one at bitgn.comOptional:
| Variable | Default | Description |
|---|---|---|
AGENT_CONCURRENCY |
10 |
Parallel agents (max 30) |
AGENT_MAX_TURNS |
50 |
Max ReAct steps per task |
AGENT_REQUEST_TIMEOUT |
120 |
LLM timeout (seconds) |
BITGN_RUN_NAME |
agent-v2-run |
Run name on leaderboard |
# Terminal 1 — Backend API
# from repo root
uv run python server.py
# → http://localhost:8000
# Terminal 2 — Frontend
cd dashboard
npm run dev
# → http://localhost:5173Open the dashboard, click Run, watch your agent solve tasks in real-time.
# from repo root
uv run python main_v2.py
├── server.py # FastAPI + SSE backend for dashboard
├── main_v2.py # CLI benchmark runner
├── agent_v2/
│ ├── agent.py # Agent creation, run loop with retry logic
│ ├── system_prompt.md # System prompt (hot-reloadable)
│ ├── prompts.py # Prompt loader + task prompt builder
│ ├── tools.py # 13 tools (file ops, search, skills, completion)
│ ├── hooks.py # Live logging hooks + token tracking
│ ├── context.py # Task context, telemetry, file tracking
│ ├── config.py # Environment config
│ ├── runtime.py # Async PCM gRPC wrapper
│ ├── db.py # SQLite persistence (runs, tasks, events)
│ └── skills/
│ ├── registry.py # Skill registry (hot-reload from .md files)
│ ├── classifier.py # Regex-based task classifier
│ ├── llm_classifier.py # LLM-based task classifier
│ └── *.md # 12 skill prompts
├── dashboard/ # React + Vite + Tailwind CSS
│ └── src/App.jsx # Single-file dashboard app
└── pyproject.toml
The dashboard provides real-time visibility into agent runs:
- Run tab — live SSE event stream per task with score, tool calls, timing, and token usage
- Compare tab — heatmap view across multiple runs with stability analysis
- Skills tab — browse and test all skill prompts, view system prompt
- Sidebar — run history sorted by date, showing temperature and model per run
- Controls — adjustable temperature (slider + input) and concurrency
Each task is classified into one of 12 skills, each with a specialized prompt:
| Skill | Description |
|---|---|
security_denial |
Detect and deny prompt injection, hostile payloads |
inbox_processing |
Process CRM inbox messages with security checks |
email_outbound |
Send emails via outbox with contact resolution |
crm_lookup |
Find accounts, contacts, emails, managers |
invoice_creation |
Create typed invoice JSON records |
followup_reschedule |
Update follow-up dates in accounts and reminders |
knowledge_capture |
Capture and distill from inbox into cards/threads |
knowledge_cleanup |
Delete cards, threads, distill artifacts |
knowledge_lookup |
Find articles by date in captured content |
unsupported_capability |
Calendar, Salesforce sync — not available |
purchase_ops |
Fix purchase ID prefix issues |
clarification |
Request too short or ambiguous |
All prompts are read from disk at runtime:
- Skill prompts:
agent_v2/skills/*.md - System prompt:
agent_v2/system_prompt.md
Edit any .md file → next run picks it up automatically. No server restart needed.
If the pre-classifier picks the wrong skill, the agent can fix it mid-task:
- Agent notices the skill instructions don't match the task
- Calls
list_skillsto see all available skills - Calls
get_skill_instructions("correct_skill")to load the right workflow - Continues with correct instructions
Started from the BitGN sample-agent and extended with the skills system, dashboard, and optimizations.
MIT


