🌐 Live at brennerbot.org
Harness the scientific methods of Sydney Brenner using AI Agents
Brenner Bot is a research "seed crystal": a curated primary-source corpus (Sydney Brenner transcripts) plus multi-model syntheses, powering collaborative scientific research conversations that follow the "Brenner approach."
📊 End-to-End Test Report: Bio-Inspired Nanochat Session — A complete walkthrough demonstrating the Brenner method in action on a real research question about biological vs. synthetic nanoparticle communication.
This repository integrates with Agent Mail (coordination + memory + workflow glue) so multiple coding agents can collaborate as a research group:
- Claude Code running Opus 4.5
- Codex CLI running GPT‑5.2 (extra-high reasoning)
- Gemini CLI running Gemini 3
Critical constraint (non-negotiable): We do not call vendor AI APIs from code. Instead, we coordinate CLI tools via their subscription tiers (Claude Max / GPT Pro / Gemini Ultra) running in terminal sessions. Orchestration is message passing + compilation, not remote inference.
The agents run in parallel via ntm (Named Tmux Manager), coordinating through Agent Mail threads, producing structured deltas that get compiled into durable artifacts.
The system includes:
- A Next.js web app at
brennerbot.org— human interface for corpus browsing + session viewing (not agent execution) - A Bun CLI (
brenner) — terminal-first workflows for power users - A cockpit runtime — ntm-based multi-agent sessions with Agent Mail coordination
Deployed on Vercel with Cloudflare DNS at brennerbot.org.
- Why this repo is interesting
- The Core Insight: Why Brenner?
- What's here
- Quick Install
- What this is ultimately for
- How the system works
- How to use this repo
- Repository map
- The three distillations
- Working vocabulary
- The Operator Algebra
- The Implicit Bayesianism
- The Brenner Method: Ten Principles
- The Required Contradictions
- Why This Matters for AI-Assisted Research
- Provenance, attribution, and epistemic hygiene
- System Architecture
- Design Principles
- The Artifact Merge Algorithm
- The Linting System
- Performance Characteristics
- Testing Infrastructure
- Development Workflow
- Releases
- Research Artifact Lifecycle Management
- Scoring & Evaluation System
- Research Program Orchestration
- Experiment Capture & Encoding
- Cockpit Runtime
- Web Application Pages
- Specification Reference
- Storage & Schema Architecture
- JSON Output Mode
- Session Replay Infrastructure
- Coach Mode: Guided Learning System
- Domain-Aware Confound Detection
- Hypothesis Similarity Search
- Agent Debate Mode
- What-If Scenario Exploration
- Session State Machine
- Undo/Redo System
- Prediction Lock & Calibration
- The Operator Framework
- Offline Resilience
- Citation System
- Demo Mode for Public Website
- Server-Side Analytics
- API Security Architecture
- Parser Robustness
- Storage Performance Optimizations
The goal is to operationalize a scientific method and make it runnable as a collaboration protocol between AI agents and human researchers.
What you get:
- Primary sources with stable anchors:
complete_brenner_transcript.mdis the canonical text, organized into numbered sections (§n) so claims can be cited precisely. - Verbatim primitive extraction:
quote_bank_restored_primitives.mdis a growing bank of high-signal verbatim quotes keyed by§nand intended to be tagged to operators/motifs. - Three incompatible distillation styles: Opus 4.5, GPT‑5.2, and Gemini 3 saw the same transcript corpus and produced different “coordinate systems” for the Brenner method. Comparing them is itself a Brenner move: a representation change that reveals invariants and failure modes.
- Artifacts, not chat logs: Sessions produce lab-like outputs (hypothesis slates, discriminative tests, assumption ledgers, anomaly registers, adversarial critiques) that can be audited and iterated.
- Protocol + orchestration substrate: Agent Mail provides durable threads and coordination primitives; Bun provides a path to a single self-contained CLI binary; Beads provides a dependency-aware roadmap in-repo.
Sydney Brenner (1927–2019) was one of the most successful experimental biologists in history: co-discoverer of messenger RNA, architect of the genetic code experiments, founder of C. elegans as a model organism, and Nobel laureate. But his method is more valuable than any single discovery.
Brenner's "superpower" was repeatedly redesigning the world so that updates become easy. He changed organisms to change costs. He changed readouts to change likelihood sharpness. He changed question forms to turn mush into discrete constraints. He changed abstraction levels to avoid misspecified model classes.
This repository attempts to reverse-engineer that cognitive architecture and render it reusable for AI-assisted scientific research.
After extensive analysis, we distilled Brenner's approach to two fundamental commitments from which everything else derives:
Axiom 1: Reality Has a Generative Grammar
The world is not merely patterns and correlations. It is produced by causal machinery that operates according to discoverable rules. Biology is computation, not metaphorically, but literally.
Axiom 2: To Understand Is to Be Able to Reconstruct
You have not explained a phenomenon until you can specify, in principle, how to build it from primitives. Description is not understanding. Prediction is not understanding. Only reconstruction is understanding.
From these axioms flow all of Brenner's operational moves: finding the "machine language" of each system, separating program from interpreter, hunting forbidden patterns, choosing organisms strategically, and designing experiments with extreme likelihood ratios.
A taste of Brenner's voice (all from the transcripts):
"Exclusion is always a tremendously good thing in science."
"We proposed three models... 'You've forgotten there's a third alternative.' 'What's that?' 'Both could be wrong.'"
"I had invented something called HAL biology. HAL, that's H-A-L, it stood for Have A Look biology. I mean, what's the use of doing a lot of biochemistry when you can just see what happened?"
"The best thing in science is to work out of phase. That is, either half a wavelength ahead or half a wavelength behind. It doesn't matter. But if you're out of phase with the fashion you can do new things."
"One should not fall in love with one's theories. They should be treated as mistresses to be discarded once the pleasure is over."
"A proper simulation must be done in the machine language of the object being simulated... you need to be able to say: there are no more wires—we know all the wires."
"The choice of the experimental object remains one of the most important things to do in biology."
"I'm a great believer in the power of ignorance... when you know too much you're dangerous in the subject because you will deter originality."
"The best people to push a science forward are in fact those who come from outside it... the émigrés are always the best people to make the new discoveries."
This repository provides everything needed to run "Brenner-style" research workflows: the primary source corpus, multi-model syntheses, a searchable quote bank, and the tooling to orchestrate multi-agent research sessions.
- Corpus search + excerpt builder: Full-text search across the 236 transcript segments. Build cited excerpt blocks for session kickoffs with stable
§nanchors. - Multi-agent orchestration: Kick off Brenner Loop sessions with Claude, GPT, and Gemini via Agent Mail. Each model produces structured deltas (not essays) that get compiled into durable artifacts.
- Artifact compiler + linter: Parse agent responses, merge deterministically, and validate against 50+ Brenner-style rules (third alternative check, potency controls, citation anchors, provenance verification, scale constraints). Human-readable and JSON output formats.
- Web app (brennerbot.org): Browse the corpus, compose excerpts, start sessions, and review compiled artifacts.
- CLI (brenner): Terminal-first workflow for power users. Compiles to a single self-contained binary via
bun build --compile.
curl -fsSL "https://raw.githubusercontent.com/Dicklesworthstone/brenner_bot/main/install.sh?$(date +%s)" | bashOptions:
--easy-mode— Minimal prompts, sensible defaults--verify— Verify checksum after download--system— Install to/usr/local/bin(requires sudo)
curl -fsSL "https://raw.githubusercontent.com/Dicklesworthstone/brenner_bot/main/install.sh?$(date +%s)" | bash -s -- --easy-mode --verifyirm https://raw.githubusercontent.com/Dicklesworthstone/brenner_bot/main/install.ps1 | iexgit clone https://github.com/Dicklesworthstone/brenner_bot.git
cd brenner_bot
bun build --compile ./brenner.ts --outfile brenner
./brenner --helpAfter installation, here's the minimal workflow to run a Brenner session:
1. Verify installation:
./brenner doctor --skip-ntm --skip-cass --skip-cm2. Search the corpus:
./brenner corpus search "reduction to one dimension"3. Build an excerpt from transcript sections:
./brenner excerpt build --sections 58,78,161 > excerpt.md4. Start a session (requires Agent Mail running):
# Start Agent Mail first: cd /path/to/mcp_agent_mail && bash scripts/run_server_with_token.sh
# Then start a session (all flags are required):
./brenner session start \
--project-key "$PWD" \
--sender GreenCastle \
--to BlueLake \
--thread-id RS-$(date +%Y%m%d)-test \
--excerpt-file excerpt.md \
--question "How do cells determine their position in a developing embryo?"Note on agent names: Agent Mail requires adjective+noun combinations (e.g., GreenCastle, BlueLake, RedForest). If you use a different format, the system auto-assigns a random valid name.
Common issues:
- "Missing --question" → The
--questionflag is required forsession start - "Missing --sender" → Add
--sender GreenCastleor setAGENT_NAME=GreenCastle - "Agent Mail not available" → Ensure Agent Mail server is running on localhost:8765
The project aims to operationalize Brenner's approach as a set of reusable collaboration patterns:
- How to pick problems (and when to walk away)
- How to formulate discriminative questions
- How to choose experiments/observations that collapse hypothesis space fast
- How to design “decision procedures” rather than accumulate “interesting data”
- How to reason with constraints, paradoxes, and representation changes
The idea is to turn those into prompt templates + structured research protocols that a multi-agent team can repeatedly execute (and audit).
Key insight: This is a CLI-based architecture. We do NOT call AI APIs. Instead, CLI tools (claude code, codex-cli, gemini-cli) run in terminal sessions via ntm (Named Tmux Manager), coordinating through Agent Mail. The web app is a human interface for browsing—not agent execution.
flowchart TB
subgraph HUMAN[" 👤 HUMAN INTERFACES "]
WEB["🌐 Web App"]
BCLI["⌨️ brenner CLI"]
end
subgraph SOURCES[" 📚 PRIMARY SOURCES "]
T["Transcripts · 236 sections"]
Q["Quote Bank · primitives"]
end
subgraph KERNEL[" ⚙️ PROTOCOL KERNEL "]
S["Schema + Guardrails"]
P["Role Prompts"]
end
subgraph BUS[" 📬 COORDINATION BUS "]
AM["Agent Mail · threads · acks"]
end
subgraph COCKPIT[" 🖥️ COCKPIT RUNTIME "]
NTM["ntm · tmux sessions"]
C["🟣 claude code"]
G["🟢 codex-cli"]
M["🔵 gemini-cli"]
AC["🔧 Artifact Compiler"]
NTM --> C & G & M --> AC
end
subgraph ARTIFACTS[" 📋 DURABLE ARTIFACTS "]
H["Hypothesis Slates"]
D["Discriminative Tests"]
A["Assumption Ledgers"]
X["Adversarial Critiques"]
end
subgraph MEMORY[" 🧠 MEMORY · optional "]
CASS["cass · session search"]
CM["cm · rules + patterns"]
end
HUMAN --> SOURCES & BUS
SOURCES --> KERNEL --> BUS
BUS <--> COCKPIT --> ARTIFACTS
ARTIFACTS -.->|feedback| SOURCES
MEMORY -.->|augment| KERNEL
style HUMAN fill:#e8eaf6,stroke:#5c6bc0
style SOURCES fill:#e3f2fd,stroke:#42a5f5
style KERNEL fill:#e8f5e9,stroke:#66bb6a
style BUS fill:#fff3e0,stroke:#ffa726
style COCKPIT fill:#f3e5f5,stroke:#ab47bc
style ARTIFACTS fill:#fff8e1,stroke:#ffca28
style MEMORY fill:#eceff1,stroke:#78909c
Thread ID is the global join key that ties everything together:
- Agent Mail thread → where messages live
- ntm session name → where agents run
- Artifact file path → where outputs are persisted
- Beads issue ID → what work this relates to
Thread ID formats (see specs/thread_subject_conventions_v0.1.md):
- Engineering work: Use the bead ID directly (e.g.,
brenner_bot-5so.3.4.2) - Research sessions: Use
RS-{YYYYMMDD}-{slug}format (e.g.,RS-20251230-cell-fate)
Example mappings:
| Work type | Thread ID | ntm session | Artifact path |
|---|---|---|---|
| Engineering | brenner_bot-5so.3.4.2 |
brenner_bot-5so.3.4.2 |
artifacts/brenner_bot-5so.3.4.2.md |
| Research | RS-20251230-cell-fate |
RS-20251230-cell-fate |
artifacts/RS-20251230-cell-fate.md |
This means: given a thread ID, you can find the conversation, the tmux session, and the compiled artifacts without guessing.
Agent Mail is the coordination bus that makes "a research group of agents" viable:
- Durable threads with inbox/outbox per agent
- Acknowledgement tracking (who responded, what's pending)
- File reservations to avoid clobbering
- Persistent audit trail in git
Key insight: Agent Mail provides message passing, not inference. The agents (claude code, codex-cli, gemini-cli) run in terminal sessions and post their responses to Agent Mail threads. The artifact compiler then merges those responses.
See: Dicklesworthstone/mcp_agent_mail
The cockpit is where agents actually run. We recommend ntm (Named Tmux Manager):
- Spawn multiple agent panes in parallel
- Broadcast prompts to all agents at once
- Capture structured output from each
- Route responses to Agent Mail threads
This is humans-in-the-loop: operators manage the tmux sessions, review agent outputs, and decide when to compile artifacts. The web app and CLI are for viewing and composing—not for running agents.
Each research session produces artifacts that look like what a serious lab would create:
- Research thread: a single problem statement that stays stable
- Hypothesis slate: 2–5 candidate explanations, always including the “third alternative” (both wrong / misspecification)
- Predictions table: discriminative predictions per hypothesis (in chosen representation / machine language)
- Discriminative tests: ranked “decision experiments”, each stating which hypotheses it separates
- Potency checks: “chastity vs impotence” controls so negative results are interpretable
- Assumption ledger: load-bearing assumptions + at least one explicit scale/physics check
- Anomaly register: exceptions quarantined explicitly (or “none”)
- Adversarial critique: what would make the whole framing wrong? what’s the real third alternative?
- Understand the source material:
complete_brenner_transcript.md(scan headings, then deep-read clusters) - Understand the prompting intent:
initial_metaprompt.md,metaprompt_by_gpt_52.md - Compare syntheses across models: read batch 1 across GPT Pro / Opus / Gemini and diff what they emphasize
- Find specific Brenner moves: search the transcript for phrases like “Occam’s broom”, “Have A Look (HAL)”, “out of phase”, “choice of the experimental object”
- Pick a narrow theme (e.g., “discriminative experiments”, “problem choice”, “inversion”, “digital handles”).
- Pull quotes from
complete_brenner_transcript.md(treat headings as anchors). - Read the three model writeups on that theme (at least one batch per model).
- Write down the intersection:
- What appears in all syntheses and is strongly supported by quotes?
- What appears in one synthesis but isn’t supported by quotes?
- Generate a new synthesis with your own prompt variant and a fresh excerpt to test if the idea generalizes.
Why triangulation matters
If you only read an LLM synthesis, you tend to inherit its narrative biases. If you only read raw transcripts, you’ll drown in volume. Triangulation keeps you grounded while still compressing the search space.
cd apps/web
bun install
bun run devKey routes:
/corpus: browse primary docs (read server-side from repo root)/sessions/new: compose a kickoff prompt and send it via Agent Mail (requires local Agent Mail + lab gating). Supports per-recipient role assignment via dropdown UI and a "Default 3-Agent" quick assign button.
The CLI is the terminal equivalent of the web “lab” flow. It is Bun-only and runs as:
./brenner.ts ...(script)bun build --compile --outfile brenner ./brenner.ts(single executable)
To embed build metadata (so brenner --version works outside the git repo), set BRENNER_* at build time and pass --env=BRENNER_*:
mkdir -p dist
BRENNER_VERSION="0.0.0-dev" \
BRENNER_GIT_SHA="$(git rev-parse HEAD)" \
BRENNER_BUILD_DATE="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
BRENNER_TARGET="linux-x64" \
bun build --compile --minify --env=BRENNER_* \
--target=bun-linux-x64-baseline --outfile dist/brenner ./brenner.ts
./dist/brenner --version
./dist/brenner doctor --json --skip-ntm --skip-cass --skip-cmcurl -fsSL "https://raw.githubusercontent.com/Dicklesworthstone/brenner_bot/main/install.sh?$(date +%s)" \
| bash -s -- --easy-mode --verifyFor a safer, reproducible install, pin to a tag (avoid installing from main):
export VERSION="0.1.0" # example
curl -fsSL "https://raw.githubusercontent.com/Dicklesworthstone/brenner_bot/v${VERSION}/install.sh?$(date +%s)" \
| bash -s -- --version "${VERSION}" --easy-mode --verifybrenner doctor
ntm deps -v
cass health
cm onboard statusTroubleshooting + upgrades: specs/bootstrap_troubleshooting_v0.1.md
Status legend:
- ✅ Implemented now
- 🧭 Planned (tracked in Beads; don’t assume it exists yet)
| Command | Purpose | Status |
|---|---|---|
--version / version |
Print brenner version + build metadata | ✅ |
doctor [--json] |
Verify local toolchain health (for installers/CI) | ✅ |
upgrade [--version <ver>] |
Print canonical installer commands (re-run installer) | ✅ |
memory context "<task>" |
Fetch cass-memory context JSON (debug tool) | ✅ |
excerpt build [--sections <A,B>] [--tags <A,B>] ... |
Build a cited excerpt block (from transcript sections or quote-bank tags) | ✅ |
mail health |
Check Agent Mail readiness | ✅ |
mail tools |
List Agent Mail MCP tools | ✅ |
mail agents --project-key <abs-path> |
List known agents for a project | ✅ |
mail send --project-key <abs-path> ... |
Send a message to agents (optionally in a --thread-id) |
✅ |
prompt compose --template <path> --excerpt-file <path> ... |
Render a kickoff prompt (template + excerpt injection) | ✅ |
session start --project-key <abs-path> ... |
Compose + send a “kickoff” message via Agent Mail (alias: orchestrate start) |
✅ |
session status --thread-id <id> [--watch] |
Show per-role session status (and optionally wait until complete) | ✅ |
mail inbox / mail ack / mail thread |
Inbox + acknowledgement + thread tooling | ✅ |
session compile / session write / session publish |
Compile agent deltas into a canonical artifact, optionally write to disk, and publish back to thread | ✅ |
corpus search <query> |
Corpus search (ranked hits + anchors + snippets) | ✅ |
evidence init --thread-id <id> |
Create a new evidence pack for a session | ✅ |
evidence add --thread-id <id> ... |
Add an evidence record (paper, dataset, prior session, etc.) | ✅ |
evidence add-excerpt --thread-id <id> ... |
Add an excerpt to an evidence record | ✅ |
evidence list --thread-id <id> |
List evidence records in a pack | ✅ |
evidence render --thread-id <id> |
Render evidence pack to markdown | ✅ |
evidence post --thread-id <id> ... |
Post evidence summary to Agent Mail thread | ✅ |
evidence verify --thread-id <id> ... |
Mark an evidence record as verified | ✅ |
When the same setting is provided in multiple places, precedence is:
- Flags (per-command)
- Environment
- Config file
- Defaults
Environment variables (current):
AGENT_MAIL_BASE_URL(defaulthttp://127.0.0.1:8765)AGENT_MAIL_PATH(default/mcp/)AGENT_MAIL_BEARER_TOKEN(optional; required if Agent Mail auth is enabled)AGENT_NAME(optional default for--sender)
Config file (optional, JSON):
- Override path with
--config <path>orBRENNER_CONFIG_PATH=<path> - Default path (POSIX):
~/.config/brenner/config.json(or$XDG_CONFIG_HOME/brenner/config.json) - Default path (Windows):
%APPDATA%\\brenner\\config.json
Example:
{
"agentMail": {
"baseUrl": "http://127.0.0.1:8765",
"path": "/mcp/",
"bearerToken": "optional"
},
"defaults": {
"projectKey": "/abs/path/to/your/repo",
"template": "metaprompt_by_gpt_52.md"
}
}Required flags (today’s implementation):
mail agents:--project-keyoptional (default: configdefaults.projectKey, else"$PWD")mail send:--project-keyoptional (default: configdefaults.projectKey, else"$PWD"),--sender(orAGENT_NAME),--to,--subject,--body-fileprompt compose:--templateoptional (default: configdefaults.template, elsemetaprompt_by_gpt_52.md),--excerpt-filesession start:--project-keyoptional (default: configdefaults.projectKey, else"$PWD"),--sender(orAGENT_NAME),--to,--thread-id,--excerpt-file,--question(research question)session status:--project-keyoptional (default: configdefaults.projectKey, else"$PWD"),--thread-id(use--watchto poll;--timeoutoptional)
./brenner.ts mail tools
./brenner.ts prompt compose --template metaprompt_by_gpt_52.md --excerpt-file excerpt.md
# Engineering work: use bead ID as thread-id
./brenner.ts session start --project-key "$PWD" --sender GreenCastle --to BlueMountain,RedForest \
--thread-id brenner_bot-5so.3.4.2 --excerpt-file excerpt.md --question "What is the core problem?"
# Research session: use RS-{YYYYMMDD}-{slug} format
./brenner.ts session start --project-key "$PWD" --sender GreenCastle --to BlueMountain,RedForest \
--thread-id RS-20251230-cell-fate --excerpt-file excerpt.md --question "How do cells determine position?"This is the primary workflow for running Brenner Loop sessions with multiple agents:
Prerequisites:
- Agent Mail running locally (
cd mcp_agent_mail && bash scripts/run_server_with_token.sh) - ntm installed (Dicklesworthstone/ntm)
- CLI agents available:
claude(Claude Max),codex(GPT Pro),gemini(Gemini Ultra)
Session workflow:
# 1. Pick a thread ID (this is your join-key)
export THREAD_ID="RS-20251230-cell-fate"
# 2. List available agents in this project
./brenner.ts mail agents --project-key "$PWD"
# Example output: BlueLake, PurpleMountain, GreenValley
# 3. Create an ntm session with agent panes
ntm new $THREAD_ID --layout=3-agent
# 4. Compose kickoff prompt with excerpt
./brenner.ts prompt compose \
--template metaprompt_by_gpt_52.md \
--excerpt-file excerpt.md \
> kickoff.md
# 5. Send role-separated kickoff (recommended for multi-agent sessions)
# Use --role-map to assign roles to real Agent Mail identities:
./brenner.ts session start \
--project-key "$PWD" \
--thread-id $THREAD_ID \
--sender GreenCastle \
--to BlueLake,PurpleMountain,GreenValley \
--role-map "BlueLake=hypothesis_generator,PurpleMountain=test_designer,GreenValley=adversarial_critic" \
--excerpt-file excerpt.md \
--question "How do cells determine their position in a developing embryo?"
# Alternative: unified mode (all agents get the same prompt)
./brenner.ts session start \
--project-key "$PWD" \
--thread-id $THREAD_ID \
--sender GreenCastle \
--to BlueLake,PurpleMountain,GreenValley \
--unified \
--excerpt-file excerpt.md \
--question "How do cells determine their position in a developing embryo?"
# 6. Run agents in ntm panes (they post responses to Agent Mail)
ntm broadcast $THREAD_ID "Please check your Agent Mail inbox"
# 7. Compile and publish the artifact
./brenner.ts session compile --project-key "$PWD" --thread-id $THREAD_ID > artifact.md
./brenner.ts session publish --project-key "$PWD" --thread-id $THREAD_ID \
--sender GreenCastle --to BlueLake,PurpleMountain,GreenValleyRoster roles (for --role-map):
| Role | Primary Model | Responsibility |
|---|---|---|
hypothesis_generator |
Codex/GPT | Hunt paradoxes, propose hypotheses (H1-H3) |
test_designer |
Claude/Opus | Design discriminative tests + potency controls |
adversarial_critic |
Gemini | Attack framing, check scale constraints |
Key insight: Agents run in your terminal (via ntm), not in the cloud. You manage the sessions, review outputs, and decide when to compile. This is humans-in-the-loop orchestration.
Evidence packs let you import external sources (papers, datasets, prior session results) into a Brenner Loop session with stable IDs that can be cited in artifacts. This enables research on topics beyond just the Brenner transcripts.
Why evidence packs?
- Avoid model-memory hallucination by importing auditable evidence
- Stable anchors (
EV-001,EV-001#E1) for citation in artifacts - Excerpt-first: store only the snippets you actually use
- Local-first: never ship copyrighted content to production
Evidence pack workflow:
# 1. Pick a thread ID for your session
export THREAD_ID="RS-20251231-cell-fate"
# 2. Initialize an evidence pack
./brenner.ts evidence init --thread-id $THREAD_ID
# 3. Add a paper (auto-assigns EV-001)
./brenner.ts evidence add \
--thread-id $THREAD_ID \
--type paper \
--title "Synaptic vesicle depletion dynamics" \
--source "doi:10.1234/neuro.2024.001" \
--relevance "Provides timescale data for H1" \
--supports H1
# 4. Add a key excerpt from the paper (auto-assigns EV-001#E1)
./brenner.ts evidence add-excerpt \
--thread-id $THREAD_ID \
--evidence-id EV-001 \
--text "Recovery time constant was 487 +/- 32 ms" \
--verbatim \
--location "p. 4, Results"
# 5. Add a dataset
./brenner.ts evidence add \
--thread-id $THREAD_ID \
--type dataset \
--title "Synthetic repetition benchmark v2" \
--source "file://benchmarks/synth_v2.json" \
--relevance "Test stimuli for T5 potency check" \
--informs T5
# 6. Add prior session results
./brenner.ts evidence add \
--thread-id $THREAD_ID \
--type prior_session \
--title "Initial hypothesis exploration" \
--source "session://RS-20251228-initial" \
--relevance "H2 was killed; avoid re-investigating" \
--refutes H2
# 7. Mark evidence as verified
./brenner.ts evidence verify \
--thread-id $THREAD_ID \
--evidence-id EV-001 \
--notes "Peer-reviewed in Nature Neuroscience"
# 8. List and render the pack
./brenner.ts evidence list --thread-id $THREAD_ID --json
./brenner.ts evidence render --thread-id $THREAD_ID
# 9. Post evidence summary to the Agent Mail thread
./brenner.ts evidence post \
--thread-id $THREAD_ID \
--sender GreenCastle \
--to BlueLake,PurpleMountain,GreenValley \
--subject "Evidence pack for $THREAD_ID"Evidence types:
| Type | Use case |
|---|---|
paper |
Published research paper |
preprint |
Unpublished manuscript |
dataset |
Benchmark data, corpus, test stimuli |
experiment |
Results from an experiment |
observation |
Empirical observation |
prior_session |
Results from another Brenner Loop session |
expert_opinion |
Human expert statement |
code_artifact |
Existing code as evidence |
Citing evidence in artifacts:
**Anchors**: §58, EV-001#E1 [inference]
**Claim**: RRP depletion follows exponential decay (EV-001#E1, EV-002).
| P1 | RRP decay rate | ~500ms (EV-001#E2) | ~200ms | indeterminate |File layout:
artifacts/
└── <thread_id>/
├── artifact.md # Compiled artifact
├── evidence.json # Evidence pack (structured)
└── evidence.md # Evidence pack (human-readable)
See: specs/evidence_pack_v0.1.md for the full specification.
Bun can compile the CLI into one portable executable (the output is a single native binary that bundles your code + dependencies + the Bun runtime):
bun build --compile --outfile brenner ./brenner.tsThe CLI source does not need to be a single .ts file. Bun follows the import graph and bundles everything into one executable.
complete_brenner_transcript.md- A single consolidated document containing 236 transcript segments (as stated in-file), organized into numbered sections with headings and quoted transcript text.
- Treat this as the canonical text you search/cite from.
initial_metaprompt.md- The starter prompt used to elicit the “inner threads / symmetries / heuristics” analysis.
- Designed to be paired with transcript excerpts.
artifact_schema_v0.1.md- Canonical markdown schema for session artifacts (7 required sections, stable IDs, validation rules).
artifact_delta_spec_v0.1.md- Deterministic delta/merge rules for multi-agent updates (ADD/EDIT/KILL, conflict policy, ordering).
These are long-form writeups produced from transcript excerpts. They're useful as candidate lenses, not truth.
opus_45_responses/(Claude Opus 4.5): coherent “mental architecture” narratives; strong at structural synthesis.gpt_pro_extended_reasoning_responses/(GPT‑5.2 Pro): explicit decision-theory / Bayesian framing; strong at operational rubrics.gemini_3_deep_think_responses/(Gemini 3): alternate clustering and computational metaphors; strong at reframing.
These are the final synthesis documents, triangulated across all three models and grounded in direct transcript quotes:
final_distillation_of_brenner_method_by_opus45.md(Opus 4.5): “Two Axioms” framing + operator algebra + worksheet.final_distillation_of_brenner_method_by_gpt_52_extra_high_reasoning.md(GPT‑5.2 Pro): formal operators + experiment scoring rubric + guardrails.final_distillation_of_brenner_method_by_gemini3.md(Gemini 3): “Brenner Kernel” metaphor + instruction set + debugging protocols.
apps/web/- Next.js App Router UI for browsing the corpus, composing excerpts, orchestrating sessions, and reviewing compiled artifacts.
- Mobile-first responsive design with optimized touch targets (44px minimum) and viewport handling.
- Deployed at
brennerbot.org.
brenner.ts- Bun CLI for corpus search, session orchestration, and artifact management.
- Compiles to a standalone portable executable via
bun build --compile:bun build --compile --outfile brenner ./brenner.ts
- The resulting binary bundles the Bun runtime, all dependencies, and your code into a single executable that runs without installing Node/Bun separately.
.beads/: repo-native issue tracking (dependencies, epics, and a roadmap graph). Usebdandbv --robot-triage.
All three distillation documents draw on the same 236 transcript segments, but each model compresses the material through a different lens. The result is a form of triangulation: three incompatible representations of the same method.
This divergence is itself informative. The concepts that survive translation across all three are likely "real" primitives; the disagreements reveal where representation choices are doing work (or where a model drifted into confabulation).
Each model brought a different abstraction style to the same raw material:
| Dimension | Opus 4.5 | GPT-5.2 Pro | Gemini 3 |
|---|---|---|---|
| Metaphor | Philosophy of science | Decision theory | Operating system |
| Core question | "What are the axioms?" | "What's the objective function?" | "How would I install this?" |
| Structure | Hierarchical derivation | Loop + rubric + guardrails | Kernel modules + drivers |
| Voice | Academic, systematic | Engineering, procedural | Hacker, irreverent |
| Output format | Theory of the method | Executable protocol | Instruction set |
Consider how each model handles the idea of choosing the right experimental system:
Opus frames it philosophically:
"A generative grammar is abstract. It can be implemented in different physical systems. This means you can choose your substrate strategically... He surveyed the entire animal kingdom, reading textbooks of zoology and botany."
GPT frames it operationally:
"⟂ Object transpose: Swap organism/system until the decisive experiment becomes cheap, fast, and unambiguous."
Gemini frames it as a system requirement:
"He didn't 'pick' C. elegans. He specified it like a hardware requisition... C. elegans was the unique solution to this system of linear inequalities. He treated the Tree of Life as a component library to be raided."
All three capture the same insight, but through different lenses: philosophical justification, operational instruction, and computational metaphor.
Concepts that appear in all three distillations with strong transcript grounding:
- Dimensional reduction: 3D → 1D as a core move
- Digital handles: Prefer yes/no over quantitative measurement
- Forbidden patterns: Exclusion beats accumulation
- Third alternative: "Both could be wrong"
- Productive ignorance: Fresh eyes as strategic asset
- Don't Worry hypothesis: Defer secondary mechanisms
- Seven-cycle log paper: Design for visible differences
- Organism choice: The experimental object as a design variable
- Opus only: "Gedanken organism" standard, explicit failure modes, conversation as distributed cognition
- GPT only: "Evidence per week" objective function, 0-3 scoring rubric, 12 guardrail rules
- Gemini only: GAN metaphor for Brenner-Crick, "Integer Biology" framing, "Monopoly Market of Ideas"
Primary file: final_distillation_of_brenner_method_by_opus45.md
- Abstraction style: Coherent mental architecture (axioms → derived moves → social technology → failure modes).
- Best at: A readable theory of the method; the "why" and the inner structure.
- Unique contributions: The "Two Axioms" framing; an operator algebra with compositions; an actionable worksheet; explicit failure modes.
- Watch-outs: Narrative coherence can feel stronger than the evidence; treat it as a map that requires §-anchored grounding.
Primary file: final_distillation_of_brenner_method_by_gpt_52_extra_high_reasoning.md
- Abstraction style: Operationalization-first (define primitives precisely; define a loop; define a scoring rubric).
- Best at: Making the method executable (scoring experiments, structuring artifacts, defining guardrails).
- Unique contributions: "Evidence per week" objective function; next-experiment scoring rubric (0-3); explicit protocol artifacts (slates, tests, ledgers); hygiene rules suitable for a linter.
- Watch-outs: The method can become over-formalized; treat the rubric as a decision aid, not a substitute for taste.
Primary file: final_distillation_of_brenner_method_by_gemini3.md
- Abstraction style: Computational metaphor + systems decomposition (root access, scheduler, drivers, debugging protocol).
- Best at: Reframing and memorability; "how would I implement this as an OS?" thinking useful for UI and orchestration design.
- Unique contributions: The Kernel / instruction-set framing; explicit "distributed cognition" motifs (Brenner-Crick as GAN); a debugging-oriented lens.
- Watch-outs: Metaphors can drift; keep the mapping anchored to verbatim primitives.
| Concept | Opus | GPT | Gemini |
|---|---|---|---|
| Foundation | Two Axioms | One sentence + objective function | Root Access (ontological stance) |
| Operators | Operator algebra + compositions | Operator basis + loop + rubric | Instruction set |
| Execution | Brenner Loop | 9-step loop + worksheet | Debug protocol + scheduler |
| Quality | Failure modes section | 12 guardrails | Error handling (Occam's Broom, etc.) |
| Social | Conversation as technology | Conversation as hypothesis search | Brenner-Crick GAN |
- Start with Opus for coherence and the "shape" of the method
- Use GPT to turn the shape into executable protocol (artifacts + scoring + guardrails)
- Use Gemini when you need reframing, alternate clustering, or systems metaphors for architecture
- Ground in transcripts: When any claim matters, walk back to
complete_brenner_transcript.mdand cite§nanchors
This repo defines a "Brenner approach" playbook. These terms are the vocabulary used in prompt templates and structured artifacts:
- Brenner move: a recurring reasoning pattern (e.g., hunt paradoxes, invert the problem, pick the experimental object).
- Decision experiment: an observation designed to eliminate whole families of explanations at once.
- Digital handle: a readout that is effectively yes/no (robust to noise, high leverage).
- Representation change: restating the problem in a domain where constraints are clearer (e.g., logic/topology vs chemistry).
- Assumption ledger: explicit list of load-bearing assumptions + tests that would break them.
- Third alternative: the "both models are wrong" option; systematic guard against false dichotomies.
- Abundance trick: Bypassing purification by choosing systems where target dominates signal (50-70% of synthesis).
- Dimensional reduction: Collapsing 3D physical problems into 1D informational problems (DNA reduces biology from spatial nightmare to algebra).
- Don't Worry hypothesis: Assume required mechanisms exist; proceed with theory development ("Don't worry about unwinding; assume an enzyme exists").
- Forbidden pattern: An observation that cannot occur if a hypothesis is true (e.g., adjacent amino acid pairs forbidden under overlapping code).
- Gedanken organism: The reconstruction standard; could you compute the animal from DNA sequences alone?
- Generative grammar: The production rules that generate phenomena (biology is computation).
- House of cards: Theory with interlocking mutual constraints; if N predictions each have probability p, all N true has probability p^N.
- Imprisoned imagination: Staying within physical/scale constraints ("DNA is 1mm long in a 1μm bacterium, folded 1000×").
- Machine language: The operational vocabulary the system actually uses (for development: cells, divisions, recognition proteins, not gradients or differential equations).
- Materialization: Translating theory to "what would I see if this were true?"
- Occam's broom: The junk swept under the carpet to keep a theory tidy (count this, not entities).
- Out of phase: Misaligned with (or deliberately avoiding) scientific fashion; "half a wavelength ahead or behind."
- Productive ignorance: Fresh eyes unconstrained by expert priors (experts have overly tight probability mass on known solutions).
- Seven-cycle log paper: Test for qualitative, visible differences ("hold at one end of room, stand at other; if you can see the difference, it's significant").
- Topological proof: Deducing structure from invariants rather than molecular details (the triplet code from frameshift algebra).
- Chastity vs impotence: Same outcome, fundamentally different reasons. A diagnostic for causal typing.
The distillations formalize Brenner's moves into a compact algebra of cognitive operators. These can be composed and applied systematically:
- ⊘ Level‑split: Separate program from interpreter; message from machine; “chastity vs impotence” control typing.
- 𝓛 Recode: Change representation / coordinates; reduce dimensionality; choose the machine language.
- ≡ Invariant‑extract: Find what survives; use physics/scale to kill impossible cartoons.
- ✂ Exclusion‑test: Derive forbidden patterns; design model-killing experiments.
- ⟂ Object‑transpose: Change organism/system until the decisive test becomes cheap.
- ↑ Amplify: Use selection, dominance, regime switches; get “across the room” differences.
- ⊕ Cross‑domain: Import tools/encodings; use pattern transfer to break monopolies.
- ◊ Paradox‑hunt: Use contradictions as beacons; start where the model can’t be true.
- ΔE Exception‑quarantine: Isolate anomalies explicitly without hiding them or nuking the coherent core.
- ∿ Dephase: Work out of phase with fashion; stay in the opening game.
- † Theory‑kill: Drop hypotheses aggressively when the world says no.
- ⌂ Materialize: Compile stories into a decision procedure (“what would I see?”).
- 🔧 DIY: Build what you need; don’t wait for infrastructure.
- ⊞ Scale‑check: Calculate; stay imprisoned in physics.
The signature "Brenner move" can be expressed as:
(⌂ ∘ ✂ ∘ ≡ ∘ ⊘) powered by (↑ ∘ ⟂ ∘ 🔧) seeded by (◊ ∘ ⊕) constrained by (⊞) kept honest by (ΔE ∘ †)
In English: Starting from a paradox noticed through cross-domain vision, split levels and reduce dimensions to extract invariants, then materialize as an exclusion test. Power this by amplification in a well-chosen system you can build yourself. Constrain by physical reality. Keep honest with exception handling and willingness to kill failing theories.
Brenner never used formal probability, but his reasoning maps precisely onto Bayesian concepts:
| Brenner Move | Bayesian Operation |
|---|---|
| Enumerate 3+ models before experimenting | Maintain explicit prior distribution |
| Hunt paradoxes | Find high-probability contradictions in posterior |
| "Third alternative: both wrong" | Reserve probability mass for misspecification |
| Design forbidden patterns | Maximize expected information gain (KL divergence) |
| Seven-cycle log paper | Choose experiments with extreme likelihood ratios |
| Choose organism for decisive test | Modify data-generating process to separate likelihoods |
| "House of cards" theories | Interlocking constraints (posterior ≈ product of likelihoods) |
| Exception quarantine | Model anomalies as typed mixture components |
| "Don't Worry" hypothesis | Marginalize over latent mechanisms (explicitly labeled) |
| Kill theories early | Update aggressively; avoid sunk-cost fallacy |
| Scale/physics constraints | Use strong physical priors to prune before experimenting |
| Productive ignorance | Avoid over-tight priors that collapse hypothesis space |
The objective function Brenner was implicitly maximizing:
Expected Information Gain × Downstream Leverage
Score(E) = ─────────────────────────────────────────────────────────
Time × Cost × Ambiguity × Infrastructure-Dependence
His genius was in making all the denominator terms small (DIY, clever design, digital handles) while keeping the numerator large (exclusion tests, paradox resolution). He did this by changing the problem rather than brute-forcing the experiment.
A compressed summary of the method, suitable for quick reference:
- Enter problems as an outsider: Embrace productive ignorance; émigrés make the best discoveries
- Reduce dimensionality: Find the representation that transforms the problem into algebra
- Go digital: Choose systems with qualitative differences; avoid statistics where possible
- Defer secondary problems: "Don't Worry" about mechanisms you can't yet see; assume they exist
- Materialize immediately: Ask "what experiment would test this?" before theorizing further
- Build what you need: Crude apparatus that works beats elegant apparatus you're waiting for
- Think out loud: Ideas are 50% wrong the first time; conversation is a thinking technology
- Stay imprisoned in physics: Calculate scale; respect mechanism; filter impossible cartoons
- Distinguish information from implementation: Separate the program from the interpreter (von Neumann's insight)
- Play with words and inversions: Puns and inversions train mental flexibility ("what if the obvious interpretation is wrong?")
Brenner was explicit that science demands contradictory traits held in tension:
Brenner’s method requires oscillations (not a single personality setting):
- Imagination ↔ Focus
- Passion ↔ Ruthlessness
- Ignorance ↔ Learning
- Attachment ↔ Detachment
- Conversation ↔ Solitude
- Theory ↔ Experiment
"There are brilliant people that can never accomplish anything. And there are people that have no ideas but do things. And if only one could chimerise them—join them into one person—one would have a good scientist."
The method requires oscillating between these modes, not choosing one.
Large language models are powerful pattern-matchers, but they lack the meta-cognitive architecture that made Brenner effective:
- They don't spontaneously ask "what organism would make this test easy?"
- They don't naturally hunt for forbidden patterns
- They don't instinctively separate program from interpreter
- They don't automatically calculate scale constraints
- They don't maintain assumption ledgers or exception quarantines
By encoding Brenner's operators, vocabulary, and protocols as explicit prompts and workflows, we can scaffold this meta-cognition onto LLMs. The goal is not to make LLMs "think like Brenner" (they can't), but to make them follow Brenner-style protocols that a human researcher can audit and steer.
Different models have different strengths:
- Claude (Opus): Strong at coherent narrative synthesis, maintaining context, and identifying structural relationships
- GPT-5.2 Pro: Strong at formal reasoning, decision-theoretic framing, and explicit calculation
- Gemini 3: Strong at alternative clustering, novelty search, and computational metaphors
By having these models collaborate via Agent Mail using shared Brenner protocols, we get triangulation at the workflow level. This reduces the risk that any single model's biases dominate the research direction.
- Transcript source:
complete_brenner_transcript.mdstates it is “a collection of 236 video transcripts from Web of Stories.” If you publish derived work, verify applicable rights/terms and attribute appropriately.
- Treat syntheses as hypotheses: the model writeups can be brilliant and wrong.
- Prefer quotes over vibes: if a claim matters, ground it in the transcripts.
- Separate “Brenner said” from “we infer”: label interpretation explicitly.
The system is organized into eight components:
The ground truth that everything references:
complete_brenner_transcript.md: 236 transcript segments with stable§nanchors- Quote bank: curated primitives tagged by operator/motif
- Transcript parser with structured index
The Brenner method encoded as executable primitives:
- Artifact schema (
artifact_schema_v0.1.md): 7 required sections, stable IDs, validation rules - Delta spec (
artifact_delta_spec_v0.1.md): ADD/EDIT/KILL operations, merge rules, conflict policy - Operator library: definitions, triggers, failure modes, anchored quotes
- Role prompts: Claude/GPT/Gemini-specific templates that output structured deltas
- Guardrails + linter: 50+ validation rules covering structural integrity, hypothesis hygiene, third alternative requirements, potency controls, citation anchors (
§n), provenance verification, and scale constraints. Outputs in human-readable text or machine-parseable JSON.
The message-passing substrate for multi-agent work:
- Thread protocol contract (kickoff, delta response, compiled artifact, critique, admin notes)
- Acknowledgement tracking (who responded, what's pending)
- File reservations to prevent clobbering
- Persistent audit trail in git
Key constraint: Agent Mail does coordination, NOT inference. No AI APIs are called.
Where agents actually run—not in the web app:
- ntm (Named Tmux Manager): parallel tmux panes, prompt broadcast, output capture
- CLI agents: claude code (Claude Max), codex-cli (GPT Pro), gemini-cli (Gemini Ultra)
- Artifact compiler: parse structured deltas → merge → lint → render canonical markdown
- Join-key contract: thread_id ↔ ntm session ↔ artifact path ↔ beads ID
This is humans-in-the-loop: operators run agents in terminal sessions, review outputs, trigger compilation.
Human interface for browsing and viewing—not agent execution:
- Public mode: corpus reader, distillations, method docs (no orchestration side-effects)
- Lab mode (gated): session viewer, artifact panel, kickoff composer
- Cloudflare Access + app-layer gating for protected actions
Terminal-first workflow for power users:
- Command surface:
brenner corpus,brenner session,brenner mail - Inbox/thread tooling for Agent Mail
- Session compose/send/fetch/compile/publish
- Single self-contained binary via
bun build --compile
Context augmentation via cass-memory (local-first, no AI APIs):
- cass: episodic search across prior agent sessions
- cm (cass-memory): procedural rules + anti-patterns with confidence/decay
cm context --jsonto augment kickoffs with relevant prior work- Feedback loop from session artifacts back to durable memory
cass indexes local CLI-agent session logs (Codex CLI / Claude Code / Gemini CLI) on your machine so you can search for prior work by keyword, thread ID, or file paths.
What gets indexed (default connectors; run cass diag --json to confirm on your machine):
- Codex CLI sessions:
~/.codex/sessions - Claude Code sessions:
~/.claude/projects - Gemini CLI sessions:
~/.gemini/tmp
What does not get indexed by default:
- Agent Mail’s git mailbox archive (use Agent Mail search tools or
rgon the mailbox repo instead)
One-time setup (build the index):
cass index --fullKeep the index current:
# Option A: run continuously in a background terminal
cass index --watch
# Option B: re-run periodically
cass indexSearch examples:
# Recommended: search by the join-key thread id (make sure your prompts include THREAD_ID)
cass search "$THREAD_ID" --workspace "$PWD" --robot --limit 10
# Search by keyword (optionally time-box it)
cass search "forbidden pattern" --workspace "$PWD" --week --robot --limit 10
# Quick health check (if stale, run cass index)
cass status --jsonProduction infrastructure:
- Vercel deployment for
apps/web - Cloudflare DNS for
brennerbot.org - Cloudflare Access for lab mode protection
- Content policy enforcement (public doc allowlist vs gated content)
The system embodies several deliberate architectural choices that prioritize auditability, correctness, and practical multi-agent coordination.
Most AI orchestration systems expose HTTP APIs and expect you to call vendor inference endpoints from your code. Brenner Bot inverts this:
What we do:
- CLI tools (
claude,codex,gemini) run in terminal sessions under your subscription - Coordination happens via message passing (Agent Mail), not remote inference
- The web app is for viewing artifacts and composing prompts—not executing agents
Why this matters:
- No API keys in code: You authenticate via your CLI tool's existing auth (Claude Max, GPT Pro, Gemini Ultra)
- No rate limits to manage: Your subscription handles throttling
- Full session context: CLI agents maintain conversation state naturally
- Human-in-the-loop by default: You see what agents are doing in real-time (tmux panes)
When multiple agents produce structured deltas concurrently, the artifact compiler must merge them deterministically. Two runs with the same inputs must produce identical outputs.
The merge algorithm:
- Timestamp ordering: Deltas are sorted by creation timestamp (Agent Mail provides this)
- Section-wise application: Each delta specifies target sections (hypothesis_slate, tests, etc.)
- Operation semantics:
ADD: Append new item with auto-generated ID (H4, T7, etc.)EDIT: Modify existing item by ID (must exist, must not be killed)KILL: Mark item as killed with reason (idempotent; killing twice is a no-op)
- Conflict resolution: Last-write-wins within the same item ID; conflicts are logged as warnings
- Post-merge validation: The linter runs after merge to catch constraint violations
Invariants guaranteed:
- Merge order is stable (same inputs → same output, regardless of filesystem ordering)
- Killed items are preserved in history (audit trail), not deleted
- ID sequences never regress (if H3 exists, you can't add H2 later)
The orchestration layer (session kickoffs, Agent Mail integration) is protected by a fail-closed security model:
The gating logic:
- Lab mode check:
BRENNER_LAB_MODE=1must be explicitly set - Authentication: Either Cloudflare Access headers (when trusted) OR a valid lab secret
- Timing-safe comparison: Secret comparison uses constant-time algorithms to prevent timing attacks
- 404 on failure: Unauthorized requests get a generic 404, not a 401/403 (no information leakage)
What this means in practice:
- Default deployment is read-only (corpus browsing works, orchestration doesn't)
- Lab features require explicit opt-in at both infrastructure and application layers
- Secret validation resists timing attacks even in Edge runtime (where
crypto.timingSafeEqualisn't available)
The test suite (2,700+ tests) follows a strict philosophy: test real implementations with real data fixtures, not mocked abstractions.
What we mock (infrastructure only):
next/headersandnext/cookies(Next.js request context doesn't exist outside request lifecycle)framer-motion(animation timing is non-deterministic in tests)- External network calls (for offline test reliability)
What we don't mock:
- Business logic (artifact merge, delta parsing, linting)
- Storage layer (real file system operations with temp directories)
- Agent Mail client (uses real test server with in-memory state)
Why this matters:
- Tests catch real integration bugs, not mock configuration errors
- Refactoring doesn't require updating mock implementations
- Test failures correspond to actual production issues
Coverage thresholds:
- Overall: 80% lines, 75% branches (enforced in CI)
- Critical modules (artifact-merge, delta-parser): 85%+ branches
- New code: must not regress coverage
The artifact compiler is the core of the "structured output" philosophy. Instead of free-form responses, agents produce deltas that specify operations on a shared artifact. The compiler merges these deterministically.
Each agent response contains a structured delta block:
:::delta
{
"operation": "ADD",
"section": "hypothesis_slate",
"target_id": null,
"payload": {
"statement": "The observed depletion follows exponential decay",
"anchors": ["§58", "EV-001#E1"]
},
"rationale": "Addressing paradox from excerpt"
}
:::
:::delta
{
"operation": "EDIT",
"section": "hypothesis_slate",
"target_id": "H2",
"payload": {
"confidence": "high"
},
"rationale": "Updated based on test results"
}
:::
:::delta
{
"operation": "KILL",
"section": "hypothesis_slate",
"target_id": "H1",
"payload": {
"reason": "Contradicted by EV-002#E3"
},
"rationale": "Test T1 ruled out this hypothesis"
}
:::ADD operations:
- Auto-assign next ID in sequence (H1 → H2 → H3)
- Validate required fields per section schema
- Check for duplicate content (warning if semantically similar to existing item)
EDIT operations:
- Target item must exist and not be killed
- Merge payload fields (deep merge for nested objects)
- Preserve fields not mentioned in payload
- Track edit history with timestamp and author
KILL operations:
- Mark item as killed (not deleted)
- Reason is required and preserved
- Killed items appear in artifact with strikethrough and reason
- Killing an already-killed item is idempotent (no error)
When multiple agents edit the same item concurrently:
- Operations are ordered by timestamp
- Last-write-wins for conflicting fields
- Non-conflicting fields are merged
- Conflict is logged with both values for audit
Example:
- Agent A edits H2.predictions.T1 = "< 500ms" at t=100
- Agent B edits H2.predictions.T1 = "< 600ms" at t=101
- Agent B edits H2.predictions.T2 = "> 1000ms" at t=101
Result: H2.predictions = { T1: "< 600ms" (B wins), T2: "> 1000ms" (no conflict) }
The artifact linter enforces 50+ validation rules that encode Brenner-style research hygiene. These are not style preferences—they're constraints designed to catch common failure modes in hypothesis-driven research.
Structural integrity:
- All 7 required sections present (research_thread, hypothesis_slate, predictions_table, discriminative_tests, assumption_ledger, anomaly_register, adversarial_critique)
- IDs follow correct format (H1-H9, T1-T99, A1-A99, X1-X99)
- Cross-references resolve (predictions reference valid hypothesis IDs)
Hypothesis hygiene:
- Minimum 2 hypotheses (avoids false dichotomy)
- Maximum 5 hypotheses (forces prioritization)
- Third alternative slot present (explicit "both wrong" option)
- No duplicate hypothesis content
Test design:
- Each test specifies which hypotheses it discriminates
- Potency controls present (chastity vs impotence check)
- At least one test per active hypothesis
- Tests reference valid anchors (§n or EV-xxx)
Assumption tracking:
- Load-bearing assumptions explicitly listed
- At least one scale/physics constraint check
- Assumptions linked to hypotheses they support
Citation hygiene:
- All claims have anchors (§n for transcripts, EV-xxx for evidence)
- Anchors resolve to valid sources
- Inference vs verbatim explicitly marked
Anomaly handling:
- Anomalies quarantined with explicit status
- Promoted anomalies have resolution plan
- Dismissed anomalies have reason
The linter outputs structured violations:
{
"artifact": "RS-20251230-example",
"valid": false,
"summary": {
"errors": 1,
"warnings": 1,
"info": 0
},
"violations": [
{
"id": "EH-003",
"severity": "error",
"message": "Third alternative not explicitly labeled",
"fix": "Add 'third_alternative: true' to one hypothesis"
},
{
"id": "WP-001",
"severity": "warning",
"message": "P4 does not discriminate (all hypothesis outcomes identical or missing)",
"fix": "Adjust prediction so at least two hypotheses differ in expected outcome"
}
]
}- error: Artifact is structurally invalid; must fix before publishing
- warning: Potential issue that should be reviewed; may publish with acknowledgment
- info: Suggestion for improvement; purely advisory
The system is designed for responsive local development and efficient CI runs.
The brenner binary starts in < 50ms (measured on M2 MacBook Air):
- Bun's compiled binaries bundle the runtime
- No JIT warmup required
- Lazy loading for heavy modules (Agent Mail client only initialized when needed)
Full test suite (2,700+ tests) completes in < 30 seconds on CI:
- Vitest's parallel execution across CPU cores
- In-memory Agent Mail test server (no real network I/O)
- Shared test fixtures loaded once per file
Typical session compilation (3 agents, ~50 operations each):
- Parse deltas: < 10ms
- Merge operations: < 5ms
- Lint validation: < 20ms
- Render markdown: < 5ms
Total: < 50ms for a complete compile-lint-render cycle.
- Corpus pages: Static generation at build time (instant load)
- Session pages: Server components with streaming (Time to First Byte < 100ms)
- Search: Client-side with debounced queries (< 50ms perceived latency)
The test suite is structured for comprehensive coverage with fast feedback loops.
apps/web/
├── src/
│ ├── lib/
│ │ ├── artifact-merge.ts # 2,800 lines of merge logic
│ │ ├── artifact-merge.test.ts # 108 tests, 85%+ branch coverage
│ │ ├── delta-parser.ts # 438 lines of parsing
│ │ └── delta-parser.test.ts # 19 tests
│ ├── components/
│ │ └── *.test.tsx # Component tests with Testing Library
│ └── app/
│ └── api/
│ └── */route.test.ts # API route tests
├── e2e/
│ ├── *.spec.ts # Playwright E2E tests
│ └── utils/
│ ├── agent-mail-test-server.ts # In-memory Agent Mail for E2E
│ └── test-fixtures.ts # Shared setup utilities
└── src/test-utils/
├── index.ts # Test utility exports
├── logging.ts # Structured test logging
├── fixtures.ts # Data fixtures
├── assertions.ts # Custom assertion helpers
└── agent-mail-test-server.ts # Unit test Agent Mail server
For testing Agent Mail integration without a real server:
import { AgentMailTestServer } from "@/test-utils";
let server: AgentMailTestServer;
beforeAll(async () => {
server = new AgentMailTestServer();
await server.start(18765);
process.env.AGENT_MAIL_BASE_URL = `http://localhost:18765`;
});
afterAll(async () => {
await server.stop();
});
beforeEach(() => {
server.reset(); // Clear state between tests
});
it("seeds a thread for testing", () => {
// seedThread creates projects and agents as needed
server.seedThread({
projectKey: "/test/project",
threadId: "TEST-001",
messages: [
{ from: "TestAgent", to: ["Recipient"], subject: "Test", body_md: "Hello" },
],
});
// Inspection methods for assertions
const messages = server.getMessagesTo("Recipient");
expect(messages).toHaveLength(1);
expect(server.getMessagesInThread("TEST-001")).toHaveLength(1);
});E2E tests run against the real web app with visual regression support:
import { test, expect } from "./utils/test-fixtures";
test("corpus search returns results", async ({ page }) => {
await page.goto("/corpus");
await page.fill('[data-testid="search-input"]', "exclusion");
await page.waitForSelector('[data-testid="search-results"]');
const results = await page.locator('[data-testid="search-result"]').count();
expect(results).toBeGreaterThan(0);
});
test("session page loads", async ({ page, testSession }) => {
// testSession.seed() creates a session with Agent Mail test server
const threadId = `E2E-TEST-${Date.now()}`;
await testSession.seed({
threadId,
messages: [{
from: "operator",
subject: "Research Session",
body: "Kickoff content here",
type: "KICKOFF",
}],
});
await page.goto(`/sessions/${threadId}`);
await expect(page.locator("body")).toContainText(threadId);
});# Unit tests (fast, run frequently)
cd apps/web
bun run test
# With coverage
bun run test:coverage
# Watch mode for development
bun run test:watch
# E2E tests (slower, run before commit)
bun run test:e2e
# E2E with UI (for debugging)
bun run test:e2e:ui
# Full CI suite
bun run test:coverage && bun run test:e2e- Bun 1.1.38+ (
curl -fsSL https://bun.sh/install | bash) - Node.js 20+ (for some tooling compatibility)
- Git with SSH configured for GitHub
# Clone the repository
git clone git@github.com:Dicklesworthstone/brenner_bot.git
cd brenner_bot
# Install dependencies
cd apps/web && bun install && cd ../..
# Verify toolchain
./brenner.ts doctor --skip-ntm --skip-cass --skip-cm
# Run tests to confirm everything works
cd apps/web && bun run testcd apps/web
bun run dev
# Open http://localhost:3000# Create a feature branch
git checkout -b feature/your-feature
# Make changes, run tests frequently
bun run test:watch
# Before committing, run full suite
bun run test:coverage
bun run lint
# Commit with conventional format
git commit -m "feat(component): add new feature"- TypeScript: Strict mode, no
anyin production code (allowed in tests for fixtures) - Formatting: Prettier with default config
- Linting: ESLint + oxlint for fast local checks
- Imports: Absolute paths via
@/alias
type(scope): description
feat: New feature
fix: Bug fix
docs: Documentation only
test: Test changes
refactor: Code change that neither fixes a bug nor adds a feature
chore: Build process or auxiliary tool changes
Releases are automated via GitHub Actions. Pushing a version tag triggers the release workflow.
# Update version in package.json (if applicable)
# Then tag and push
git tag v0.1.0
git push origin v0.1.0- Checkout repository with full history
- Install dependencies
- Build binaries for all platforms:
brenner-linux-x64(baseline for older CPUs)brenner-linux-arm64brenner-darwin-arm64(Apple Silicon)brenner-darwin-x64(Intel Mac)brenner-win-x64.exe(Windows)
- Generate SHA256 checksums for each binary
- Publish GitHub Release with auto-generated notes
- Upload all binaries and checksums as release assets
To test the release workflow without publishing:
- Go to Actions → Release → Run workflow
- Select the branch to build from
- Artifacts are uploaded but not published as a release
The CLI embeds build metadata at compile time:
./brenner --version
# brenner v0.1.0 (abc1234, 2025-01-02T12:00:00Z, linux-x64)This is set via environment variables during build:
BRENNER_VERSION: Semantic version from tagBRENNER_GIT_SHA: Full commit SHABRENNER_BUILD_DATE: ISO 8601 timestampBRENNER_TARGET: Platform identifier
The CLI provides comprehensive commands for managing research artifacts as first-class entities. Each artifact type follows a defined lifecycle with state transitions that enforce the Brenner method's epistemic hygiene.
Hypotheses are the core currency of scientific inquiry. The CLI tracks them through a rigorous lifecycle:
States: proposed → active → under_attack / assumption_undermined / refined → killed / validated / dormant
# List all hypotheses in a session
brenner hypothesis list --session-id RS20251230
# Show detailed hypothesis information
brenner hypothesis show H-RS20251230-001
# Create a new hypothesis
brenner hypothesis create \
--session-id RS20251230 \
--statement "Positional information is encoded in morphogen gradients" \
--mechanism "Cells read concentration thresholds to determine fate" \
--category mechanistic \
--confidence medium
# Activate a proposed hypothesis for testing
brenner hypothesis activate H-RS20251230-001
# Kill a hypothesis with discriminative test evidence
brenner hypothesis kill H-RS20251230-001 \
--test T-RS20251230-002 \
--reason "Test showed no gradient correlation"
# Validate a hypothesis with supporting test
brenner hypothesis validate H-RS20251230-001 \
--test T-RS20251230-003
# Create a third alternative (the "both could be wrong" move)
brenner hypothesis create \
--session-id RS20251230 \
--statement "Neither gradient nor timing — mechanical forces drive fate" \
--category third_alternative \
--origin third_alternativeCategories: mechanistic, phenomenological, boundary, auxiliary, third_alternative
Origins: proposed, third_alternative, refinement, anomaly_spawned
Discriminative tests are designed to eliminate hypotheses, not confirm them. "Exclusion is always a tremendously good thing." (§147)
States: designed → pending / in_progress → completed / blocked
# List tests for a session
brenner test list --session-id RS20251230
# Show test details with expected outcomes
brenner test show T-RS20251230-001
# Execute a test and record results
brenner test execute T-RS20251230-001 \
--result "Random fate assignment observed" \
--potency-pass \
--confidence high \
--by GreenCastle \
--notes "n=15 embryos, p<0.001"
# Bind test result to hypothesis outcomes
brenner test bind T-RS20251230-001 H-RS20251230-002 --matched \
--reason "Result consistent with prediction" \
--by GreenCastle
brenner test bind T-RS20251230-001 H-RS20251230-001 --violated \
--reason "Gradient model predicted no fate change" \
--by GreenCastle
# Suggest which hypotheses this test could kill
brenner test suggest-kills T-RS20251230-001 --confidence highPotency checks are mandatory — they distinguish "no effect" from "assay failed." Use --potency-pass or --potency-fail to record the potency check result.
Note: Tests are typically created as part of session artifacts. The CLI focuses on execution, binding, and kill suggestion rather than test creation.
Assumptions are load-bearing beliefs that underpin hypotheses and tests. The Brenner method requires explicit tracking because falsifying an assumption invalidates everything that depends on it.
Types: background, methodological, boundary, scale_physics
States: unchecked → challenged → verified / falsified
# List assumptions
brenner assumption list --session-id RS20251230
# Create a scale/physics assumption (mandatory for every research program)
brenner assumption create \
--session-id RS20251230 \
--statement "Morphogen diffusion is fast enough for pattern formation" \
--type scale_physics \
--load-description "Underpins gradient-based models" \
--affects-hypotheses H-RS20251230-001,H-RS20251230-002 \
--affects-tests T-RS20251230-001 \
--calculation '{"quantities":"D ≈ 10 μm²/s, L ≈ 100 μm","result":"τ ≈ L²/D ≈ 1000s ≈ 17 min"}'
# Challenge an assumption
brenner assumption challenge A-RS20251230-001 \
--reason "New evidence suggests D may be 10x lower in tissue context" \
--by GreenCastle
# Verify an assumption
brenner assumption verify A-RS20251230-001 \
--evidence "FRAP measurements confirm D = 8-12 μm²/s in vivo" \
--by GreenCastle
# Falsify an assumption (triggers propagation to linked hypotheses/tests)
brenner assumption falsify A-RS20251230-001 \
--evidence "D measured at 0.5 μm²/s — gradient takes hours, not minutes" \
--by GreenCastleThe scale_physics type is special — it represents "the imprisoned imagination" constraint from Brenner. Every research program must have at least one.
Anomalies are surprising observations that don't fit the current framework. They reveal when framings are inadequate and can spawn new hypotheses through the "paradox grounding" mechanism.
Quarantine States: active → resolved / deferred / paradigm_shifting
# List anomalies
brenner anomaly list --session-id RS20251230
# Create an anomaly from experimental observation
brenner anomaly create \
--session-id RS20251230 \
--observation "Cells at boundary show oscillating fate markers" \
--source-type experiment \
--source-ref T-RS20251230-003 \
--conflicts-with H-RS20251230-001,H-RS20251230-002 \
--conflict-description "Neither gradient nor timing models predict oscillation"
# Resolve an anomaly with a hypothesis
brenner anomaly resolve X-RS20251230-001 \
--by H-RS20251230-004 \
--notes "Third alternative explains oscillation as bistable switch"
# Defer an anomaly (must provide reason — prevents Occam's broom)
brenner anomaly defer X-RS20251230-001 \
--reason "Requires live imaging to characterize; park until microscope available"
# Reactivate a deferred anomaly
brenner anomaly reactivate X-RS20251230-001
# Spawn a hypothesis from an anomaly (paradox grounding)
brenner hypothesis create \
--session-id RS20251230 \
--category third_alternative \
--statement "Fate oscillation reflects bistable genetic circuit" \
--origin anomaly_spawned \
--notes "Spawned from X-RS20251230-001"Key insight: "We didn't conceal them; we put them in an appendix." (§110) — Anomalies are quarantined, not hidden or allowed to destroy coherent frameworks prematurely.
Critiques are adversarial attacks on hypotheses, tests, assumptions, framing, or methodology. They enforce the "when they go ugly, kill them" discipline while requiring explicit justification.
Status: active → addressed / dismissed / accepted
Severity: minor, moderate, serious, critical
# List critiques
brenner critique list --session-id RS20251230
# Create a critique targeting a hypothesis
brenner critique create \
--session-id RS20251230 \
--target H-RS20251230-001 \
--attack "Gradient model assumes linear readout, but evidence suggests threshold switching" \
--evidence-to-confirm "Test non-linear response curves in dose-response assays" \
--severity serious
# Create a framing critique (attacks the research question itself)
brenner critique create \
--session-id RS20251230 \
--target framing \
--attack "Wrong level of description — should be asking about information flow, not substance" \
--evidence-to-confirm "Show equivalent patterning with different morphogens" \
--severity critical
# Respond to a critique
brenner critique respond C-RS20251230-001 \
--response "Added non-linear model variant; does not change discriminative power" \
--action modified \
--by GreenCastle
# Dismiss a critique (must provide reason)
brenner critique dismiss C-RS20251230-001 \
--reason "Evidence cited is from non-comparable system (Drosophila vs. vertebrate)" \
--by GreenCastle
# Accept a critique and take action
brenner critique accept C-RS20251230-001 \
--action killed \
--response "Critique was correct; hypothesis killed in favor of threshold model" \
--by GreenCastleNote: The scoring CLI commands (
brenner score,brenner feedback,brenner leaderboard) are planned but not yet implemented. The rubric below documents the intended evaluation framework. Currently, scoring happens during session compilation and is embedded in compiled artifacts.
The project implements a 14-criterion evaluation rubric for scoring Brenner method adherence. Scores are computed per-contribution and aggregated at the session level.
Sessions are scored across seven dimensions that capture the essence of rigorous scientific inquiry:
| Dimension | Max Points | What It Measures |
|---|---|---|
| Paradox Grounding | 20 | Does the session start from a genuine puzzle? |
| Hypothesis Kill Rate | 20 | Are hypotheses being eliminated, not just accumulated? |
| Test Discriminability | 20 | Do tests actually distinguish between hypotheses? |
| Assumption Tracking | 15 | Are load-bearing assumptions explicit and tested? |
| Third Alternative Discovery | 15 | Are "both could be wrong" alternatives explored? |
| Experimental Feasibility | 10 | Are tests actually executable? |
| Adversarial Pressure | 20 | Has adversarial critique been applied? |
Grades: A (90%+), B (80%+), C (70%+), D (60%+), F (<60%)
Each agent role is scored on criteria relevant to their function:
Hypothesis Generator (Codex) — 19 points max
- Structural Correctness (×1.0)
- Citation Compliance (×1.0)
- Rationale Quality (×0.5)
- Level Separation (×1.5) — "Programs don't have wants"
- Third Alternative Presence (×2.0) — "Both could be wrong"
- Paradox Exploitation (×0.5)
Test Designer (Opus) — 21.5 points max
- Discriminative Power (×2.0) — "Exclusion is always good"
- Potency Check Sufficiency (×2.0)
- Object Transposition Considered (×0.5)
- Score Calibration Honesty (×0.5)
Adversarial Critic (Gemini) — 25.5 points max (with KILL)
- Scale Check Rigor (×1.5) — "The imprisoned imagination"
- Anomaly Quarantine Discipline (×1.5)
- Theory Kill Justification (×1.5) — "When they go ugly, kill them"
- Real Third Alternative (×1.5)
Certain failures are disqualifying regardless of other scores:
- Invalid JSON in delta block
- Fake anchor detected (§n that doesn't exist)
- Missing potency check in test design
- KILL without rationale in critique
Research Programs aggregate multiple sessions into a coherent multi-session research effort. They provide dashboard views of hypothesis funnels, registry health, and timeline events.
States: active → paused → completed / abandoned
# Create a new research program
brenner program create \
--name "Cell Fate Determination in Vertebrate Embryos" \
--description "Investigating the computational basis of positional information encoding"
# List programs
brenner program list
brenner program list --status active
# Show program dashboard
brenner program show RP-CELL-FATE-001
brenner program dashboard RP-CELL-FATE-001
# Add sessions to a program
brenner program add-session RP-CELL-FATE-001 --session RS20251230
brenner program add-session RP-CELL-FATE-001 --session RS20251231
# Remove a session
brenner program remove-session RP-CELL-FATE-001 --session RS20251231
# Pause a program
brenner program pause RP-CELL-FATE-001 \
--reason "Waiting for CRISPR reagents"
# Resume a program
brenner program resume RP-CELL-FATE-001
# Complete a program
brenner program complete RP-CELL-FATE-001 \
--summary "Validated threshold model; gradient hypothesis killed"
# Abandon a program (requires explanation)
brenner program abandon RP-CELL-FATE-001 \
--reason "Funding ended; see RP-NEURAL-CREST-001 for continuation"The program dashboard shows:
Hypothesis Funnel
Proposed → Active → Under Attack → Killed/Validated
12 5 2 7 / 0
Registry Health
- Assumptions: 8 total (5 verified, 2 challenged, 1 unchecked)
- Anomalies: 3 total (1 resolved, 1 deferred, 1 active)
- Critiques: 5 total (4 addressed, 1 active)
Timeline Events
2025-12-30 09:00 [hypothesis_proposed] H-RS20251230-001 created
2025-12-30 11:30 [test_executed] T-RS20251230-001 completed
2025-12-30 14:00 [hypothesis_killed] H-RS20251230-001 refuted by T-RS20251230-001
The CLI provides commands for running experiments, capturing results, and sharing them via Agent Mail.
# Run an experiment and capture output (wraps any shell command)
brenner experiment run \
--thread-id RS-20251230-example \
--test-id T-RS20251230-001 \
--timeout 60 \
--out-file results/experiment_001.json \
-- bash -lc "python run_assay.py --marker GFP"
# Record results from an already-executed experiment
brenner experiment record \
--thread-id RS-20251230-example \
--test-id T-RS20251230-001 \
--exit-code 0 \
--stdout-file results/stdout.txt \
--stderr-file results/stderr.txt \
--out-file results/experiment_001.json
# Encode captured results for sharing
brenner experiment encode \
--result-file results/experiment_001.json \
--out-file results/experiment_001_encoded.json
# Post encoded results to session participants via Agent Mail
brenner experiment post \
--result-file results/experiment_001_encoded.json \
--sender GreenCastle \
--to BlueLake,PurpleMountain \
--project-key "$PWD"See specs/experiment_result_encoding_v0.1.md and specs/experiment_capture_protocol_v0.1.md for the encoding specification.
The cockpit provides an ntm-based multi-agent runtime for running collaborative research sessions. It coordinates multiple AI agents through Agent Mail, manages the session lifecycle, and produces the research artifact.
# Start a new research session with the cockpit
# This spawns ntm panes, sends kickoff messages, and broadcasts to agents
brenner cockpit start \
--project-key "$PWD" \
--thread-id RS-20251230-cell-fate \
--sender GreenCastle \
--to BlueLake,PurpleMountain,GreenValley \
--role-map "BlueLake=hypothesis_generator,PurpleMountain=test_designer,GreenValley=adversarial_critic" \
--excerpt-file excerpt.md \
--question "How do cells determine their position in a developing embryo?"
# Check session status (uses session status command)
brenner session status --project-key "$PWD" --thread-id RS-20251230-cell-fate
# Watch for completion (polls until all roles respond)
brenner session status --project-key "$PWD" --thread-id RS-20251230-cell-fate --watch --timeout 3600
# Compile the artifact once agents have responded
brenner session compile --project-key "$PWD" --thread-id RS-20251230-cell-fate > artifact.mdThe cockpit:
- Provisions ntm panes for each agent
- Sends kickoff messages via Agent Mail
- Monitors for deltas and validates them
- Compiles approved deltas into the session artifact
- Manages round transitions and convergence
See specs/cockpit_start_command_v0.1.md and specs/cockpit_runbook_v0.1.md for detailed documentation.
The Next.js web app provides several views for browsing and analyzing research sessions:
| Route | Description |
|---|---|
/sessions |
List of all research sessions |
/sessions/[threadId] |
Session detail view with artifact |
/sessions/[threadId]/evidence |
Evidence pack view — consolidated experimental data |
| Route | Description |
|---|---|
/operators |
The Operator Algebra reference — all cognitive moves |
/method |
The Brenner Method guide — principles and practices |
| Route | Description |
|---|---|
/ |
Home page with search |
/corpus |
Corpus document listing |
/corpus/[doc] |
Document viewer with anchor navigation |
/distillations |
Model distillation summaries |
/glossary |
Term definitions and Brenner vocabulary |
The specs/ directory contains detailed specifications for all protocols and formats:
| Spec | Description |
|---|---|
artifact_schema_v0.1.md |
The 8-section research artifact structure |
artifact_delta_spec_v0.1.md |
Delta format for incremental updates |
artifact_linter_spec_v0.1.md |
50+ validation rules for artifact hygiene |
artifact_publish_spec_v0.1.md |
Publication and export formats |
evaluation_rubric_v0.1.md |
The 14-criterion scoring rubric |
operator_library_v0.1.md |
Complete operator algebra reference |
role_prompts_v0.1.md |
Agent role system prompts |
agent_mail_contracts_v0.1.md |
Message formats and threading |
agent_roster_schema_v0.1.md |
Agent configuration format |
message_body_schema_v0.1.md |
Message body structure |
thread_subject_conventions_v0.1.md |
Thread naming conventions |
excerpt_format_v0.1.md |
Transcript excerpt format |
delta_output_format_v0.1.md |
Delta output formatting |
experiment_result_encoding_v0.1.md |
Experiment data encoding |
experiment_capture_protocol_v0.1.md |
Capture workflow |
evidence_pack_v0.1.md |
Evidence consolidation format |
toolchain_manifest_v0.1.md |
Toolchain configuration |
session_replay_spec_v0.1.md |
Session replay format |
cockpit_start_command_v0.1.md |
Cockpit CLI reference |
cockpit_runbook_v0.1.md |
Cockpit operational guide |
deployment_runbook_v0.1.md |
Deployment procedures |
bootstrap_troubleshooting_v0.1.md |
Setup troubleshooting |
cross_workspace_binding_v0.1.md |
Multi-workspace coordination |
release_artifact_matrix_v0.1.md |
Release binary matrix |
The system uses a layered storage architecture for research artifacts:
apps/web/src/lib/storage/
├── hypothesis-storage.ts # Hypothesis CRUD with lifecycle
├── test-storage.ts # Test design and execution
├── assumption-storage.ts # Assumption tracking with load graphs
├── anomaly-storage.ts # Anomaly quarantine management
├── critique-storage.ts # Adversarial critique tracking
├── program-storage.ts # Research program aggregation
└── program-dashboard.ts # Dashboard metric computation
Each storage module provides:
- Create: Generate IDs, validate schemas, persist to store
- Read: Lookup by ID, list with filters, query by status
- Update: State transitions with timestamp tracking
- Lifecycle: State machine enforcement with transition validation
apps/web/src/lib/schemas/
├── hypothesis.ts # Hypothesis schema and factory
├── hypothesis-lifecycle.ts # State machine transitions
├── test-record.ts # Test design schema
├── test-binding.ts # Test execution binding
├── prediction.ts # Expected outcome predictions
├── assumption.ts # Assumption schema with load tracking
├── assumption-lifecycle.ts # Assumption state transitions
├── anomaly.ts # Anomaly schema with quarantine
├── critique.ts # Critique schema with responses
├── research-program.ts # Program aggregation schema
├── scorecard.ts # 14-criterion scoring schema
├── session.ts # Session metadata
├── session-replay.ts # Replay format schema
└── operator-intervention.ts # Human operator actions
All artifacts use stable, session-scoped IDs:
| Artifact | Pattern | Example |
|---|---|---|
| Hypothesis | H-{session}-{seq} |
H-RS20251230-001 |
| Test | T-{session}-{seq} |
T-RS20251230-001 |
| Assumption | A-{session}-{seq} |
A-RS20251230-001 |
| Anomaly | X-{session}-{seq} |
X-RS20251230-001 |
| Critique | C-{session}-{seq} |
C-RS20251230-001 |
| Program | RP-{slug}-{seq} |
RP-CELL-FATE-001 |
| Session ID | RS{date} |
RS20251230 |
Note on Session ID vs Thread ID:
- Session ID (
RS20251230): Short identifier embedded in artifact item IDs, used with--session-idfor filtering - Thread ID (
RS-20251230-cell-fate): Full identifier for Agent Mail threads, used with--thread-idfor orchestration
The session ID is typically the date portion extracted from the thread ID. When using CLI commands:
- Use
--session-id RS20251230for filtering (hypothesis list, test list, etc.) - Use
--thread-id RS-20251230-cell-fatefor orchestration (session start, status, compile)
All CLI commands support structured JSON output for programmatic integration:
# Get hypothesis as JSON
brenner hypothesis show H-RS20251230-001 --json
# List with JSON output
brenner test list --session-id RS20251230 --json
# Pipe to jq for processing
brenner hypothesis list --session-id RS20251230 --json | jq '.[] | select(.state == "active")'The JSON output matches the TypeScript schema types, enabling type-safe integration with other tools.
The thread status system tracks session lifecycle phases and role contributions in real-time. It's the backbone of session orchestration, enabling the CLI and web app to show progress and determine when compilation is ready.
Sessions progress through a defined lifecycle:
| Phase | Description |
|---|---|
not_started |
Thread exists but no kickoff sent |
awaiting_responses |
Kickoff sent, waiting for agent deltas |
partially_complete |
Some roles have contributed, others pending |
awaiting_compilation |
All roles complete, ready for compile |
compiled |
Artifact compiled, no critique yet |
in_critique |
Critique phase active |
closed |
Session complete |
Each session tracks three Brenner roles:
| Role | Primary Responsibility | Default Model |
|---|---|---|
hypothesis_generator |
Hunt paradoxes, propose H1-H3 | Codex/GPT |
test_designer |
Design discriminative tests + potency controls | Claude/Opus |
adversarial_critic |
Attack framing, check scale constraints | Gemini |
import { computeThreadStatus, formatThreadStatusSummary } from "./lib/threadStatus";
const status = computeThreadStatus(messages); // threadId is derived from message.thread_id
console.log(formatThreadStatusSummary(status));
// → "Phase: awaiting_compilation | 3/3 roles complete | 0 pending acks"# Show session status
brenner session status --thread-id RS20251230
# Watch for completion
brenner session status --thread-id RS20251230 --watch --timeout 3600The delta parser extracts structured contributions from agent message bodies. Agents produce deltas (not essays) that specify operations on the shared artifact.
| Operation | Semantics |
|---|---|
ADD |
Append new item (auto-assigns ID like H4, T7) |
EDIT |
Modify existing item by ID |
KILL |
Mark item as killed with reason |
Deltas target one of seven artifact sections:
| Section | ID Prefix | Content |
|---|---|---|
hypothesis_slate |
H | Candidate explanations |
predictions_table |
P | Per-hypothesis predictions |
discriminative_tests |
T | Tests that separate hypotheses |
assumption_ledger |
A | Load-bearing assumptions |
anomaly_register |
X | Quarantined exceptions |
adversarial_critique |
C | Framing attacks |
research_thread |
RT | Problem statement (singleton) |
Agents embed deltas in fenced code blocks:
:::delta
{
"operation": "ADD",
"section": "hypothesis_slate",
"target_id": null,
"payload": {
"name": "Gradient Model",
"claim": "Positional information is encoded in morphogen gradients",
"mechanism": "Cells read concentration thresholds",
"anchors": ["§58", "EV-001"]
},
"rationale": "Builds on established developmental biology (§58)"
}
:::import { parseDeltaMessage, extractValidDeltas } from "./lib/delta-parser";
const result = parseDeltaMessage(messageBody);
console.log(`Found ${result.validCount} valid deltas, ${result.invalidCount} invalid`);
// Get only valid deltas
const deltas = extractValidDeltas(messageBody);
for (const delta of deltas) {
console.log(`${delta.operation} on ${delta.section}`);
}The tribunal system uses four specialized AI agents, each with a comprehensive persona definition that governs their behavior, tone, and activation patterns.
| Agent | Role | Tagline | Core Purpose |
|---|---|---|---|
| Devil's Advocate | devils_advocate |
"Challenge everything. Trust nothing without evidence." | Attack hypotheses, expose assumptions, prevent confirmation bias |
| Experiment Designer | experiment_designer |
"Design tests that give clean answers." | Translate hypotheses into discriminative tests, ensure methodological rigor |
| Brenner Channeler | brenner_channeler |
"You've got to really find out." | Channel Sydney Brenner's voice, push for exclusion tests, demand experiments |
| Synthesis | synthesis |
"Distill clarity from complexity." | Integrate tribunal outputs, identify consensus, prioritize next steps |
Each persona includes:
interface AgentPersona {
role: TribunalAgentRole;
displayName: string;
tagline: string;
corePurpose: string;
behaviors: AgentBehavior[]; // Prioritized behavioral patterns
tone: ToneCalibration; // Assertiveness, constructiveness, Socratic level
modelConfig: ModelConfig; // Temperature, tokens, preferred tier
invocationTriggers: InvocationTrigger[]; // Events that activate this agent
activePhases: PersonaPhaseGroup[]; // Session phases where active
interactionPatterns: InteractionPattern[]; // Input→output examples
synergizesWith: TribunalAgentRole[]; // Complementary agents
systemPromptFragments: string[]; // Prompt building blocks
}Each agent's voice is tuned across four dimensions (0-1 scale):
| Agent | Assertiveness | Constructiveness | Socratic Level | Formality |
|---|---|---|---|---|
| Devil's Advocate | 0.8 | 0.7 | 0.6 | 0.5 |
| Experiment Designer | 0.6 | 0.9 | 0.7 | 0.6 |
| Brenner Channeler | 0.9 | 0.6 | 0.5 | 0.3 |
| Synthesis | 0.5 | 0.95 | 0.2 | 0.7 |
Agents activate on specific events:
| Trigger | Description | Agents Activated |
|---|---|---|
hypothesis_submitted |
User submits initial hypothesis | Devil's Advocate, Experiment Designer, Brenner Channeler |
hypothesis_refined |
Hypothesis is modified | Devil's Advocate, Brenner Channeler |
prediction_locked |
Prediction committed (pre-registration) | Devil's Advocate |
evidence_supports |
Evidence supports hypothesis | Devil's Advocate |
test_designed |
New test proposed | Experiment Designer, Brenner Channeler |
tribunal_requested |
Full tribunal session | All agents |
phase_transition |
Moving between phases | Brenner Channeler, Synthesis |
Agents are active during specific session phase groups:
| Phase Group | Detailed Phases | Active Agents |
|---|---|---|
intake |
intake | Devil's Advocate |
hypothesis |
sharpening | All agents |
operators |
level_split, exclusion_test, object_transpose, scale_check | Devil's Advocate, Experiment Designer, Brenner Channeler |
agents |
agent_dispatch | All agents |
evidence |
evidence_gathering | Devil's Advocate, Experiment Designer, Brenner Channeler |
synthesis |
synthesis, revision | Brenner Channeler, Synthesis |
Devil's Advocate (priority behaviors):
- Identify Unstated Assumptions: "You're assuming the correlation reflects causation, but what if both variables are caused by a third factor you haven't measured?"
- Find Alternative Explanations: "This pattern is also consistent with reverse causation, measurement artifact, or selection bias. How would you distinguish these?"
Experiment Designer (priority behaviors):
- Ask Probing Questions About Measurements: "When you say you'll measure 'improvement', what specific metric are you using? How will you operationalize that?"
- Identify Confounds: "If you compare treated vs untreated groups, how will you control for the placebo effect and experimenter bias?"
Brenner Channeler (priority behaviors):
- Demand the Experiment: "That's all very well, but what's the experiment? How would you actually test this?"
- Seek Exclusion Over Confirmation: "Exclusion is always a tremendously good thing in science. What observation would kill your hypothesis?"
import {
getPersona,
getActivePersonasForPhase,
getPersonasForTrigger,
buildSystemPromptContext,
} from "@/lib/brenner-loop";
// Get all personas active during the operators phase
const operatorAgents = getActivePersonasForPhase("level_split");
// → [Devil's Advocate, Experiment Designer, Brenner Channeler]
// Get personas triggered by hypothesis submission
const triggered = getPersonasForTrigger("hypothesis_submitted");
// → [Devil's Advocate, Experiment Designer, Brenner Channeler]
// Build system prompt for an agent
const prompt = buildSystemPromptContext("devils_advocate");The prediction lock system provides cryptographic pre-registration for scientific predictions. Predictions are hashed before evidence is collected, ensuring that claimed predictions were actually made in advance.
| State | Symbol | Description |
|---|---|---|
draft |
— | Freely editable, not yet committed |
locked |
🔒 | SHA-256 sealed, waiting for evidence |
revealed |
🔓 | Evidence collected, prediction compared to outcome |
amended |
Modified after evidence (flagged for integrity) |
Draft → Lock (SHA-256 hash) → Evidence Collection → Reveal → Compare
↓
[Amendment] (if changed post-hoc)
| Type | Description |
|---|---|
qualitative |
"X will increase" |
quantitative |
"X will be > 5.0" |
comparative |
"X > Y" |
temporal |
"X before Y" |
null |
"No effect" |
When a prediction is locked:
- SHA-256 hash computed:
hash(prediction_text + timestamp) - Original text becomes immutable
- Hash stored for later verification
If interpretations change post-evidence:
- Each amendment logged with type:
clarification,reinterpretation,scope_change,retraction - Amendments penalize integrity score
- Visual warnings displayed in UI
integrityScore = (1 - amendmentPenalty) × 100
High integrity = predictions were locked before evidence and not modified after.
Predictions with higher integrity get weighted more heavily in confidence updates:
| Integrity | Multiplier |
|---|---|
| 100% (no amendments) | 1.0× |
| 75-99% | 0.8× |
| 50-74% | 0.5× |
| < 50% | 0.2× |
import {
lockPrediction,
revealPrediction,
verifyPrediction,
calculatePredictionLockStats,
} from "@/lib/brenner-loop";
// Lock a prediction before evidence
const lockResult = await lockPrediction({
hypothesisId: "H-RS20251230-001",
predictionText: "Recovery time constant will be < 500ms",
predictionType: "quantitative",
});
// → { lockedAt, hash, state: "locked" }
// After evidence, reveal and compare
const revealResult = await revealPrediction({
predictionId: lockResult.id,
observedOutcome: "Recovery time was 487 ± 32ms",
result: "confirmed", // confirmed | refuted | inconclusive
});
// Verify a prediction's hash
const valid = await verifyPrediction(lockResult.id, lockResult.hash);The hypothesis arena provides head-to-head competitive testing between multiple hypotheses. Instead of evaluating hypotheses in isolation, the arena tracks how they perform relative to each other on shared discriminative tests.
| Concept | Description |
|---|---|
| Arena | A competitive space where multiple hypotheses face the same tests |
| Competitor | A hypothesis entered into the arena |
| Shared Test | A test that applies to multiple hypotheses |
| Elimination | When a test definitively rules out a hypothesis |
| Champion | The hypothesis that survives with highest score |
| Status | Description |
|---|---|
active |
Still competing |
eliminated |
Definitively ruled out by a test |
suspended |
Temporarily set aside |
champion |
Won the arena |
Predictions are scored by specificity and risk:
| Boldness | Description | Multiplier |
|---|---|---|
vague |
"Things will improve" | 1.0× |
specific |
"Score increases 5-10%" | 1.5× |
precise |
"Score will be exactly 7.3" | 2.0× |
surprising |
"Contrary to consensus, X will occur" | 3.0× |
Scoring formula:
- Confirmed bold prediction:
+base_score × boldness_multiplier - Refuted bold prediction:
-base_score × boldness_multiplier
Bold predictions that succeed earn more; bold predictions that fail cost more. This incentivizes making specific, risky predictions.
The arena generates a comparison matrix showing:
| Hypothesis | Test T1 | Test T2 | Test T3 | Total Score |
|---|---|---|---|---|
| H1 | +3 ✓ | -2 ✗ | +4 ✓ | 5 |
| H2 | +1 ✓ | +2 ✓ | ELIM | — |
| H3 | 0 | +2 ✓ | +1 ✓ | 3 |
Tests are evaluated for how well they distinguish hypotheses:
discriminativePower = variance(predictions across hypotheses)
A test where all hypotheses predict the same outcome has zero discriminative power and is flagged.
import {
createArena,
addCompetitor,
createArenaTest,
recordTestResult,
buildComparisonMatrix,
getLeader,
} from "@/lib/brenner-loop";
// Create an arena for competing hypotheses
const arena = createArena("Mechanism of X");
// Add competitors
const h1 = addCompetitor(arena, hypothesis1);
const h2 = addCompetitor(arena, hypothesis2);
const h3 = addCompetitor(arena, hypothesis3);
// Create a shared test
const test = createArenaTest(arena, {
description: "Measure response time under condition Y",
predictions: {
[h1.id]: { outcome: "< 500ms", boldness: "specific" },
[h2.id]: { outcome: "> 1000ms", boldness: "specific" },
[h3.id]: { outcome: "500-1000ms", boldness: "vague" },
},
});
// Record result and update scores
recordTestResult(arena, test.id, {
observed: "487ms",
hypothesisResults: {
[h1.id]: "confirmed",
[h2.id]: "refuted",
[h3.id]: "refuted",
},
});
// Get rankings
const matrix = buildComparisonMatrix(arena);
const leader = getLeader(arena);Individual hypotheses progress through a defined lifecycle managed by a state machine. This ensures proper tracking of hypothesis status and enforces valid transitions.
| State | Symbol | Description |
|---|---|---|
draft |
○ | Initial formulation, freely editable |
proposed |
◐ | Submitted for evaluation |
active |
● | Under active investigation |
under_attack |
⚔ | Facing serious challenges |
assumption_undermined |
⚠ | Key assumption falsified |
refined |
↻ | Evolved based on feedback |
dormant |
◇ | Parked for later |
killed |
✗ | Definitively falsified |
validated |
✓ | Survived rigorous testing |
draft → proposed → active ─┬→ under_attack → killed
├→ assumption_undermined → killed
├→ refined → active (cycle)
├→ dormant → active (reactivation)
└→ validated
| Event | From States | To State |
|---|---|---|
submit |
draft | proposed |
activate |
proposed | active |
challenge |
active | under_attack |
undermine_assumption |
active, under_attack | assumption_undermined |
refine |
active, under_attack | refined |
park |
active | dormant |
reactivate |
dormant | active |
kill |
under_attack, assumption_undermined | killed |
validate |
active | validated |
Once a hypothesis reaches killed or validated, no further transitions are possible. These are terminal states that represent the end of the hypothesis lifecycle.
Each state has associated configuration:
interface HypothesisStateConfig {
label: string; // Display name
description: string; // What this state means
icon: string; // Visual indicator
colors: {
bg: string; // Background color class
text: string; // Text color class
border: string; // Border color class
};
isEditable: boolean; // Can hypothesis be modified?
isDeletable: boolean; // Can hypothesis be deleted?
isTerminal: boolean; // End of lifecycle?
}import {
transitionHypothesis,
getAvailableTransitions,
canTransition,
isTerminalState,
createHypothesisWithLifecycle,
} from "@/lib/brenner-loop";
// Create a hypothesis with lifecycle tracking
const hypothesis = createHypothesisWithLifecycle({
statement: "X causes Y through mechanism Z",
mechanism: "Z enables X to produce Y",
predictionsIfTrue: ["Blocking Z prevents effect"],
impossibleIfTrue: ["Effect without Z present"],
});
// → state: "draft"
// Check available transitions
const events = getAvailableTransitions(hypothesis);
// → ["submit"]
// Transition to next state
const result = transitionHypothesis(hypothesis, "submit");
if (result.success) {
// hypothesis.state is now "proposed"
}
// Check if we can transition
if (canTransition(hypothesis, "activate")) {
transitionHypothesis(hypothesis, "activate");
}Certain transitions trigger side effects:
| Transition | Side Effect |
|---|---|
kill |
Records kill reason, updates arena if applicable |
validate |
Marks as champion in arena if applicable |
refine |
Creates new version, preserves history link |
undermine_assumption |
Propagates to dependent tests |
The session kickoff system composes role-specific prompts for multi-agent sessions. It supports both unified mode (all agents get the same prompt) and roster mode (each agent gets a role-tailored prompt).
Unified mode (--unified): All agents receive identical kickoff messages containing the full research question and excerpt.
Role-separated mode (--role-map): Each agent receives a role-specific prompt emphasizing their responsibilities:
| Role | Operators | Focus |
|---|---|---|
hypothesis_generator |
⊘ Level-Split, ⊕ Cross-Domain, ◊ Paradox-Hunt | Generate 3+ competing hypotheses |
test_designer |
✂ Exclusion-Test, ⌂ Materialize, ⟂ Object-Transpose, 🎭 Potency-Check | Design discriminative experiments |
adversarial_critic |
ΔE Exception-Quarantine, † Theory-Kill, ⊞ Scale-Check | Attack framings, enforce constraints |
# Role-separated kickoff (recommended)
brenner session start \
--project-key "$PWD" \
--thread-id RS20251230 \
--sender GreenCastle \
--to BlueLake,PurpleMountain,GreenValley \
--role-map "BlueLake=hypothesis_generator,PurpleMountain=test_designer,GreenValley=adversarial_critic" \
--excerpt-file excerpt.md \
--question "How do cells determine their position in a developing embryo?"
# Unified kickoff
brenner session start \
--project-key "$PWD" \
--thread-id RS20251230 \
--sender GreenCastle \
--to BlueLake,PurpleMountain,GreenValley \
--unified \
--excerpt-file excerpt.md \
--question "How do cells determine their position in a developing embryo?"import { composeKickoffMessages, type KickoffConfig } from "./lib/session-kickoff";
const config: KickoffConfig = {
threadId: "RS20251230",
researchQuestion: "How do cells determine their position?",
context: "Investigating positional information encoding",
excerpt: excerptMarkdown,
recipients: ["BlueLake", "PurpleMountain", "GreenValley"],
recipientRoles: {
BlueLake: "hypothesis_generator",
PurpleMountain: "test_designer",
GreenValley: "adversarial_critic",
},
};
const messages = composeKickoffMessages(config);
// → Array of role-specific kickoff messagesThe global search system provides full-text search across the entire corpus: transcripts (236 segments), quote bank, distillations, metaprompts, and raw model responses.
| Category | Content |
|---|---|
transcript |
Complete Brenner transcript (236 segments) |
quote-bank |
Curated primitives tagged by operator |
distillation |
Final synthesis documents (Opus, GPT, Gemini) |
metaprompt |
Prompt templates |
raw-response |
Model response batches |
all |
Search everything (default) |
- In-memory caching: Corpus loaded once, cached for fast queries
- Relevance scoring: BM25-based ranking with title/content weighting
- Match highlighting: Snippets with matched terms emphasized
- Model filtering: Filter raw responses by model (gpt, opus, gemini)
- Anchor extraction: Returns
§nanchors for citation
# Basic search
brenner corpus search "forbidden pattern"
# Filter by category
brenner corpus search "exclusion" --category transcript
# Filter by model for raw responses
brenner corpus search "dimensional reduction" --category raw-response --model opus
# JSON output for programmatic use
brenner corpus search "digital handle" --json --limit 50import { globalSearch, type SearchCategory } from "./lib/globalSearch";
const result = await globalSearch("discriminative experiment", {
limit: 20,
category: "transcript",
});
console.log(`Found ${result.totalMatches} matches in ${result.searchTimeMs}ms`);
for (const hit of result.hits) {
console.log(`${hit.anchor}: ${hit.title}`);
console.log(` ${hit.snippet}`);
}The jargon dictionary provides a comprehensive glossary of 100+ terms covering Brenner operators, scientific methodology, biology, Bayesian reasoning, and project-specific terminology.
| Category | Content |
|---|---|
operators |
Brenner operators (⊘ Level-Split, 𝓛 Recode, ✂ Exclusion-Test, etc.) |
brenner |
Core Brenner concepts (third alternative, forbidden pattern, etc.) |
biology |
Scientific/biology terms (C. elegans, morphogen, etc.) |
bayesian |
Statistical/probabilistic terms (likelihood ratio, prior, etc.) |
method |
Scientific method terms (hypothesis, falsification, etc.) |
project |
BrennerBot-specific terms (delta, artifact, session, etc.) |
Each term has multiple levels of explanation:
interface JargonTerm {
term: string; // Display name (e.g., "Level-split")
short: string; // ~100 char tooltip definition
long: string; // 2-4 sentence explanation
analogy?: string; // "Think of it like..." for non-experts
why?: string; // Why this matters in Brenner context
related?: string[]; // Related term keys for discovery
category: JargonCategory;
}import { getJargon, jargonDictionary, searchJargon } from "./lib/jargon";
// Get a specific term
const term = getJargon("level-split");
if (term) {
console.log(term.short); // For tooltips
console.log(term.long); // For full explanation
console.log(term.analogy); // For non-experts
}
// Search across all terms
const matches = searchJargon("exclusion");
// → Returns terms containing "exclusion" in term, short, or longThe web app uses the jargon dictionary for:
- Hover tooltips: Show
shortdefinition on hover - Glossary page: Full dictionary with category filtering
- Progressive disclosure: Click for
long→ click again foranalogy+why
The lab mode auth system implements defense-in-depth gating for orchestration features. Public deployments cannot trigger Agent Mail operations without explicit enablement.
- Environment gate:
BRENNER_LAB_MODE=1must be set (fail-closed) - Authentication: Either Cloudflare Access headers OR shared secret
- Timing-safe comparison: Secrets compared in constant time
- Information hiding: Failed auth returns 404 (not 401/403)
| Variable | Purpose |
|---|---|
BRENNER_LAB_MODE |
Enable lab mode (1 or true) |
BRENNER_LAB_SECRET |
Shared secret for local auth |
BRENNER_TRUST_CF_ACCESS_HEADERS |
Trust Cloudflare Access JWT headers |
BRENNER_PROJECT_KEY |
Default project key for Agent Mail (absolute path) |
BRENNER_AGENT_NAME |
Default agent name for session pages |
BRENNER_PUBLIC_BASE_URL |
Public base URL for fetching corpus/assets |
Method 1: Cloudflare Access (recommended for production)
- Deploy behind Cloudflare Access
- Set
BRENNER_TRUST_CF_ACCESS_HEADERS=1 - Auth is handled by Cloudflare JWT validation
Method 2: Shared Secret (for local development)
- Set
BRENNER_LAB_SECRET=your-secret-here - Pass secret via header or cookie:
- Header:
x-brenner-lab-secret: your-secret-here - Cookie:
brenner_lab_secret=your-secret-here
- Header:
import { checkOrchestrationAuth, assertOrchestrationAuth } from "./lib/auth";
// Check auth (returns status object)
const { authorized, reason } = checkOrchestrationAuth(headers, cookies);
if (!authorized) {
console.log(`Denied: ${reason}`);
}
// Assert auth (throws on failure)
assertOrchestrationAuth(headers, cookies);
// Proceeds if authorized, throws Error if notThe operator intervention system tracks human overrides during Brenner Loop sessions. This enables reproducibility, trust verification, and learning from operator decisions.
| Type | Description | Typical Severity |
|---|---|---|
artifact_edit |
Direct edit to compiled artifact | moderate |
delta_exclusion |
Excluded a delta from compilation | moderate |
delta_injection |
Added a delta not from an agent | major |
decision_override |
Overrode a protocol decision | major |
session_control |
Terminated, forked, or reset session | critical |
role_reassignment |
Changed agent-role mappings mid-session | major |
| Severity | Examples |
|---|---|
minor |
Typo fixes, formatting adjustments |
moderate |
Delta exclusion, small edits |
major |
Killing hypotheses, adding tests, role changes |
critical |
Session termination, protocol bypass |
interface OperatorIntervention {
id: string; // INT-RS20251230-001
session_id: string; // RS20251230
timestamp: string; // ISO 8601
operator_id: string; // "human" or agent name
type: InterventionType;
severity: InterventionSeverity;
target: {
message_id?: number;
artifact_version?: number;
item_id?: string; // H-RS20251230-001
item_type?: string; // hypothesis, test, etc.
};
state_change?: {
before: string;
after: string;
before_hash?: string;
after_hash?: string;
};
rationale: string; // Required, min 10 chars
reversible: boolean;
reversed_at?: string;
reversed_by?: string;
}Compiled artifacts include an intervention summary:
{
"total_count": 3,
"by_severity": { "minor": 1, "moderate": 2, "major": 0, "critical": 0 },
"by_type": { "artifact_edit": 2, "delta_exclusion": 1, ... },
"has_major_interventions": false,
"operators": ["human"],
"first_intervention_at": "2025-12-30T10:00:00Z",
"last_intervention_at": "2025-12-30T14:30:00Z"
}The session replay system records complete session traces for reproducibility, debugging, and training. It captures inputs, execution traces, and outputs with verification hashes.
Inputs (deterministic):
- Kickoff configuration (thread_id, question, excerpt, operators)
- External evidence summaries
- Agent roster with role assignments
- Protocol versions
Trace (execution):
- Rounds with message traces
- Content hashes (SHA256) for verification
- Operator interventions
- Timing information
Outputs:
- Final artifact hash
- Lint results
- Artifact counts (hypotheses, tests, assumptions, etc.)
- Scorecard results
interface SessionRecord {
id: string; // REC-RS20251230-1704067200
session_id: string; // RS20251230
created_at: string;
inputs: SessionInputs;
trace: SessionTrace;
outputs: SessionOutputs;
schema_version: string;
}| Mode | Purpose |
|---|---|
verification |
Re-run with same agents to verify outputs match |
comparison |
Re-run with different agents to compare |
trace |
Step through recorded messages without re-running |
When replaying, the system detects divergences:
| Severity | Meaning |
|---|---|
none |
Identical or semantically equivalent |
minor |
Slight wording differences, same meaning |
moderate |
Different approach, similar conclusions |
major |
Fundamentally different conclusions |
import {
createEmptySessionRecord,
createTraceMessage,
validateSessionRecord,
isReplayable
} from "./lib/schemas/session-replay";
// Create a session record
const record = createEmptySessionRecord("RS20251230");
// Add a trace message
const message = await createTraceMessage(
"BlueLake",
"DELTA",
deltaContent,
{ message_id: 42, subject: "H1 proposal" }
);
record.trace.rounds[0].messages.push(message);
// Validate the record
const result = validateSessionRecord(record);
if (result.valid) {
console.log("Record is valid");
}
// Check if replayable
if (isReplayable(record)) {
console.log("Session can be replayed");
}The Coach Mode provides progressive scaffolding for researchers new to the Brenner method. Instead of throwing users into a complex methodology, it introduces concepts gradually, provides inline explanations, and catches common mistakes before they become problematic.
Traditional scientific method tutorials are passive—you read, then try to apply. Coach Mode inverts this: learn by doing with guardrails. The system watches what you're doing, explains concepts when they become relevant, and gently corrects methodological errors in real-time.
This mirrors how Brenner himself taught—through conversation and working examples rather than lectures. The coach doesn't tell you "what is a discriminative test"; it waits until you're designing a test and then explains why your current approach may or may not have discriminative power.
| Level | Explanation Verbosity | Confirmations | Auto-Pause |
|---|---|---|---|
beginner |
Full explanations with examples and Brenner quotes | Required for major actions | Yes, at each phase |
intermediate |
Brief explanations, examples on request | Optional | Only at decision points |
advanced |
Tooltips only, no interruptions | Rare | Never |
The system auto-promotes users based on their progress: sessions completed, hypotheses formulated, operators used, and quality checkpoints passed.
Coach Mode tracks learning progress across sessions:
interface LearningProgress {
seenConcepts: Set<ConceptId>; // Concepts with explanations viewed
sessionsCompleted: number; // Total sessions finished
hypothesesFormulated: number; // Hypotheses created
operatorsUsed: Set<string>; // Brenner operators applied
mistakesCaught: number; // Quality checkpoint failures
checkpointsPassed: number; // Quality checkpoint successes
firstSessionDate?: string;
lastSessionDate?: string;
}At critical moments (hypothesis formulation, test design, assumption logging), Coach Mode validates user input against Brenner-style quality criteria:
Hypothesis Quality Checks:
- Statement length (too short = too vague)
- Vague causal language without mechanism
- Missing mechanism specification
- Missing predictions
- Missing falsification conditions
- Unfalsifiable hedging language ("might", "could possibly")
Each check returns a severity (error, warning, info), an explanation of why it matters, and a specific suggestion for improvement.
Explanations are keyed to specific concepts and phases:
type ConceptId =
// Phases
| "phase_intake"
| "phase_sharpening"
| "phase_level_split"
| "phase_exclusion_test"
// ... operators, agents, methodology conceptsEach explanation includes:
- Brief: Always-visible short explanation
- Full: Detailed explanation for beginners
- Key Points: Bulleted takeaways
- Brenner Quote: Relevant quote from the transcripts
- Example: Concrete worked example
import { CoachProvider, useCoach } from "@/lib/brenner-loop/coach-context";
function MyComponent() {
const {
isCoachActive,
effectiveLevel,
shouldShowExplanation,
markConceptSeen,
recordCheckpointPassed,
recordMistakeCaught,
} = useCoach();
if (isCoachActive && shouldShowExplanation("phase_level_split")) {
// Show level-split explanation
}
}The Confound Detection system automatically identifies likely confounds based on the research domain of a hypothesis. This addresses a key weakness in hypothesis-driven research: confounds are easier to see when you know where to look.
Brenner was explicit about the importance of identifying confounds before experimenting:
"What's the third alternative? Both could be wrong."
But humans are bad at generating confounds spontaneously—we tend to see what we expect to see. The confound detector provides domain-specific libraries of common threats to validity, prompting researchers to consider issues they might otherwise miss.
| Domain | Example Confounds |
|---|---|
psychology |
Selection bias, demand characteristics, social desirability, reverse causation, maturation effects, regression to the mean |
epidemiology |
Healthy user bias, confounding by indication, temporal ambiguity, surveillance bias, immortal time bias, recall bias |
economics |
Endogeneity, omitted variable bias, survivorship bias, simultaneity, measurement error |
biology |
Batch effects, genetic background confounding, environmental variation, off-target effects |
sociology |
Ecological fallacy, period effects, social network confounding (homophily vs influence) |
computer_science |
Data leakage, benchmark overfitting, training data selection bias |
neuroscience |
Reverse inference, motion artifacts |
general |
Publication bias, multiple comparisons, Hawthorne effect, third variable problem |
- Domain Classification: The system analyzes hypothesis text (statement, mechanism, predictions, domains) using keyword matching to classify the research domain
- Library Selection: Domain-specific confounds plus general confounds are loaded
- Pattern Matching: Each confound template is matched against hypothesis text using keywords and regex patterns
- Likelihood Scoring: Matches are scored based on keyword frequency and pattern matches
- Result Ranking: Confounds are ranked by likelihood and returned with prompting questions
Each confound in the library includes:
interface ConfoundTemplate {
id: string; // Unique identifier
name: string; // Human-readable name
description: string; // Explanation of the confounding mechanism
domain: ResearchDomain; // Primary domain
keywords: string[]; // Terms suggesting this confound applies
patterns?: RegExp[]; // Structural patterns in hypothesis text
promptQuestions: string[]; // Questions to prompt user consideration
baseLikelihood: number; // Default likelihood when detected (0-1)
}import { detectConfounds, classifyDomain } from "@/lib/brenner-loop/confound-detection";
// Detect confounds for a hypothesis
const result = detectConfounds(hypothesis, {
threshold: 0.3, // Minimum likelihood to include
maxConfounds: 10, // Max results
forceDomain: undefined, // Auto-detect domain
});
// Result includes:
// - confounds: IdentifiedConfound[]
// - detectedDomain: ResearchDomain
// - domainConfidence: number
// - summary: string
// Get prompting questions for a specific confound
const questions = getConfoundQuestions("selection_bias", "psychology");The Similarity Search system finds related hypotheses across sessions using semantic embeddings. This surfaces prior work, identifies potential duplicates, and reveals clusters of related research questions.
Research sessions generate hypotheses that may overlap with previous work—sometimes intentionally (refinement), sometimes accidentally (duplication). Without similarity search:
- Researchers waste time re-investigating killed hypotheses
- Related work in other sessions goes undiscovered
- Duplicate effort fragments institutional memory
The similarity system uses hash-based embeddings for client-side computation without external API calls. This approach:
- Works offline (no network required)
- Has no usage limits or costs
- Is deterministic (same input → same embedding)
- Is fast (< 1ms per embedding)
The trade-off: hash-based embeddings capture lexical similarity rather than deep semantic meaning. For research hypotheses with technical vocabulary, this is often sufficient.
Similarity between hypotheses is computed across three dimensions:
| Component | Weight | What It Measures |
|---|---|---|
| Statement | 0.5 | Core hypothesis claim similarity |
| Mechanism | 0.3 | Proposed causal pathway similarity |
| Domain | 0.2 | Research domain overlap (Jaccard) |
The combined score uses cosine similarity for text components and Jaccard similarity for domain overlap.
import {
findSimilarHypotheses,
searchHypothesesByText,
clusterSimilarHypotheses,
findDuplicates,
getSimilarityStats,
} from "@/lib/brenner-loop/search/hypothesis-similarity";
// Find hypotheses similar to a query hypothesis
const matches = findSimilarHypotheses(query, candidates, {
minScore: 0.3,
maxResults: 10,
excludeQuery: true,
sessionFilter: ["RS-20251230", "RS-20251231"],
});
// Search by free-form text
const results = searchHypothesesByText(
"morphogen gradient signaling",
candidates
);
// Cluster similar hypotheses (e.g., for deduplication)
const clusters = clusterSimilarHypotheses(hypotheses, 0.5);
// Find potential duplicates (high threshold)
const duplicates = findDuplicates(hypotheses, 0.8);Each match includes a breakdown of similarity components:
interface SimilarityMatch {
hypothesis: IndexedHypothesis;
score: number; // Overall similarity (0-1)
breakdown: {
statement: number; // Statement similarity
mechanism: number; // Mechanism similarity
domain: number; // Domain overlap
content: number; // Combined content similarity
};
reason: string; // Human-readable explanation
}Agent Debate Mode enables multi-round adversarial dialogue between tribunal agents. Instead of single-round responses, agents engage in structured debates that sharpen arguments through opposition.
Single-round agent responses are shallow. The agent states a position and stops. Debate forces:
- Clarification: Vague claims get challenged
- Steel-manning: Each side must represent opponents fairly
- Convergence: Points of genuine agreement emerge
- Sharpening: Arguments get refined through opposition
This mirrors Brenner's own practice—he credited conversation with Crick as essential to his thinking:
"We didn't work together on experiments... But we had lunch together every day for thirty years. And that was where we talked."
| Format | Structure | Best For |
|---|---|---|
oxford_style |
Proposition vs Opposition with Judge | Testing hypothesis strength |
socratic |
Probing questions reveal weaknesses | Finding hidden assumptions |
steelman_contest |
Each agent builds and attacks strongest version | Exploring hypothesis space |
interface AgentDebate {
id: string; // DEB-RS20251230-1704067200
sessionId: string; // Parent session
hypothesisId: string; // Hypothesis under debate
format: DebateFormat;
status: DebateStatus;
config: DebateConfig; // Max rounds, timeouts, etc.
participants: DebateParticipant[];
rounds: DebateRound[];
userInjections: UserInjection[]; // Human questions during debate
conclusion?: DebateConclusion;
}Each debate round is analyzed for:
- New points made: Arguments introduced this round
- Objections raised: Challenges to previous statements
- Concessions given: Agreements with opponents
- Key quotes: Extractable insights
- Setup: Create debate with hypothesis, format, and participants
- Opening: Each participant states initial position
- Rounds: Agents respond to each other (max N rounds)
- Injections: Users can inject questions at any point
- Conclusion: System synthesizes consensus, unresolved points, and key insights
import {
createDebate,
addRound,
addUserInjection,
generateConclusion,
} from "@/lib/brenner-loop/agents/debate";
// Create a debate
const debate = createDebate(
sessionId,
hypothesisId,
"oxford_style",
[
{ role: "devils_advocate", position: "opposition" },
{ role: "hypothesis_generator", position: "proposition" },
{ role: "test_designer", position: "judge" },
]
);
// Add rounds as agents respond
const round = addRound(debate, "devils_advocate", responseContent);
// User can inject questions
addUserInjection(debate, {
content: "What if the gradient is non-linear?",
targetAgent: "proposition",
injectedAt: new Date().toISOString(),
});
// Generate conclusion when debate ends
const conclusion = generateConclusion(debate);The What-If system enables simulation of evidence impact before running tests. Users can explore how different test outcomes would affect hypothesis confidence, helping prioritize which tests to pursue first.
Researchers often have multiple possible tests they could run. Without simulation:
- They pick tests arbitrarily or based on convenience
- High-impact tests may be deferred in favor of easier ones
- The test that would most discriminate between hypotheses isn't identified
A scenario is a collection of assumed test results with their projected impact:
interface WhatIfScenario {
id: string;
name: string; // "Best case for H1"
sessionId: string;
hypothesisId: string;
startingConfidence: number; // Current confidence
assumedTests: AssumedTestResult[]; // List of assumed outcomes
projectedConfidence: number; // Confidence after all tests
confidenceDelta: number; // Total change
}The system computes expected information gain for each test:
interface TestComparison {
testId: string;
testName: string;
discriminativePower: DiscriminativePower;
analysis: WhatIfAnalysis;
maxImpact: number; // Maximum potential confidence change
expectedInformationGain: number;
rank: number;
}Tests are ranked by expected information gain—how much the test reduces uncertainty about the hypothesis regardless of outcome.
import {
createScenario,
addTestToScenario,
compareTests,
recommendNextTest,
} from "@/lib/brenner-loop/what-if";
// Create a scenario
const scenario = createScenario(
sessionId,
hypothesisId,
"Best case scenario",
currentConfidence
);
// Add assumed test results
addTestToScenario(scenario, {
testId: "T-RS20251230-001",
testName: "Gradient perturbation",
assumedResult: "supports",
discriminativePower: 4,
});
// Compare multiple tests
const comparisons = compareTests(hypothesisId, testQueue, currentConfidence);
// Get recommendation for next test
const recommendation = recommendNextTest(comparisons);
// Returns: { testId, reason, expectedGain }The Session State Machine orchestrates Brenner Loop sessions through a deterministic finite state machine. Built on XState, it ensures sessions follow the proper methodology and can be replayed, debugged, and audited.
Research sessions are complex workflows with many possible paths. Without formal state management:
- Sessions can skip required steps
- Transitions happen in invalid orders
- State becomes inconsistent after errors
- Replay and debugging are difficult
The state machine makes session flow explicit, enforceable, and debuggable.
| Phase | Purpose | Entry Conditions |
|---|---|---|
idle |
Initial state | Session created |
intake |
Research question formulation | User starts session |
sharpening |
Hypothesis refinement | Intake complete |
level_split |
Program/interpreter separation | Hypothesis formulated |
exclusion_test |
Discriminative test design | Level-split applied |
object_transpose |
Organism/system selection | Tests designed |
scale_check |
Physics/scale constraint validation | System selected |
agent_dispatch |
Multi-agent tribunal convened | Scale check passed |
synthesis |
Agent outputs merged | Agents responded |
evidence_gathering |
External evidence collected | Synthesis complete |
revision |
Hypothesis updated based on evidence | Evidence gathered |
complete |
Session finished | Revision complete |
error |
Error state | Any unrecoverable error |
const sessionMachine = createMachine({
id: "brennerSession",
initial: "idle",
context: {
session: Session,
hypothesis: HypothesisCard | null,
operatorResults: OperatorResults,
agentResponses: AgentResponse[],
evidence: EvidenceEntry[],
errors: ErrorEntry[],
},
states: {
idle: { on: { START: "intake" } },
intake: { on: { SUBMIT_QUESTION: "sharpening" } },
// ... full state definition
},
});Each transition can trigger actions:
// Example: transitioning from sharpening to level_split
sharpening: {
on: {
SUBMIT_HYPOTHESIS: {
target: "level_split",
actions: [
"saveHypothesis",
"recordTimestamp",
"notifyTransition",
],
guard: "hypothesisValid",
},
},
}import { useSessionMachine } from "@/lib/brenner-loop/use-session-machine";
function SessionComponent() {
const {
state, // Current state name
context, // Session context (hypothesis, results, etc.)
send, // Dispatch events
canTransition, // Check if transition is valid
getAvailableTransitions, // List valid next events
} = useSessionMachine(sessionId);
// Check current phase
if (state === "level_split") {
// Show level-split UI
}
// Dispatch transition
const handleSubmit = () => {
send({ type: "SUBMIT_OPERATOR_RESULT", result: levelSplitResult });
};
}The Undo/Redo system implements a command pattern for reversible operations within sessions. Every significant action can be undone, supporting exploratory research without fear of losing work.
Operations are encapsulated as commands with execute and undo methods:
interface Command<TState, TResult = void> {
id: string;
type: string;
description: string;
execute: (state: TState) => TResult;
undo: (state: TState) => void;
canUndo: () => boolean;
}| Operation | Undoable? | Notes |
|---|---|---|
| Add hypothesis | ✅ | Removes hypothesis |
| Edit hypothesis | ✅ | Restores previous state |
| Kill hypothesis | ✅ | Resurrects hypothesis |
| Add evidence | ✅ | Removes evidence |
| Change confidence | ✅ | Restores previous confidence |
| Apply operator | ✅ | Removes operator results |
| External API calls | ❌ | Cannot undo side effects |
interface UndoManagerState<T> {
history: Command<T>[]; // Executed commands
redoStack: Command<T>[]; // Undone commands available for redo
currentState: T; // Current session state
maxHistory: number; // Maximum undo depth (default: 50)
}import { createUndoManager, executeCommand, undo, redo } from "@/lib/brenner-loop/undoManager";
// Create manager for session
const manager = createUndoManager<SessionState>(initialState, { maxHistory: 100 });
// Execute a command
const editCommand = createEditHypothesisCommand(hypothesisId, newData);
const newState = executeCommand(manager, editCommand);
// Undo last operation
const previousState = undo(manager);
// Redo if available
if (canRedo(manager)) {
const restoredState = redo(manager);
}
// Get history for display
const history = getHistory(manager);
// Returns: [{ description: "Edit H1", timestamp: "...", canUndo: true }, ...]The Prediction Lock system prevents hindsight bias by locking predictions before test results are known. The Calibration system tracks researcher confidence accuracy over time.
When a test is designed, predictions for each hypothesis must be locked before the test is run. This prevents the common failure mode of "predicting" results that were already known.
interface PredictionLock {
testId: string;
lockedAt: string; // Timestamp when locked
predictions: {
hypothesisId: string;
predictedOutcome: string; // What would happen if H is true
confidence: number; // How confident in this prediction
}[];
lockedBy: string; // Who locked the predictions
unlockable: boolean; // Can predictions be changed?
}Lock lifecycle:
- Test designed → predictions unlocked
- Predictions entered → user locks predictions
- Lock active → predictions cannot be changed
- Test executed → results compared to locked predictions
- Calibration updated based on accuracy
The calibration system compares predicted vs actual outcomes over time:
interface CalibrationRecord {
userId: string;
domain: ResearchDomain;
predictions: CalibrationDataPoint[];
calibrationScore: number; // How well-calibrated (0-1)
overconfidenceBias: number; // Positive = overconfident
brier: number; // Brier score for probabilistic accuracy
}import { lockPredictions, checkCalibration } from "@/lib/brenner-loop/prediction-lock";
import { updateCalibration, getCalibrationSummary } from "@/lib/brenner-loop/calibration";
// Lock predictions before running test
const lock = lockPredictions(testId, predictions, userId);
// After test completes, update calibration
updateCalibration(userId, testId, actualResult, lock);
// Get calibration summary
const summary = getCalibrationSummary(userId);
// Returns: { calibrationScore, overconfidenceBias, totalPredictions, accuracy }The Operator Framework implements Brenner's cognitive operators as composable, reusable functions. Each operator transforms the research state in a specific way, and operators can be composed into pipelines.
| Operator | Symbol | Function |
|---|---|---|
| Level-Split | ⊘ | Separate program from interpreter; message from machine |
| Exclusion-Test | ✂ | Design tests that eliminate hypotheses via forbidden patterns |
| Object-Transpose | ⟂ | Change organism/system until decisive test becomes cheap |
| Scale-Check | ⊞ | Validate against physical/scale constraints |
interface Operator<TInput, TOutput> {
id: string; // Unique operator identifier
name: string; // Human-readable name
symbol: string; // Mathematical symbol (⊘, ✂, etc.)
description: string; // What this operator does
brennerQuote?: string; // Relevant Brenner quote
apply: (input: TInput, context: OperatorContext) => TOutput;
validate: (input: TInput) => ValidationResult;
templates: OperatorTemplate[]; // Starter templates for this operator
}Separates levels of description to avoid confusing program with interpreter:
interface LevelSplitResult {
programLevel: {
description: string; // What the program/message specifies
variables: string[]; // Information-bearing variables
};
interpreterLevel: {
description: string; // How the program is executed
mechanisms: string[]; // Physical implementation mechanisms
};
chastityVsImpotence?: {
scenario: string;
chastityExplanation: string; // No signal was sent
impotenceExplanation: string; // Signal sent but not received
discriminatingTest?: string;
};
}Designs tests based on forbidden patterns:
interface ExclusionTestResult {
forbiddenPatterns: {
pattern: string; // What cannot occur if H is true
hypothesesRuledOut: string[]; // Which hypotheses this eliminates
observationType: string; // How to observe this pattern
}[];
discriminativePower: DiscriminativePower;
testDesign: {
procedure: string;
expectedOutcomes: {
hypothesisId: string;
prediction: string;
}[];
};
}Operators can be composed into pipelines:
import { compose, pipe } from "@/lib/brenner-loop/operators/framework";
// The signature Brenner move:
// (⌂ ∘ ✂ ∘ ≡ ∘ ⊘) powered by (↑ ∘ ⟂ ∘ 🔧) constrained by (⊞)
const brennerPipeline = pipe(
levelSplit, // Separate levels
invariantExtract, // Find what survives
exclusionTest, // Design killing experiments
materialize, // Compile to decision procedure
);
const result = brennerPipeline(hypothesis, context);import { applyLevelSplit } from "@/lib/brenner-loop/operators/level-split";
import { applyExclusionTest } from "@/lib/brenner-loop/operators/exclusion-test";
import { applyScaleCheck } from "@/lib/brenner-loop/operators/scale-check";
// Apply level-split to a hypothesis
const splitResult = applyLevelSplit(hypothesis, {
sessionId,
previousResults: [],
});
// Design exclusion tests
const testResult = applyExclusionTest(hypothesis, {
sessionId,
levelSplitResult: splitResult,
});
// Validate against scale constraints
const scaleResult = applyScaleCheck(hypothesis, {
sessionId,
domain: "biology",
constraints: [
{ type: "spatial", min: "1nm", max: "1mm" },
{ type: "temporal", min: "1ms", max: "1hr" },
],
});The system is designed for offline-first operation with network resilience built into the storage layer.
Operations that require network access are queued when offline and replayed when connectivity returns:
interface OfflineQueue {
operations: QueuedOperation[];
status: "idle" | "flushing" | "offline";
lastFlushAttempt?: string;
failedOperations: FailedOperation[];
}Local-first storage with sync:
- Primary: IndexedDB for structured data (hypotheses, tests, evidence)
- Fallback: localStorage for small data and queue state
- Sync: Background sync when connectivity restored
- Conflict Resolution: Last-write-wins with audit trail
For filesystem operations, the system uses advisory file locks to prevent concurrent modification:
import { acquireLock, releaseLock, isLocked } from "@/lib/storage/file-lock";
// Acquire exclusive lock
const lock = await acquireLock(filePath, {
ttl: 30000, // Lock expires after 30 seconds
retries: 3, // Retry 3 times if locked
retryDelay: 1000, // Wait 1 second between retries
});
try {
// Perform file operations
await writeFile(filePath, data);
} finally {
await releaseLock(lock);
}The Citation System provides parsing and formatting for Brenner transcript references, enabling precise anchoring of claims to primary sources.
Brenner transcript sections are referenced as §n where n is the section number (1-236):
// Parse section IDs from text
const ids = parseBrennerSectionIds("§58, §78-82, §161");
// Returns: [58, 78, 79, 80, 81, 82, 161]
// Extract from free-form text
const extracted = extractBrennerSectionIdsFromText(
"As Brenner noted in §58 and later expanded in §78..."
);
// Returns: [58, 78]import { formatCitation, formatCitationRange } from "@/lib/brenner-loop/artifacts/citations";
// Format single citation
formatCitation(58); // "§58"
// Format range
formatCitationRange([58, 59, 60, 78, 79]); // "§58-60, §78-79"
// Format with verbatim/inference marker
formatCitation(58, { verbatim: true }); // "§58 [verbatim]"
formatCitation(58, { verbatim: false }); // "§58 [inference]"Citations are validated against the actual transcript:
import { validateAnchors } from "@/lib/brenner-loop/artifacts/citations";
const result = validateAnchors(["§58", "§300", "§invalid"]);
// Returns: {
// valid: ["§58"],
// invalid: ["§300", "§invalid"],
// errors: ["§300 exceeds transcript length (236)", "Invalid format: §invalid"]
// }The public website at brennerbot.org serves visitors who want to explore Brenner Loop sessions without running local infrastructure. Demo mode provides static fixture data that showcases the full workflow.
When Lab Mode is disabled (the default for public deployments), session pages automatically detect the public context and display demo content:
- Public host detection: Pages check
window.location.hostnameagainst known public domains - Demo thread routing: Thread IDs starting with
demo-serve fixture data instead of Agent Mail queries - Feature previews: Locked features display explanatory overlays with "Coming Soon" messaging
Demo sessions are pre-built examples that demonstrate the complete Brenner Loop workflow:
| Demo Session | Phase | Description |
|---|---|---|
demo-bio-nanochat-001 |
compiled |
Bio-Inspired Nanochat research: vesicle depletion vs frequency penalty |
Each demo session includes:
- KICKOFF message: Research question with working hypotheses and Brenner anchors
- Agent DELTAs: Structured contributions from hypothesis generator, test designer, and adversarial critic
- COMPILED artifact: Final merged artifact with all sections complete
import { isDemoThreadId } from "@/lib/demo-mode";
import { getDemoSession, getDemoThreads } from "@/lib/fixtures/demo-sessions";
// Check if viewing demo content
if (isDemoThreadId(threadId)) {
const session = getDemoSession(threadId);
// Render with fixture data
} else {
// Fetch from Agent Mail
}
// List all demo sessions
const demoThreads = getDemoThreads();apps/web/src/
├── lib/
│ ├── demo-mode.ts # Demo detection utilities
│ └── fixtures/
│ └── demo-sessions.ts # Static session fixtures
└── components/sessions/
├── DemoSessionsView.tsx # Demo-aware session list
└── DemoFeaturePreview.tsx # Feature preview overlays
The system includes server-side event tracking via the GA4 Measurement Protocol, enabling reliable conversion tracking that bypasses client-side ad blockers.
Client Event → POST /api/track → Server Validation → GA4 Measurement Protocol
The tracking API provides:
- Rate limiting: 60 requests/minute per IP with automatic cleanup
- Payload validation: Schema enforcement with size limits
- Sanitization: Input cleaning for GA4 compliance
- Timeout handling: 3-second abort for external calls
// Rate limit configuration
const RATE_LIMIT_WINDOW_MS = 60 * 1000; // 1 minute window
const RATE_LIMIT_MAX_REQUESTS = 60; // 60 requests per window
const MAX_MAP_SIZE = 10000; // Max tracked IPsRate limiting uses the X-Real-IP header (set by Vercel edge, not spoofable by clients) rather than X-Forwarded-For which can be manipulated.
// POST /api/track
interface TrackRequest {
client_id: string; // GA client ID (max 100 chars)
events: Array<{
name: string; // Event name (alphanumeric, max 40 chars)
params?: Record<string, string | number | boolean>;
}>;
user_id?: string;
user_properties?: Record<string, string | number | boolean>;
}| Measure | Implementation |
|---|---|
| Payload size | Max 32KB |
| Events per request | Max 10 |
| Parameter count | Max 25 per event |
| String truncation | Max 100 chars |
| Prototype pollution | Blocked (__proto__, constructor, prototype) |
The system implements defense-in-depth security across all API endpoints.
The experiments endpoint (/api/experiments) executes commands for test runs. A strict whitelist prevents arbitrary code execution:
const ALLOWED_COMMANDS = new Set([
// Package managers / runners
"bun", "bunx", "npm", "npx", "yarn", "pnpm", "node", "deno",
// Python
"python", "python3", "pip", "pip3", "poetry", "uv",
// Testing frameworks
"pytest", "vitest", "jest", "mocha",
// Build tools
"make", "cargo", "go", "rustc",
// Version control
"git",
// Shell (requires lab mode auth)
"bash", "sh",
// Safe utilities
"echo", "cat", "ls", "pwd", "which", "env", "printenv",
"date", "wc", "head", "tail", "grep", "find", "diff", "sort", "uniq",
]);Path injection prevention: Commands containing / or \ are rejected, preventing bypass via ./malicious or /path/to/evil.
Lab secrets are compared using HMAC-based constant-time comparison to prevent timing attacks:
function safeEquals(a: string, b: string): boolean {
// HMAC normalizes both inputs to fixed-length buffers,
// eliminating timing leaks from length differences
const hmacKey = "brenner-auth-compare";
const hmacA = createHmac("sha256", hmacKey).update(a).digest();
const hmacB = createHmac("sha256", hmacKey).update(b).digest();
return timingSafeEqual(hmacA, hmacB);
}Failed authentication returns HTTP 404 (not 401/403) to prevent endpoint enumeration and reduce information leakage about protected resources.
The delta parser and related utilities are designed to be tolerant of format variations while maintaining correctness.
The delta parser accepts common agent "hallucinations" rather than rejecting entire contributions:
| Scenario | Handling |
|---|---|
target_id in ADD operation |
Silently ignored (normalized to null) |
| Extra whitespace in anchors | Trimmed (§ 42 → §42) |
| Missing optional fields | Defaults applied |
// ADD operations: target_id is normalized to null regardless of input
// This tolerates agents that hallucinate IDs for new entities
target_id: operation === "ADD" ? null : (typeof target_id === "string" ? target_id : null)The operator library parser handles markdown format variations:
- Section boundaries: Lookahead patterns instead of exact
\n\n - Case insensitivity:
**Definition**and**definition**both work - Optional backticks: Canonical tags accept
\tag`ortag`
Transcript anchors support optional whitespace:
// Matches: §42, § 42, §42-45, § 42 - 45
const anchorPattern = /§\s*(\d+)(?:-(\d+))?/g;The storage layer implements several optimizations for large session histories.
Instead of rebuilding the entire cross-session index on every mutation, storage modules perform targeted updates:
// When saving to a specific session, only that session's entries are refreshed
async updateIndexForSessionUnlocked(sessionId: string, items: T[]): Promise<void> {
// 1. Read existing index
const index = await this.loadIndex();
// 2. Filter out entries for this session
const otherEntries = index.entries.filter(e => e.sessionId !== sessionId);
// 3. Create new entries from saved items
const newEntries = items.map(item => this.toIndexEntry(item));
// 4. Merge and write
index.entries = [...otherEntries, ...newEntries];
await this.writeIndex(index);
}Falls back to full rebuild if the index is missing or corrupt.
All storage modules support both compound and simple ID formats:
| Format | Pattern | Example | Use Case |
|---|---|---|---|
| Compound | {prefix}-{session}-{seq} |
H-RS20251230-001 |
Cross-session uniqueness |
| Simple | {prefix}{n} |
H1, T2 |
Artifact-merge generation, quick references |
The compound format includes the session ID, enabling fast lookups without scanning all files. The simple format supports backwards compatibility and quick artifact references.
Delete operations extract session IDs from compound IDs when possible:
async deleteHypothesis(id: string): Promise<boolean> {
// Fast path: extract session from compound ID
const match = id.match(/^H-(.+)-\d+$/);
if (match) {
const sessionId = match[1];
// Load only the relevant session file
const hypotheses = await this.loadSessionHypotheses(sessionId);
// ...
}
// Slow path: scan all sessions
const hypothesis = await this.getHypothesisById(id);
// ...
}For filesystem operations, advisory file locks prevent concurrent modification:
import { withFileLock } from "@/lib/storage/file-lock";
await withFileLock(baseDir, "hypotheses", async () => {
// Safe to read-modify-write
const data = await loadFile();
data.items.push(newItem);
await saveFile(data);
});Lock implementation uses atomic file operations with TTL-based expiry for crash recovery.