Skip to content

feat(openclaw): native OpenClaw integration with domain routing and decision tracing#116

Merged
saschabuehrle merged 54 commits intomainfrom
codex/feature-openclaw-native
Feb 12, 2026
Merged

feat(openclaw): native OpenClaw integration with domain routing and decision tracing#116
saschabuehrle merged 54 commits intomainfrom
codex/feature-openclaw-native

Conversation

@saschabuehrle
Copy link
Copy Markdown
Collaborator

Summary

  • Native OpenClaw OpenAI-compatible server (/v1/chat/completions) with streaming + non-streaming support
  • Hybrid domain detection (ML semantic + rule-based) for intelligent cascade routing
  • Decision trace JSONL logging for audit and observability
  • Pre-router classifier with category → domain mapping
  • Sentinel stripping (NO_REPLY) to prevent cascade artifacts leaking to clients
  • Per-channel routing strategies and OpenClaw cron/config-based routing
  • Wired use_hybrid through all 5 preset functions so domain detection activates for OpenClaw

Test plan

  • pytest tests/ -x -q — 878 passed, 0 failures
  • black --check and ruff check clean
  • CI green on GitHub Actions
  • Manual: verify domain detection activates with use_hybrid=True (agent logs show "Domain detection: SEMANTIC")
  • Manual: verify NO_REPLY content returns empty string to client

saschabuehrle and others added 30 commits February 5, 2026 12:19
Copy tool_calls from metadata to message.tool_calls for standard OpenAI
format compatibility. Also set finish_reason to "tool_calls" when tools
are present. Add /health endpoint returning {"status": "ok"}.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- 10 tool call scenarios (single, multiple, nested, error handling)
- 10 Q&A scenarios (factual, code, math, creative)
- 4x10 multi-turn conversations (project planning, debug, travel, language)
- 2 multi-agent tool calls (parallel, sequential)
- 2 agent loop tests
- 2 streaming/webhook tests
- Stats validation after each batch
- Response metadata validation (cascadeflow fields)

Generated by Codex task_e_6984e6bd31008333944422b002b2cc78
…assertion

- cascade_overhead_ms → cascade_overhead
- avg_cascade_overhead_ms → avg_cascade_overhead
- Increase tolerance for acceptance rate from 0.02 to 0.05
- Remove tool_calls >= tool_queries assertion (invalid for Q&A tests)
Problem: When tool results are provided (role='tool' in messages), the drafter
correctly responds with TEXT summary. But cascade quality check rejected it with
'no_tool_calls_generated' because has_tools=true.

Solution:
- Add _has_tool_result_in_messages() helper function
- Modify _should_accept_tool_draft() to check for tool results in messages
- If tool results provided and draft has text content (no tool_calls), accept it

This fixes the agent loop flow:
1. User: 'Get weather in London'
2. Assistant: tool_call(weather, London)
3. Tool: 'Sunny, 22°C'
4. Assistant: 'The weather in London is sunny at 22°C' ← NOW ACCEPTED

Also fixes test field names and simplifies provider configs.
Two fixes for agent loop support:

1. normalize_messages() now preserves tool_calls, tool_call_id, name fields
   - Previously stripped all fields except role/content
   - OpenAI tool calling requires these fields

2. Anthropic provider: _convert_messages_to_anthropic()
   - Converts role="tool" → role="user" with tool_result content blocks
   - Converts assistant tool_calls → tool_use content blocks
   - Extracts system messages (Anthropic wants them separate)
   - Merges consecutive same-role messages (Anthropic requirement)

All 3 configs now pass 9/9 tests:
- openai-only: 100% acceptance, 752ms latency
- anthropic-only: 100% acceptance, 867ms latency
- mixed: 100% acceptance, 732ms latency
Three provider preset configurations for OpenClaw integration:

1. mixed-domains.yaml - Full 12-domain routing
   - medical/legal/finance → opus verifier (strict)
   - code/technical/math → gpt-4o verifier
   - creative/conversation → claude models

2. mixed-domains-v2.yaml - Tuned thresholds (0.70 for medical/legal)

3. mixed-domains-fixed.yaml - Experimental with tool routing

Tested: 62.5% acceptance, 0.88 quality mean, proper domain detection
…l bypass, forced escalation

- Added COMPLEXITY_SIGNALS in complexity.py for proof/math/implementation detection
- Removed trivial-mode auto-pass bypass in quality.py (now validates hedging/hallucination)
- Added forced escalation in cascade.py (expert always, hard when quality<0.85)
- Added comprehensive tests (25 tests, all passing)

Fixes: Hard/Expert queries now properly escalate instead of auto-accepting
…thresholds

- Add 7 exemplars each for CONVERSATION, FACTUAL, FINANCIAL domains
- Add DOMAIN_THRESHOLDS dict with domain-specific confidence levels
- Enable hybrid mode by default (rule-based + semantic combined)
- Add 16 comprehensive tests for new functionality

Fixes domain detection gaps where finance/conversation/factual queries
were incorrectly routed to GENERAL domain.
- pip install cascadeflow[openclaw] auto-enables FastEmbed v0.7+
- Added conversation domain keywords for rule-based fallback
- Covers small talk, greetings, casual chat patterns
- Ensures good routing when FastEmbed unavailable
The _keyword_matches method used rf"\\b..." which in a raw string
produces literal \\b (escaped backslash + b) instead of the regex
word boundary \b. This caused ALL single-word keyword matches to
fail silently, dropping domain detection accuracy to ~16%.

Fixed by using r"\b" + re.escape(keyword) + r"\b" which correctly
produces word boundary assertions. Also added missing keywords for
math (prove, irrational, eigenvalue, etc.), medical (symptoms,
diabetes, hypertension, etc.), finance (mortgage, tax, inflation,
etc.), factual (when was, who was, signed, etc.), and reasoning
domains. Detection accuracy now 100% across 50 test queries.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Domain Detection Fixes:
- Enhanced CONVERSATION with greetings, small talk, casual chat patterns
- Enhanced FACTUAL with knowledge-seeking patterns (capital, invented, when was)
- Enhanced FINANCIAL with retirement, savings, pension, ETF, hedge fund
- Enhanced MEDICAL with anatomy, physiology, organ-specific terms
- Reduced GENERAL domain to true fallback (removed knowledge patterns)

Results (60+ queries tested):
- Overall: 90.2% accuracy (was 70.5%)
- Conversation: 100% (was 12%)
- Factual: 88% (was 50%)
- Medical: 100%
- Legal: 100%
- Financial: 90%

Remaining edge cases are acceptable domain overlaps:
- 'How many planets' → math (counting)
- 'Compare ETF vs mutual fund' → comparison (legitimate)
- 'Python vs JavaScript' → code (language keywords)
- test_detect_general_domain: Use truly generic query (not knowledge questions)
- test_rag_vs_general_separation: Expect FACTUAL for knowledge questions
- test_low_confidence_fallback: Use non-greeting query (greetings → CONVERSATION now)

These tests now reflect the improved domain detection behavior where:
- Knowledge questions (capital of, population of) → FACTUAL
- Greetings (hello, hi) → CONVERSATION
- Only truly ambiguous queries → GENERAL
- Add FASTEMBED_DOMAIN_MODELS registry (e5-large, bge-large, MiniLM)
- Add model_name param to SemanticDomainDetector (default: e5-large-v2)
- Implement adaptive hybrid blending:
  - Rule lock for high-confidence decisions
  - Semantic override when confident + separated
  - Adaptive weights based on agreement
- Add tuning constants (HYBRID_RULE_LOCK_*, HYBRID_SEMANTIC_*)
- Add tests for hybrid behavior with mocked embedder
- Create fastembed-investigation-report.md
P0 Measurement Correctness:
- Add cascadeflow/pricing/pricebook.py with PriceBook and PricingResolver
- Add cascadeflow/schema/usage.py with canonical Usage schema
- Add tests/test_pricing_resolver.py

P0 Tool Loop Engine:
- Update cascadeflow/agent.py with tool loop closure support
- Add tests/test_agent_p0_tool_loop.py

Audit Reports:
- dx-plan-v2-audit-report.md (P0 status audit)
- fastembed-audit-report.md (plan vs implementation)

Tests:
- Add semantic domain/complexity detector tests
- TypeScript usage type tests

Co-authored-by: Codex <codex@openai.com>
saschabuehrle and others added 21 commits February 6, 2026 09:27
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix baseline_cost calculation when cost_saved is None
- Use bigonly_cost from metadata when available
- Implement fallback estimation (2x for cascade, 1x for direct)
- Resolves issue where savings_percent showed 0% despite cascade working
- Now properly calculates baseline as cost if ALL queries used verifier model
…e-en-v1.5

- intfloat/e5-large-v2 not supported in FastEmbed 0.7.4
- BAAI/bge-base-en-v1.5 is supported and tested working
- Updated SemanticDomainDetector default model
- Updated FASTEMBED_DOMAIN_MODELS list
- Updated tests to use new model

Tested: SemanticDomainDetector.is_available=True, detection works
All extras now require FastEmbed 0.7.0+ which supports:
- BAAI/bge-base-en-v1.5 (default for SemanticDomainDetector)
- BAAI/bge-large-en-v1.5
- sentence-transformers/all-MiniLM-L6-v2

Prevents cryptic model-not-found errors with older versions.
Allows agents to bypass plan review for evaluation-only tasks
(no code modifications, just running tests/benchmarks).
- Default use_hybrid=False in CascadeAgent constructor
- Add use_hybrid parameter to preset functions and auto_agent
- OpenClaw integration passes use_hybrid=True when creating agent
- Standard CascadeFlow usage now uses pure semantic/rule-based detection
- Hybrid mode (ML + rules combined) reserved for OpenClaw integration only
- domain_config.py: Remove require_verifier=True for medical/legal domains
- domain_config.py: Lower thresholds from 0.95/0.90 to 0.70
- openai-only.yaml: Add medical/legal domains, pure OpenAI stack
- anthropic-only.yaml: Pure Anthropic stack (haiku→sonnet)
- mixed-anthropic-openai.yaml: gpt-5-mini drafter, sonnet verifier

All three configs tested successfully:
- Medical queries now use cascade (not direct routing)
- Draft acceptance working for all providers
The fallback savings calculation was using 2x multiplier, but the actual
cost difference between gpt-5-mini ($0.00025) and gpt-5 ($0.015) is 60x.

This fix corrects the savings percentage calculation when bigonly_cost
metadata is not available, properly reflecting the significant cost
savings from cascade acceptance.
The telemetry.record() was being called with the raw SpeculativeResult
before _build_cascade_result() ran the CostCalculator to fix cost_saved.

This caused stats to show tiny negative savings even when drafts were
accepted, because the raw cascade module uses different cost logic.

Now the record() call happens after the CostCalculator corrects the
cost values, giving accurate savings stats (98% instead of -1.7%).
- Add decisions.jsonl writer (thread-safe, 50MB rotation, env-configurable)
- Log one-line DECISION summary per request via cascadeflow.openclaw logger
- Enrich DRAFT_DECISION events with alignment_score, threshold, complexity
- Add domain parameter to telemetry.record() with per-domain stats tracking
- Add by_domain section to /stats export with acceptance rates and averages
Incorporates gateway consolidation, production readiness improvements,
PriceBook expansion, tool executor implementation, and OpenAI server
hardening from main. Resolves all merge conflicts, keeping decision
trace and domain stats additions from feature branch.

All 878 tests pass.
…ntinels

Forward use_hybrid → enable_domain_detection + use_hybrid in all 5
preset functions so organic OpenClaw chat messages get semantic domain
classification instead of domain=None.

Add _strip_sentinel() to openai_server.py and apply it in both
streaming and non-streaming paths, _has_content(), and the fallback
loop in _build_openai_response(). Prevents "NO_REPLYNO_REPLY" from
being returned to clients when system prompts use the NO_REPLY
convention.
Auto-format 12 files with black, remove duplicate imports in agent.py,
remove unused import in gateway.py, fix import ordering in context.py,
remove unused variable tokens_used, split compound assertion.
@saschabuehrle saschabuehrle merged commit 0ee7da7 into main Feb 12, 2026
19 checks passed
@saschabuehrle saschabuehrle deleted the codex/feature-openclaw-native branch February 12, 2026 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant