A regression guard for AI agents.
Wrap any LangGraph / Hermes / custom agent node. Record traces. Detect success-rate or latency regression. Roll back bad config changes. Preserve event evidence.
Latest verified release: v0.1.1 · Agent data strategy: AGENT_DATA_STRATEGY.md · Eval labels: docs/ANNOTATION_GUIDELINE.md · External repro: EXTERNAL_REPRO.md · Hermes-style guard: docs/HERMES_SKILL_GUARD.md
中文定位:self-improving-loop 是 AI Agent 的回归保护层。它包住 LangGraph / Hermes / 自定义 agent 节点,记录 trace,检测成功率或延迟退化,回滚坏配置,并保留可复查事件证据。
Most "self-improving agent" projects stop at "log the failures, let the next run read the log". That's a methodology, not a loop. This package is the loop, implemented as a compact pure-stdlib Python runtime — no framework lock-in, no LLM dependency, no cloud.
The optional Yijing strategy is an internal policy router: runtime signals are mapped into six engineering lines, recognized as a hexagram state, converted into a bounded policy patch, then verified through the same rollback guard as any other strategy. It is a state machine, not fortune telling.
Overhead is negligible for normal LLM/HTTP agent calls; for sub-10ms in-memory functions, measure before wrapping.
Wrap any function, get:
- 📊 Automatic execution tracking (success rate, latency, rolling window)
- 🗄 Trace storage choice: readable JSONL by default, SQLite/WAL for multi-worker deployments
- 🧠 Adaptive thresholds per agent profile (high-freq / mid-freq / low-freq / critical)
- ☯️ Optional hexagram state strategy: six runtime dimensions → policy patch
- 🛠 Strategy hook for proposing improvement configs when failure patterns are detected
- 🧩 ConfigAdapter contract for real config backup / patch / restore
- 🛡 Rollback trigger when the new config regresses (>10% success drop, >20% latency gain, or 5 consecutive failures)
- 📬 Pluggable notifier (stub by default — swap in Telegram / Slack / whatever)
Extracted from TaijiOS, where the same six-line state model is used for production-scale agent workloads.
Demo artifacts: terminal transcript · asciinema cast
Latest verified GitHub release:
pip install https://github.com/yangfei222666-9/self-improving-loop/releases/download/v0.1.1/self_improving_loop-0.1.1-py3-none-any.whlFrom source:
pip install git+https://github.com/yangfei222666-9/self-improving-loop.git@v0.1.1Zero required dependencies. Everything is Python stdlib, including optional
SQLite trace storage via sqlite3.
- ...methodology doc. Many "self-improving agent" repos are markdown templates that ask you to log learnings to
CLAUDE.md. This is the runtime loop that does it for you. - ...heavyweight framework. Compact stdlib code. Drop it next to your existing code. No decorators forced on you. No background process.
- ...LLM-dependent. The analysis is statistical, not LLM-based. If you want LLM-authored config tweaks, pass an
improvement_strategyobject and ask your favorite LLM inside itsanalyze()method.
Stable:
- Execution tracking
- Adaptive failure thresholds
- JSONL trace storage with a cross-process lock
- Optional SQLite/WAL trace storage
- Strategy-triggered config patching
- ConfigAdapter-backed rollback when a patch regresses
- Optional Yijing hexagram strategy as a deterministic state router: runtime traces -> six engineering lines -> hexagram -> bounded policy patch
Experimental:
- Choosing the best config patch automatically. The loop calls your
improvement_strategy; it does not pretend to know your agent better than your production tests. - Full 64-hexagram policy coverage. The first Yijing strategy supports only eight core states and should be treated as a bounded policy router.
from self_improving_loop import SelfImprovingLoop
loop = SelfImprovingLoop()
def my_agent_work():
# Your actual agent call / LLM chain / tool invocation
return {"status": "ok", "data": ...}
result = loop.execute_with_improvement(
agent_id="my-agent",
task="handle user query",
execute_fn=my_agent_work,
)
if result["improvement_triggered"]:
print(f"Strategy applied {result['improvement_applied']} config tweaks")
if result["rollback_executed"]:
print(f"Rolled back because: {result['rollback_executed']['reason']}")That's it. The loop watches every execution and decides when to trigger tuning.
To mutate and restore real agent config, provide a strategy hook plus either a
ConfigAdapter or the legacy strategy current_config/apply/rollback methods.
From a repo checkout, start here:
python3 examples/01_basic_tracking.py
python3 examples/02_config_rollback.py
python3 examples/03_langgraph_adapter.py
python3 examples/04_yijing_strategy.py
python3 examples/05_langgraph_regression_guard.py
python3 examples/06_hermes_skill_regression_guard.pyThey prove the six important contracts:
01_basic_tracking.py: wrapper records traces and exposes stats.02_config_rollback.py: a bad patch is applied, regression is detected, andConfigAdapter.rollback_config()restores the previous config.03_langgraph_adapter.py: a LangGraph-style node can be wrapped without adopting a new framework.04_yijing_strategy.py: traces become six runtime lines, a hexagram state, and a bounded policy patch.05_langgraph_regression_guard.py: a LangGraph-style node regresses, traces are recorded, rollback runs, and an event trail survives.06_hermes_skill_regression_guard.py: a Hermes-style skill call regresses, rollback restores the skill config, and an event trail survives.
For the verbose rollback event trail, run:
python3 examples/regression_rollback_demo.py --data-dir .repro-demo
python3 examples/verify_regression_rollback_event_trail.py .repro-demo/regression_rollback_event_trail.jsonlFor the bundled agent-failure eval packet, run:
python3 examples/verify_agent_eval_cases.py examples/agent_eval_cases.jsonlThe packet contains 30 non-authorizing cases for silent failure, stale artifacts, provider drift, missing event trails, rollback gaps, and unsafe action escalation. It is eval data only: no judgment, paper-buy, trade, or promote.
This package is not trying to replace LangGraph, CrewAI, AutoGen, OpenAI Agents, or your own internal runner. It wraps the callable you already trust:
result = loop.execute_with_improvement(
agent_id="support-agent",
task="answer ticket",
execute_fn=lambda: existing_agent.run(ticket),
context={"framework": "your-current-stack"},
)loop.track(...) is also available as a shorter alias for the same API.
Dependency-free examples show the integration seam:
python3 examples/03_langgraph_adapter.py
python3 examples/05_langgraph_regression_guard.py
python3 examples/06_hermes_skill_regression_guard.py
python3 examples/wrap_existing_agent.pyThe goal is narrow: traces, thresholds, guarded strategy application, and rollback evidence around an agent you already have.
By default, traces are written to a readable traces.jsonl file with a
cross-platform sidecar lock. For multi-worker deployments, switch to SQLite:
from self_improving_loop import SelfImprovingLoop
loop = SelfImprovingLoop(storage="sqlite")This writes traces.sqlite3 with WAL mode enabled. The public API is unchanged:
execute_with_improvement() records traces, and the loop reads them back for
thresholds, metrics, and rollback checks.
For long-running JSONL deployments, enable size-based rotation and call compaction from cron or your scheduler:
loop = SelfImprovingLoop(
storage="jsonl",
jsonl_max_bytes=50 * 1024 * 1024,
jsonl_max_archives=7,
)
# Keep the latest 100k valid active traces.
loop.trace_store.compact(max_entries=100_000)Rotated JSONL files are gzipped under trace_archives/ by default. Compaction
drops corrupt rows and keeps the latest valid entries in the active trace file.
The Yijing layer is implemented as a deterministic state machine, not as a fortune-telling layer:
runtime traces -> six engineering lines -> hexagram state -> policy patch
The six lines are:
- stability
- efficiency
- learning activity
- routing accuracy
- collaboration
- governance
Use it as the strategy:
from self_improving_loop import SelfImprovingLoop, YijingEvolutionStrategy
loop = SelfImprovingLoop(
strategy=YijingEvolutionStrategy(),
config_adapter=my_config_adapter,
)improvement_strategy= remains supported for backward compatibility.
The engineering mapping is explicit:
| Line | Dimension | Yang means | Yin means |
|---|---|---|---|
| 1 | stability | dependencies are healthy | API/network/dependency failure |
| 2 | efficiency | high success, low latency | low success or high latency |
| 3 | learning activity | feedback / recovery signal exists | repeated failure without learning |
| 4 | routing accuracy | model/tool choice looks correct | wrong model/tool/schema drift |
| 5 | collaboration | tools / agents hand off cleanly | conflicts or context breaks |
| 6 | governance | cost and rollout are bounded | quota, cost, or policy drift |
The first version supports eight core policy states: Qian, Kun, Zhen, Kan, Bo, Fu, Ji Ji, and Wei Ji. It returns a bounded config patch and relies on the same canary/rollback path as any other strategy.
Most agents have this failure mode:
- You ship an agent.
- It works for a week.
- Something upstream changes (rate limits, schema drift, a new edge case).
- Your agent starts failing.
- You find out three days later from angry users.
- You tweak a config, hope for the best, ship it.
- The tweak makes another scenario worse.
- You roll it back manually, losing the original learning.
self-improving-loop turns steps 3–8 into a tight feedback loop that runs inside your process, without needing observability infra, Kubernetes, or a dedicated ML team.
Different agents have different "pulse rates". A critical alerting agent should reconsider after 1 failure; a batch classifier can tolerate 5 before triggering analysis. The library classifies agents by execution frequency and adjusts:
The automatic profile is based on exec_count_24h; override it with set_manual_threshold() when production semantics matter more than raw frequency.
| Agent profile | Failure trigger | Analysis window | Cooldown |
|---|---|---|---|
| High-frequency (>10/day) | 5 failures | 48h | 3h |
| Medium-frequency (3-10/day) | 3 failures | 24h | 6h |
| Low-frequency (<3/day) | 2 failures | 72h | 12h |
| Critical (user-marked) | 1 failure | 24h | 6h |
Or bypass the classifier and set manually:
from self_improving_loop import AdaptiveThreshold
adaptive = AdaptiveThreshold()
adaptive.set_manual_threshold(
"critical-agent",
failure_threshold=1,
analysis_window_hours=12,
cooldown_hours=1,
is_critical=True,
)When a config change ships, the loop keeps watching. It rolls back if any of these become true:
- Success rate drops >10%
- Average latency increases >20%
- ≥5 consecutive failures after the change
Real rollback requires a config hook. Prefer an explicit ConfigAdapter:
from self_improving_loop import SelfImprovingLoop
class MyConfigAdapter:
def get_config(self, agent_id):
return load_agent_config(agent_id)
def apply_config(self, agent_id, config_patch):
save_agent_config(agent_id, {**load_agent_config(agent_id), **config_patch})
return True
def rollback_config(self, agent_id, backup_config):
save_agent_config(agent_id, backup_config)
loop = SelfImprovingLoop(
improvement_strategy=my_strategy,
config_adapter=MyConfigAdapter(),
)Without a config adapter or strategy rollback hook, the loop will record the rollback decision but will not claim that your external agent config was restored.
# See recent rollbacks
rollback_history = loop.auto_rollback.get_rollback_history("my-agent")
for event in rollback_history:
print(event["reason"], event["timestamp"])The built-in TelegramNotifier is a stub — it logs to stdout. Override _send_message() to hook any channel:
from self_improving_loop import TelegramNotifier
class MySlackNotifier(TelegramNotifier):
def __init__(self, webhook_url, **kw):
super().__init__(**kw)
self.webhook_url = webhook_url
def _send_message(self, message, priority="normal"):
import requests
requests.post(self.webhook_url, json={"text": f"[{priority}] {message}"})
loop = SelfImprovingLoop(notifier=MySlackNotifier(webhook_url="https://hooks..."))Measured locally with benchmarks/overhead.py (200 iterations per workload, Python 3.12, Windows):
| Workload profile | Absolute overhead | Relative overhead |
|---|---|---|
| ~100 ms agent call (typical LLM) | +0.27 ms | +0.3% |
| ~10 ms agent call (tool call) | +0.31 ms | +3.0% |
| sub-millisecond call | +0.08 ms | >>% (don't wrap these) |
The wrapper adds a stable ~300 μs of fixed cost per call (trace append + threshold check). Whether that's negligible depends on your workload:
- LLM calls (>500 ms): overhead is ≤0.06% — invisible
- HTTP / DB calls (~30-100 ms): ≤1%
- Fast in-memory work (<10 ms): 3%+ — reconsider whether you need this for those
Rerun the benchmark on your own machine with python benchmarks/overhead.py.
Restart / recovery startup cost can be checked with:
python benchmarks/startup_recovery.py --traces 1000 10000 100000SelfImprovingLoop.__init__ only loads loop_state.json; trace history is
loaded on demand for stats, thresholds, and rollback checks.
Separate operation costs (triggered occasionally, not per-call):
| Operation | Cost |
|---|---|
| Failure analysis (only when threshold crossed) | ~100 ms |
| Applying improvement config | ~200 ms |
| Rollback execution | ~10 ms |
Extracted from TaijiOS — a self-learning AI operating system with 5 I Ching–bound engines and a 346-heartbeat Ising physics experiment. The parent project has 14 modules; this one is the most generally reusable, so it lives as a standalone package.
TaijiOS started on Chinese New Year 2026-02-17 and has been built through multi-AI collaboration since then.
MIT. Ship it wherever.
This is a very early release. Every bug report, every "didn't work for me", every "I wish it did X" is read:
"Safety first, then automation."