feat: add self-evolving background reviewer to auto-extract skills from conversations#2406
feat: add self-evolving background reviewer to auto-extract skills from conversations#2406jiuyueQ wants to merge 3 commits intobytedance:mainfrom
Conversation
|
@jiuyueQ Please create a GitHub issue to discuss this new feature first. |
@WillemJiang Thanks for the suggestion! |
There was a problem hiding this comment.
Pull request overview
Adds an automated, background “skill review” mechanism that periodically analyzes tool-heavy conversations and writes/updates reusable skills via the existing skill_manage tool, gated by skill_evolution.enabled.
Changes:
- Introduces
SkillReviewMiddlewareto track per-thread tool-call rounds and trigger background reviews at a configurable interval. - Adds
SkillReviewer+ prompts to run a minimal, anti-recursive agent that can only callskill_manage, with existing-skill context for deduplication. - Extends skill evolution configuration (
creation_nudge_interval) and adds unit tests for the new behavior.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
config.example.yaml |
Adds skill_evolution.creation_nudge_interval to document/configure review cadence. |
backend/packages/harness/deerflow/config/skill_evolution_config.py |
Adds creation_nudge_interval field to the SkillEvolutionConfig model. |
backend/packages/harness/deerflow/agents/middlewares/skill_review_middleware.py |
New middleware that counts tool-call rounds per thread, triggers background reviews, and resets counters on skill_manage. |
backend/packages/harness/deerflow/agents/skill_review/reviewer.py |
New background reviewer that runs a minimal agent to create/patch skills and logs outcomes. |
backend/packages/harness/deerflow/agents/skill_review/prompt.py |
Adds the system/user prompts guiding skill extraction and deduplication behavior. |
backend/packages/harness/deerflow/agents/skill_review/__init__.py |
Initializes the new skill_review package. |
backend/packages/harness/deerflow/agents/lead_agent/agent.py |
Wires the middleware into the lead agent middleware chain behind the config gate. |
backend/packages/harness/deerflow/agents/lead_agent/prompt.py |
Adds prompt guidance noting background skill review and encouraging patching outdated skills. |
backend/tests/test_skill_review.py |
Adds unit tests covering middleware counters/threshold logic and reviewer best-effort behavior. |
| thread_id = self._get_thread_id(request.runtime) if request.runtime else None | ||
| if thread_id: | ||
| with self._lock: | ||
| self._iters.pop(thread_id, None) |
There was a problem hiding this comment.
awrap_tool_call resets _iters on skill_manage, but leaves the corresponding entry in _last_seen. Over time (especially if many threads call skill_manage and then stop), _last_seen can grow without bound and also bypass the LRU eviction logic (which only runs based on _iters). Consider popping/resetting _last_seen[thread_id] alongside _iters when handling skill_manage.
| self._iters.pop(thread_id, None) | |
| self._iters.pop(thread_id, None) | |
| self._last_seen.pop(thread_id, None) |
| logger.info("SkillReviewMiddleware.aafter_agent CALLED — entry point reached") | ||
| self._ensure_config() | ||
|
|
||
| if not getattr(self._config, "skill_evolution", None) or not self._config.skill_evolution.enabled: | ||
| logger.info("SkillReviewMiddleware: skill_evolution not enabled, skipping") | ||
| return None | ||
|
|
||
| thread_id = self._get_thread_id(runtime) | ||
| logger.info("SkillReviewMiddleware: thread_id=%s", thread_id) | ||
| if not thread_id: | ||
| logger.info("No thread_id in context, skipping skill review check") | ||
| return None | ||
|
|
||
| total_rounds = self._count_tool_call_rounds(state) | ||
| logger.info("SkillReviewMiddleware: total_rounds=%d", total_rounds) | ||
| if total_rounds == 0: |
There was a problem hiding this comment.
This middleware logs at INFO on every aafter_agent call (including skip paths and per-turn counters). Other middlewares (e.g., MemoryMiddleware) use DEBUG for frequent-path diagnostics; keeping these at INFO will likely spam logs in normal operation. Consider downgrading these frequent-path logs to DEBUG and keeping INFO only for the actual trigger event / failures.
| if messages: | ||
| logger.info("Skill review triggered for thread %s (%d tool-call iterations)", thread_id, current) | ||
| try: | ||
| thread = threading.Thread( | ||
| target=lambda: asyncio.run(self._reviewer.review(thread_id, messages)), |
There was a problem hiding this comment.
The background thread is launched with the messages list taken directly from state. Since the main agent may continue to append/modify state["messages"], the reviewer can observe concurrent mutations. Consider passing an immutable snapshot (e.g., messages_copy = list(messages) / tuple(messages)) into the thread target, and also capturing reviewer = self._reviewer locally before starting the thread to avoid races if _reviewer changes.
| if messages: | |
| logger.info("Skill review triggered for thread %s (%d tool-call iterations)", thread_id, current) | |
| try: | |
| thread = threading.Thread( | |
| target=lambda: asyncio.run(self._reviewer.review(thread_id, messages)), | |
| reviewer = self._reviewer | |
| messages_snapshot = tuple(messages) | |
| if reviewer is not None and messages_snapshot: | |
| logger.info("Skill review triggered for thread %s (%d tool-call iterations)", thread_id, current) | |
| try: | |
| thread = threading.Thread( | |
| target=lambda: asyncio.run(reviewer.review(thread_id, messages_snapshot)), |
| logger.info("SkillReviewMiddleware check: skill_evolution_config=%s, enabled=%s", skill_evolution_config, skill_evolution_enabled) | ||
| if skill_evolution_enabled: | ||
| mw = SkillReviewMiddleware(config=app_config) | ||
| middlewares.append(mw) | ||
| logger.info("SkillReviewMiddleware appended to middleware chain (total: %d)", len(middlewares)) |
There was a problem hiding this comment.
These new INFO logs in _build_middlewares include the full skill_evolution_config object. This method runs on agent creation and may be called frequently in some deployments; logging full config objects at INFO can be noisy and may leak configuration details. Consider removing these logs or lowering to DEBUG and only logging the boolean skill_evolution_enabled if needed.
| logger.info("SkillReviewMiddleware check: skill_evolution_config=%s, enabled=%s", skill_evolution_config, skill_evolution_enabled) | |
| if skill_evolution_enabled: | |
| mw = SkillReviewMiddleware(config=app_config) | |
| middlewares.append(mw) | |
| logger.info("SkillReviewMiddleware appended to middleware chain (total: %d)", len(middlewares)) | |
| logger.debug( | |
| "SkillReviewMiddleware check: enabled=%s", | |
| skill_evolution_enabled, | |
| ) | |
| if skill_evolution_enabled: | |
| mw = SkillReviewMiddleware(config=app_config) | |
| middlewares.append(mw) | |
| logger.debug("SkillReviewMiddleware appended to middleware chain (total: %d)", len(middlewares)) |
| def test_threshold_triggers_review(self): | ||
| """Review is triggered when counter reaches the threshold.""" | ||
| config = _make_config(enabled=True, interval=3) | ||
| mw = SkillReviewMiddleware(config=config) | ||
| runtime = _make_runtime("thread-B") | ||
|
|
||
| mock_reviewer = AsyncMock() | ||
| mw._reviewer = mock_reviewer | ||
|
|
||
| # First call: 2 tool-call rounds → counter = 2 (below threshold) | ||
| state1 = _make_state(tool_call_rounds=2) | ||
| anyio.run(_run_middleware, mw, state1, runtime) | ||
| assert mw._iters["thread-B"] == 2 | ||
|
|
||
| # Second call: 1 more round → counter would be 3 ≥ threshold → triggers review, resets | ||
| from langchain_core.messages import AIMessage | ||
|
|
||
| msgs = state1["messages"] + [AIMessage(content="", tool_calls=[{"name": "bash", "id": "call_new", "args": {"command": "ls"}}])] | ||
| state2 = {"messages": msgs} | ||
| anyio.run(_run_middleware, mw, state2, runtime) | ||
| assert mw._iters["thread-B"] == 0 # Reset after trigger | ||
| mock_reviewer.review.assert_called_once() | ||
|
|
There was a problem hiding this comment.
test_threshold_triggers_review asserts mock_reviewer.review was called, but the middleware invokes review() inside a newly spawned daemon thread. That makes this test inherently racey/flaky (the assertion can run before the background thread executes). Consider patching threading.Thread to execute the target synchronously in tests, or add a synchronization mechanism (event/join with timeout) before asserting the call.
…g.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@WillemJiang I saw you requested a Copilot review — could you let me know which suggestions you'd like me to follow? Happy to make the changes. |
|
The self-evolving skill extraction design is solid. A few observations: The daemon thread with its own event loop is the right call — without it, trying to launch a background agent from within an existing event loop context would hit The LRU eviction at 100 tracked threads is reasonable for most deployments, but worth noting it as a configurable parameter — high-concurrency deployments with thousands of simultaneous threads would need to tune this upward or accept that overflow threads simply won't get reviewed. The One potential gap: if a thread dies abnormally (process kill, segfault, OOM), the in-memory counter state for that thread is lost. The next new conversation in that thread slot would start fresh with counter=0, which is fine functionally, but there's no way to reconstruct whether a review was interrupted mid-extraction. Not a bug, just a recovery edge case. The integration test (verifying skill files appear in |
Summary
Add a self-evolving mechanism that automatically extracts reusable skills from conversations, inspired by the Hermes agent architecture. After sufficient tool-call iterations, a lightweight background reviewer agent analyzes the conversation and creates/updates skills via the existing
skill_managetool — without interfering with the main dialogue.What's New
SkillReviewMiddleware — After-agent middleware that counts tool-call rounds per thread using delta tracking. When the accumulated count reaches
creation_nudge_interval, it launches a backgroundSkillReviewerin a daemon thread (with its own event loop to avoidRuntimeErroron main loop closure). Includes LRU eviction (max 100 tracked threads) and counter reset onskill_managetool calls (viaawrap_tool_callhook) to avoid redundant reviews.SkillReviewer — Creates a minimal LangGraph agent with only the
skill_managetool and no middlewares (anti-recursion guarantee). Injects a list of existing skills into the review prompt to prevent duplicate creation. All exceptions are caught and logged — never propagated to the main flow.Skill review prompts — System prompt guides the review agent to identify non-trivial patterns, trial-and-error workflows, and error-resolution strategies. User prompt includes existing skills context for deduplication.
Config — New
creation_nudge_intervalfield inSkillEvolutionConfig(default=10). Feature is gated behindskill_evolution.enabled.Files Changed
agents/middlewares/skill_review_middleware.pyaafter_agent+awrap_tool_callhooksagents/skill_review/reviewer.pyagents/skill_review/prompt.pyagents/skill_review/__init__.pyagents/lead_agent/agent.pyagents/lead_agent/prompt.pyconfig/skill_evolution_config.pycreation_nudge_intervalfieldconfig.example.yamlcreation_nudge_interval: 10tests/test_skill_review.pyTest Plan
skill_evolution, run a conversation exceedingcreation_nudge_intervaltool-call rounds, verify skill files appear inskills/custom/creation_nudge_interval: 3, check logs for "Skill review triggered" messages