feat: add self-evolving background reviewer to auto-extract skills from conversations by jiuyueQ · Pull Request #2406 · bytedance/deer-flow

jiuyueQ · 2026-04-21T13:34:12Z

Summary

Add a self-evolving mechanism that automatically extracts reusable skills from conversations, inspired by the Hermes agent architecture. After sufficient tool-call iterations, a lightweight background reviewer agent analyzes the conversation and creates/updates skills via the existing skill_manage tool — without interfering with the main dialogue.

What's New

SkillReviewMiddleware — After-agent middleware that counts tool-call rounds per thread using delta tracking. When the accumulated count reaches creation_nudge_interval, it launches a background SkillReviewer in a daemon thread (with its own event loop to avoid RuntimeError on main loop closure). Includes LRU eviction (max 100 tracked threads) and counter reset on skill_manage tool calls (via awrap_tool_call hook) to avoid redundant reviews.
SkillReviewer — Creates a minimal LangGraph agent with only the skill_manage tool and no middlewares (anti-recursion guarantee). Injects a list of existing skills into the review prompt to prevent duplicate creation. All exceptions are caught and logged — never propagated to the main flow.
Skill review prompts — System prompt guides the review agent to identify non-trivial patterns, trial-and-error workflows, and error-resolution strategies. User prompt includes existing skills context for deduplication.
Config — New creation_nudge_interval field in SkillEvolutionConfig (default=10). Feature is gated behind skill_evolution.enabled.

Files Changed

File	Change
`agents/middlewares/skill_review_middleware.py`	New — middleware with `aafter_agent` + `awrap_tool_call` hooks
`agents/skill_review/reviewer.py`	New — background review agent
`agents/skill_review/prompt.py`	New — system & user prompts
`agents/skill_review/__init__.py`	New — package init
`agents/lead_agent/agent.py`	Wire middleware into chain (gated by config)
`agents/lead_agent/prompt.py`	Add guidance about proactive skill maintenance
`config/skill_evolution_config.py`	Add `creation_nudge_interval` field
`config.example.yaml`	Add `creation_nudge_interval: 10`
`tests/test_skill_review.py`	New — 21 unit tests

Test Plan

Unit tests pass (21 tests covering counter logic, delta counting, threshold trigger, reset, per-thread isolation, LRU eviction, anti-recursion, best-effort error handling)
Integration test: enable skill_evolution, run a conversation exceeding creation_nudge_interval tool-call rounds, verify skill files appear in skills/custom/
Manual smoke test: set creation_nudge_interval: 3, check logs for "Skill review triggered" messages
Verify main conversation is unaffected by review failures

…om conversations

CLAassistant · 2026-04-21T13:34:20Z

All committers have signed the CLA.

WillemJiang · 2026-04-22T03:11:51Z

@jiuyueQ Please create a GitHub issue to discuss this new feature first.

jiuyueQ · 2026-04-22T07:47:33Z

@jiuyueQ Please create a GitHub issue to discuss this new feature first.

@WillemJiang Thanks for the suggestion!
I've created issue #2437 to discuss this feature first.
Please take a look when you have time. We can continue the discussion there.
Thanks!

Copilot

Pull request overview

Adds an automated, background “skill review” mechanism that periodically analyzes tool-heavy conversations and writes/updates reusable skills via the existing skill_manage tool, gated by skill_evolution.enabled.

Changes:

Introduces SkillReviewMiddleware to track per-thread tool-call rounds and trigger background reviews at a configurable interval.
Adds SkillReviewer + prompts to run a minimal, anti-recursive agent that can only call skill_manage, with existing-skill context for deduplication.
Extends skill evolution configuration (creation_nudge_interval) and adds unit tests for the new behavior.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`config.example.yaml`	Adds `skill_evolution.creation_nudge_interval` to document/configure review cadence.
`backend/packages/harness/deerflow/config/skill_evolution_config.py`	Adds `creation_nudge_interval` field to the SkillEvolutionConfig model.
`backend/packages/harness/deerflow/agents/middlewares/skill_review_middleware.py`	New middleware that counts tool-call rounds per thread, triggers background reviews, and resets counters on `skill_manage`.
`backend/packages/harness/deerflow/agents/skill_review/reviewer.py`	New background reviewer that runs a minimal agent to create/patch skills and logs outcomes.
`backend/packages/harness/deerflow/agents/skill_review/prompt.py`	Adds the system/user prompts guiding skill extraction and deduplication behavior.
`backend/packages/harness/deerflow/agents/skill_review/__init__.py`	Initializes the new `skill_review` package.
`backend/packages/harness/deerflow/agents/lead_agent/agent.py`	Wires the middleware into the lead agent middleware chain behind the config gate.
`backend/packages/harness/deerflow/agents/lead_agent/prompt.py`	Adds prompt guidance noting background skill review and encouraging patching outdated skills.
`backend/tests/test_skill_review.py`	Adds unit tests covering middleware counters/threshold logic and reviewer best-effort behavior.

Copilot · 2026-04-23T07:07:56Z

+            thread_id = self._get_thread_id(request.runtime) if request.runtime else None
+            if thread_id:
+                with self._lock:
+                    self._iters.pop(thread_id, None)


awrap_tool_call resets _iters on skill_manage, but leaves the corresponding entry in _last_seen. Over time (especially if many threads call skill_manage and then stop), _last_seen can grow without bound and also bypass the LRU eviction logic (which only runs based on _iters). Consider popping/resetting _last_seen[thread_id] alongside _iters when handling skill_manage.

Suggested change

self._iters.pop(thread_id, None)

self._iters.pop(thread_id, None)

self._last_seen.pop(thread_id, None)

Copilot · 2026-04-23T07:07:57Z

+        logger.info("SkillReviewMiddleware.aafter_agent CALLED — entry point reached")
+        self._ensure_config()
+
+        if not getattr(self._config, "skill_evolution", None) or not self._config.skill_evolution.enabled:
+            logger.info("SkillReviewMiddleware: skill_evolution not enabled, skipping")
+            return None
+
+        thread_id = self._get_thread_id(runtime)
+        logger.info("SkillReviewMiddleware: thread_id=%s", thread_id)
+        if not thread_id:
+            logger.info("No thread_id in context, skipping skill review check")
+            return None
+
+        total_rounds = self._count_tool_call_rounds(state)
+        logger.info("SkillReviewMiddleware: total_rounds=%d", total_rounds)
+        if total_rounds == 0:


This middleware logs at INFO on every aafter_agent call (including skip paths and per-turn counters). Other middlewares (e.g., MemoryMiddleware) use DEBUG for frequent-path diagnostics; keeping these at INFO will likely spam logs in normal operation. Consider downgrading these frequent-path logs to DEBUG and keeping INFO only for the actual trigger event / failures.

Copilot · 2026-04-23T07:07:57Z

+        if messages:
+            logger.info("Skill review triggered for thread %s (%d tool-call iterations)", thread_id, current)
+            try:
+                thread = threading.Thread(
+                    target=lambda: asyncio.run(self._reviewer.review(thread_id, messages)),


The background thread is launched with the messages list taken directly from state. Since the main agent may continue to append/modify state["messages"], the reviewer can observe concurrent mutations. Consider passing an immutable snapshot (e.g., messages_copy = list(messages) / tuple(messages)) into the thread target, and also capturing reviewer = self._reviewer locally before starting the thread to avoid races if _reviewer changes.

Suggested change

if messages:

logger.info("Skill review triggered for thread %s (%d tool-call iterations)", thread_id, current)

try:

thread = threading.Thread(

target=lambda: asyncio.run(self._reviewer.review(thread_id, messages)),

reviewer = self._reviewer

messages_snapshot = tuple(messages)

if reviewer is not None and messages_snapshot:

logger.info("Skill review triggered for thread %s (%d tool-call iterations)", thread_id, current)

try:

thread = threading.Thread(

target=lambda: asyncio.run(reviewer.review(thread_id, messages_snapshot)),

Copilot · 2026-04-23T07:07:58Z

+    logger.info("SkillReviewMiddleware check: skill_evolution_config=%s, enabled=%s", skill_evolution_config, skill_evolution_enabled)
+    if skill_evolution_enabled:
+        mw = SkillReviewMiddleware(config=app_config)
+        middlewares.append(mw)
+        logger.info("SkillReviewMiddleware appended to middleware chain (total: %d)", len(middlewares))


These new INFO logs in _build_middlewares include the full skill_evolution_config object. This method runs on agent creation and may be called frequently in some deployments; logging full config objects at INFO can be noisy and may leak configuration details. Consider removing these logs or lowering to DEBUG and only logging the boolean skill_evolution_enabled if needed.

Suggested change

logger.info("SkillReviewMiddleware check: skill_evolution_config=%s, enabled=%s", skill_evolution_config, skill_evolution_enabled)

if skill_evolution_enabled:

mw = SkillReviewMiddleware(config=app_config)

middlewares.append(mw)

logger.info("SkillReviewMiddleware appended to middleware chain (total: %d)", len(middlewares))

logger.debug(

"SkillReviewMiddleware check: enabled=%s",

skill_evolution_enabled,

)

if skill_evolution_enabled:

mw = SkillReviewMiddleware(config=app_config)

middlewares.append(mw)

logger.debug("SkillReviewMiddleware appended to middleware chain (total: %d)", len(middlewares))

Copilot · 2026-04-23T07:07:58Z

+    def test_threshold_triggers_review(self):
+        """Review is triggered when counter reaches the threshold."""
+        config = _make_config(enabled=True, interval=3)
+        mw = SkillReviewMiddleware(config=config)
+        runtime = _make_runtime("thread-B")
+
+        mock_reviewer = AsyncMock()
+        mw._reviewer = mock_reviewer
+
+        # First call: 2 tool-call rounds → counter = 2 (below threshold)
+        state1 = _make_state(tool_call_rounds=2)
+        anyio.run(_run_middleware, mw, state1, runtime)
+        assert mw._iters["thread-B"] == 2
+
+        # Second call: 1 more round → counter would be 3 ≥ threshold → triggers review, resets
+        from langchain_core.messages import AIMessage
+
+        msgs = state1["messages"] + [AIMessage(content="", tool_calls=[{"name": "bash", "id": "call_new", "args": {"command": "ls"}}])]
+        state2 = {"messages": msgs}
+        anyio.run(_run_middleware, mw, state2, runtime)
+        assert mw._iters["thread-B"] == 0  # Reset after trigger
+        mock_reviewer.review.assert_called_once()
+


test_threshold_triggers_review asserts mock_reviewer.review was called, but the middleware invokes review() inside a newly spawned daemon thread. That makes this test inherently racey/flaky (the assertion can run before the background thread executes). Consider patching threading.Thread to execute the target synchronously in tests, or add a synchronization mechanism (event/join with timeout) before asserting the call.

…g.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

jiuyueQ · 2026-04-25T17:34:32Z

@WillemJiang I saw you requested a Copilot review — could you let me know which suggestions you'd like me to follow? Happy to make the changes.

dengluozhang · 2026-04-25T17:42:02Z

The self-evolving skill extraction design is solid. A few observations:

The daemon thread with its own event loop is the right call — without it, trying to launch a background agent from within an existing event loop context would hit RuntimeError: asyncio.run() cannot be called from a running event loop. Making it a daemon thread sidesteps that entirely, and the anti-recursion guarantee (reviewer has only skill_manage tool, no middlewares) means the background agent can't trigger another review cycle.

The LRU eviction at 100 tracked threads is reasonable for most deployments, but worth noting it as a configurable parameter — high-concurrency deployments with thousands of simultaneous threads would need to tune this upward or accept that overflow threads simply won't get reviewed.

The awrap_tool_call hook resetting the counter on skill_manage calls is clever — it means if a skill gets created during the main conversation, the review trigger gets pushed back, avoiding the awkward case of a review running immediately after a skill is created.

One potential gap: if a thread dies abnormally (process kill, segfault, OOM), the in-memory counter state for that thread is lost. The next new conversation in that thread slot would start fresh with counter=0, which is fine functionally, but there's no way to reconstruct whether a review was interrupted mid-extraction. Not a bug, just a recovery edge case.

The integration test (verifying skill files appear in skills/custom/) is the right validation. Without it, the reviewer could silently fail to write files and nobody would notice.

feat: add self-evolving background reviewer to auto-extract skills fr…

b2b7288

…om conversations

WillemJiang requested a review from Copilot April 23, 2026 07:03

Copilot started reviewing on behalf of WillemJiang April 23, 2026 07:03 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

jiuyueQ and others added 2 commits April 23, 2026 20:50

Update backend/packages/harness/deerflow/config/skill_evolution_confi…

b0ecd2d

…g.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'main' into feature/skill-reviewer

61523f5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add self-evolving background reviewer to auto-extract skills from conversations#2406

feat: add self-evolving background reviewer to auto-extract skills from conversations#2406
jiuyueQ wants to merge 3 commits intobytedance:mainfrom
jiuyueQ:feature/skill-reviewer

jiuyueQ commented Apr 21, 2026

Uh oh!

CLAassistant commented Apr 21, 2026 •

edited

Loading

Uh oh!

WillemJiang commented Apr 22, 2026

Uh oh!

jiuyueQ commented Apr 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

Copilot AI Apr 23, 2026

Uh oh!

jiuyueQ commented Apr 25, 2026

Uh oh!

dengluozhang commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

	self._iters.pop(thread_id, None)
	self._iters.pop(thread_id, None)
	self._last_seen.pop(thread_id, None)

Conversation

jiuyueQ commented Apr 21, 2026

Summary

What's New

Files Changed

Test Plan

Uh oh!

CLAassistant commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WillemJiang commented Apr 22, 2026

Uh oh!

jiuyueQ commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

jiuyueQ commented Apr 25, 2026

Uh oh!

dengluozhang commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

CLAassistant commented Apr 21, 2026 •

edited

Loading

jiuyueQ commented Apr 22, 2026 •

edited

Loading