Skip to content

feat: add self-evolving background reviewer to auto-extract skills from conversations#2406

Open
jiuyueQ wants to merge 3 commits intobytedance:mainfrom
jiuyueQ:feature/skill-reviewer
Open

feat: add self-evolving background reviewer to auto-extract skills from conversations#2406
jiuyueQ wants to merge 3 commits intobytedance:mainfrom
jiuyueQ:feature/skill-reviewer

Conversation

@jiuyueQ
Copy link
Copy Markdown

@jiuyueQ jiuyueQ commented Apr 21, 2026

Summary

Add a self-evolving mechanism that automatically extracts reusable skills from conversations, inspired by the Hermes agent architecture. After sufficient tool-call iterations, a lightweight background reviewer agent analyzes the conversation and creates/updates skills via the existing skill_manage tool — without interfering with the main dialogue.

What's New

  • SkillReviewMiddleware — After-agent middleware that counts tool-call rounds per thread using delta tracking. When the accumulated count reaches creation_nudge_interval, it launches a background SkillReviewer in a daemon thread (with its own event loop to avoid RuntimeError on main loop closure). Includes LRU eviction (max 100 tracked threads) and counter reset on skill_manage tool calls (via awrap_tool_call hook) to avoid redundant reviews.

  • SkillReviewer — Creates a minimal LangGraph agent with only the skill_manage tool and no middlewares (anti-recursion guarantee). Injects a list of existing skills into the review prompt to prevent duplicate creation. All exceptions are caught and logged — never propagated to the main flow.

  • Skill review prompts — System prompt guides the review agent to identify non-trivial patterns, trial-and-error workflows, and error-resolution strategies. User prompt includes existing skills context for deduplication.

  • Config — New creation_nudge_interval field in SkillEvolutionConfig (default=10). Feature is gated behind skill_evolution.enabled.

Files Changed

File Change
agents/middlewares/skill_review_middleware.py New — middleware with aafter_agent + awrap_tool_call hooks
agents/skill_review/reviewer.py New — background review agent
agents/skill_review/prompt.py New — system & user prompts
agents/skill_review/__init__.py New — package init
agents/lead_agent/agent.py Wire middleware into chain (gated by config)
agents/lead_agent/prompt.py Add guidance about proactive skill maintenance
config/skill_evolution_config.py Add creation_nudge_interval field
config.example.yaml Add creation_nudge_interval: 10
tests/test_skill_review.py New — 21 unit tests

Test Plan

  • Unit tests pass (21 tests covering counter logic, delta counting, threshold trigger, reset, per-thread isolation, LRU eviction, anti-recursion, best-effort error handling)
  • Integration test: enable skill_evolution, run a conversation exceeding creation_nudge_interval tool-call rounds, verify skill files appear in skills/custom/
  • Manual smoke test: set creation_nudge_interval: 3, check logs for "Skill review triggered" messages
  • Verify main conversation is unaffected by review failures

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 21, 2026

CLA assistant check
All committers have signed the CLA.

@WillemJiang
Copy link
Copy Markdown
Collaborator

@jiuyueQ Please create a GitHub issue to discuss this new feature first.

@jiuyueQ
Copy link
Copy Markdown
Author

jiuyueQ commented Apr 22, 2026

@jiuyueQ Please create a GitHub issue to discuss this new feature first.

@WillemJiang Thanks for the suggestion!
I've created issue #2437 to discuss this feature first.
Please take a look when you have time. We can continue the discussion there.
Thanks!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an automated, background “skill review” mechanism that periodically analyzes tool-heavy conversations and writes/updates reusable skills via the existing skill_manage tool, gated by skill_evolution.enabled.

Changes:

  • Introduces SkillReviewMiddleware to track per-thread tool-call rounds and trigger background reviews at a configurable interval.
  • Adds SkillReviewer + prompts to run a minimal, anti-recursive agent that can only call skill_manage, with existing-skill context for deduplication.
  • Extends skill evolution configuration (creation_nudge_interval) and adds unit tests for the new behavior.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
config.example.yaml Adds skill_evolution.creation_nudge_interval to document/configure review cadence.
backend/packages/harness/deerflow/config/skill_evolution_config.py Adds creation_nudge_interval field to the SkillEvolutionConfig model.
backend/packages/harness/deerflow/agents/middlewares/skill_review_middleware.py New middleware that counts tool-call rounds per thread, triggers background reviews, and resets counters on skill_manage.
backend/packages/harness/deerflow/agents/skill_review/reviewer.py New background reviewer that runs a minimal agent to create/patch skills and logs outcomes.
backend/packages/harness/deerflow/agents/skill_review/prompt.py Adds the system/user prompts guiding skill extraction and deduplication behavior.
backend/packages/harness/deerflow/agents/skill_review/__init__.py Initializes the new skill_review package.
backend/packages/harness/deerflow/agents/lead_agent/agent.py Wires the middleware into the lead agent middleware chain behind the config gate.
backend/packages/harness/deerflow/agents/lead_agent/prompt.py Adds prompt guidance noting background skill review and encouraging patching outdated skills.
backend/tests/test_skill_review.py Adds unit tests covering middleware counters/threshold logic and reviewer best-effort behavior.

thread_id = self._get_thread_id(request.runtime) if request.runtime else None
if thread_id:
with self._lock:
self._iters.pop(thread_id, None)
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awrap_tool_call resets _iters on skill_manage, but leaves the corresponding entry in _last_seen. Over time (especially if many threads call skill_manage and then stop), _last_seen can grow without bound and also bypass the LRU eviction logic (which only runs based on _iters). Consider popping/resetting _last_seen[thread_id] alongside _iters when handling skill_manage.

Suggested change
self._iters.pop(thread_id, None)
self._iters.pop(thread_id, None)
self._last_seen.pop(thread_id, None)

Copilot uses AI. Check for mistakes.
Comment on lines +92 to +107
logger.info("SkillReviewMiddleware.aafter_agent CALLED — entry point reached")
self._ensure_config()

if not getattr(self._config, "skill_evolution", None) or not self._config.skill_evolution.enabled:
logger.info("SkillReviewMiddleware: skill_evolution not enabled, skipping")
return None

thread_id = self._get_thread_id(runtime)
logger.info("SkillReviewMiddleware: thread_id=%s", thread_id)
if not thread_id:
logger.info("No thread_id in context, skipping skill review check")
return None

total_rounds = self._count_tool_call_rounds(state)
logger.info("SkillReviewMiddleware: total_rounds=%d", total_rounds)
if total_rounds == 0:
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This middleware logs at INFO on every aafter_agent call (including skip paths and per-turn counters). Other middlewares (e.g., MemoryMiddleware) use DEBUG for frequent-path diagnostics; keeping these at INFO will likely spam logs in normal operation. Consider downgrading these frequent-path logs to DEBUG and keeping INFO only for the actual trigger event / failures.

Copilot uses AI. Check for mistakes.
Comment on lines +139 to +143
if messages:
logger.info("Skill review triggered for thread %s (%d tool-call iterations)", thread_id, current)
try:
thread = threading.Thread(
target=lambda: asyncio.run(self._reviewer.review(thread_id, messages)),
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The background thread is launched with the messages list taken directly from state. Since the main agent may continue to append/modify state["messages"], the reviewer can observe concurrent mutations. Consider passing an immutable snapshot (e.g., messages_copy = list(messages) / tuple(messages)) into the thread target, and also capturing reviewer = self._reviewer locally before starting the thread to avoid races if _reviewer changes.

Suggested change
if messages:
logger.info("Skill review triggered for thread %s (%d tool-call iterations)", thread_id, current)
try:
thread = threading.Thread(
target=lambda: asyncio.run(self._reviewer.review(thread_id, messages)),
reviewer = self._reviewer
messages_snapshot = tuple(messages)
if reviewer is not None and messages_snapshot:
logger.info("Skill review triggered for thread %s (%d tool-call iterations)", thread_id, current)
try:
thread = threading.Thread(
target=lambda: asyncio.run(reviewer.review(thread_id, messages_snapshot)),

Copilot uses AI. Check for mistakes.
Comment thread backend/packages/harness/deerflow/config/skill_evolution_config.py
Comment on lines +256 to +260
logger.info("SkillReviewMiddleware check: skill_evolution_config=%s, enabled=%s", skill_evolution_config, skill_evolution_enabled)
if skill_evolution_enabled:
mw = SkillReviewMiddleware(config=app_config)
middlewares.append(mw)
logger.info("SkillReviewMiddleware appended to middleware chain (total: %d)", len(middlewares))
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new INFO logs in _build_middlewares include the full skill_evolution_config object. This method runs on agent creation and may be called frequently in some deployments; logging full config objects at INFO can be noisy and may leak configuration details. Consider removing these logs or lowering to DEBUG and only logging the boolean skill_evolution_enabled if needed.

Suggested change
logger.info("SkillReviewMiddleware check: skill_evolution_config=%s, enabled=%s", skill_evolution_config, skill_evolution_enabled)
if skill_evolution_enabled:
mw = SkillReviewMiddleware(config=app_config)
middlewares.append(mw)
logger.info("SkillReviewMiddleware appended to middleware chain (total: %d)", len(middlewares))
logger.debug(
"SkillReviewMiddleware check: enabled=%s",
skill_evolution_enabled,
)
if skill_evolution_enabled:
mw = SkillReviewMiddleware(config=app_config)
middlewares.append(mw)
logger.debug("SkillReviewMiddleware appended to middleware chain (total: %d)", len(middlewares))

Copilot uses AI. Check for mistakes.
Comment on lines +203 to +225
def test_threshold_triggers_review(self):
"""Review is triggered when counter reaches the threshold."""
config = _make_config(enabled=True, interval=3)
mw = SkillReviewMiddleware(config=config)
runtime = _make_runtime("thread-B")

mock_reviewer = AsyncMock()
mw._reviewer = mock_reviewer

# First call: 2 tool-call rounds → counter = 2 (below threshold)
state1 = _make_state(tool_call_rounds=2)
anyio.run(_run_middleware, mw, state1, runtime)
assert mw._iters["thread-B"] == 2

# Second call: 1 more round → counter would be 3 ≥ threshold → triggers review, resets
from langchain_core.messages import AIMessage

msgs = state1["messages"] + [AIMessage(content="", tool_calls=[{"name": "bash", "id": "call_new", "args": {"command": "ls"}}])]
state2 = {"messages": msgs}
anyio.run(_run_middleware, mw, state2, runtime)
assert mw._iters["thread-B"] == 0 # Reset after trigger
mock_reviewer.review.assert_called_once()

Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_threshold_triggers_review asserts mock_reviewer.review was called, but the middleware invokes review() inside a newly spawned daemon thread. That makes this test inherently racey/flaky (the assertion can run before the background thread executes). Consider patching threading.Thread to execute the target synchronously in tests, or add a synchronization mechanism (event/join with timeout) before asserting the call.

Copilot uses AI. Check for mistakes.
jiuyueQ and others added 2 commits April 23, 2026 20:50
@jiuyueQ
Copy link
Copy Markdown
Author

jiuyueQ commented Apr 25, 2026

@WillemJiang I saw you requested a Copilot review — could you let me know which suggestions you'd like me to follow? Happy to make the changes.

@dengluozhang
Copy link
Copy Markdown

The self-evolving skill extraction design is solid. A few observations:

The daemon thread with its own event loop is the right call — without it, trying to launch a background agent from within an existing event loop context would hit RuntimeError: asyncio.run() cannot be called from a running event loop. Making it a daemon thread sidesteps that entirely, and the anti-recursion guarantee (reviewer has only skill_manage tool, no middlewares) means the background agent can't trigger another review cycle.

The LRU eviction at 100 tracked threads is reasonable for most deployments, but worth noting it as a configurable parameter — high-concurrency deployments with thousands of simultaneous threads would need to tune this upward or accept that overflow threads simply won't get reviewed.

The awrap_tool_call hook resetting the counter on skill_manage calls is clever — it means if a skill gets created during the main conversation, the review trigger gets pushed back, avoiding the awkward case of a review running immediately after a skill is created.

One potential gap: if a thread dies abnormally (process kill, segfault, OOM), the in-memory counter state for that thread is lost. The next new conversation in that thread slot would start fresh with counter=0, which is fine functionally, but there's no way to reconstruct whether a review was interrupted mid-extraction. Not a bug, just a recovery edge case.

The integration test (verifying skill files appear in skills/custom/) is the right validation. Without it, the reviewer could silently fail to write files and nobody would notice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants