feat(core): thinking block cross-turn retention with idle cleanup#2897
feat(core): thinking block cross-turn retention with idle cleanup#2897tanzhenxin merged 5 commits intomainfrom
Conversation
📋 Review SummaryThis PR implements thinking block cross-turn retention with idle cleanup, addressing a key usability issue where thinking blocks were unconditionally stripped on every new user query. The implementation introduces an idle-aware mechanism that preserves thinking during active sessions (<1h idle) and only clears old thinking after cache TTL expiration, with a sticky latch to protect the newly-warmed cache prefix. The code is well-structured with comprehensive test coverage. 🔍 General Feedback
🎯 Specific Feedback🟡 High
🟢 Medium
🔵 Low
✅ Highlights
|
bb23b70 to
723b7ed
Compare
723b7ed to
42b0335
Compare
|
Thanks for the review. Addressing the feedback: 🟡 High — 🟢 Medium — Duplicated thought identification logic: 🟢 Medium — JSDoc referencing PR: 🔵 Low — Move constant to config / Add resetChat comment / Extract mock helper:
|
There was a problem hiding this comment.
Pull request overview
This PR updates the core chat/client flow to retain model “thinking” blocks across turns during active sessions, while performing time-based cleanup after long idle periods to reduce context/token bloat and stabilize any provider-side prompt caching behavior.
Changes:
- Add
GeminiChat.stripThoughtsFromHistoryKeepRecent(keepTurns)to strip older thought parts while keeping the most recent N thought-containing model turns intact. - Introduce idle/latch state in
GeminiClientto conditionally strip thoughts only after >1h idle, then keep that behavior “latched” until/clear(resetChat()). - Add unit tests for the new history-stripping helper and update client test mocks to include the new chat method.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| packages/core/src/core/geminiChat.ts | Adds stripThoughtsFromHistoryKeepRecent() to selectively strip thought parts from older model turns. |
| packages/core/src/core/geminiChat.test.ts | Adds focused unit tests validating keep-recent behavior and edge cases. |
| packages/core/src/core/client.ts | Implements idle-aware thought retention and sticky latch, plus completion timestamp tracking/reset. |
| packages/core/src/core/client.test.ts | Updates mocked GeminiChat objects to include the new method used by the client. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
42b0335 to
5f48fb7
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Previously, all thinking blocks were unconditionally stripped from history on every new user query. This caused loss of reasoning coherence in active multi-turn sessions where thinking context was still valuable. Now thinking blocks are preserved during active sessions and only cleaned up after >1h idle (cache TTL expired), keeping the most recent 1 turn. A sticky-on latch prevents the cleanup from reverting, protecting the newly-warmed cache prefix from invalidation. - Add `stripThoughtsFromHistoryKeepRecent(keepTurns)` to GeminiChat - Add `lastApiCompletionTimestamp` and `thinkingClearLatched` to GeminiClient - Replace unconditional strip with idle-aware logic in sendMessageStream - Track API completion timestamp on all exit paths (success/error/loop) - Reset latch and timestamp on resetChat() - Add 5 unit tests for the new method, update 18 mock objects Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5f48fb7 to
03d91f1
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
Explicitly document that keepTurns selects from thought-containing model turns (not all model turns) to ensure the most recent reasoning chain is preserved even if later turns have no thinking. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ecent Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Log when the latch triggers and when old thinking blocks are stripped, making the behavior observable via QWEN_CODE_DEBUG=CLIENT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Prompt Cache Hit Analysis: Thinking Block RetentionObserved PatternTest scenario: three consecutive user tasks in a single session, with
AnalysisPhase 1: Agentic loop within a task (reqs 0–5)Cache hit rate is consistently high (90–99%). Each request incrementally extends the previous one by appending tool results and model responses. The provider caches the growing prefix effectively. Phase 2: First request of a new user task (req 6)Cache drops to 17,965 — roughly the system prompt size — despite the message prefix being byte-for-byte identical to the previous request (verified by per-message MD5 hash comparison). The thinking/reasoning blocks are retained in history (confirmed by log inspection). This cache break at the user-task boundary is still under investigation. Phase 3: Subsequent user tasks (req 7)Cache recovers to 99.9%. Since the cache break already happened on req 6, the prefix stabilizes and subsequent requests match. Phase 4: Idle timeout (req 8)After ~8 minutes of inactivity, the provider cache expires entirely (cached = 0). This establishes the provider's cache TTL at approximately 5 minutes. Action Items1. Align idle threshold with provider cache TTL (this PR)Since the provider cache expires after ~5 minutes of inactivity, retaining thinking blocks beyond that point provides no cache benefit — the provider will miss regardless. Meanwhile, the retained thinking blocks consume context window tokens. Lower {
"context.gapThresholdMinutes": 5
}2. Investigate task-boundary cache break (follow-up PR)The cache drop at req 6 (first request of a new user task) remains unexplained. The message prefix is byte-for-byte identical to the previous request, thinking blocks are retained, and all request metadata (model, tools) matches. Yet the provider returns a cache miss down to system-prompt level. This will be tracked and resolved in a separate PR. |
tanzhenxin
left a comment
There was a problem hiding this comment.
Lower THINKING_IDLE_THRESHOLD_MS from 1 hour to 5 minutes and make it configurable via context.gapThresholdMinutes — provider cache TTL is ~5 min, so retaining beyond that wastes context with no cache benefit.
…ault to 5min Align with observed provider prompt-cache TTL (~5 min). Add `context.gapThresholdMinutes` setting so users can tune the threshold for providers with different cache TTLs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@tanzhenxin Changes implemented per your review. Summary: 1. Default threshold lowered from 1h to 5min 2. New { "context": { "gapThresholdMinutes": 5 } }Configurable for providers with different cache TTLs. Files changed:
Task-boundary cache break (req 6) to be tracked in a follow-up PR as discussed. |
Summary
Thinking blocks (model's internal reasoning, 10–60K tokens each) were unconditionally stripped from history on every new user query. This PR changes the behavior to be idle-aware: preserve thinking during active sessions, only clean up after prolonged idle.
Before
stripThoughtsFromHistory(), removing all thinking blocks from prior turnsAfter
/clearcontext.gapThresholdMinutessetting (default 5) allows tuning the idle threshold to match your provider's prompt-cache TTLBehavior matrix
/clearor resetKey changes
geminiChat.tsstripThoughtsFromHistoryKeepRecent(keepTurns)— strips old thinking, keeps last N thought-containing model turnsclient.tslastApiCompletionTimestamp,thinkingClearLatched, replace unconditional strip with idle-aware logic; read threshold from config instead of hardcoded constantconfig.ts(core)thinkingIdleThresholdMinutestoConfigParameters,getThinkingIdleThresholdMs()gettersettingsSchema.tscontext.gapThresholdMinutessetting (number, default 5)config.ts(cli)gapThresholdMinutessetting through to core configsettings.mdcontext.gapThresholdMinutessettinggeminiChat.test.tsstripThoughtsFromHistoryKeepRecentclient.test.tsDesign rationale (cache TTL alignment)
Per reviewer analysis, the provider prompt-cache expires after ~5 minutes of inactivity. Retaining thinking blocks beyond that point provides no cache benefit while consuming context tokens. The default threshold is aligned accordingly, and made configurable via
context.gapThresholdMinutesfor providers with different TTLs.Test plan
stripThoughtsFromHistoryKeepRecent(keep recent, no-op when enough, empty content removal, signature stripping, keepTurns=0)