Skip to content

功能建议:增量式会话记忆,实现零成本上下文压缩 || Feature suggestion: Incremental session memory to achieve zero-cost context compression #1691

@ascl1u

Description

@ascl1u

What feature would you like to see?

问题 || Problem
目前 kimi-cli 的 /compact 在触发时需要一次完整的 LLM 调用来总结对话。在长会话中,这个调用本身就很昂贵,而且有时候上下文太大导致总结调用也会失败。

方案
在会话过程中,后台周期性地将关键信息提取到一个结构化的 markdown 文件中(会话标题、当前状态、重要文件、工作流程、错误记录等)。当上下文压力触发压缩时,直接使用这个已经构建好的记忆文件作为摘要,裁剪旧消息,保留最近的消息尾部。

压缩时的 LLM 成本:零。 因为摘要已经在会话过程中增量构建完成了。

实现思路
注册一个 post-sampling hook,在满足阈值条件(token 增长 + 工具调用次数)时触发提取

使用一个轻量级的 forked LLM 调用更新记忆文件,不阻塞主会话

压缩时读取记忆文件,计算保留哪些消息,插入记忆内容作为摘要

裁剪算法需要保证 tool_use/tool_result 配对完整性

参考
我最近发表了一篇关于 Claude Code 上下文压缩系统的详细分析,这个方案基于其中 Session Memory Compact 的设计:[文章链接]

我愿意实现这个功能
根据 CONTRIBUTING.md 的要求,先开 issue 讨论。如果方向合适,我可以提交 PR。

Additional information

No response


What feature would you like to see?

Problem || Problem
Currently kimi-cli's /compact requires a full LLM call to summarize the conversation when triggered. In long sessions, this call itself is expensive, and sometimes the context is so large that the summary call will fail.

Plan
During the session, the background periodically extracts key information into a structured markdown file (session title, current status, important files, workflow, error logs, etc.). When context pressure triggers compression, this already constructed memory file is used directly as a digest, old messages are trimmed, and the latest message tail is retained.

LLM cost when compressing: zero. Because the summary has been incrementally built during the session.

Implementation ideas
Register a post-sampling hook to trigger extraction when threshold conditions are met (token growth + number of tool calls)

Use a lightweight forked LLM call to update memory files without blocking the main session

Read the memory file during compression, calculate which messages to keep, and insert the memory content as a summary

The clipping algorithm needs to ensure the integrity of the tool_use/tool_result pairing

Reference
I recently published a detailed analysis of Claude Code's context compression system. This solution is based on the design of Session Memory Compact: [Article link]

I am willing to implement this function
According to the requirements of CONTRIBUTING.md, open an issue for discussion first. If the direction is right, I can submit a PR.

Additional information

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions