feat(cli): CJK word segmentation and Ctrl+arrow navigation optimization#2942
feat(cli): CJK word segmentation and Ctrl+arrow navigation optimization#2942Apophis3158 wants to merge 1 commit intoQwenLM:mainfrom
Conversation
| return b.end; | ||
| } | ||
| if (col < b.start) { | ||
| return b.end; |
There was a problem hiding this comment.
[Critical] findNextCjkWordEnd returns b.end when col < b.start, causing Ctrl+Right to skip over non-CJK text and jump directly to the end of a CJK word.
For example, in "hello 你好 world", if the cursor is inside "hello", pressing Ctrl+Right would jump to position 8, skipping "llo " and "你好" entirely. This is asymmetric with findPrevCjkWordStart which correctly returns null in the analogous case.
| return b.end; | |
| } | |
| if (col < b.start) { | |
| return b.end; | |
| if (col < b.start) { | |
| return null; | |
| } |
— glm-5.1 via Qwen Code /review
| ); | ||
| segmentitInstance = null; | ||
| return; | ||
| } | ||
| segmentitInstance = initSegment(new Segment()); | ||
| debugLogger.info('segmentit: loaded successfully'); | ||
| } catch (err) { | ||
| debugLogger.warn('segmentit: failed to load', err); | ||
| segmentitInstance = null; | ||
| } |
There was a problem hiding this comment.
[Suggestion] ensureSegmentitLoaded sets segmentitInstance = null on failure, causing it to retry createRequire on every keypress. Use a sentinel value to distinguish "not yet attempted" from "attempted and failed".
Three changes needed:
- Declaration (line ~114):
let segmentitInstance: { doSegment: (text: string) => Array<{w: string}> } | null | false = null;- Catch block (line ~122): change
segmentitInstance = nullto:
segmentitInstance = false;- Guard (line ~116): change
if (segmentitInstance !== null) return;— this already works sincefalse !== nullistrue, so it will skip retrying.
— glm-5.1 via Qwen Code /review
| debugLogger.warn('getCjkWordBoundaries: error, using char fallback', err); | ||
| // On error, fall back to char-by-char boundaries (cached) | ||
| const fallback = charByCharCjkFallback(line); | ||
| cjkBoundariesCache.set(line, fallback); |
There was a problem hiding this comment.
[Suggestion] The catch block inserts into the cache without calling evictCacheIfNeeded() first. All other insertion paths call it. If doSegment errors on many distinct lines, the cache can grow beyond the 500-entry CJK_BOUNDARIES_CACHE_MAX limit.
| cjkBoundariesCache.set(line, fallback); | |
| evictCacheIfNeeded(); | |
| cjkBoundariesCache.set(line, fallback); |
— glm-5.1 via Qwen Code /review
packages/cli/package.json
Outdated
| "prompts": "^2.4.2", | ||
| "react": "^19.1.0", | ||
| "read-package-up": "^11.0.0", | ||
| "segmentit": "^2.0.3", |
There was a problem hiding this comment.
[Suggestion] segmentit adds ~15MB to disk footprint (embedded dictionary data) as a mandatory dependency for all CLI users. Since the project requires Node.js 20+, the built-in Intl.Segmenter supports CJK word segmentation with zero extra weight:
const segmenter = new Intl.Segmenter('zh', { granularity: 'word' });
const segments = [...segmenter.segment(line)];Note: Intl.Segmenter uses ICU data which may produce different word boundaries than segmentit's dictionary-based approach. Recommend testing with representative CJK text samples before switching.
— glm-5.1 via Qwen Code /review
There was a problem hiding this comment.
These findings could not be posted as inline comments (lines not in diff):
- AppContainer.tsx —
midTurnDrainRefreads from React state mirror instead of synchronous ref. Fix: usedrainQueue()fromuseMessageQueuedirectly. - prompts.ts —
getActionsSection()says "ask for confirmation" but existing rule says "do not ask permission to use the tool". Contradictory instructions may cause inconsistent model behavior. - text-buffer.ts —
delete_word_left/delete_word_rightstill use Latin-only word boundary logic whilemove_wordnow uses CJK segmentation. Inconsistent UX for CJK users.
— glm-5.1 via Qwen Code /review
There was a problem hiding this comment.
Review — feat(cli): CJK word segmentation and Ctrl+arrow navigation optimization
Files changed: 3 (+465 / -31)
The feature is valuable — CJK users currently have no word-boundary navigation. Two issues to address:
1. segmentit should be replaced with Intl.Segmenter
The segmentit package adds ~15.8 MB to node_modules, bundles full Chinese dictionaries, hasn't been updated since 2022, and its license field says "Proprietary" — a red flag for an open-source project. It also requires a CJS interop hack (createRequire), lazy-loading machinery, and a setTimeout pre-warm.
Node.js >=16 ships Intl.Segmenter natively, which provides word-level segmentation for CJK with zero dependencies: new Intl.Segmenter('zh', { granularity: 'word' }). This project requires Node >=20, so it's fully available. Intl.Segmenter also handles Japanese and Korean properly, unlike segmentit which is Chinese-only.
Suggestion: Replace segmentit with Intl.Segmenter. This eliminates the dependency, the CJS interop, the lazy-loading, the licensing concern, and broadens language coverage.
2. Zero test coverage for new functionality
~240 lines of new segmentation code and modified input navigation logic with no tests. getCjkWordBoundaries, findPrevCjkWordStart, findNextCjkWordEnd, and the modified reducer branches are all untested.
Suggestion: Add unit tests covering at minimum: pure CJK navigation, mixed CJK/Latin text, fallback behavior, Latin-only regression, and edge cases (empty line, single char, cursor at boundaries).
bc68132 to
a8255dc
Compare
a8255dc to
cba7104
Compare
|
All review suggestions have been addressed. Branch has been force-pushed with the updated implementation. @tanzhenxin (CHANGES_REQUESTED) 1. Replace
|
tanzhenxin
left a comment
There was a problem hiding this comment.
Review
This PR adds CJK word segmentation using Intl.Segmenter for Ctrl+arrow/Ctrl+Backspace navigation. The switch from segmentit to the native API is a great call — zero dependencies, better language coverage. CJK navigation works well in testing: pure Chinese text segments correctly (你好世界测试 → 你好|世界|测试), mixed CJK/Latin handles script boundaries properly, and spaces between CJK words are handled cleanly.
Issues
1. Underscore no longer treated as a word boundary — regression for code editing
variable_name is now deleted as a single word by Ctrl+W/Ctrl+Backspace. The old behavior stopped at the underscore (_name first, then variable). This is because Intl.Segmenter and the new isWordCharStrict (/[\w\p{L}\p{N}]/u) both treat underscores as word characters.
Every major code editor (VS Code, JetBrains, terminals) treats underscores as word separators for Ctrl+arrow navigation. Since this is a code-oriented CLI, snake_case identifiers (my_variable, get_user_name) should navigate part-by-part.
Suggestion: post-process segmenter results to split on underscores, or restore underscore-as-separator behavior for non-CJK segments.
Verdict
REQUEST_CHANGES — The CJK feature works well, but the underscore regression affects all users typing snake_case code (Python, Rust, C, etc.). Should be a straightforward fix since the segmenter results can be post-processed.
According to my tests, Windows Terminal, JetBrains Rider 2025.3, and VS Code treat |
|
I also tested Claude Code and confirmed that |
TLDR
This PR adds intelligent CJK (Chinese/Japanese/Korean) word segmentation to the CLI text input, enabling proper Ctrl+Left/Right word-by-word navigation for CJK text.
Problem: Without this change, pressing Ctrl+Left/Right on CJK text jumps over the entire contiguous block of CJK characters until the next whitespace, treating phrases like "你好世界" as a single word. This makes precise cursor positioning in mixed Latin-CJK text nearly impossible.
Solution: Integrates the
segmentitlibrary for Chinese word segmentation, with character-by-character fallback for long lines and caching for performance. The implementation:segmentitfor CJK word boundary detectionisDifferentScriptfallbackScreenshots / Video Demo
Dive Deeper
Implementation Details
Word Navigation (
wordLeft/wordRight):getCjkWordBoundaries()for lines containing CJK charactersfindPrevCjkWordStart()/findNextCjkWordEnd()for precise cursor positioningisDifferentScript) for mixed text (e.g., Latin + CJK)Performance Optimizations:
segmentitis loaded on-demand viacreateRequire()for ESM/CJS interopDependencies:
segmentit@^2.0.3for Chinese word segmentationReviewer Test Plan
hello 你好 world 世界segmentit)你好,世界!你好hello世界arabicالعربيةnpm run test -- packages/cli/src/ui/components/shared/text-buffer.test.tsTesting Matrix
Linked issues / bugs
#2941
🤖 Generated with Qwen Code