feat(cli): CJK word segmentation and Ctrl+arrow navigation optimization by Apophis3158 · Pull Request #2942 · QwenLM/qwen-code

Apophis3158 · 2026-04-07T05:41:01Z

TLDR

This PR adds intelligent CJK (Chinese/Japanese/Korean) word segmentation to the CLI text input, enabling proper Ctrl+Left/Right word-by-word navigation for CJK text.

Problem: Without this change, pressing Ctrl+Left/Right on CJK text jumps over the entire contiguous block of CJK characters until the next whitespace, treating phrases like "你好世界" as a single word. This makes precise cursor positioning in mixed Latin-CJK text nearly impossible.

Solution: Integrates the segmentit library for Chinese word segmentation, with character-by-character fallback for long lines and caching for performance. The implementation:

Adds lazy-loaded segmentit for CJK word boundary detection
Implements caching (up to 500 entries) to avoid repeated segmentation overhead
Falls back to character-by-character navigation for lines exceeding 1500 characters to prevent UI freezing
Extends CJK regex coverage to include CJK Extension A and Compatibility Ideographs
Preserves existing Latin/multi-script word boundary logic via isDifferentScript fallback

Screenshots / Video Demo

Dive Deeper

Implementation Details

Word Navigation (wordLeft / wordRight):

First attempts CJK segmentation via getCjkWordBoundaries() for lines containing CJK characters
Uses findPrevCjkWordStart() / findNextCjkWordEnd() for precise cursor positioning
Falls back to script-boundary detection (isDifferentScript) for mixed text (e.g., Latin + CJK)
Handles edge cases: whitespace-only prefixes, punctuation skipping, and cross-line navigation

Performance Optimizations:

Lazy loading: segmentit is loaded on-demand via createRequire() for ESM/CJS interop
Pre-warming: Background initialization on first call to minimize latency
Caching: Line content → boundary mapping cache with LRU-style eviction
Length limit: Lines >1500 chars skip segmentation to avoid UI freeze

Dependencies:

Added segmentit@^2.0.3 for Chinese word segmentation

Reviewer Test Plan

Open the CLI and type mixed Latin-CJK text, e.g., hello 你好 world 世界
Use Ctrl+Left/Right to navigate word-by-word:
- Verify cursor stops at CJK word boundaries (as determined by segmentit)
- Verify Latin words are still treated as single units
- Verify cross-line navigation works correctly
Test edge cases:
- Long lines (>1500 chars) with CJK text
- CJK text with punctuation: 你好，世界！
- Mixed scripts: 你好hello世界arabicالعربية
Run unit tests: npm run test -- packages/cli/src/ui/components/shared/text-buffer.test.ts

Testing Matrix

	🍏	🪟	🐧
npm run	❓	✅	❓
npx	❓	❓	❓
Docker	❓	❓	❓
Podman	❓	-	-
Seatbelt	❓	-	-

Linked issues / bugs

#2941

🤖 Generated with Qwen Code

wenshao

(duplicate review removed)

wenshao · 2026-04-07T20:05:38Z

packages/cli/src/ui/components/shared/text-buffer.ts

+      return b.end;
+    }
+    if (col < b.start) {
+      return b.end;


[Critical] findNextCjkWordEnd returns b.end when col < b.start, causing Ctrl+Right to skip over non-CJK text and jump directly to the end of a CJK word.

For example, in "hello 你好 world", if the cursor is inside "hello", pressing Ctrl+Right would jump to position 8, skipping "llo " and "你好" entirely. This is asymmetric with findPrevCjkWordStart which correctly returns null in the analogous case.

Suggested change

return b.end;

}

if (col < b.start) {

return b.end;

if (col < b.start) {

return null;

}

— glm-5.1 via Qwen Code /review

wenshao · 2026-04-07T20:06:00Z

packages/cli/src/ui/components/shared/text-buffer.ts

+      );
+      segmentitInstance = null;
+      return;
+    }
+    segmentitInstance = initSegment(new Segment());
+    debugLogger.info('segmentit: loaded successfully');
+  } catch (err) {
+    debugLogger.warn('segmentit: failed to load', err);
+    segmentitInstance = null;
+  }


[Suggestion] ensureSegmentitLoaded sets segmentitInstance = null on failure, causing it to retry createRequire on every keypress. Use a sentinel value to distinguish "not yet attempted" from "attempted and failed".

Three changes needed:

Declaration (line ~114):

let segmentitInstance: { doSegment: (text: string) => Array<{w: string}> } | null | false = null;

Catch block (line ~122): change segmentitInstance = null to:

segmentitInstance = false;

Guard (line ~116): change if (segmentitInstance !== null) return; — this already works since false !== null is true, so it will skip retrying.

— glm-5.1 via Qwen Code /review

wenshao · 2026-04-07T20:06:19Z

packages/cli/src/ui/components/shared/text-buffer.ts

+    debugLogger.warn('getCjkWordBoundaries: error, using char fallback', err);
+    // On error, fall back to char-by-char boundaries (cached)
+    const fallback = charByCharCjkFallback(line);
+    cjkBoundariesCache.set(line, fallback);


[Suggestion] The catch block inserts into the cache without calling evictCacheIfNeeded() first. All other insertion paths call it. If doSegment errors on many distinct lines, the cache can grow beyond the 500-entry CJK_BOUNDARIES_CACHE_MAX limit.

Suggested change

cjkBoundariesCache.set(line, fallback);

evictCacheIfNeeded();

cjkBoundariesCache.set(line, fallback);

— glm-5.1 via Qwen Code /review

wenshao · 2026-04-07T20:06:34Z

packages/cli/package.json

    "prompts": "^2.4.2",
    "react": "^19.1.0",
    "read-package-up": "^11.0.0",
+    "segmentit": "^2.0.3",


[Suggestion] segmentit adds ~15MB to disk footprint (embedded dictionary data) as a mandatory dependency for all CLI users. Since the project requires Node.js 20+, the built-in Intl.Segmenter supports CJK word segmentation with zero extra weight:

const segmenter = new Intl.Segmenter('zh', { granularity: 'word' }); const segments = [...segmenter.segment(line)];

Note: Intl.Segmenter uses ICU data which may produce different word boundaries than segmentit's dictionary-based approach. Recommend testing with representative CJK text samples before switching.

— glm-5.1 via Qwen Code /review

wenshao

These findings could not be posted as inline comments (lines not in diff):

AppContainer.tsx — midTurnDrainRef reads from React state mirror instead of synchronous ref. Fix: use drainQueue() from useMessageQueue directly.
prompts.ts — getActionsSection() says "ask for confirmation" but existing rule says "do not ask permission to use the tool". Contradictory instructions may cause inconsistent model behavior.
text-buffer.ts — delete_word_left/delete_word_right still use Latin-only word boundary logic while move_word now uses CJK segmentation. Inconsistent UX for CJK users.

— glm-5.1 via Qwen Code /review

tanzhenxin

Review — feat(cli): CJK word segmentation and Ctrl+arrow navigation optimization

Files changed: 3 (+465 / -31)

The feature is valuable — CJK users currently have no word-boundary navigation. Two issues to address:

1. `segmentit` should be replaced with `Intl.Segmenter`

The segmentit package adds ~15.8 MB to node_modules, bundles full Chinese dictionaries, hasn't been updated since 2022, and its license field says "Proprietary" — a red flag for an open-source project. It also requires a CJS interop hack (createRequire), lazy-loading machinery, and a setTimeout pre-warm.

Node.js >=16 ships Intl.Segmenter natively, which provides word-level segmentation for CJK with zero dependencies: new Intl.Segmenter('zh', { granularity: 'word' }). This project requires Node >=20, so it's fully available. Intl.Segmenter also handles Japanese and Korean properly, unlike segmentit which is Chinese-only.

Suggestion: Replace segmentit with Intl.Segmenter. This eliminates the dependency, the CJS interop, the lazy-loading, the licensing concern, and broadens language coverage.

2. Zero test coverage for new functionality

~240 lines of new segmentation code and modified input navigation logic with no tests. getCjkWordBoundaries, findPrevCjkWordStart, findNextCjkWordEnd, and the modified reducer branches are all untested.

Suggestion: Add unit tests covering at minimum: pure CJK navigation, mixed CJK/Latin text, fallback behavior, Latin-only regression, and edge cases (empty line, single char, cursor at boundaries).

Apophis3158 · 2026-04-09T18:59:12Z

All review suggestions have been addressed. Branch has been force-pushed with the updated implementation.

@tanzhenxin (CHANGES_REQUESTED)

1. Replace `segmentit` with `Intl.Segmenter` — ✅ Done

Removed the segmentit dependency entirely and switched to Node.js's built-in Intl.Segmenter:

const segmenter = new Intl.Segmenter('zh', { granularity: 'word' });
const segments = segmenter.segment(line);

Zero dependency — previously segmentit added ~15.8 MB
No licensing concern — segmentit had "license": "Proprietary"
CJK coverage — Intl.Segmenter handles Chinese, Japanese, and Korean; segmentit was Chinese-only
No machinery — eliminated createRequire CJS interop, lazy-loading, and setTimeout pre-warm

2. Add unit tests — ✅ Done

Added ~390 lines of tests covering:

Pure CJK word navigation (你好世界)
Mixed CJK/Latin navigation (hello你好world)
delete_word_left / delete_word_right with CJK text
Long-line fallback (>1500 code points)
Edge cases: empty line, single char, document boundaries, cross-line
Dotted identifiers (Intl.Segmenter)
Repeated identical words (variable_name variable_name)
Test isolation via __resetWordSegmenter() in beforeEach

@wenshao (inline comments)

3. Sentinel value for `ensureSegmenterLoaded` — ✅ Done

segmenter is typed as Intl.Segmenter | null | false. On failure it's set to false, so the guard if (segmenter !== null) return skips re-attempts on every keypress.

4. Missing `evictCacheIfNeeded()` in catch block — ✅ Done

All cache insertion paths, including the error catch branch, now call evictCacheIfNeeded() before set(). Additionally, eviction was improved from clear() to single-entry LRU eviction to preserve hot data.

5. Suggest `Intl.Segmenter` over `segmentit` — ✅ Done

Same as the first case above.

Summary review (glm-5.1 via /review)

6. `delete_word_left` / `delete_word_right` should use CJK segmentation — ✅ Done

Both operations now call getWordBoundaries(lineContent, arr), consistent with wordLeft / wordRight.

Additional optimizations beyond the review:

Eliminated redundant toCodePoints() calls: getWordBoundaries accepts an optional pre-computed codePoints parameter
Exported __resetWordSegmenter() for test isolation, called in beforeEach of all relevant test blocks
Replaced cache clear() with per-entry LRU eviction at 500 entries

tanzhenxin

Review

This PR adds CJK word segmentation using Intl.Segmenter for Ctrl+arrow/Ctrl+Backspace navigation. The switch from segmentit to the native API is a great call — zero dependencies, better language coverage. CJK navigation works well in testing: pure Chinese text segments correctly (你好世界测试 → 你好|世界|测试), mixed CJK/Latin handles script boundaries properly, and spaces between CJK words are handled cleanly.

Issues

1. Underscore no longer treated as a word boundary — regression for code editing

variable_name is now deleted as a single word by Ctrl+W/Ctrl+Backspace. The old behavior stopped at the underscore (_name first, then variable). This is because Intl.Segmenter and the new isWordCharStrict (/[\w\p{L}\p{N}]/u) both treat underscores as word characters.

Every major code editor (VS Code, JetBrains, terminals) treats underscores as word separators for Ctrl+arrow navigation. Since this is a code-oriented CLI, snake_case identifiers (my_variable, get_user_name) should navigate part-by-part.

Suggestion: post-process segmenter results to split on underscores, or restore underscore-as-separator behavior for non-CJK segments.

Verdict

REQUEST_CHANGES — The CJK feature works well, but the underscore regression affects all users typing snake_case code (Python, Rust, C, etc.). Should be a straightforward fix since the segmenter results can be post-processed.

Apophis3158 · 2026-04-10T09:20:44Z

Issues

1. Underscore no longer treated as a word boundary — regression for code editing

variable_name is now deleted as a single word by Ctrl+W/Ctrl+Backspace. The old behavior stopped at the underscore (_name first, then variable). This is because Intl.Segmenter and the new isWordCharStrict (/[\w\p{L}\p{N}]/u) both treat underscores as word characters.

Every major code editor (VS Code, JetBrains, terminals) treats underscores as word separators for Ctrl+arrow navigation. Since this is a code-oriented CLI, snake_case identifiers (my_variable, get_user_name) should navigate part-by-part.

Suggestion: post-process segmenter results to split on underscores, or restore underscore-as-separator behavior for non-CJK segments.

According to my tests, Windows Terminal, JetBrains Rider 2025.3, and VS Code treat my_variable, get_user_name as a whole word for navigation, which is also the consensus of modern text editors (such as https://chat.qwen.ai/, GitHub text editor in this PR, Telegram and so on)

Apophis3158 · 2026-04-10T09:36:55Z

I also tested Claude Code and confirmed that my_variable, get_user_name is treated as a whole word

wenshao reviewed Apr 7, 2026

View reviewed changes

tanzhenxin requested changes Apr 9, 2026

View reviewed changes

tanzhenxin added the type/feature-request New feature or enhancement request label Apr 9, 2026

Apophis3158 force-pushed the feat/cjk-navigation branch from bc68132 to a8255dc Compare April 9, 2026 18:50

feat(cli): add CJK word segmentation with Intl.Segmenter

cba7104

Apophis3158 force-pushed the feat/cjk-navigation branch from a8255dc to cba7104 Compare April 9, 2026 18:56

tanzhenxin requested changes Apr 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cli): CJK word segmentation and Ctrl+arrow navigation optimization#2942

feat(cli): CJK word segmentation and Ctrl+arrow navigation optimization#2942
Apophis3158 wants to merge 1 commit intoQwenLM:mainfrom
Apophis3158:feat/cjk-navigation

Apophis3158 commented Apr 7, 2026 •

edited

Loading

Uh oh!

wenshao left a comment •

edited

Loading

Uh oh!

wenshao Apr 7, 2026 •

edited

Loading

Uh oh!

wenshao Apr 7, 2026 •

edited

Loading

Uh oh!

wenshao Apr 7, 2026 •

edited

Loading

Uh oh!

wenshao Apr 7, 2026 •

edited

Loading

Uh oh!

wenshao left a comment •

edited

Loading

Uh oh!

tanzhenxin left a comment •

edited

Loading

Uh oh!

Apophis3158 commented Apr 9, 2026 •

edited

Loading

Uh oh!

tanzhenxin left a comment

Uh oh!

Apophis3158 commented Apr 10, 2026 •

edited

Loading

Issues

Uh oh!

Apophis3158 commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	cjkBoundariesCache.set(line, fallback);
	evictCacheIfNeeded();
	cjkBoundariesCache.set(line, fallback);

Conversation

Apophis3158 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TLDR

Screenshots / Video Demo

Dive Deeper

Implementation Details

Reviewer Test Plan

Testing Matrix

Linked issues / bugs

Uh oh!

wenshao left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenshao Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenshao Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenshao Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenshao Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wenshao left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tanzhenxin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review — feat(cli): CJK word segmentation and Ctrl+arrow navigation optimization

1. segmentit should be replaced with Intl.Segmenter

2. Zero test coverage for new functionality

Uh oh!

Apophis3158 commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Replace segmentit with Intl.Segmenter — ✅ Done

2. Add unit tests — ✅ Done

3. Sentinel value for ensureSegmenterLoaded — ✅ Done

4. Missing evictCacheIfNeeded() in catch block — ✅ Done

5. Suggest Intl.Segmenter over segmentit — ✅ Done

6. delete_word_left / delete_word_right should use CJK segmentation — ✅ Done

Uh oh!

tanzhenxin left a comment

Choose a reason for hiding this comment

Review

Issues

Verdict

Uh oh!

Apophis3158 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues

Uh oh!

Apophis3158 commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Apophis3158 commented Apr 7, 2026 •

edited

Loading

wenshao left a comment •

edited

Loading

wenshao Apr 7, 2026 •

edited

Loading

wenshao Apr 7, 2026 •

edited

Loading

wenshao Apr 7, 2026 •

edited

Loading

wenshao Apr 7, 2026 •

edited

Loading

wenshao left a comment •

edited

Loading

tanzhenxin left a comment •

edited

Loading

1. `segmentit` should be replaced with `Intl.Segmenter`

Apophis3158 commented Apr 9, 2026 •

edited

Loading

1. Replace `segmentit` with `Intl.Segmenter` — ✅ Done

3. Sentinel value for `ensureSegmenterLoaded` — ✅ Done

4. Missing `evictCacheIfNeeded()` in catch block — ✅ Done

5. Suggest `Intl.Segmenter` over `segmentit` — ✅ Done

6. `delete_word_left` / `delete_word_right` should use CJK segmentation — ✅ Done

Apophis3158 commented Apr 10, 2026 •

edited

Loading