Skip to content

Support Unicode 15.1 new GB9c break rule #1718

@DonKult

Description

@DonKult

ycmd embeds its unicode support files and tests (currently for version 13), but a script (update_unicode.py) is provided to update to the latest unicode version. This used to work to upgrade to version 14, but doesn't anymore with 15. The tests fail for example with:

[ RUN      ] UnicodeTest/WordTest.BreakIntoCharacters/1186
./cpp/ycm/tests/Word_test.cpp:60: Failure
Value of: Word( word_.text_ ).Characters()
Expected: { *{ "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D\xE0\xA4\xA4"
    As Text: "कत", "\xE0\xA4\x95\xE0\xA4\xA4"
    As Text: "कत", "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D\xE0\xA4\xA4"
    As Text: "कत", "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D\xE0\xA4\xA4"
    As Text: "कत", false, true, false, false } }
  Actual: { *{ "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D"
    As Text: "क", "\xE0\xA4\x95"
    As Text: "क", "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D"
    As Text: "क", "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D"
    As Text: "क्, false, true, false, false }, *{ "\xE0\xA4\xA4"
    As Text: "त", "\xE0\xA4\xA4"
    As Text: "त", "\xE0\xA4\xA4"
    As Text: "त", "\xE0\xA4\xA4"
    As Text: "त", true, true, false, false } }

[  FAILED  ] UnicodeTest/WordTest.BreakIntoCharacters/1186, where GetParam() = { "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D\xE0\xA4\xA4"
    As Text: "कत", { "\xE0\xA4\x95\xE0\xA5\x8D\xE0\xA5\x8D\xE0\xA4\xA4"
    As Text: "कत" } (0 ms)

The reason is that 15.1 introduces a new rule for (not) breaking: GB9c and of course the new tests exercising this rule fail now.

Prior art implementing this rule elsewhere: JuliaStrings/utf8proc#253

Would be nice if support for newer Unicode standards could be added to ycmd.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions