Skip to content

uppercase/lowercase functions are not portable? #11471

@stevengj

Description

@stevengj

Since @ScottPJones was mentioning upper/lowercase functions recently, I took a quick look at them and I noticed that we are calling towupper and towlower, which are C99 functions that accept wchar_t arguments.

Unfortunately, this means that they are broken on Windows (where wchar_t is 16 bits) for any character outside the BMP. Even on other platforms with a 32-bit wchar_t, they are going to return different results on different systems, and many systems will have out-of-date Unicode tables. They are also locale-dependent; I'm not sure if this is desirable for us.

utf8proc has up-to-date upper/lower/titlecase mapping data already in its "database" (generated from http://www.unicode.org/Public/UNIDATA/UnicodeData.txt), so maybe we should just add a utf8proc_toupper function (etc.) to utf8proc to make this accessible. Then we could call that (probably plus a check for the common case of ASCII codepoints).

Metadata

Metadata

Assignees

No one assigned

    Labels

    unicodeRelated to unicode characters and encodings

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions