Since @ScottPJones was mentioning upper/lowercase functions recently, I took a quick look at them and I noticed that we are calling towupper and towlower, which are C99 functions that accept wchar_t arguments.
Unfortunately, this means that they are broken on Windows (where wchar_t is 16 bits) for any character outside the BMP. Even on other platforms with a 32-bit wchar_t, they are going to return different results on different systems, and many systems will have out-of-date Unicode tables. They are also locale-dependent; I'm not sure if this is desirable for us.
utf8proc has up-to-date upper/lower/titlecase mapping data already in its "database" (generated from http://www.unicode.org/Public/UNIDATA/UnicodeData.txt), so maybe we should just add a utf8proc_toupper function (etc.) to utf8proc to make this accessible. Then we could call that (probably plus a check for the common case of ASCII codepoints).
Since @ScottPJones was mentioning upper/lowercase functions recently, I took a quick look at them and I noticed that we are calling
towupperandtowlower, which are C99 functions that acceptwchar_targuments.Unfortunately, this means that they are broken on Windows (where
wchar_tis 16 bits) for any character outside the BMP. Even on other platforms with a 32-bitwchar_t, they are going to return different results on different systems, and many systems will have out-of-date Unicode tables. They are also locale-dependent; I'm not sure if this is desirable for us.utf8proc has up-to-date upper/lower/titlecase mapping data already in its "database" (generated from http://www.unicode.org/Public/UNIDATA/UnicodeData.txt), so maybe we should just add a
utf8proc_toupperfunction (etc.) to utf8proc to make this accessible. Then we could call that (probably plus a check for the common case of ASCII codepoints).