cfile / ddio: use most of functions to std::filesystem::path#475
cfile / ddio: use most of functions to std::filesystem::path#475Lgt2x merged 23 commits intoDescentDevelopers:mainfrom
Conversation
5eb204f to
946affc
Compare
Make cfopen(), open_file_in_directory() and cfexist() use std::filesystem::path arguments.
Simplify related code in `cf_SetSearchPath()`.
Refactor paths and extensions to use std::map. Rewriting related code in manage.cpp.
Convert cf_IsFileInHog(), cf_OpenFileInLibrary() and cf_GetfileCRC() functions to use std::fs::path.
Refactor paths and extensions to use std::map. Rewriting related code in manage.cpp.
Convert cf_CopyFile() cf_CopyFileTime() and underlying functions to use std::filesystem::path
Convert cf_Diff() and underlying functions to use std::filesystem::path
Use std::fs::path on path construction, simplify code.
Simplifies code related to UpdatePrimitive().
Remove ddio_CreateDir(), ddio_RemoveDir(), ddio_SaveWorkingDir(), ddio_RestoreWorkingDir(), ddio_DirExists(). Replace them in code with std::fs alternatives.
946affc to
3fc6dd5
Compare
|
You can do something like that to fix mac os CI (but cleaner). |
504dd2d to
82e6c2f
Compare
|
After some refactoring cf_FindRealFileNameCaseInsensitive() now works much faster and more reliable. Here some benchmarks: Release: Debug: |
| tmode[1] = (mode[0] == 'w') ? mode[1] : 'b'; | ||
| // try to open file | ||
| fp = fopen(path, tmode); | ||
| fp = fopen(using_filename.u8string().c_str(), tmode); |
There was a problem hiding this comment.
How do you know that UTF-8 is the correct encoding to use for fopen()’s filename parameter?
There was a problem hiding this comment.
This way how std::filesystem::path represent native encoding on systems. POSIX filesystem is native char representation, but Windows' native representation is wchar. If you want use paths with C functions, you need convert them with c_str() using u8string().
There was a problem hiding this comment.
This way how std::filesystem::path represent native encoding on systems. POSIX filesystem is native
charrepresentation, but Windows' native representation iswchar. If you want use paths with C functions, you need convert them withc_str()usingu8string().
I’m a little bit confused. According to cppreference.com, std::filesystem::path::u8string() returns a std::u8string. std::u8string is the same as std::basic_string<char8_t>. std::basic_string::<whatever>::c_str() returns a whatever * which means that my_u8string.c_str() should return a char8_t *, right? It seems to me like using_filename.u8string().c_str() is always going to evaluate to a char8_t *. I just don’t see how it could evaluate to a wchar_t *.
(The std::u8string type was added in C++20. In C++17, a similar argument can be made, but you have to replace std::u8string with std::string and char8_t with char.)
Passing a UTF-8 encoded string to fopen() is potentially a problem. According to Microsoft’s documentation,
The
fopenfunction opens the file specified byfilename. By default, a narrowfilenamestring is interpreted using the ANSI codepage (CP_ACP). In Windows Desktop applications, it can be changed to the OEM codepage (CP_OEMCP) by using theSetFileApisToOEMfunction. You can use theAreFileApisANSIfunction to determine whetherfilenameis interpreted using the ANSI or the system default OEM codepage.
Is there anything that’s going to guarantee that we’re using a UTF-8 codepage? Windows won’t use a UTF-8 codepage by default unless the user has “Beta: Use Unicode UTF-8 for worldwide language support” enabled, or the application explicitly specifies a UTF-8 codepage.
Additionally, if using_filename.u8string().c_str() sometimes does evaluate to a wchar_t *, then I think that we have a problem. According to Microsoft’s documentation, you’re supposed to pass a char * to fopen(), not a wchar_t *. If you want to use a wchar_t * for the filename parameter, then you’re supposed to call _wfopen().
There was a problem hiding this comment.
We don't even interested here what encoding is actually in filesystem. We effectively receive C-string (and don't bother in what encoding is it, it's just sequence of chars) and we ensuring that C-string is actually valid path in native filesystem. We do not using UTF-8 here at all. It's just way to convert native encoding to C-string.
There was a problem hiding this comment.
We don't even interested here what encoding is actually in filesystem.
If we’re going to call fopen() on Windows, then I think that we need to be interested in the filename parameter’s encoding because Windows is interested in the filename parameter’s encoding.
We effectively receive C-string (and don't bother in what encoding is it, it's just sequence of chars) and we ensuring that C-string is actually valid path in native filesystem. We do not using UTF-8 here at all.
That last sentence is very surprising to me. I thought that std::filesystem::path::u8string was supposed to return UTF-8, that std::filesystem::path::u16string was supposed to return UTF-16 and that std::filesystem::path::u32string was supposed to return UTF-32. cppreference.com says the following about those functions:
Returns the internal pathname in native pathname format, converted to specific string type. Conversion, if any, is performed as follows:
[…]
3) The result encoding in the case of
u8string()is always UTF-8.
Additionally, Microsoft’s documentation for std::filesystem::path::u8string says:
Converts the sequence stored in
mypathto UTF-8 and returns it stored in an object of typeu8string.
What makes you say “We do not using UTF-8 here at all”?
It's just way to convert native encoding to C-string.
I agree. u8string().c_str() does indeed convert a path (which uses its native encoding internally) into a C-string. My concern is about the encoding of the resulting C-string.
Lgt2x
left a comment
There was a problem hiding this comment.
Really good job, tested successfully on Linux. Will also test Windows before merging. If someone can do OSX don't hesitate
Refactoring platform-dependent parts of code into common part, greatly reduce LOC.
Convert some of the paths to use std::fs::path.
Moved directory parameter to 3rd position and make it optional.
Implementing cf_FindRealFileNameCaseInsensitive_new() based on std::fs::directory_iterator, which runs faster and more reliable. cf_FindRealFileNameCaseInsensitive_new() works exactly as cf_FindRealFileNameCaseInsensitive(). Benchmark on Release build: ``` ------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------- cfile_no_exist_old 11607 ns 11399 ns 63035 cfile_no_exist_new 9302 ns 9196 ns 70041 cfile_exist_relative_old 2701 ns 2480 ns 282695 cfile_exist_relative_new 867 ns 866 ns 798478 cfile_exist_absolute_old 3415 ns 3167 ns 228323 cfile_exist_absolute_new 1182 ns 1181 ns 574782 ```
A little speed-up for our slowpoke.
Surprisingly on macOS FS is case-insensitive.
82e6c2f to
2fce643
Compare
|
No problems in game in either Linux or Windows, we're good to merge. I hope we did not break anything for MacOS, but I don't have the hardware to test it. If UTF-8 proves to be a problem in some situations we can fix it later. |
Consider this program:
#include <cstdio>
int main(void) {
const char *filename = u8"ディセント3.txt";
auto fp = std::fopen(filename, "r");
if (fp) {
std::fclose(fp);
return 0;
} else {
return 1;
};
}
If a file named ディセント3.txt exists, then will that program
successfully open it? The answer is: it depends.
filename is going to point to these bytes:
Raw bytes: e3 83 87 e3 82 a3 e3 82 bb e3 83 b3 e3 83 88 33 2e 74 78 74 00
Characters: ディセント3.txt␀
Internally, Windows uses UTF-16. When you call fopen(), Windows will
convert the filename parameter into UTF-16 [1]. If the program is run
with a UTF-8 Windows code page, then the above bytes will be correctly
interpreted as UTF-8 when being converted into UTF-16 [2]. The final
UTF-16 string will be this*:
Raw bytes: ff fe c7 30 a3 30 bb 30 f3 30 c8 30 33 00 2e 00 74 00 78 00 74 00
Characters: ディセント3.txt
On the other hand, if the program is run with code page 932, then the
original bytes will be incorrectly interpreted as code page 932 when
being converted into UTF-16. The final UTF-16 string will be this*:
Raw bytes: ff fe 5d 7e fd ff 67 7e 63 ff 67 7e 7b ff 5d 7e 73 ff 5d 7e fd ff 33 00 2e 00 74 00 78 00 74 00
Characters: 繝�繧」繧サ繝ウ繝�3.txt
In other words, if that program gets compiled on Windows with a UTF-8
execution character set, then it needs to be run with a UTF-8 Windows
code page. Otherwise, mojibake might happen.
*Unlike the first string, this one does not have a null terminator. This
is because the Windows kernel doesn’t use null terminated strings for
paths [3][4].
---
Before this commit, Descent 3 would pass UTF-8 to fopen(), even if
Descent 3 is run with a non-UTF-8 Windows code page [5]. This commit
makes sure that Descent 3 gets run with a UTF-8 Windows code page.
The Windows code page isn’t just used by fopen(). It also gets used by
many other functions in the Windows API [6]. I don’t know if Descent 3
uses any of those other functions, but if it does, then this commit will
also help make sure that those functions receive strings with the
correct character encoding. Descent 3 uses UTF-8 for strings by default
[7]. Making sure that Descent 3 uses UTF-8 everywhere will make
encoding-related mistakes less likely in the future.
Fixes DescentDevelopers#483.
[1]: <https://stackoverflow.com/a/7950569/7593853>
[2]: <https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen?view=msvc-170#remarks>
[3]: <https://stackoverflow.com/a/52372115/7593853>
[4]: <https://googleprojectzero.blogspot.com/2016/02/the-definitive-guide-on-win32-to-nt.html>
[5]: <DescentDevelopers#475 (comment)>
[6]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page#-a-vs--w-apis>
[7]: adf58ec (Explicitly declare execution character set, 2024-07-07)
Pull Request Type
Description
This PR updates cfile and partially ddio modules to use C++17's
std::filesystem::pathfeature, which simplifies overall filesystem operations.Some parts of code was migrated, and there some performance improvements was achieved. Internally cfile maintains some sort search paths. Old code was excessively added some non-existent paths, and on opening files these paths was pointlessly used. Also there some code optimization in
cf_FindRealFileNameCaseInsenstive()was performed.Added unittest framework for cfile.
Related Issues
Screenshots (if applicable)
Checklist
Additional Comments