Skip to content

cfile / ddio: use most of functions to std::filesystem::path#475

Merged
Lgt2x merged 23 commits intoDescentDevelopers:mainfrom
winterheart:cfile-std-filesystem
Jul 6, 2024
Merged

cfile / ddio: use most of functions to std::filesystem::path#475
Lgt2x merged 23 commits intoDescentDevelopers:mainfrom
winterheart:cfile-std-filesystem

Conversation

@winterheart
Copy link
Collaborator

Pull Request Type

  • GitHub Workflow changes
  • Documentation or Wiki changes
  • Build and Dependency changes
  • Runtime changes
    • Render changes
    • Audio changes
    • Input changes
    • Network changes
    • Other changes

Description

This PR updates cfile and partially ddio modules to use C++17's std::filesystem::path feature, which simplifies overall filesystem operations.

Some parts of code was migrated, and there some performance improvements was achieved. Internally cfile maintains some sort search paths. Old code was excessively added some non-existent paths, and on opening files these paths was pointlessly used. Also there some code optimization in cf_FindRealFileNameCaseInsenstive() was performed.

Added unittest framework for cfile.

Related Issues

Screenshots (if applicable)

Checklist

  • I have tested my changes locally and verified that they work as intended.
  • I have documented any new or modified functionality.
  • I have reviewed the changes to ensure they do not introduce any unnecessary complexity or duplicate code.
  • I understand that by submitting this pull request, I am agreeing to license my contributions under the project's license.

Additional Comments

@winterheart winterheart force-pushed the cfile-std-filesystem branch from 5eb204f to 946affc Compare June 29, 2024 09:36
Make cfopen(), open_file_in_directory() and cfexist() use std::filesystem::path arguments.
Simplify related code in `cf_SetSearchPath()`.
Refactor paths and extensions to use std::map. Rewriting related code in manage.cpp.
Convert cf_IsFileInHog(), cf_OpenFileInLibrary() and cf_GetfileCRC() functions to use std::fs::path.
Refactor paths and extensions to use std::map. Rewriting related code in manage.cpp.
Convert cf_CopyFile() cf_CopyFileTime() and underlying functions to use std::filesystem::path
Convert cf_Diff() and underlying functions to use std::filesystem::path
Use std::fs::path on path construction, simplify code.
Simplifies code related to UpdatePrimitive().
Remove ddio_CreateDir(), ddio_RemoveDir(), ddio_SaveWorkingDir(), ddio_RestoreWorkingDir(), ddio_DirExists(). Replace them in code with std::fs alternatives.
@winterheart winterheart force-pushed the cfile-std-filesystem branch from 946affc to 3fc6dd5 Compare July 1, 2024 09:51
@Lgt2x
Copy link
Member

Lgt2x commented Jul 1, 2024

You can do something like that to fix mac os CI (but cleaner). strncpy is problematic Lgt2x@15a50ca#diff-be34739991a103c73033e6358b31b2a8d7859b7a5c84616dcf1409176168f55d

@winterheart winterheart force-pushed the cfile-std-filesystem branch 6 times, most recently from 504dd2d to 82e6c2f Compare July 2, 2024 16:27
@winterheart
Copy link
Collaborator Author

After some refactoring cf_FindRealFileNameCaseInsensitive() now works much faster and more reliable. Here some benchmarks:

Release:

2024-07-02T20:08:56+03:00
Running ./cfile_bench
Run on (12 X 4300 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 2.82, 1.81, 1.42
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
cfile_no_exist_old            11384 ns        11065 ns        63154
cfile_no_exist_new             9812 ns         9613 ns        74990
cfile_exist_relative_old       2682 ns         2452 ns       258570
cfile_exist_relative_new        926 ns          925 ns       784549
cfile_exist_absolute_old       3422 ns         3132 ns       222331
cfile_exist_absolute_new       1222 ns         1220 ns       585813

Debug:

2024-07-02T20:09:47+03:00
Running ./cfile_bench
Run on (12 X 4300 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 12288 KiB (x1)
Load Average: 2.21, 1.81, 1.45
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
cfile_no_exist_old            11768 ns        11426 ns        61555
cfile_no_exist_new            13682 ns        13423 ns        53839
cfile_exist_relative_old       3147 ns         2907 ns       224424
cfile_exist_relative_new       1591 ns         1588 ns       427936
cfile_exist_absolute_old       4179 ns         3893 ns       183101
cfile_exist_absolute_new       2132 ns         2129 ns       319614

@Lgt2x Lgt2x self-assigned this Jul 3, 2024
tmode[1] = (mode[0] == 'w') ? mode[1] : 'b';
// try to open file
fp = fopen(path, tmode);
fp = fopen(using_filename.u8string().c_str(), tmode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you know that UTF-8 is the correct encoding to use for fopen()’s filename parameter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way how std::filesystem::path represent native encoding on systems. POSIX filesystem is native char representation, but Windows' native representation is wchar. If you want use paths with C functions, you need convert them with c_str() using u8string().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way how std::filesystem::path represent native encoding on systems. POSIX filesystem is native char representation, but Windows' native representation is wchar. If you want use paths with C functions, you need convert them with c_str() using u8string().

I’m a little bit confused. According to cppreference.com, std::filesystem::path::u8string() returns a std::u8string. std::u8string is the same as std::basic_string<char8_t>. std::basic_string::<whatever>::c_str() returns a whatever * which means that my_u8string.c_str() should return a char8_t *, right? It seems to me like using_filename.u8string().c_str() is always going to evaluate to a char8_t *. I just don’t see how it could evaluate to a wchar_t *.

(The std::u8string type was added in C++20. In C++17, a similar argument can be made, but you have to replace std::u8string with std::string and char8_t with char.)

Passing a UTF-8 encoded string to fopen() is potentially a problem. According to Microsoft’s documentation,

The fopen function opens the file specified by filename. By default, a narrow filename string is interpreted using the ANSI codepage (CP_ACP). In Windows Desktop applications, it can be changed to the OEM codepage (CP_OEMCP) by using the SetFileApisToOEM function. You can use the AreFileApisANSI function to determine whether filename is interpreted using the ANSI or the system default OEM codepage.

Is there anything that’s going to guarantee that we’re using a UTF-8 codepage? Windows won’t use a UTF-8 codepage by default unless the user has “Beta: Use Unicode UTF-8 for worldwide language support” enabled, or the application explicitly specifies a UTF-8 codepage.

Additionally, if using_filename.u8string().c_str() sometimes does evaluate to a wchar_t *, then I think that we have a problem. According to Microsoft’s documentation, you’re supposed to pass a char * to fopen(), not a wchar_t *. If you want to use a wchar_t * for the filename parameter, then you’re supposed to call _wfopen().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't even interested here what encoding is actually in filesystem. We effectively receive C-string (and don't bother in what encoding is it, it's just sequence of chars) and we ensuring that C-string is actually valid path in native filesystem. We do not using UTF-8 here at all. It's just way to convert native encoding to C-string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't even interested here what encoding is actually in filesystem.

If we’re going to call fopen() on Windows, then I think that we need to be interested in the filename parameter’s encoding because Windows is interested in the filename parameter’s encoding.


We effectively receive C-string (and don't bother in what encoding is it, it's just sequence of chars) and we ensuring that C-string is actually valid path in native filesystem. We do not using UTF-8 here at all.

That last sentence is very surprising to me. I thought that std::filesystem::path::u8string was supposed to return UTF-8, that std::filesystem::path::u16string was supposed to return UTF-16 and that std::filesystem::path::u32string was supposed to return UTF-32. cppreference.com says the following about those functions:

Returns the internal pathname in native pathname format, converted to specific string type. Conversion, if any, is performed as follows:

[…]

3) The result encoding in the case of u8string() is always UTF-8.

Additionally, Microsoft’s documentation for std::filesystem::path::u8string says:

Converts the sequence stored in mypath to UTF-8 and returns it stored in an object of type u8string.

What makes you say “We do not using UTF-8 here at all”?


It's just way to convert native encoding to C-string.

I agree. u8string().c_str() does indeed convert a path (which uses its native encoding internally) into a C-string. My concern is about the encoding of the resulting C-string.

Copy link
Member

@Lgt2x Lgt2x left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good job, tested successfully on Linux. Will also test Windows before merging. If someone can do OSX don't hesitate

Convert some of the paths to use std::fs::path.
Moved directory parameter to 3rd position and make it optional.
Implementing cf_FindRealFileNameCaseInsensitive_new() based on std::fs::directory_iterator, which runs faster and more reliable.

cf_FindRealFileNameCaseInsensitive_new() works exactly as cf_FindRealFileNameCaseInsensitive().

Benchmark on Release build:

```
-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
cfile_no_exist_old            11607 ns        11399 ns        63035
cfile_no_exist_new             9302 ns         9196 ns        70041
cfile_exist_relative_old       2701 ns         2480 ns       282695
cfile_exist_relative_new        867 ns          866 ns       798478
cfile_exist_absolute_old       3415 ns         3167 ns       228323
cfile_exist_absolute_new       1182 ns         1181 ns       574782
```
Surprisingly on macOS FS is case-insensitive.
@winterheart winterheart force-pushed the cfile-std-filesystem branch from 82e6c2f to 2fce643 Compare July 4, 2024 22:51
@Lgt2x
Copy link
Member

Lgt2x commented Jul 6, 2024

No problems in game in either Linux or Windows, we're good to merge. I hope we did not break anything for MacOS, but I don't have the hardware to test it. If UTF-8 proves to be a problem in some situations we can fix it later.

@Lgt2x Lgt2x merged commit de75a80 into DescentDevelopers:main Jul 6, 2024
Jayman2000 added a commit to Jayman2000/Descent3-pr that referenced this pull request Jul 17, 2024
Consider this program:

  #include <cstdio>

  int main(void) {
    const char *filename = u8"ディセント3.txt";
    auto fp = std::fopen(filename, "r");
    if (fp) {
      std::fclose(fp);
      return 0;
    } else {
      return 1;
    };
  }

If a file named ディセント3.txt exists, then will that program
successfully open it? The answer is: it depends.

filename is going to point to these bytes:

  Raw bytes: e3 83 87 e3 82 a3 e3 82 bb e3 83 b3 e3 83 88 33 2e 74 78 74 00
  Characters: ディセント3.txt␀

Internally, Windows uses UTF-16. When you call fopen(), Windows will
convert the filename parameter into UTF-16 [1]. If the program is run
with a UTF-8 Windows code page, then the above bytes will be correctly
interpreted as UTF-8 when being converted into UTF-16 [2]. The final
UTF-16 string will be this*:

  Raw bytes: ff fe c7 30 a3 30 bb 30 f3 30 c8 30 33 00 2e 00 74 00 78 00 74 00
  Characters: ディセント3.txt

On the other hand, if the program is run with code page 932, then the
original bytes will be incorrectly interpreted as code page 932 when
being converted into UTF-16. The final UTF-16 string will be this*:

  Raw bytes: ff fe 5d 7e fd ff 67 7e 63 ff 67 7e 7b ff 5d 7e 73 ff 5d 7e fd ff 33 00 2e 00 74 00 78 00 74 00
  Characters: 繝�繧」繧サ繝ウ繝�3.txt

In other words, if that program gets compiled on Windows with a UTF-8
execution character set, then it needs to be run with a UTF-8 Windows
code page. Otherwise, mojibake might happen.

*Unlike the first string, this one does not have a null terminator. This
is because the Windows kernel doesn’t use null terminated strings for
paths [3][4].

---

Before this commit, Descent 3 would pass UTF-8 to fopen(), even if
Descent 3 is run with a non-UTF-8 Windows code page [5]. This commit
makes sure that Descent 3 gets run with a UTF-8 Windows code page.

The Windows code page isn’t just used by fopen(). It also gets used by
many other functions in the Windows API [6]. I don’t know if Descent 3
uses any of those other functions, but if it does, then this commit will
also help make sure that those functions receive strings with the
correct character encoding. Descent 3 uses UTF-8 for strings by default
[7]. Making sure that Descent 3 uses UTF-8 everywhere will make
encoding-related mistakes less likely in the future.

Fixes DescentDevelopers#483.

[1]: <https://stackoverflow.com/a/7950569/7593853>
[2]: <https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen?view=msvc-170#remarks>
[3]: <https://stackoverflow.com/a/52372115/7593853>
[4]: <https://googleprojectzero.blogspot.com/2016/02/the-definitive-guide-on-win32-to-nt.html>
[5]: <DescentDevelopers#475 (comment)>
[6]: <https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page#-a-vs--w-apis>
[7]: adf58ec (Explicitly declare execution character set, 2024-07-07)
@winterheart winterheart deleted the cfile-std-filesystem branch August 12, 2024 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants