Skip to content

join: consider locale collation in field comparison#9982

Merged
sylvestre merged 2 commits intouutils:mainfrom
WaterWhisperer:join-locale-collation
Jan 20, 2026
Merged

join: consider locale collation in field comparison#9982
sylvestre merged 2 commits intouutils:mainfrom
WaterWhisperer:join-locale-collation

Conversation

@WaterWhisperer
Copy link
Contributor

Fixes: #9971

GNU join uses LC_COLLATE for field comparison. This PR (ref expr) implements locale-aware string comparison using uucore's i18n::collator module.

Reproduce:

coreutils$ export LC_ALL=en_US.UTF-8
coreutils$ cat > f1 <<'EOF'
ab:d  1
abc:d 2
EOF
coreutils$ cat > f2 <<'EOF'
ab:d  x
abc:d y
EOF
coreutils$ sort -k1,1 f1 > f1.sorted
coreutils$ sort -k1,1 f2 > f2.sorted
coreutils$ /usr/bin/join --check-order f1.sorted f2.sorted
abc:d 2 y
ab:d 1 x
coreutils$ ./target/release/join --check-order f1.sorted f2.sorted 
abc:d 2 y
ab:d 1 x

@sylvestre
Copy link
Contributor

did you run some benchmarks ?

@WaterWhisperer
Copy link
Contributor Author

did you run some benchmarks ?

Thanks for reminding me. Here are the results of benchmark test.
截图 2026-01-02 18-06-01
截图 2026-01-02 18-06-59
It seems there's a significant performance drop :( So I'm trying to find ways to improve it.

@WaterWhisperer
Copy link
Contributor Author

I have a few optimization ideas:

  • Cache the locale check at startup: Currently checking locale on every comparison. We could check once and store a flag in the Input struct.
  • Fast path for C locale: Detect C/POSIX locale early and use direct byte comparison.

Initial testing shows this could bring the C locale overhead from ~16% down to ~1%.
Is this performance trade-off acceptable for the project? Would you prefer a different approach?

Happy to iterate on this based on your guidance! @sylvestre

@sylvestre
Copy link
Contributor

Why not both :)

@WaterWhisperer
Copy link
Contributor Author

Why not both :)

Yeah, that's exactly what I did.

@sylvestre
Copy link
Contributor

I don't see the change :)

Note that we will need a benchmark for join
In a separatepr if you are interested in doing it

@WaterWhisperer
Copy link
Contributor Author

WaterWhisperer commented Jan 2, 2026

I don't see the change :)

Sorry, I just pushed.

Note that we will need a benchmark for join
In a separatepr if you are interested in doing it

Sure!

@sylvestre
Copy link
Contributor

sylvestre commented Jan 3, 2026

in the future, please avoid screenshot, they are terrible for accessibility and search :) thanks

and please compare with gnu too

@WaterWhisperer
Copy link
Contributor Author

100_000 lines file

coreutils$ LC_ALL=C hyperfine --warmup 5 --runs 10 -n "GNU join" '/usr/bin/join file1.txt file2.txt' -n "uutils join" 'target/release/join file1.txt file2.txt'
Benchmark 1: GNU join
  Time (mean ± σ):      20.8 ms ±   1.8 ms    [User: 18.2 ms, System: 2.5 ms]
  Range (min … max):    18.8 ms …  24.2 ms    10 runs
 
Benchmark 2: uutils join
  Time (mean ± σ):      22.9 ms ±   2.2 ms    [User: 19.8 ms, System: 3.0 ms]
  Range (min … max):    20.4 ms …  27.4 ms    10 runs
 
Summary
  GNU join ran
    1.10 ± 0.14 times faster than uutils join
 coreutils$ LC_ALL=en_US.UTF-8 hyperfine --warmup 5 --runs 10 -n "GNU join" '/usr/bin/join file1.txt file2.txt' -n "uutils join" 'target/release/join file1.txt file2.txt'
Benchmark 1: GNU join
  Time (mean ± σ):      31.0 ms ±   3.5 ms    [User: 27.1 ms, System: 3.7 ms]
  Range (min … max):    27.6 ms …  38.6 ms    10 runs
 
Benchmark 2: uutils join
  Time (mean ± σ):      63.1 ms ±   3.3 ms    [User: 59.8 ms, System: 3.1 ms]
  Range (min … max):    60.1 ms …  70.4 ms    10 runs
 
Summary
  GNU join ran
    2.04 ± 0.26 times faster than uutils join

@sylvestre sylvestre force-pushed the join-locale-collation branch from 41bce04 to 8bcc415 Compare January 3, 2026 13:45
@github-actions
Copy link

github-actions bot commented Jan 3, 2026

GNU testsuite comparison:

GNU test failed: tests/tty/tty-eof. tests/tty/tty-eof is passing on 'main'. Maybe you have to rebase?
Congrats! The gnu test tests/tail/follow-name is no longer failing!

@sylvestre
Copy link
Contributor

some benchmarks (the way i am expecting to see it)

  Locale-specific files:
  Benchmark 1: target/release/coreutils.ref join test_data/locale_file1.txt test_data/locale_file2.txt
    Time (mean ± σ):       3.8 ms ±   0.9 ms    [User: 1.6 ms, System: 2.1 ms]
    Range (min … max):     1.2 ms …   6.1 ms    688 runs

  Benchmark 2: target/release/coreutils join test_data/locale_file1.txt test_data/locale_file2.txt
    Time (mean ± σ):       4.7 ms ±   0.9 ms    [User: 1.9 ms, System: 2.7 ms]
    Range (min … max):     2.7 ms …   8.9 ms    593 runs

  Benchmark 3: /usr/bin/join test_data/locale_file1.txt test_data/locale_file2.txt
    Time (mean ± σ):       1.9 ms ±   0.7 ms    [User: 1.2 ms, System: 0.6 ms]
    Range (min … max):     0.2 ms …   5.5 ms    1147 runs

  Summary
    /usr/bin/join test_data/locale_file1.txt test_data/locale_file2.txt ran
      1.96 ± 0.85 times faster than target/release/coreutils.ref join test_data/locale_file1.txt
  test_data/locale_file2.txt
      2.41 ± 1.01 times faster than target/release/coreutils join test_data/locale_file1.txt
  test_data/locale_file2.txt
  Small files (10K lines):
  Benchmark 1: target/release/coreutils.ref join test_data/file1.txt test_data/file2.txt
    Time (mean ± σ):       7.0 ms ±   2.0 ms    [User: 4.7 ms, System: 2.3 ms]
    Range (min … max):     2.6 ms …  14.4 ms    344 runs

  Benchmark 2: target/release/coreutils join test_data/file1.txt test_data/file2.txt
    Time (mean ± σ):      12.6 ms ±   4.0 ms    [User: 10.4 ms, System: 2.1 ms]
    Range (min … max):     6.5 ms …  29.5 ms    198 runs

  Benchmark 3: /usr/bin/join test_data/file1.txt test_data/file2.txt
    Time (mean ± σ):       6.6 ms ±   2.4 ms    [User: 5.8 ms, System: 0.8 ms]
    Range (min … max):     2.9 ms …  15.3 ms    366 runs

  Summary
    /usr/bin/join test_data/file1.txt test_data/file2.txt ran
      1.05 ± 0.48 times faster than target/release/coreutils.ref join test_data/file1.txt test_data/file2.txt
      1.89 ± 0.90 times faster than target/release/coreutils join test_data/file1.txt test_data/file2.txt
  Large files (100K lines):
  Benchmark 1: target/release/coreutils.ref join test_data/large_file1.txt test_data/large_file2.txt
    Time (mean ± σ):      21.5 ms ±   4.1 ms    [User: 19.8 ms, System: 1.7 ms]
    Range (min … max):    16.8 ms …  34.6 ms    112 runs

  Benchmark 2: target/release/coreutils join test_data/large_file1.txt test_data/large_file2.txt
    Time (mean ± σ):      60.5 ms ±   6.2 ms    [User: 58.3 ms, System: 2.2 ms]
    Range (min … max):    54.5 ms …  77.6 ms    53 runs

  Benchmark 3: /usr/bin/join test_data/large_file1.txt test_data/large_file2.txt
    Time (mean ± σ):      42.8 ms ±  13.8 ms    [User: 41.3 ms, System: 1.5 ms]
    Range (min … max):    30.5 ms …  90.7 ms    78 runs

  Summary
    target/release/coreutils.ref join test_data/large_file1.txt test_data/large_file2.txt ran
      2.00 ± 0.75 times faster than /usr/bin/join test_data/large_file1.txt test_data/large_file2.txt
      2.82 ± 0.61 times faster than target/release/coreutils join test_data/large_file1.txt
  test_data/large_file2.txt

the performance regressed a bit much :/

@codspeed-hq
Copy link

codspeed-hq bot commented Jan 19, 2026

CodSpeed Performance Report

Merging this PR will degrade performance by 4.2%

Comparing WaterWhisperer:join-locale-collation (81e2f00) with main (2fb45c1)

Summary

⚡ 2 improved benchmarks
❌ 2 regressed benchmarks
✅ 278 untouched benchmarks
⏩ 38 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Memory sort_key_field[500000] 32.8 MB 31.4 MB +4.65%
Memory sort_numeric[500000] 48.7 MB 47.3 MB +3.07%
Memory du_all_wide_tree[(5000, 500)] 1.3 MB 1.4 MB -3.41%
Memory du_wide_tree[(5000, 500)] 1.2 MB 1.3 MB -4.2%

Footnotes

  1. 38 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@sylvestre sylvestre merged commit 98d3dba into uutils:main Jan 20, 2026
154 of 157 checks passed
@WaterWhisperer WaterWhisperer deleted the join-locale-collation branch January 20, 2026 14:49
mattsu2020 pushed a commit to mattsu2020/coreutils that referenced this pull request Jan 23, 2026
Co-authored-by: Sylvestre Ledru <sylvestre@debian.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

join: locale collation should be considered

2 participants