Skip to content

Refactor hash join with multiset#18021

Merged
rapids-bot[bot] merged 26 commits intorapidsai:branch-25.08from
PointKernel:refactor-hash-join
Jul 9, 2025
Merged

Refactor hash join with multiset#18021
rapids-bot[bot] merged 26 commits intorapidsai:branch-25.08from
PointKernel:refactor-hash-join

Conversation

@PointKernel
Copy link
Copy Markdown
Member

@PointKernel PointKernel commented Feb 15, 2025

Description

Part of #12261

This PR refactors hash join to use cuco::static_multiset in place of cuco::static_multimap, eliminating the awkward use of cuco pair_* APIs and improving overall readability.

In terms of performance, it introduces up to a 20% slowdown for small datasets (e.g., ~1000 elements), but the impact diminishes as data size increases. While we should be mindful of this regression, there’s no need for alarm, the performance difference becomes negligible for larger workloads.

The slowdown has been traced to an increase in branching instructions (~15%), and addressing these extra branches on the cuco side should resolve the issue.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@PointKernel PointKernel added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change cuco cuCollections related issue labels Feb 15, 2025
@PointKernel PointKernel self-assigned this Feb 15, 2025
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 15, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@PointKernel PointKernel added the improvement Improvement / enhancement to an existing function label Feb 15, 2025
PointKernel added a commit to NVIDIA/cuCollections that referenced this pull request Feb 27, 2025
…ance (#681)

This PR enhances the CG-based device insertion by introducing an
additional `SupportsErase` build-time check. When erasure is not
required, the new implementation leverages this flag to select a more
efficient code path, ensuring comparisons are made against an empty
sentinel without loading the target slot's content into the CAS
operation. This optimization gets rid of excessive local memory
transactions, resulting in a 10~30% improvement in multimap performance.
This PR also updates the multimap insert and count benchmarks to run
with the new implementations.

Unblocking rapidsai/cudf#18021
@PointKernel
Copy link
Copy Markdown
Member Author

With NVIDIA/cuCollections#694, we can get about the same performance with the new cuco data structure.

# inner_join

## [0] Quadro RTX 8000

|  Key  |  Nullable  |  left_size  |  right_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|-------------|---------|----------|
|  I32  |     0      |    1000     |     1000     |  78.456 us |      11.69% |  80.592 us |      12.34% |    2.136 us |   2.72% |   SAME   |
|  I32  |     0      |   100000    |     1000     | 107.511 us |       2.74% | 113.968 us |       2.25% |    6.457 us |   6.01% |   SLOW   |
|  I32  |     0      |  10000000   |     1000     |   3.945 ms |       0.87% |   3.878 ms |       1.31% |  -66.990 us |  -1.70% |   FAST   |
|  I32  |     0      |   100000    |    100000    | 127.654 us |       3.95% | 123.229 us |       1.64% |   -4.425 us |  -3.47% |   FAST   |
|  I32  |     0      |  10000000   |    100000    |   4.387 ms |       0.57% |   4.211 ms |       0.74% | -176.781 us |  -4.03% |   FAST   |
|  I32  |     0      |  10000000   |   10000000   |  17.019 ms |       0.37% |  16.814 ms |       0.05% | -205.263 us |  -1.21% |   FAST   |
|  I32  |     1      |    1000     |     1000     |  89.228 us |       4.57% |  88.800 us |       3.53% |   -0.427 us |  -0.48% |   SAME   |
|  I32  |     1      |   100000    |     1000     | 104.325 us |       2.21% | 104.431 us |       3.82% |    0.106 us |   0.10% |   SAME   |
|  I32  |     1      |  10000000   |     1000     |   1.970 ms |       0.26% |   1.737 ms |       0.63% | -232.933 us | -11.82% |   FAST   |
|  I32  |     1      |   100000    |    100000    | 112.281 us |       2.61% | 114.439 us |       2.99% |    2.159 us |   1.92% |   SAME   |
|  I32  |     1      |  10000000   |    100000    |   2.076 ms |       0.13% |   1.848 ms |       0.61% | -228.025 us | -10.98% |   FAST   |
|  I32  |     1      |  10000000   |   10000000   |   4.686 ms |       0.14% |   4.386 ms |       0.24% | -299.448 us |  -6.39% |   FAST   |
|  I64  |     0      |    1000     |     1000     |  74.534 us |       3.93% |  78.344 us |       5.10% |    3.811 us |   5.11% |   SLOW   |
|  I64  |     0      |   100000    |     1000     | 117.017 us |       3.23% | 114.288 us |       2.29% |   -2.730 us |  -2.33% |   FAST   |
|  I64  |     0      |  10000000   |     1000     |   4.140 ms |       0.87% |   4.083 ms |       1.01% |  -57.024 us |  -1.38% |   FAST   |
|  I64  |     0      |   100000    |    100000    | 130.924 us |      10.18% | 126.372 us |       4.12% |   -4.551 us |  -3.48% |   SAME   |
|  I64  |     0      |  10000000   |    100000    |   4.544 ms |       0.59% |   4.437 ms |       0.79% | -107.459 us |  -2.36% |   FAST   |
|  I64  |     0      |  10000000   |   10000000   |  17.132 ms |       0.08% |  16.946 ms |       0.07% | -186.493 us |  -1.09% |   FAST   |
|  I64  |     1      |    1000     |     1000     |  90.450 us |       2.62% |  87.540 us |       5.42% |   -2.910 us |  -3.22% |   FAST   |
|  I64  |     1      |   100000    |     1000     | 106.508 us |       7.31% | 105.284 us |       5.58% |   -1.224 us |  -1.15% |   SAME   |
|  I64  |     1      |  10000000   |     1000     |   2.040 ms |       0.72% |   1.800 ms |       0.71% | -239.451 us | -11.74% |   FAST   |
|  I64  |     1      |   100000    |    100000    | 115.019 us |       2.89% | 111.492 us |       1.99% |   -3.527 us |  -3.07% |   FAST   |
|  I64  |     1      |  10000000   |    100000    |   2.179 ms |       0.42% |   1.945 ms |       0.85% | -233.751 us | -10.73% |   FAST   |
|  I64  |     1      |  10000000   |   10000000   |   4.757 ms |       0.24% |   4.453 ms |       0.26% | -304.215 us |  -6.40% |   FAST   |

# left_join

## [0] Quadro RTX 8000

|  Key  |  Nullable  |  left_size  |  right_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|-------------|---------|----------|
|  I32  |     0      |    1000     |     1000     |  71.950 us |       3.31% |  76.459 us |       6.33% |    4.509 us |   6.27% |   SLOW   |
|  I32  |     0      |   100000    |     1000     | 114.625 us |       4.09% | 114.861 us |       3.04% |    0.236 us |   0.21% |   SAME   |
|  I32  |     0      |  10000000   |     1000     |   4.372 ms |       0.88% |   4.510 ms |       1.03% |  137.763 us |   3.15% |   SLOW   |
|  I32  |     0      |   100000    |    100000    | 131.739 us |       1.72% | 126.395 us |       3.63% |   -5.344 us |  -4.06% |   FAST   |
|  I32  |     0      |  10000000   |    100000    |   4.799 ms |       0.68% |   4.872 ms |       0.75% |   73.783 us |   1.54% |   SLOW   |
|  I32  |     0      |  10000000   |   10000000   |  17.339 ms |       0.07% |  17.231 ms |       0.07% | -107.189 us |  -0.62% |   FAST   |
|  I32  |     1      |    1000     |     1000     |  87.459 us |       3.06% |  88.084 us |       6.17% |    0.625 us |   0.71% |   SAME   |
|  I32  |     1      |   100000    |     1000     | 105.215 us |       1.96% | 104.527 us |       3.78% |   -0.689 us |  -0.65% |   SAME   |
|  I32  |     1      |  10000000   |     1000     |   2.219 ms |       0.70% |   2.097 ms |       0.78% | -122.460 us |  -5.52% |   FAST   |
|  I32  |     1      |   100000    |    100000    | 112.265 us |       1.65% | 112.164 us |       2.49% |   -0.101 us |  -0.09% |   SAME   |
|  I32  |     1      |  10000000   |    100000    |   2.351 ms |       0.66% |   2.240 ms |       0.74% | -110.418 us |  -4.70% |   FAST   |
|  I32  |     1      |  10000000   |   10000000   |   5.002 ms |       0.21% |   4.826 ms |       0.34% | -175.772 us |  -3.51% |   FAST   |
|  I64  |     0      |    1000     |     1000     |  76.228 us |       6.67% |  79.417 us |       3.10% |    3.189 us |   4.18% |   SLOW   |
|  I64  |     0      |   100000    |     1000     | 122.671 us |       1.51% | 114.218 us |       2.25% |   -8.453 us |  -6.89% |   FAST   |
|  I64  |     0      |  10000000   |     1000     |   4.600 ms |       0.79% |   4.694 ms |       1.08% |   94.411 us |   2.05% |   SLOW   |
|  I64  |     0      |   100000    |    100000    | 132.822 us |       1.86% | 134.545 us |       4.39% |    1.723 us |   1.30% |   SAME   |
|  I64  |     0      |  10000000   |    100000    |   5.008 ms |       0.69% |   5.082 ms |       0.70% |   74.154 us |   1.48% |   SLOW   |
|  I64  |     0      |  10000000   |   10000000   |  17.449 ms |       0.08% |  17.354 ms |       0.10% |  -94.794 us |  -0.54% |   FAST   |
|  I64  |     1      |    1000     |     1000     |  87.880 us |       3.04% |  87.532 us |       2.61% |   -0.348 us |  -0.40% |   SAME   |
|  I64  |     1      |   100000    |     1000     | 106.245 us |      10.02% | 105.442 us |       2.46% |   -0.803 us |  -0.76% |   SAME   |
|  I64  |     1      |  10000000   |     1000     |   2.288 ms |       0.72% |   2.155 ms |       0.81% | -132.818 us |  -5.81% |   FAST   |
|  I64  |     1      |   100000    |    100000    | 113.596 us |       2.52% | 115.153 us |       2.38% |    1.557 us |   1.37% |   SAME   |
|  I64  |     1      |  10000000   |    100000    |   2.477 ms |       0.70% |   2.356 ms |       0.79% | -120.468 us |  -4.86% |   FAST   |
|  I64  |     1      |  10000000   |   10000000   |   5.075 ms |       0.23% |   4.901 ms |       0.32% | -174.314 us |  -3.43% |   FAST   |

# full_join

## [0] Quadro RTX 8000

|  Key  |  Nullable  |  left_size  |  right_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|-------------|---------|----------|
|  I32  |     0      |    1000     |     1000     | 127.564 us |       9.98% | 133.040 us |       4.42% |    5.476 us |   4.29% |   SAME   |
|  I32  |     0      |   100000    |     1000     | 151.233 us |       6.99% | 147.676 us |       2.58% |   -3.557 us |  -2.35% |   SAME   |
|  I32  |     0      |  10000000   |     1000     |   4.761 ms |       0.88% |   4.899 ms |       0.99% |  137.435 us |   2.89% |   SLOW   |
|  I32  |     0      |   100000    |    100000    | 191.089 us |       2.14% | 185.174 us |       2.33% |   -5.915 us |  -3.10% |   FAST   |
|  I32  |     0      |  10000000   |    100000    |   5.083 ms |       0.67% |   5.131 ms |       0.71% |   47.539 us |   0.94% |   SLOW   |
|  I32  |     0      |  10000000   |   10000000   |  19.172 ms |       0.37% |  19.054 ms |       0.06% | -118.309 us |  -0.62% |   FAST   |
|  I32  |     1      |    1000     |     1000     | 143.036 us |       2.44% | 146.865 us |       2.57% |    3.829 us |   2.68% |   SLOW   |
|  I32  |     1      |   100000    |     1000     | 161.708 us |       2.31% | 166.360 us |       3.12% |    4.652 us |   2.88% |   SLOW   |
|  I32  |     1      |  10000000   |     1000     |   2.631 ms |       0.50% |   2.502 ms |       0.73% | -129.342 us |  -4.92% |   FAST   |
|  I32  |     1      |   100000    |    100000    | 172.457 us |       1.87% | 177.505 us |       1.71% |    5.048 us |   2.93% |   SLOW   |
|  I32  |     1      |  10000000   |    100000    |   2.766 ms |       0.50% |   2.647 ms |       0.60% | -119.402 us |  -4.32% |   FAST   |
|  I32  |     1      |  10000000   |   10000000   |   5.984 ms |       0.17% |   5.798 ms |       0.20% | -186.533 us |  -3.12% |   FAST   |
|  I64  |     0      |    1000     |     1000     | 129.754 us |       3.55% | 132.887 us |       2.01% |    3.134 us |   2.41% |   SLOW   |
|  I64  |     0      |   100000    |     1000     | 154.894 us |       1.52% | 145.470 us |       1.98% |   -9.424 us |  -6.08% |   FAST   |
|  I64  |     0      |  10000000   |     1000     |   4.933 ms |       1.04% |   5.018 ms |       0.95% |   85.311 us |   1.73% |   SLOW   |
|  I64  |     0      |   100000    |    100000    | 192.005 us |       1.44% | 195.108 us |       3.05% |    3.103 us |   1.62% |   SLOW   |
|  I64  |     0      |  10000000   |    100000    |   5.229 ms |       0.76% |   5.281 ms |       0.76% |   52.703 us |   1.01% |   SLOW   |
|  I64  |     0      |  10000000   |   10000000   |  19.239 ms |       0.09% |  19.127 ms |       0.10% | -112.068 us |  -0.58% |   FAST   |
|  I64  |     1      |    1000     |     1000     | 142.785 us |       3.40% | 138.843 us |       3.23% |   -3.942 us |  -2.76% |   SAME   |
|  I64  |     1      |   100000    |     1000     | 160.193 us |       1.64% | 160.043 us |       2.91% |   -0.150 us |  -0.09% |   SAME   |
|  I64  |     1      |  10000000   |     1000     |   2.682 ms |       0.63% |   2.531 ms |       0.59% | -151.217 us |  -5.64% |   FAST   |
|  I64  |     1      |   100000    |    100000    | 177.095 us |       3.22% | 171.989 us |       5.19% |   -5.107 us |  -2.88% |   SAME   |
|  I64  |     1      |  10000000   |    100000    |   2.863 ms |       0.50% |   2.735 ms |       0.59% | -127.161 us |  -4.44% |   FAST   |
|  I64  |     1      |  10000000   |   10000000   |   6.038 ms |       0.25% |   5.861 ms |       0.35% | -177.526 us |  -2.94% |   FAST   |

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 21, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added Python Affects Python cuDF API. CMake CMake build issue Java Affects Java cuDF API. cudf.pandas Issues specific to cudf.pandas cudf-polars Issues specific to cudf-polars pylibcudf Issues specific to the pylibcudf package labels Apr 21, 2025
@PointKernel
Copy link
Copy Markdown
Member Author

/ok to test 3d6b0a9

@PointKernel
Copy link
Copy Markdown
Member Author

/ok to test cd19e35

@PointKernel
Copy link
Copy Markdown
Member Author

/ok to test fcfee9f

@github-actions github-actions bot added the Python Affects Python cuDF API. label Jun 30, 2025
@PointKernel
Copy link
Copy Markdown
Member Author

/ok to test 2d30df7

@PointKernel PointKernel added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jun 30, 2025
@PointKernel PointKernel marked this pull request as ready for review June 30, 2025 23:18
@PointKernel PointKernel requested review from a team as code owners June 30, 2025 23:18
@PointKernel
Copy link
Copy Markdown
Member Author

['hash-join-old.json', 'hash-join-final.json']
# inner_join

## [0] Quadro RTX 8000

|  Key  |  Nullable  |  left_size  |  right_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|------------|---------|----------|
|  I32  |     0      |    1000     |     1000     |  75.333 us |      10.33% |  78.362 us |      11.93% |   3.029 us |   4.02% |   SAME   |
|  I32  |     0      |   100000    |     1000     |  96.842 us |       2.41% | 100.821 us |       6.85% |   3.979 us |   4.11% |   SLOW   |
|  I32  |     0      |  10000000   |     1000     |   2.960 ms |       1.34% |   3.497 ms |       1.05% | 536.590 us |  18.13% |   SLOW   |
|  I32  |     0      |   100000    |    100000    | 108.570 us |       3.97% | 114.756 us |       5.58% |   6.186 us |   5.70% |   SLOW   |
|  I32  |     0      |  10000000   |    100000    |   3.320 ms |       0.79% |   3.665 ms |       0.78% | 344.707 us |  10.38% |   SLOW   |
|  I32  |     0      |  10000000   |   10000000   |  16.600 ms |       0.17% |  17.356 ms |       0.08% | 756.481 us |   4.56% |   SLOW   |
|  I32  |     1      |    1000     |     1000     |  89.266 us |       7.59% | 102.355 us |       2.77% |  13.089 us |  14.66% |   SLOW   |
|  I32  |     1      |   100000    |     1000     |  94.785 us |       5.96% | 107.420 us |       4.99% |  12.635 us |  13.33% |   SLOW   |
|  I32  |     1      |  10000000   |     1000     |   1.508 ms |       1.05% |   1.600 ms |       1.08% |  92.336 us |   6.12% |   SLOW   |
|  I32  |     1      |   100000    |    100000    | 101.215 us |       4.74% | 121.739 us |       3.66% |  20.525 us |  20.28% |   SLOW   |
|  I32  |     1      |  10000000   |    100000    |   1.604 ms |       0.82% |   1.718 ms |       0.99% | 114.054 us |   7.11% |   SLOW   |
|  I32  |     1      |  10000000   |   10000000   |   4.188 ms |       0.50% |   4.252 ms |       0.45% |  63.316 us |   1.51% |   SLOW   |
|  I64  |     0      |    1000     |     1000     |  72.684 us |       4.49% |  75.785 us |       5.80% |   3.101 us |   4.27% |   SAME   |
|  I64  |     0      |   100000    |     1000     |  98.183 us |       3.54% |  99.667 us |       3.44% |   1.484 us |   1.51% |   SAME   |
|  I64  |     0      |  10000000   |     1000     |   3.240 ms |       1.09% |   3.496 ms |       1.03% | 256.005 us |   7.90% |   SLOW   |
|  I64  |     0      |   100000    |    100000    | 109.402 us |       3.96% | 115.474 us |       4.42% |   6.072 us |   5.55% |   SLOW   |
|  I64  |     0      |  10000000   |    100000    |   3.474 ms |       0.78% |   3.821 ms |       0.83% | 347.034 us |   9.99% |   SLOW   |
|  I64  |     0      |  10000000   |   10000000   |  16.732 ms |       0.43% |  17.505 ms |       0.13% | 772.757 us |   4.62% |   SLOW   |
|  I64  |     1      |    1000     |     1000     |  89.247 us |       5.25% | 100.834 us |       7.18% |  11.587 us |  12.98% |   SLOW   |
|  I64  |     1      |   100000    |     1000     |  94.459 us |       6.08% | 112.031 us |       4.69% |  17.572 us |  18.60% |   SLOW   |
|  I64  |     1      |  10000000   |     1000     |   1.584 ms |       1.10% |   1.668 ms |       0.84% |  84.316 us |   5.32% |   SLOW   |
|  I64  |     1      |   100000    |    100000    | 103.774 us |       4.54% | 123.716 us |       3.45% |  19.943 us |  19.22% |   SLOW   |
|  I64  |     1      |  10000000   |    100000    |   1.697 ms |       0.89% |   1.791 ms |       0.98% |  93.272 us |   5.49% |   SLOW   |
|  I64  |     1      |  10000000   |   10000000   |   4.241 ms |       0.38% |   4.319 ms |       0.40% |  78.010 us |   1.84% |   SLOW   |

# left_join

## [0] Quadro RTX 8000

|  Key  |  Nullable  |  left_size  |  right_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|------------|---------|----------|
|  I32  |     0      |    1000     |     1000     |  72.134 us |       4.53% |  72.983 us |       5.24% |   0.848 us |   1.18% |   SAME   |
|  I32  |     0      |   100000    |     1000     |  97.509 us |       2.68% | 108.040 us |       3.20% |  10.531 us |  10.80% |   SLOW   |
|  I32  |     0      |  10000000   |     1000     |   3.290 ms |       1.11% |   3.943 ms |       1.00% | 653.243 us |  19.86% |   SLOW   |
|  I32  |     0      |   100000    |    100000    | 115.427 us |       2.79% | 121.910 us |       5.39% |   6.483 us |   5.62% |   SLOW   |
|  I32  |     0      |  10000000   |    100000    |   3.669 ms |       0.83% |   4.117 ms |       0.96% | 447.442 us |  12.19% |   SLOW   |
|  I32  |     0      |  10000000   |   10000000   |  16.909 ms |       0.18% |  17.591 ms |       0.09% | 681.575 us |   4.03% |   SLOW   |
|  I32  |     1      |    1000     |     1000     |  88.251 us |       3.26% | 101.621 us |       3.30% |  13.370 us |  15.15% |   SLOW   |
|  I32  |     1      |   100000    |     1000     |  99.235 us |       5.51% | 115.453 us |       4.35% |  16.218 us |  16.34% |   SLOW   |
|  I32  |     1      |  10000000   |     1000     |   1.741 ms |       1.02% |   1.855 ms |       0.96% | 113.350 us |   6.51% |   SLOW   |
|  I32  |     1      |   100000    |    100000    | 108.730 us |       7.87% | 121.619 us |       3.20% |  12.890 us |  11.85% |   SLOW   |
|  I32  |     1      |  10000000   |    100000    |   1.854 ms |       0.93% |   1.986 ms |       0.93% | 131.590 us |   7.10% |   SLOW   |
|  I32  |     1      |  10000000   |   10000000   |   4.483 ms |       0.40% |   4.553 ms |       0.37% |  69.817 us |   1.56% |   SLOW   |
|  I64  |     0      |    1000     |     1000     |  72.796 us |       4.17% |  75.477 us |       9.25% |   2.680 us |   3.68% |   SAME   |
|  I64  |     0      |   100000    |     1000     | 106.061 us |       6.68% | 107.234 us |       5.86% |   1.173 us |   1.11% |   SAME   |
|  I64  |     0      |  10000000   |     1000     |   3.613 ms |       1.09% |   3.938 ms |       1.16% | 324.991 us |   9.00% |   SLOW   |
|  I64  |     0      |   100000    |    100000    | 116.774 us |       4.06% | 122.621 us |       4.36% |   5.848 us |   5.01% |   SLOW   |
|  I64  |     0      |  10000000   |    100000    |   3.838 ms |       0.85% |   4.259 ms |       0.90% | 420.755 us |  10.96% |   SLOW   |
|  I64  |     0      |  10000000   |   10000000   |  16.999 ms |       0.18% |  17.691 ms |       0.09% | 692.378 us |   4.07% |   SLOW   |
|  I64  |     1      |    1000     |     1000     |  89.852 us |       5.02% | 100.094 us |       8.34% |  10.243 us |  11.40% |   SLOW   |
|  I64  |     1      |   100000    |     1000     | 101.441 us |       4.20% | 113.896 us |       4.26% |  12.456 us |  12.28% |   SLOW   |
|  I64  |     1      |  10000000   |     1000     |   1.828 ms |       1.02% |   1.931 ms |       0.98% | 103.235 us |   5.65% |   SLOW   |
|  I64  |     1      |   100000    |    100000    | 107.553 us |       4.07% | 125.033 us |       3.70% |  17.480 us |  16.25% |   SLOW   |
|  I64  |     1      |  10000000   |    100000    |   1.961 ms |       0.89% |   2.076 ms |       0.97% | 114.771 us |   5.85% |   SLOW   |
|  I64  |     1      |  10000000   |   10000000   |   4.537 ms |       0.50% |   4.630 ms |       0.50% |  92.937 us |   2.05% |   SLOW   |

# full_join

## [0] Quadro RTX 8000

|  Key  |  Nullable  |  left_size  |  right_size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|-------|------------|-------------|--------------|------------|-------------|------------|-------------|------------|---------|----------|
|  I32  |     0      |    1000     |     1000     | 125.959 us |       3.30% | 128.893 us |       3.90% |   2.935 us |   2.33% |   SAME   |
|  I32  |     0      |   100000    |     1000     | 131.026 us |       2.27% | 141.799 us |       3.13% |  10.773 us |   8.22% |   SLOW   |
|  I32  |     0      |  10000000   |     1000     |   3.626 ms |       1.06% |   4.356 ms |       1.10% | 730.118 us |  20.14% |   SLOW   |
|  I32  |     0      |   100000    |    100000    | 174.369 us |       2.95% | 184.768 us |       4.70% |  10.399 us |   5.96% |   SLOW   |
|  I32  |     0      |  10000000   |    100000    |   3.892 ms |       0.94% |   4.375 ms |       1.01% | 483.145 us |  12.41% |   SLOW   |
|  I32  |     0      |  10000000   |   10000000   |  18.721 ms |       0.35% |  19.386 ms |       0.14% | 665.449 us |   3.55% |   SLOW   |
|  I32  |     1      |    1000     |     1000     | 144.531 us |       5.26% | 155.570 us |       3.31% |  11.040 us |   7.64% |   SLOW   |
|  I32  |     1      |   100000    |     1000     | 156.871 us |       6.34% | 175.824 us |       5.25% |  18.952 us |  12.08% |   SLOW   |
|  I32  |     1      |  10000000   |     1000     |   2.124 ms |       0.76% |   2.255 ms |       1.11% | 131.106 us |   6.17% |   SLOW   |
|  I32  |     1      |   100000    |    100000    | 171.653 us |       4.96% | 188.635 us |       8.25% |  16.982 us |   9.89% |   SLOW   |
|  I32  |     1      |  10000000   |    100000    |   2.251 ms |       1.20% |   2.387 ms |       0.87% | 136.220 us |   6.05% |   SLOW   |
|  I32  |     1      |  10000000   |   10000000   |   5.447 ms |       0.50% |   5.525 ms |       0.35% |  77.629 us |   1.43% |   SLOW   |
|  I64  |     0      |    1000     |     1000     | 127.972 us |       3.50% | 130.212 us |       6.03% |   2.239 us |   1.75% |   SAME   |
|  I64  |     0      |   100000    |     1000     | 138.442 us |       2.92% | 138.842 us |       3.27% |   0.400 us |   0.29% |   SAME   |
|  I64  |     0      |  10000000   |     1000     |   3.931 ms |       0.96% |   4.254 ms |       1.03% | 323.053 us |   8.22% |   SLOW   |
|  I64  |     0      |   100000    |    100000    | 175.311 us |       2.22% | 184.183 us |       3.33% |   8.872 us |   5.06% |   SLOW   |
|  I64  |     0      |  10000000   |    100000    |   4.025 ms |       0.80% |   4.483 ms |       0.85% | 458.109 us |  11.38% |   SLOW   |
|  I64  |     0      |  10000000   |   10000000   |  18.772 ms |       0.15% |  19.486 ms |       0.15% | 714.129 us |   3.80% |   SLOW   |
|  I64  |     1      |    1000     |     1000     | 145.035 us |       3.96% | 160.302 us |       8.85% |  15.267 us |  10.53% |   SLOW   |
|  I64  |     1      |   100000    |     1000     | 157.814 us |       7.78% | 167.754 us |       3.87% |   9.940 us |   6.30% |   SLOW   |
|  I64  |     1      |  10000000   |     1000     |   2.191 ms |       0.75% |   2.296 ms |       0.81% | 105.106 us |   4.80% |   SLOW   |
|  I64  |     1      |   100000    |    100000    | 169.990 us |       6.53% | 186.843 us |       4.04% |  16.853 us |   9.91% |   SLOW   |
|  I64  |     1      |  10000000   |    100000    |   2.329 ms |       0.79% |   2.448 ms |       0.91% | 118.875 us |   5.10% |   SLOW   |
|  I64  |     1      |  10000000   |   10000000   |   5.488 ms |       0.28% |   5.581 ms |       0.50% |  93.317 us |   1.70% |   SLOW   |

Performance on RTX8000 for reference.

Copy link
Copy Markdown
Contributor

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python changes LGTM

@PointKernel
Copy link
Copy Markdown
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 7c58187 into rapidsai:branch-25.08 Jul 9, 2025
163 of 174 checks passed
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF Python Jul 9, 2025
@PointKernel PointKernel deleted the refactor-hash-join branch July 9, 2025 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team cuco cuCollections related issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants