Occupancy improvement for Hash table build by tgujar · Pull Request #15700 · rapidsai/cudf

tgujar · 2024-05-08T01:43:42Z

Description

Implements specialized template dispatch for hash joins and mixed semi joins to fix issue describes in #15502.

At a high level, this PR typedef's some types to void depending on the column types in the row's to avoid high register usage for comparator and hasher operations associated with more involved types (lists, structs, string, ...). This is done by dynamic dispatch on CPU side using std::variant+std::visit and dispatching with a specialized template.

This pattern can later be extended to other joins and also to groupby operation. Any operator using row hasher and row comparator should be able to see and improvement in occupancy for hash table build/probe operation.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-05-08T01:43:47Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

tgujar · 2024-05-08T01:55:21Z

I think the approach of specializing the type dispatcher is very cumbersome and will lead to a lot of code replication. Currently, I have the conditional dispatch working for device_row_hasher but I am unsure if there is a better way to implement this. We could introduce a macro here to generate the code, what do you think?

PointKernel · 2024-05-08T19:04:21Z

/ok to test

PointKernel · 2024-05-14T19:45:57Z

/ok to test

PointKernel · 2024-05-14T19:49:36Z

@tgujar I've updated the docs to unblock CI. Have you noticed any performance regressions for other use cases? It seems that it improves the performance for mixed join but the performance drops significantly in other cases using row hasher.

ttnghia · 2024-05-14T20:23:44Z

cpp/src/join/mixed_join_common_utils.cuh

+                                                          id_to_type<type_id::DECIMAL128>,
+                                                          id_to_type<type_id::DECIMAL64>,
+                                                          id_to_type<type_id::DECIMAL32>,


I don't think decimal types are complex type. They are just a wrapper around some integer type.

Equality operator for Decimal will perform scaling which uses exponentiation.

cudf/cpp/include/cudf/fixed_point/fixed_point.hpp

Line 735 in 888e9d5

CUDF_HOST_DEVICE inline bool operator==(fixed_point<Rep1, Rad1> const& lhs,

I see a reduction in register usage if I comment out decimal types in #15502. I think we can still decide on the types excluded in the branches later on

Let me know if we could resolve this. I have addressed this here #15700 (comment)

PointKernel · 2024-05-16T02:56:52Z

/ok to test

PointKernel · 2024-05-16T14:55:44Z

@tgujar Could you take a look at the failing tests?

PointKernel · 2024-05-17T17:57:22Z

/ok to test

PointKernel · 2024-05-21T16:02:15Z

/ok to test

cpp/include/cudf/table/experimental/row_operators.cuh

davidwendt · 2024-05-30T14:35:12Z

This PR needs to be rebased on branch-24.08.

tgujar · 2024-05-30T14:36:00Z

Specializing both the comparator and the hasher drops the register usage to 54 instead of the expected 46 for the mixed semi join case. Investigating why the register pressure is different from commenting out the code paths.
The current plan is to avoid using a macro(as mentioned here) and instead do dynamic dispatch on CPU side using std::variant and std::visit

bdice

Comments attached. Thanks for this! There's a lot of heavy templating but it's fairly readable in spite of that.

I am also interested in build time comparisons to the previous code.

bdice · 2024-08-26T21:17:16Z

cpp/CMakeLists.txt

  src/join/mixed_join_kernel_nulls.cu
-  src/join/mixed_join_kernels_semi.cu
  src/join/mixed_join_semi.cu
+  src/join/mixed_join_kernels_semi.cu


Keep these filenames alphabetized. If you like, you could rename this to mixed_join_semi_kernels.cu.

bdice · 2024-08-26T21:19:36Z

cpp/include/cudf/detail/distinct_hash_join.cuh

  std::shared_ptr<cudf::experimental::row::equality::preprocessed_table>
-    _preprocessed_probe;        ///< input table preprocssed for row operators
-  hash_table_type _hash_table;  ///< hash table built on `_build`
+    _preprocessed_probe;                         ///< input table preprocssed for row operators


Suggested change

_preprocessed_probe; ///< input table preprocssed for row operators

_preprocessed_probe; ///< input table preprocessed for row operators

cpp/include/cudf/table/experimental/row_operators.cuh

bdice · 2024-08-26T21:22:03Z

cpp/include/cudf/table/experimental/row_operators.cuh

+struct dispatch_void_conditional_generator {
+  /// The underlying type
+  template <typename T>
+  using type = dispatch_void_conditional_t<std::disjunction<std::is_same<T, Types>...>::value, T>;


Suggested change

using type = dispatch_void_conditional_t<std::disjunction<std::is_same<T, Types>...>::value, T>;

using type = dispatch_void_conditional_t<std::disjunction_v<std::is_same<T, Types>...>, T>;

bdice · 2024-08-26T21:28:21Z

cpp/include/cudf/table/experimental/row_operators.cuh

-  /// The type to dispatch to if the type is nested
-  using type = std::conditional_t<t == type_id::STRUCT or t == type_id::LIST, void, id_to_type<t>>;
+  /// The underlying type
+  using type = dispatch_void_if_nested_t<id_to_type<t>>;


Typically we define things the other way -- define the dispatch_void_if_nested struct, then define using dispatch_void_if_nested_t in terms of the ::type member of that struct.

Okay yep, but I think here I need dispatch_void_if_nested to be templated on cudf::type_id but I need dispatch_void_if_nested_t to be templated on some type T. Maybe they should be named differently?

bdice · 2024-08-26T22:29:39Z

cpp/src/join/distinct_hash_join.cu

+
+        auto const output_begin =
+          thrust::make_transform_output_iterator(build_indices->begin(), output_fn{});
+        // TODO conditional find for nulls once `cuco::static_set::find_if` is added


This feature now exists in cuCollections, I think. Let's refactor if we can.

I think maybe we could address this in a separate MR since the change wouldn't reflect this MR description. What do you think?

cpp/src/join/mixed_join_kernel_semi_impl.cuh

cpp/src/join/mixed_join_kernels_semi.cuh

cpp/src/join/mixed_join_kernels_semi_compound.cu

tgujar · 2024-08-27T17:33:22Z

@tgujar can you please resolve the merge conflicts against ToT? The build time still appears to be an issue, but we need a successful CI run to confirm.

Unsure how to handle this. #16603 says that we would like the launch and compilation to happen in the same TU for CUDA whole compilation mode. In this PR case, it means that all the instantiation of the kernels happen in same TU. But we split the instantiation in this PR to reduce compilation time for mixed semi join kernels. I think multiple launch functions wouldn't be good design.

robertmaynard

This PR breaks ODR violations as corrected in #16603.

It needs to be refactored so that all kernels are only launched from the TU that holds the implemenation.

robertmaynard · 2024-08-27T18:49:42Z

@tgujar can you please resolve the merge conflicts against ToT? The build time still appears to be an issue, but we need a successful CI run to confirm.

Unsure how to handle this. #16603 says that we would like the launch and compilation to happen in the same TU for CUDA whole compilation mode. In this PR case, it means that all the instantiation of the kernels happen in same TU. But we split the instantiation in this PR to reduce compilation time for mixed semi join kernels. I think multiple launch functions wouldn't be good design.

You should be able to follow the updated pattern seen in cpp/src/join/mixed_join_kernel_nulls.cu, cpp/src/join/mixed_join_kernel.cu, cpp/src/join/mixed_join_kernel.cuh, and cpp/src/join/mixed_join_kernel.hpp.

That restructing has us separate TU's for the mixed join kernel based on the nullability of the input. This was done by having the intermidate host launch code have a specilization in each TU.

tgujar · 2024-11-09T03:07:31Z

Splitting this MR so its easier to review and merge.

…7726) This PR introduces primitive row hashers and equality comparators and apply them into distinct hash joins to reduce register pressure and enhance runtime performance. It's an alternative to the 3-way dispatching row operators proposed in #15700, avoiding the build time issues associated with the original proposal. Testing shows that the new primitive row operators improve runtime performance in most scenarios, with architecture-dependent gains of up to 30%. Authors: - Yunsong Wang (https://github.com/PointKernel) - Tanmay Gujar (https://github.com/tgujar) Approvers: - David Wendt (https://github.com/davidwendt) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: #17726

vyasr · 2025-05-16T22:54:27Z

@PointKernel with #17726 merged, what is next for this PR? Do we need to apply similar approaches to also speed up the (non-distinct) hash join?

PointKernel · 2025-05-16T23:55:24Z

@PointKernel with #17726 merged, what is next for this PR? Do we need to apply similar approaches to also speed up the (non-distinct) hash join?

Yes, I'm working on this and likely a separate PR to fully address the issue. I'll close this PR once the work is complete.

vyasr · 2025-05-20T01:46:09Z

Sounds great, thanks @PointKernel. When you do, can you also update #16484 accordingly?

PointKernel · 2025-05-20T01:48:26Z

Sounds great, thanks @PointKernel. When you do, can you also update #16484 accordingly?

Oh, thanks for the reminder! I just assigned the PR to myself to make sure I don't forget about it.

Supersedes #15700 This PR updates hash join to leverage primitive row operators where applicable, resulting in performance improvements of 10% to 30%. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Nghia Truong (https://github.com/ttnghia) - Basit Ayantunde (https://github.com/lamarrr) - Bradley Dice (https://github.com/bdice) - Muhammad Haseeb (https://github.com/mhaseeb123) - David Wendt (https://github.com/davidwendt) URL: #18896

…ins (#19361) Add primitive row operator for left semi/anti joins. This improves occupancy for join operations as detailed in #15700 Authors: - Tanmay Gujar (https://github.com/tgujar) - Yunsong Wang (https://github.com/PointKernel) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Muhammad Haseeb (https://github.com/mhaseeb123) - Vyas Ramasubramani (https://github.com/vyasr) - Nghia Truong (https://github.com/ttnghia) URL: #19361

tgujar · 2025-09-23T01:43:58Z

Closing as perf improvements from this investigation are merged as part of other MRs

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 8, 2024

PointKernel added non-breaking Non-breaking change 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function Performance Performance related issue labels May 8, 2024

ttnghia reviewed May 14, 2024

View reviewed changes

davidwendt reviewed May 30, 2024

View reviewed changes

cpp/include/cudf/table/experimental/row_operators.cuh Outdated Show resolved Hide resolved

tgujar and others added 11 commits June 3, 2024 07:57

nested template instantiation for hiding types

3df7dce

hasher conditional type dispatch works

088042c

delete dead comment block

34bad1e

Fix docs

40f291e

fix type logic, minor refactor

b56bf75

refactor

078f53d

added template specialization for equality comparator

14785e4

added template specialized calls to comparator

b88b60b

fix for register usage discrepancy

1b89198

fix for register usage discrepancy

ca63201

revert edited comment blocks

ff5e0d4

tgujar force-pushed the hash-occupancy branch from 347fb02 to ff5e0d4 Compare June 3, 2024 16:13

bdice reviewed Aug 26, 2024

View reviewed changes

robertmaynard requested changes Aug 27, 2024

View reviewed changes

tgujar added 11 commits September 4, 2024 00:27

address review comments

57c9b5e

refactor find_any

8e6e6e5

Merge branch 'branch-24.10' into hash-occupancy

9cd1bda

Merge branch 'branch-24.10' into hash-occupancy

7d48f52

remove redundant SFINAE check

728478f

use distance from thrust namespace

926342f

update docs

d82754d

fix spelling

c78da6b

merge branch-24.10, needs fixes

daa2b40

fail with instantiating correct type

30ab4e3

fix issue with constness

f1db848

GregoryKimball mentioned this pull request Oct 29, 2024

[FEA] Investigate fast-path for hash joins that bypasses row operators #16026

Closed

tgujar mentioned this pull request Nov 9, 2024

Occupancy improvement for distinct hash join with specialized dispatch #17290

Closed

3 tasks

PointKernel mentioned this pull request May 8, 2025

Refactor distinct join to use primitive row operators when proper #17726

Merged

3 tasks

PointKernel mentioned this pull request May 20, 2025

Apply primitive row operators into hash join #18896

Merged

3 tasks

tgujar mentioned this pull request Jul 12, 2025

Add primitive row dispatch support for semi/anti join and cudf::contains #19361

Merged

3 tasks

tgujar closed this Sep 23, 2025

	_preprocessed_probe; ///< input table preprocssed for row operators
	_preprocessed_probe; ///< input table preprocessed for row operators

	using type = dispatch_void_conditional_t<std::disjunction<std::is_same<T, Types>...>::value, T>;
	using type = dispatch_void_conditional_t<std::disjunction_v<std::is_same<T, Types>...>, T>;

Conversation

tgujar commented May 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

copy-pr-bot bot commented May 8, 2024

Uh oh!

tgujar commented May 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PointKernel commented May 8, 2024

Uh oh!

PointKernel commented May 14, 2024

Uh oh!

PointKernel commented May 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PointKernel commented May 16, 2024

Uh oh!

PointKernel commented May 16, 2024

Uh oh!

PointKernel commented May 17, 2024

Uh oh!

PointKernel commented May 21, 2024

Uh oh!

Uh oh!

davidwendt commented May 30, 2024

Uh oh!

tgujar commented May 30, 2024

Uh oh!

bdice left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tgujar commented Aug 27, 2024

Uh oh!

robertmaynard left a comment

Choose a reason for hiding this comment

Uh oh!

robertmaynard commented Aug 27, 2024

Uh oh!

tgujar commented Nov 9, 2024

Uh oh!

vyasr commented May 16, 2025

Uh oh!

PointKernel commented May 16, 2025

Uh oh!

vyasr commented May 20, 2025

Uh oh!

PointKernel commented May 20, 2025

Uh oh!

tgujar commented Sep 23, 2025

Uh oh!

Reviewers

tgujar commented May 8, 2024 •

edited

Loading

tgujar commented May 8, 2024 •

edited

Loading

PointKernel commented May 14, 2024 •

edited

Loading

bdice left a comment •

edited

Loading