Skip to content

Occupancy improvement for Hash table build#15700

Closed
tgujar wants to merge 50 commits intorapidsai:branch-24.10from
tgujar:hash-occupancy
Closed

Occupancy improvement for Hash table build#15700
tgujar wants to merge 50 commits intorapidsai:branch-24.10from
tgujar:hash-occupancy

Conversation

@tgujar
Copy link
Copy Markdown
Contributor

@tgujar tgujar commented May 8, 2024

Description

Implements specialized template dispatch for hash joins and mixed semi joins to fix issue describes in #15502.

At a high level, this PR typedef's some types to void depending on the column types in the row's to avoid high register usage for comparator and hasher operations associated with more involved types (lists, structs, string, ...). This is done by dynamic dispatch on CPU side using std::variant+std::visit and dispatching with a specialized template.

This pattern can later be extended to other joins and also to groupby operation. Any operator using row hasher and row comparator should be able to see and improvement in occupancy for hash table build/probe operation.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented May 8, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 8, 2024
@tgujar
Copy link
Copy Markdown
Contributor Author

tgujar commented May 8, 2024

I think the approach of specializing the type dispatcher is very cumbersome and will lead to a lot of code replication. Currently, I have the conditional dispatch working for device_row_hasher but I am unsure if there is a better way to implement this. We could introduce a macro here to generate the code, what do you think?

@PointKernel PointKernel added non-breaking Non-breaking change 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function Performance Performance related issue labels May 8, 2024
@PointKernel
Copy link
Copy Markdown
Member

/ok to test

1 similar comment
@PointKernel
Copy link
Copy Markdown
Member

/ok to test

@PointKernel
Copy link
Copy Markdown
Member

PointKernel commented May 14, 2024

@tgujar I've updated the docs to unblock CI. Have you noticed any performance regressions for other use cases? It seems that it improves the performance for mixed join but the performance drops significantly in other cases using row hasher.

Comment on lines +48 to +50
id_to_type<type_id::DECIMAL128>,
id_to_type<type_id::DECIMAL64>,
id_to_type<type_id::DECIMAL32>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think decimal types are complex type. They are just a wrapper around some integer type.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Equality operator for Decimal will perform scaling which uses exponentiation.

CUDF_HOST_DEVICE inline bool operator==(fixed_point<Rep1, Rad1> const& lhs,

I see a reduction in register usage if I comment out decimal types in #15502. I think we can still decide on the types excluded in the branches later on

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if we could resolve this. I have addressed this here #15700 (comment)

@PointKernel
Copy link
Copy Markdown
Member

/ok to test

@PointKernel
Copy link
Copy Markdown
Member

@tgujar Could you take a look at the failing tests?

@PointKernel
Copy link
Copy Markdown
Member

/ok to test

1 similar comment
@PointKernel
Copy link
Copy Markdown
Member

/ok to test

@davidwendt
Copy link
Copy Markdown
Contributor

This PR needs to be rebased on branch-24.08.

@tgujar
Copy link
Copy Markdown
Contributor Author

tgujar commented May 30, 2024

Specializing both the comparator and the hasher drops the register usage to 54 instead of the expected 46 for the mixed semi join case. Investigating why the register pressure is different from commenting out the code paths.
The current plan is to avoid using a macro(as mentioned here) and instead do dynamic dispatch on CPU side using std::variant and std::visit

Copy link
Copy Markdown
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments attached. Thanks for this! There's a lot of heavy templating but it's fairly readable in spite of that.

I am also interested in build time comparisons to the previous code.

src/join/mixed_join_kernel_nulls.cu
src/join/mixed_join_kernels_semi.cu
src/join/mixed_join_semi.cu
src/join/mixed_join_kernels_semi.cu
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep these filenames alphabetized. If you like, you could rename this to mixed_join_semi_kernels.cu.

std::shared_ptr<cudf::experimental::row::equality::preprocessed_table>
_preprocessed_probe; ///< input table preprocssed for row operators
hash_table_type _hash_table; ///< hash table built on `_build`
_preprocessed_probe; ///< input table preprocssed for row operators
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_preprocessed_probe; ///< input table preprocssed for row operators
_preprocessed_probe; ///< input table preprocessed for row operators

struct dispatch_void_conditional_generator {
/// The underlying type
template <typename T>
using type = dispatch_void_conditional_t<std::disjunction<std::is_same<T, Types>...>::value, T>;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using type = dispatch_void_conditional_t<std::disjunction<std::is_same<T, Types>...>::value, T>;
using type = dispatch_void_conditional_t<std::disjunction_v<std::is_same<T, Types>...>, T>;

/// The type to dispatch to if the type is nested
using type = std::conditional_t<t == type_id::STRUCT or t == type_id::LIST, void, id_to_type<t>>;
/// The underlying type
using type = dispatch_void_if_nested_t<id_to_type<t>>;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically we define things the other way -- define the dispatch_void_if_nested struct, then define using dispatch_void_if_nested_t in terms of the ::type member of that struct.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay yep, but I think here I need dispatch_void_if_nested to be templated on cudf::type_id but I need dispatch_void_if_nested_t to be templated on some type T. Maybe they should be named differently?


auto const output_begin =
thrust::make_transform_output_iterator(build_indices->begin(), output_fn{});
// TODO conditional find for nulls once `cuco::static_set::find_if` is added
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature now exists in cuCollections, I think. Let's refactor if we can.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we could address this in a separate MR since the change wouldn't reflect this MR description. What do you think?

@tgujar
Copy link
Copy Markdown
Contributor Author

tgujar commented Aug 27, 2024

@tgujar can you please resolve the merge conflicts against ToT? The build time still appears to be an issue, but we need a successful CI run to confirm.

Unsure how to handle this. #16603 says that we would like the launch and compilation to happen in the same TU for CUDA whole compilation mode. In this PR case, it means that all the instantiation of the kernels happen in same TU. But we split the instantiation in this PR to reduce compilation time for mixed semi join kernels. I think multiple launch functions wouldn't be good design.

Copy link
Copy Markdown
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR breaks ODR violations as corrected in #16603.

It needs to be refactored so that all kernels are only launched from the TU that holds the implemenation.

@robertmaynard
Copy link
Copy Markdown
Contributor

@tgujar can you please resolve the merge conflicts against ToT? The build time still appears to be an issue, but we need a successful CI run to confirm.

Unsure how to handle this. #16603 says that we would like the launch and compilation to happen in the same TU for CUDA whole compilation mode. In this PR case, it means that all the instantiation of the kernels happen in same TU. But we split the instantiation in this PR to reduce compilation time for mixed semi join kernels. I think multiple launch functions wouldn't be good design.

You should be able to follow the updated pattern seen in cpp/src/join/mixed_join_kernel_nulls.cu, cpp/src/join/mixed_join_kernel.cu, cpp/src/join/mixed_join_kernel.cuh, and cpp/src/join/mixed_join_kernel.hpp.

That restructing has us separate TU's for the mixed join kernel based on the nullability of the input. This was done by having the intermidate host launch code have a specilization in each TU.

@tgujar
Copy link
Copy Markdown
Contributor Author

tgujar commented Nov 9, 2024

Splitting this MR so its easier to review and merge.

rapids-bot bot pushed a commit that referenced this pull request May 13, 2025
…7726)

This PR introduces primitive row hashers and equality comparators and apply them into distinct hash joins to reduce register pressure and enhance runtime performance. It's an alternative to the 3-way dispatching row operators proposed in #15700, avoiding the build time issues associated with the original proposal. Testing shows that the new primitive row operators improve runtime performance in most scenarios, with architecture-dependent gains of up to 30%.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)
  - Tanmay Gujar (https://github.com/tgujar)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - Muhammad Haseeb (https://github.com/mhaseeb123)

URL: #17726
@vyasr
Copy link
Copy Markdown
Contributor

vyasr commented May 16, 2025

@PointKernel with #17726 merged, what is next for this PR? Do we need to apply similar approaches to also speed up the (non-distinct) hash join?

@PointKernel
Copy link
Copy Markdown
Member

@PointKernel with #17726 merged, what is next for this PR? Do we need to apply similar approaches to also speed up the (non-distinct) hash join?

Yes, I'm working on this and likely a separate PR to fully address the issue. I'll close this PR once the work is complete.

@vyasr
Copy link
Copy Markdown
Contributor

vyasr commented May 20, 2025

Sounds great, thanks @PointKernel. When you do, can you also update #16484 accordingly?

@PointKernel
Copy link
Copy Markdown
Member

Sounds great, thanks @PointKernel. When you do, can you also update #16484 accordingly?

Oh, thanks for the reminder! I just assigned the PR to myself to make sure I don't forget about it.

rapids-bot bot pushed a commit that referenced this pull request Jun 18, 2025
Supersedes #15700

This PR updates hash join to leverage primitive row operators where applicable, resulting in performance improvements of 10% to 30%.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Basit Ayantunde (https://github.com/lamarrr)
  - Bradley Dice (https://github.com/bdice)
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - David Wendt (https://github.com/davidwendt)

URL: #18896
rapids-bot bot pushed a commit that referenced this pull request Jul 22, 2025
…ins (#19361)

Add primitive row operator for left semi/anti joins. This improves occupancy for join operations as detailed in #15700

Authors:
  - Tanmay Gujar (https://github.com/tgujar)
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Nghia Truong (https://github.com/ttnghia)

URL: #19361
@tgujar
Copy link
Copy Markdown
Contributor Author

tgujar commented Sep 23, 2025

Closing as perf improvements from this investigation are merged as part of other MRs

@tgujar tgujar closed this Sep 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

8 participants