Occupancy improvement for Hash table build#15700
Occupancy improvement for Hash table build#15700tgujar wants to merge 50 commits intorapidsai:branch-24.10from
Conversation
|
I think the approach of specializing the type dispatcher is very cumbersome and will lead to a lot of code replication. Currently, I have the conditional dispatch working for |
|
/ok to test |
1 similar comment
|
/ok to test |
|
@tgujar I've updated the docs to unblock CI. Have you noticed any performance regressions for other use cases? It seems that it improves the performance for mixed join but the performance drops significantly in other cases using row hasher. |
| id_to_type<type_id::DECIMAL128>, | ||
| id_to_type<type_id::DECIMAL64>, | ||
| id_to_type<type_id::DECIMAL32>, |
There was a problem hiding this comment.
I don't think decimal types are complex type. They are just a wrapper around some integer type.
There was a problem hiding this comment.
Equality operator for Decimal will perform scaling which uses exponentiation.
I see a reduction in register usage if I comment out decimal types in #15502. I think we can still decide on the types excluded in the branches later on
There was a problem hiding this comment.
Let me know if we could resolve this. I have addressed this here #15700 (comment)
|
/ok to test |
|
@tgujar Could you take a look at the failing tests? |
|
/ok to test |
1 similar comment
|
/ok to test |
|
This PR needs to be rebased on branch-24.08. |
|
Specializing both the comparator and the hasher drops the register usage to 54 instead of the expected 46 for the mixed semi join case. Investigating why the register pressure is different from commenting out the code paths. |
cpp/CMakeLists.txt
Outdated
| src/join/mixed_join_kernel_nulls.cu | ||
| src/join/mixed_join_kernels_semi.cu | ||
| src/join/mixed_join_semi.cu | ||
| src/join/mixed_join_kernels_semi.cu |
There was a problem hiding this comment.
Keep these filenames alphabetized. If you like, you could rename this to mixed_join_semi_kernels.cu.
| std::shared_ptr<cudf::experimental::row::equality::preprocessed_table> | ||
| _preprocessed_probe; ///< input table preprocssed for row operators | ||
| hash_table_type _hash_table; ///< hash table built on `_build` | ||
| _preprocessed_probe; ///< input table preprocssed for row operators |
There was a problem hiding this comment.
| _preprocessed_probe; ///< input table preprocssed for row operators | |
| _preprocessed_probe; ///< input table preprocessed for row operators |
| struct dispatch_void_conditional_generator { | ||
| /// The underlying type | ||
| template <typename T> | ||
| using type = dispatch_void_conditional_t<std::disjunction<std::is_same<T, Types>...>::value, T>; |
There was a problem hiding this comment.
| using type = dispatch_void_conditional_t<std::disjunction<std::is_same<T, Types>...>::value, T>; | |
| using type = dispatch_void_conditional_t<std::disjunction_v<std::is_same<T, Types>...>, T>; |
| /// The type to dispatch to if the type is nested | ||
| using type = std::conditional_t<t == type_id::STRUCT or t == type_id::LIST, void, id_to_type<t>>; | ||
| /// The underlying type | ||
| using type = dispatch_void_if_nested_t<id_to_type<t>>; |
There was a problem hiding this comment.
Typically we define things the other way -- define the dispatch_void_if_nested struct, then define using dispatch_void_if_nested_t in terms of the ::type member of that struct.
There was a problem hiding this comment.
Okay yep, but I think here I need dispatch_void_if_nested to be templated on cudf::type_id but I need dispatch_void_if_nested_t to be templated on some type T. Maybe they should be named differently?
|
|
||
| auto const output_begin = | ||
| thrust::make_transform_output_iterator(build_indices->begin(), output_fn{}); | ||
| // TODO conditional find for nulls once `cuco::static_set::find_if` is added |
There was a problem hiding this comment.
This feature now exists in cuCollections, I think. Let's refactor if we can.
There was a problem hiding this comment.
I think maybe we could address this in a separate MR since the change wouldn't reflect this MR description. What do you think?
Unsure how to handle this. #16603 says that we would like the launch and compilation to happen in the same TU for CUDA whole compilation mode. In this PR case, it means that all the instantiation of the kernels happen in same TU. But we split the instantiation in this PR to reduce compilation time for mixed semi join kernels. I think multiple launch functions wouldn't be good design. |
robertmaynard
left a comment
There was a problem hiding this comment.
This PR breaks ODR violations as corrected in #16603.
It needs to be refactored so that all kernels are only launched from the TU that holds the implemenation.
You should be able to follow the updated pattern seen in That restructing has us separate TU's for the mixed join kernel based on the nullability of the input. This was done by having the intermidate host launch code have a specilization in each TU. |
|
Splitting this MR so its easier to review and merge. |
…7726) This PR introduces primitive row hashers and equality comparators and apply them into distinct hash joins to reduce register pressure and enhance runtime performance. It's an alternative to the 3-way dispatching row operators proposed in #15700, avoiding the build time issues associated with the original proposal. Testing shows that the new primitive row operators improve runtime performance in most scenarios, with architecture-dependent gains of up to 30%. Authors: - Yunsong Wang (https://github.com/PointKernel) - Tanmay Gujar (https://github.com/tgujar) Approvers: - David Wendt (https://github.com/davidwendt) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: #17726
|
@PointKernel with #17726 merged, what is next for this PR? Do we need to apply similar approaches to also speed up the (non-distinct) hash join? |
Yes, I'm working on this and likely a separate PR to fully address the issue. I'll close this PR once the work is complete. |
|
Sounds great, thanks @PointKernel. When you do, can you also update #16484 accordingly? |
Oh, thanks for the reminder! I just assigned the PR to myself to make sure I don't forget about it. |
Supersedes #15700 This PR updates hash join to leverage primitive row operators where applicable, resulting in performance improvements of 10% to 30%. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Nghia Truong (https://github.com/ttnghia) - Basit Ayantunde (https://github.com/lamarrr) - Bradley Dice (https://github.com/bdice) - Muhammad Haseeb (https://github.com/mhaseeb123) - David Wendt (https://github.com/davidwendt) URL: #18896
…ins (#19361) Add primitive row operator for left semi/anti joins. This improves occupancy for join operations as detailed in #15700 Authors: - Tanmay Gujar (https://github.com/tgujar) - Yunsong Wang (https://github.com/PointKernel) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Muhammad Haseeb (https://github.com/mhaseeb123) - Vyas Ramasubramani (https://github.com/vyasr) - Nghia Truong (https://github.com/ttnghia) URL: #19361
|
Closing as perf improvements from this investigation are merged as part of other MRs |
Description
Implements specialized template dispatch for hash joins and mixed semi joins to fix issue describes in #15502.
At a high level, this PR typedef's some types to void depending on the column types in the row's to avoid high register usage for comparator and hasher operations associated with more involved types (lists, structs, string, ...). This is done by dynamic dispatch on CPU side using std::variant+std::visit and dispatching with a specialized template.
This pattern can later be extended to other joins and also to groupby operation. Any operator using row hasher and row comparator should be able to see and improvement in occupancy for hash table build/probe operation.
Checklist