Is your feature request related to a problem? Please describe.
cuco insert kernel has poor occupancy due to high register usage during hash table build operation executed by cuDF. If I disable some of the code paths for complex types(commenting out dict, string, list, struct, decimal) in
|
CUDF_HOST_DEVICE __forceinline__ constexpr decltype(auto) type_dispatcher(cudf::data_type dtype, |
the type dispatcher, then the register usage per thread drops from 75 -> 46 and leads to a significant occupancy bump. It seems that the insert kernel has to pay the cost of high register usage even for simpler types since the compiler has to account for all code paths.
I did some experiments by disabling different subsets of types, list has types I disable -> register count for insert kernel
- decimal -> 72
- struct -> 73
- list -> 73
- string -> 73
- dict -> 68
- struct, list -> 64
- list, decimal, struct -> 63
- dict, string, list, struct -> 58
- string, dict, struct, list, decimal -> 46
Here is the speedup I see on mixed semi join kernel by improving occupancy for int32 keys obtained by disabling complex types

Describe the solution you'd like
Improve occupancy by disabling codepaths for complex types.
Describe alternatives you've considered
- Add more template params to the hasher/comparator which allow us to separate codepaths for complex types and simpler types, or
- Add JIT compilation to only consider the types necessary for hasher/comparator for a row
Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.
Is your feature request related to a problem? Please describe.
cuco insert kernel has poor occupancy due to high register usage during hash table build operation executed by cuDF. If I disable some of the code paths for complex types(commenting out dict, string, list, struct, decimal) in
cudf/cpp/include/cudf/utilities/type_dispatcher.hpp
Line 456 in 434df44
I did some experiments by disabling different subsets of types, list has types I disable -> register count for insert kernel
Here is the speedup I see on mixed semi join kernel by improving occupancy for int32 keys obtained by disabling complex types

Describe the solution you'd like
Improve occupancy by disabling codepaths for complex types.
Describe alternatives you've considered
Additional context
Add any other context, code examples, or references to existing implementations about the feature request here.