-
Notifications
You must be signed in to change notification settings - Fork 1k
[FEA] Add distinct-key joins to libcudf #14948
Description
Is your feature request related to a problem? Please describe.
For equality joins in which the keys of one of the tables do not contain any duplicates, then we can provide a more efficient implementation based on cuco::static_set. Distinct-key joins also have more predictable output sizes and most join types can be implemented with single-pass kernels. The join APIs currently in libcudf's hash_join class use the cuco::static_multimap data structure to support duplicates.
Describe the solution you'd like
We should provide a new distinct_hash_join class that uses the cuco::static_set data structure and does not support duplicate keys in the build table. This class would have member functions for inner_join and left_join join types.
Staging the work
- Update RAPIDS to use cuco version
56c53beb(Fetch the latest cuco and remove outdated patches rapids-cmake#526) - Merge distinct-key inner join based on a
static_setdata structure (Add distinct key inner join #14990) - Add left and full join types (Add distinct left join #15149)
- Explore a fast path with a single
int32orint64keying column. To be revisited after Add set retrieve NVIDIA/cuCollections#442 - refactor the custom device code from the first distinct key join implementation Improve distinct join with set
retrieve#15636 - Explore a shared memory hash table if cardinality is below the size threshold to fit in shared memory. We can estimate if the hash table will fit in shared memory based on the size of the build table. Also see [BUG] low cardinality joins on the GPU are really bad. NVIDIA/spark-rapids#7529
Additional context
See also #12261, which includes refactoring hash_join from using cuco::static_multimap to cuco::static_multiset. If we add the simpler and more efficient distinct-key joins, it will make it easier to experiment with join implementations using set-like data structures.
Distinct-key joins are common in "primary key / foreign key" joins because the primary key in a table is required to never have duplicates.