[FEA] Add distinct-key joins to libcudf

**Is your feature request related to a problem? Please describe.**
For equality joins in which the keys of one of the tables do not contain any duplicates, then we can provide a more efficient implementation based on `cuco::static_set`. Distinct-key joins also have more predictable output sizes and most join types can be implemented with single-pass kernels. The join APIs currently in libcudf's [hash_join](https://github.com/rapidsai/cudf/blob/767dde16e413f34cac16cb0b96b7eca18d71b7e9/cpp/include/cudf/join.hpp#L273) class use the `cuco::static_multimap` data structure to support duplicates.

**Describe the solution you'd like**
We should provide a new `distinct_hash_join` class that uses the `cuco::static_set` data structure and does not support duplicate keys in the build table. This class would have member functions for `inner_join` and `left_join` join types.

**Staging the work**
- [x] Update RAPIDS to use cuco version `56c53beb` (https://github.com/rapidsai/rapids-cmake/pull/526)
- [x] Merge distinct-key inner join based on a `static_set` data structure (https://github.com/rapidsai/cudf/pull/14990)
- [x] Add left and full join types (#15149)
- [ ] Explore a fast path with a single `int32` or `int64` keying column. To be revisited after https://github.com/NVIDIA/cuCollections/pull/442
- [x] refactor the custom device code from the first distinct key join implementation #15636
- [ ] Explore a shared memory hash table if cardinality is below the size threshold to fit in shared memory. We can estimate if the hash table will fit in shared memory based on the size of the build table. Also see https://github.com/NVIDIA/spark-rapids/issues/7529

**Additional context**
See also #12261, which includes refactoring `hash_join` from using `cuco::static_multimap` to `cuco::static_multiset`. If we add the simpler and more efficient distinct-key joins, it will make it easier to experiment with join implementations using set-like data structures.

Distinct-key joins are common in "primary key / foreign key" joins because the primary key in a table is required to never have duplicates.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add distinct-key joins to libcudf #14948

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Add distinct-key joins to libcudf #14948

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions