Skip to content

[FEA] Add distinct-key joins to libcudf #14948

@GregoryKimball

Description

@GregoryKimball

Is your feature request related to a problem? Please describe.
For equality joins in which the keys of one of the tables do not contain any duplicates, then we can provide a more efficient implementation based on cuco::static_set. Distinct-key joins also have more predictable output sizes and most join types can be implemented with single-pass kernels. The join APIs currently in libcudf's hash_join class use the cuco::static_multimap data structure to support duplicates.

Describe the solution you'd like
We should provide a new distinct_hash_join class that uses the cuco::static_set data structure and does not support duplicate keys in the build table. This class would have member functions for inner_join and left_join join types.

Staging the work

Additional context
See also #12261, which includes refactoring hash_join from using cuco::static_multimap to cuco::static_multiset. If we add the simpler and more efficient distinct-key joins, it will make it easier to experiment with join implementations using set-like data structures.

Distinct-key joins are common in "primary key / foreign key" joins because the primary key in a table is required to never have duplicates.

Metadata

Metadata

Assignees

No one assigned

    Labels

    2 - In ProgressCurrently a work in progressPerformancePerformance related issueSparkFunctionality that helps Spark RAPIDSfeature requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions