[FEA] Improve hash join with the new data structure

**Is your feature request related to a problem? Please describe.**
#18021 migrates the legacy hash join implementation to use the new `cuco::static_multiset` data structure. This change eliminates the awkward use of hash tables that required checking both keys and payloads for equality via the unconventional `pair_*` APIs. However, the new implementation introduces performance issues, with regressions of up to 20% observed in some cases.


**Describe the solution you'd like**
Once  #18021 is merged, there are several directions we can explore to further improve hash join runtime performance. Historically, our hash join implementation has favored using cuco host APIs to minimize maintenance overhead on the cudf side. This approach has served us well over the past few years, enabling efficient development and simplifying long-term maintenance.

However, recent performance challenges suggest that this strategy is no longer optimal. As we invest more effort into performance tuning, such as with shared memory groupby, we've come to realize that highly optimized kernels often require special handling tailored to specific data sizes or distributions. Previously, we avoided this level of tuning because cuco's general-purpose host APIs made fine-grained control impractical. Additionally, we operated under the assumption that we wouldn't introspect data characteristics at runtime.

With that in mind, here are a few ideas I’d like to explore after #18021 is in:

- [x] Refactor the join utilities, as several are currently misplaced e.g., mixed join utilities reside in `join_common_utils.cuh`, while common functions for mixed semi/anti joins are placed within the standard mixed join implementations. Additionally, many utilities were duplicated to support fast prototyping or to avoid breaking changes during the extended migration process; these duplicates should also be removed and everything properly organized.
  - #20836
- [ ] Chunked Probing API Support: #18677 proposes exposing new APIs to support chunked probing in hash join. This would unblock Spark’s efforts to implement more effective workload chunking and better balance processing. This is a foundational step toward more scalable join execution.
- [ ] Replacing cuco Host APIs with cudf Custom Kernels: We should consider replacing cuco’s host APIs, specifically `count`/`count_outer` and `retrieve`/`retrieve_outer`, with cudf-specific custom kernels. This would allow us to better support cudf-specific semantics such as outer joins, and eventually phase out the unused `*_outer` APIs in cuco. As part of this effort, we should also resolve #9151 by skipping nulls during probing, similar to how we currently skip them during insertion. Additionally, the existing `retrieve` kernel is designed for general-purpose use, writing both contained and matched key-value pairs to the output. In cudf, we often discard the unnecessary results using transform or discard iterators, but these still introduce register pressure and prevent the compiler from fully optimizing the code. In the custom kernel, we can simplify the output to include only the right table indices, reducing overhead and improving performance.
- [ ] Branching Instruction Analysis: While not strictly dependent on Step 2, using custom kernels gives us an opportunity to investigate the source of the ~15% increase in branching instructions observed after migrating to the new multiset structure. Understanding this can help us optimize and port any improvements back to cuco, potentially addressing the performance regressions.
- [ ] Improved Count and Retrieve with Chunked Probing: With chunked probing, we’ll obtain an array of match counts rather than a global sum. This design eliminates the need for costly global atomics during counting. Once count is done, we can perform a scan and pass the scanned offsets to a custom retrieve kernel. This avoids global reductions during retrieval, potentially improving performance. Additionally, since each thread will know its exact output location, shared memory buffering may no longer be necessary.
- [ ] The current hash join implementation uses a tagging strategy where the row's hash value is used as a tag and combined with the row index to form the key inserted into the hash table. This approach is beneficial for multi-column or complex data types. However, for single-column inputs with primitive types, it may be excessive. In such cases, we could evaluate the performance of using the row content directly as the key, which could reduce memory usage and potentially improve runtime performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Improve hash join with the new data structure #19270

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Improve hash join with the new data structure #19270

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions