-
Notifications
You must be signed in to change notification settings - Fork 1k
[FEA] Improve hash join with the new data structure #19270
Description
Is your feature request related to a problem? Please describe.
#18021 migrates the legacy hash join implementation to use the new cuco::static_multiset data structure. This change eliminates the awkward use of hash tables that required checking both keys and payloads for equality via the unconventional pair_* APIs. However, the new implementation introduces performance issues, with regressions of up to 20% observed in some cases.
Describe the solution you'd like
Once #18021 is merged, there are several directions we can explore to further improve hash join runtime performance. Historically, our hash join implementation has favored using cuco host APIs to minimize maintenance overhead on the cudf side. This approach has served us well over the past few years, enabling efficient development and simplifying long-term maintenance.
However, recent performance challenges suggest that this strategy is no longer optimal. As we invest more effort into performance tuning, such as with shared memory groupby, we've come to realize that highly optimized kernels often require special handling tailored to specific data sizes or distributions. Previously, we avoided this level of tuning because cuco's general-purpose host APIs made fine-grained control impractical. Additionally, we operated under the assumption that we wouldn't introspect data characteristics at runtime.
With that in mind, here are a few ideas I’d like to explore after #18021 is in:
- Refactor the join utilities, as several are currently misplaced e.g., mixed join utilities reside in
join_common_utils.cuh, while common functions for mixed semi/anti joins are placed within the standard mixed join implementations. Additionally, many utilities were duplicated to support fast prototyping or to avoid breaking changes during the extended migration process; these duplicates should also be removed and everything properly organized. - Chunked Probing API Support: [FEA] Add support for chunked probe in hash joins #18677 proposes exposing new APIs to support chunked probing in hash join. This would unblock Spark’s efforts to implement more effective workload chunking and better balance processing. This is a foundational step toward more scalable join execution.
- Replacing cuco Host APIs with cudf Custom Kernels: We should consider replacing cuco’s host APIs, specifically
count/count_outerandretrieve/retrieve_outer, with cudf-specific custom kernels. This would allow us to better support cudf-specific semantics such as outer joins, and eventually phase out the unused*_outerAPIs in cuco. As part of this effort, we should also resolve [FEA] Filter join probe table rows that contain nulls when nulls are not equal #9151 by skipping nulls during probing, similar to how we currently skip them during insertion. Additionally, the existingretrievekernel is designed for general-purpose use, writing both contained and matched key-value pairs to the output. In cudf, we often discard the unnecessary results using transform or discard iterators, but these still introduce register pressure and prevent the compiler from fully optimizing the code. In the custom kernel, we can simplify the output to include only the right table indices, reducing overhead and improving performance. - Branching Instruction Analysis: While not strictly dependent on Step 2, using custom kernels gives us an opportunity to investigate the source of the ~15% increase in branching instructions observed after migrating to the new multiset structure. Understanding this can help us optimize and port any improvements back to cuco, potentially addressing the performance regressions.
- Improved Count and Retrieve with Chunked Probing: With chunked probing, we’ll obtain an array of match counts rather than a global sum. This design eliminates the need for costly global atomics during counting. Once count is done, we can perform a scan and pass the scanned offsets to a custom retrieve kernel. This avoids global reductions during retrieval, potentially improving performance. Additionally, since each thread will know its exact output location, shared memory buffering may no longer be necessary.
- The current hash join implementation uses a tagging strategy where the row's hash value is used as a tag and combined with the row index to form the key inserted into the hash table. This approach is beneficial for multi-column or complex data types. However, for single-column inputs with primitive types, it may be excessive. In such cases, we could evaluate the performance of using the row content directly as the key, which could reduce memory usage and potentially improve runtime performance.