Skip to content

[FEA] Add prefetching to join build object's internal map/set storage #20073

@devavret

Description

@devavret

Is your feature request related to a problem? Please describe.

When evaluating Velox-cuDF's performance on larger than memory joins, we observed significant page faults in inner_join. This was fairly mitigated when prefetching the hash_join object's internal map/set storage.

Before
Image

After
Image

Describe the solution you'd like

cuDF to prefetch the storage used internally by cuco just before a join, by either calling prefetch on cuco's internal storage using its public APIs like so:

--- a/cpp/src/join/hash_join.cu
+++ b/cpp/src/join/hash_join.cu
@@ -278,6 +278,8 @@ probe_join_hash_table(
   auto right_indices = std::make_unique<rmm::device_uvector<size_type>>(join_size, stream, mr);
   cudf::prefetch::detail::prefetch(*left_indices, stream);
   cudf::prefetch::detail::prefetch(*right_indices, stream);
+  cudf::experimental::prefetch::detail::prefetch(
+    hash_table.data(), hash_table.capacity() * sizeof(*hash_table.data()), stream);
 
   auto const probe_table_num_rows = probe_table.num_rows();
   auto const out_probe_begin =

Or by libcudf owning the map/set's underlying storage directly and using cuco's non-owning types in conjunction with it so that it can confidently prefetch it.

Additionally, prefetching the build table would also help in further reducing the page faults. Although that could be done in the application layer too as the hash_join object does not take ownership of it, it would be convenient if cuDF did that automatically.

Metadata

Metadata

Assignees

No one assigned

    Labels

    PerformancePerformance related issueVeloxFunctionality that helps Velox-cudffeature requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions