Refactor distinct using static_map insert_or_apply#16484
Refactor distinct using static_map insert_or_apply#16484srinivasyadav18 wants to merge 10 commits intorapidsai:branch-24.10from
distinct using static_map insert_or_apply#16484Conversation
| auto keys = rmm::device_uvector<size_type>(map_size, stream, mr); | ||
| auto values = rmm::device_uvector<size_type>(map_size, stream, mr); | ||
|
|
||
| map.retrieve_all(keys.begin(), values.begin(), stream); |
There was a problem hiding this comment.
Can we use a discard iterator for the keys? I don't think we need to materialize them since they're not returned.
| auto plusop = plus_op{}; | ||
| map.insert_or_apply(pairs, pairs + num_rows, plusop, stream.value()); |
There was a problem hiding this comment.
| auto plusop = plus_op{}; | |
| map.insert_or_apply(pairs, pairs + num_rows, plusop, stream.value()); | |
| map.insert_or_apply(pairs, pairs + num_rows, plus_op{}, stream.value()); |
| [values = values.begin(), | ||
| keys = keys.begin(), |
There was a problem hiding this comment.
Feels more natural to order these as keys, then values.
| [values = values.begin(), | |
| keys = keys.begin(), | |
| [keys = keys.begin(), | |
| values = values.begin(), |
| } | ||
| }); | ||
|
|
||
| auto const map_end = thrust::copy_if( |
There was a problem hiding this comment.
The for_each above is unnecessary afaict. I think we can use a single copy_if with a counting transform iterator that checks the value for a given index, and copies the key for that index if so. That avoids materializing the intermediate output_indices.
There was a problem hiding this comment.
This header is now empty. Can we delete it and remove it from #includes?
There was a problem hiding this comment.
Likewise, this file is now empty and should be removed and deleted from CMakeLists.txt.
|
@bdice @PointKernel, I pushed all the cleanups and minor optimizations. The PR is ready for review. Also, I don't think we might need extra optimization on keep_none. As @bdice mentioned, transform counting iterator with |
reverting back this commit, as compilation time of distinct.cu is almost 6 mins. This reverts commit 5b66240.
|
@srinivasyadav18 can you please share the performance comparisons before and after this PR? |
|
Based on this gist, here are the most important performance signals: I'm comparing the 1B row case, removing
Overall, I believe this performance signature is a wash. Perhaps the shifting throughputs are due to cache locality differences. It's worth profiling the |
|
Thanks @GregoryKimball for providing the plot and important performance signals. I have done some profiling on 1 Billion keys and 1 Billion cardinality, and this what I found out. Overall distint algorithm has 2 major steps
Time taken for each step with base branch-24.10:
Below is the profiling of base branch-24.10 (selected region in green shade shows gather runtime) Time taken for each step with insert_or_apply:
Below is the profiling with insert_or_apply (selected region in green shade shows gather runtime) In summary, the gather algorithm is causing the main bottleneck. I tested the insert_or_apply implementtation to sort the array before returning from the distinct_algorithm, and performance improved signifcantly as now gather takes very less time. Below is the profiling with insert_or_apply with sorting the distinct indices: (selected region in green shade shows gather runtime) (red shows the sorting overhead) |
|
In summary, the branch-24.10 implementation is already efficient as it performs reduction and also maintains the ordering for reduction values (as they are row-indices).
Sorting helps the situation in large input cases, but causes regression in low input cases (see here in gist for numbers). |
|
Thank you @srinivasyadav18 for studying the 1B cardinality case. The gist you posted shows the benefit. It would be worth a profile and check of the 100K cardinality case, but I don't think the moderate cardinality performance is a blocker |
|
Instead of sorting, how about using scatter-gather approach?
I'm starting to review this PR. Sorry for jumping late. |
| auto const map_end = thrust::copy_if( | ||
| rmm::exec_policy_nosync(stream), | ||
| output_iter, | ||
| output_iter + num_distinct_keys, | ||
| output_indices.begin(), | ||
| cuda::proclaim_return_type<bool>([] __device__(auto const idx) { return idx != -1; })); |
There was a problem hiding this comment.
Please add back all the explanation comments (and new comments for new code). Although I wrote the original code, now I hardly understand what it is doing without the comments.
| struct plus_op { | ||
| template <cuda::thread_scope Scope> | ||
| __device__ void operator()(cuda::atomic_ref<size_type, Scope> ref, size_type const val) |
There was a problem hiding this comment.
Please do not put __device__ code in hpp file.
| #include <rmm/device_uvector.hpp> | ||
| #include <rmm/resource_ref.hpp> | ||
|
|
||
| #include <cuco/static_map.cuh> |
There was a problem hiding this comment.
Now with including the cuco header, this file should be renamed into _helper.cuh. Otherwise please move the these header and their relevant code into _helper.cuh and keep this file containing host-only code.
This PR adopts some work from @srinivasyadav18 with additional modifications. This is meant to complement #16484. Authors: - Bradley Dice (https://github.com/bdice) - Srinivas Yadav (https://github.com/srinivasyadav18) Approvers: - Yunsong Wang (https://github.com/PointKernel) - Srinivas Yadav (https://github.com/srinivasyadav18) URL: #16497
|
We hope that #15700 will improve the overall performance outlook of |
|
Update: waiting for #15700 to merge to determine if the current PR can improve performance after resolving the register issues. |
|
We could re-benchmark this feature using a primitive row operator as in #18896 |






Description
This PR refactors
distinctusingstatic_map::insert_or_applyfor keep_first, keep_last and keep_none options.Checklist