-
Notifications
You must be signed in to change notification settings - Fork 1k
[FEA] Refactor hash-based algorithms with new cuco data structures #12261
Description
Is your feature request related to a problem? Please describe.
cuco::static_map and cuco::static_multimap are used to perform hash-based operations in libcudf. Depending on NVIDIA/cuCollections#110, a lot of existing use cases could be replaced with cuco::static_set or cuco:static_multiset since the payload part in the hash map is not used.
Moreover, some libcudf implementations are still using concurrent_unordered_map or unordered_multiset and they should be refactored with cuco::static_set/multiset as well.
Describe the solution you'd like
Replace existing implementations with new data structures like cuco::static_set, cuco::static_multiset or cuco::static_map.
There should also be benchmarks showing the performance changes after replacement for most works listed below.
Note: under "Related APIs", the ref:: prefix refers to the cuco device-side API.
| Algorithm | Desired Data Structure | Related APIs | PRs | Notes |
|---|---|---|---|---|
| distinct_count | cuco::static_set |
insert, insert_if |
✅#13343 | |
| orc dictionary encoding (issue #10495) | cuco::static_map (legacy) |
ref::insert, ref::find |
✅#13580 | Similar to parquet dictionary encoding but the current implementation is still using a custom dictionary |
| byte_pair_encoding | cuco::static_map |
insert_async, ref::find |
✅#13807 | uses heterogeneous lookup |
| json_tree | cuco::static_set |
insert_if_async, ref::find |
✅#13928 | |
| search/contains_table | cuco::static_set |
insert_async, insert_if_async, contains_async, contains_if_async |
✅#14064 | Needs migration from cudf::detail::unordered_multiset |
| search/contains_column | cuco::static_set |
insert_async, insert_if_async, contains_async |
✅#14238 | Needs migration from cudf::detail::unordered_multiset. unordered_multiset can be removed once this is done. |
| hash-based groupby (issue #10401) | cuco::static_set |
ref::insert_and_find, retrieve_all |
✅#14813 | Needs migration from concurrent_unordered_map |
| legacy json reader | cuco::static_map (legacy) |
insert_async, ref::find |
✅#15813 | concurrent_unordered_map. Looks like a rational use of hash map. To be replaced by cuco::static_map |
| distinct | cuco::static_set |
insert_async retrieve_all, ref::insert_and_find |
✅#15984 | We have work in flight to move this to cuco::static_map, but it is blocked on performance concerns |
| parquet dictionary encoding | cuco::static_map (experimental) |
ref::insert, ref::find |
✅#16541 | Needs migration from cuco::static_map (legacy) |
| mixed_semi_joins | cuco::static_set |
insert, insert_if, ref::contains |
✅#16230 | Needs migration from cuco::static_map (legacy) |
| semi/anti joins | ✅ | Refactoring not needed anymore, as they use contains |
||
| histogram | cuco::static_set |
insert_async retrieve_all, ref::insert_and_find |
✅#16485 | |
| orc dictionary encoding | cuco::static_map (experimental) |
ref::insert, ref::find |
✅#17049 | Needs migration from cuco::static_map (legacy) |
| joins | cuco::static_multiset |
count, count_outer, retrieve, retrieve_outer |
✅#18021 | Needs migration from cuco::static_multimap (legacy) Related issues: #11176, #13116, |
| mixed joins | cuco::static_multiset |
✅#19660 | Needs migration from cuco::static_multimap (legacy) |
Metadata
Metadata
Assignees
Labels
Type
Projects
Status