-
Notifications
You must be signed in to change notification settings - Fork 1k
[FEA] Implement full support for nested types #11844
Copy link
Copy link
Closed
Labels
PerformancePerformance related issuePerformance related issueSparkFunctionality that helps Spark RAPIDSFunctionality that helps Spark RAPIDSfeature requestNew feature or requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.Affects libcudf (C++/CUDA) code.
Description
After the introduction of new row operators for nested types (#10186), it's time to build on that success with a new issue to focus on the outstanding items. The row operators for equality = (#10289) and hashing # (#10641) included support for arbitrarily-nested List and Struct types. List inequality < (#11129) added support for arbitrarily-nested List<List> and Struct<List>. Please note that the Spark-RAPIDS plugin has a companion issue (NVIDIA/spark-rapids#8550) to this story issue.
Part 1: Transition algorithms to new row operators
| Algorithm | Status | Operation | File | Notes |
|---|---|---|---|---|
| scan_rank | ✅ ️ ️ | =️ |
rank_scan.cu | #10289 |
| distinct | ✅ ️ | # |
distinct.cu | #10641 |
| distinct_count | ✅ #12776 | #, = |
||
| unique_count | ✅ #12776 | #, = |
||
| hash-groupby | ✅ ️ | #, = |
groupby.cu | #10770 |
| sort | ✅ ️ | < |
sort_impl.cuh | #11129, using legacy relational_compare in simple_comparator as part of two-path solution in sorted_order |
| list contains | ✅ | = |
contains.cu | #10548, using legacy equality_compare as part of two-path solution in search_list_non_nested_types_fn, see also #11330 for the two-path solution for equality |
| struct binops | ✅ | =, < |
#11153 | |
| murmur hash | ✅ | # |
plus spark_murmur_hash | |
| merge | ✅ #14250 | < |
merge.cuh | merge creates a row_lexicographic_tagged_comparator that uses legacy element_relational_comparator as part of internal operations. Also see discussion in #13514 and #8050 |
| sort-groupby nunique | ✅ #11792 | = |
group_nunique.cu | uses legacy element_equality_comparator in is_unique_iterator_fn |
| sort-groupby rank_scan | ✅ #11792 | = |
group_rank_scan.cu | uses legacy row_equality_comparator in permuted_comparator |
| sort-groupby | ✅ #11792 | = |
sort_helper.cu | uses legacy row_equality_comparator in permuted_row_equality_comparator, linked issue #8039 |
| hash map in parquet | ✅ #12918 | = |
chunk_dict.cu | uses legacy equality_compare in find/insert. #8476 convert to cuco and #10635 optimize |
| unused includes | ✅ | round_robin.cu search_ordered.cu group_single_pass... | #11857 | |
| partition | ✅ #12761 | # |
partitioning.cu | uses legacy row_hasher in hash_partition_table |
| struct min/max | ✅ #13069 | < |
struct_minmax_util.cuh | uses legacy row_lexicographic_comparator in row_arg_minmax_fn #8974, see also #8964. see draft work in #10811 and #13069 |
| is_sorted | ✅ #12752 | < |
is_sorted.cu | uses legacy row_lexicographic_comparator in is_sorted |
| rank | ✅ #12481 | = |
rank.cu | uses legacy row_equality_comparator in unique_comparator |
| unique | ✅ | = |
unique.cu | uses legacy row_equality_comparator in unique |
| stream compaction | ✅ #12776 | # |
stream_co... | uses legacy row_hasher renamed as row_hash |
| one-hot encode | ✅ | = |
one_hot_encode.cu | uses legacy element_equality_comparator in one_hot_encode_functor |
| join | ✅ #12787 | #, = |
join_common_utils.hpp | uses legacy row_hasher and row_equality_comparator |
| mixed join | ✅ #13028 | #, = |
uses struct flattening | |
| test util | ✅ #12777 | = |
column_utilities.cu | uses legacy row_equality_comparator in corresponding_rows_unequal |
| contains_table | ✅ #13119 | #, = |
See #13032 and previous work in #11330 | |
| list contains | ✅ #13810 | = |
search_list | uses legacy cudf::equality_compare also see #13672 |
| list min/max | ✅ #13676 | < |
list_minmax_util.cuh |
See #13667, We need to introduce a utility similar to struct_minmax_util.cuh to handle top-level lists in aggregation values |
Part 2: Improving performance and adding functionality
| Name | Status | Notes |
|---|---|---|
| Make experimental comparators default | see #12593, depends on Part 1 | |
Improve two-path < |
⏸️ (paused) | See #11667. #11129 added a two-path implementation for sorted_order based on templating experimental <, with one instance for simple types and one for nested. |
Improve two-path = |
✅ #12676 | See #12676 #11016, #11667, #11330 added a two-path implementation for lists::contains, with one path for simple types using legacy = and one path for nested using experimental =. |
Enable < operator on Struct<List> types |
✅ #13005 | see #11672, we expected that these types would work as part of #11129, but we encountered failures and disabled it |
Support < operator for List<Struct> types |
✅ #12953 | Proposal and initial design in #11222, impacts #10408, to be solved by #12953 |
Support self-comparison in two-table < |
#10730 added strong right- and left-table index types for the two table comparator. For refactoring "merge" (see Part 1) we need to allow right-right and left-left indices in the two table comparator | |
Support self-comparison in two-table = |
see #13371 | |
Add strong index types to two-table # |
See #13116 for more information, may require upstream changes in cuco |
Part 3: Expand support for nested types in cuDF-python
Update: Part 3 will be made into a separate story issue once this issue is closed
| Name | Issue | Notes |
|---|---|---|
drop_duplicates() on nested types |
#6784 | doesn't drop duplicates correctly |
groupby.apply on nested types |
#8039 | throws exception from index with List type |
| binop comparison | TBD | exception handling in #11609 |
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
PerformancePerformance related issuePerformance related issueSparkFunctionality that helps Spark RAPIDSFunctionality that helps Spark RAPIDSfeature requestNew feature or requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.Affects libcudf (C++/CUDA) code.
Type
Projects
Status
Story Issue