Skip to content

[FEA] Implement full support for nested types #11844

@GregoryKimball

Description

@GregoryKimball

After the introduction of new row operators for nested types (#10186), it's time to build on that success with a new issue to focus on the outstanding items. The row operators for equality = (#10289) and hashing # (#10641) included support for arbitrarily-nested List and Struct types. List inequality < (#11129) added support for arbitrarily-nested List<List> and Struct<List>. Please note that the Spark-RAPIDS plugin has a companion issue (NVIDIA/spark-rapids#8550) to this story issue.

Part 1: Transition algorithms to new row operators

Algorithm Status Operation File Notes
scan_rank ✅ ️ ️ = rank_scan.cu #10289
distinct ✅ ️ # distinct.cu #10641
distinct_count #12776 #, =
unique_count #12776 #, =
hash-groupby ✅ ️ #, = groupby.cu #10770
sort ✅ ️ < sort_impl.cuh #11129, using legacy relational_compare in simple_comparator as part of two-path solution in sorted_order
list contains = contains.cu #10548, using legacy equality_compare as part of two-path solution in search_list_non_nested_types_fn, see also #11330 for the two-path solution for equality
struct binops =, < #11153
murmur hash # plus spark_murmur_hash
merge #14250 < merge.cuh merge creates a row_lexicographic_tagged_comparator that uses legacy element_relational_comparator as part of internal operations. Also see discussion in #13514 and #8050
sort-groupby nunique #11792 = group_nunique.cu uses legacy element_equality_comparator in is_unique_iterator_fn
sort-groupby rank_scan #11792 = group_rank_scan.cu uses legacy row_equality_comparator in permuted_comparator
sort-groupby #11792 = sort_helper.cu uses legacy row_equality_comparator in permuted_row_equality_comparator, linked issue #8039
hash map in parquet #12918 = chunk_dict.cu uses legacy equality_compare in find/insert. #8476 convert to cuco and #10635 optimize
unused includes round_robin.cu search_ordered.cu group_single_pass... #11857
partition #12761 # partitioning.cu uses legacy row_hasher in hash_partition_table
struct min/max #13069 < struct_minmax_util.cuh uses legacy row_lexicographic_comparator in row_arg_minmax_fn #8974, see also #8964. see draft work in #10811 and #13069
is_sorted #12752 < is_sorted.cu uses legacy row_lexicographic_comparator in is_sorted
rank #12481 = rank.cu uses legacy row_equality_comparator in unique_comparator
unique = unique.cu uses legacy row_equality_comparator in unique
stream compaction #12776 # stream_co... uses legacy row_hasher renamed as row_hash
one-hot encode = one_hot_encode.cu uses legacy element_equality_comparator in one_hot_encode_functor
join #12787 #, = join_common_utils.hpp uses legacy row_hasher and row_equality_comparator
mixed join #13028 #, = uses struct flattening
test util #12777 = column_utilities.cu uses legacy row_equality_comparator in corresponding_rows_unequal
contains_table #13119 #, = See #13032 and previous work in #11330
list contains #13810 = search_list uses legacy cudf::equality_compare also see #13672
list min/max #13676 < list_minmax_util.cuh See #13667, We need to introduce a utility similar to struct_minmax_util.cuh to handle top-level lists in aggregation values

Part 2: Improving performance and adding functionality

Name Status Notes
Make experimental comparators default see #12593, depends on Part 1
Improve two-path < ⏸️ (paused) See #11667. #11129 added a two-path implementation for sorted_order based on templating experimental <, with one instance for simple types and one for nested.
Improve two-path = #12676 See #12676 #11016, #11667, #11330 added a two-path implementation for lists::contains, with one path for simple types using legacy = and one path for nested using experimental =.
Enable < operator on Struct<List> types #13005 see #11672, we expected that these types would work as part of #11129, but we encountered failures and disabled it
Support < operator for List<Struct> types #12953 Proposal and initial design in #11222, impacts #10408, to be solved by #12953
Support self-comparison in two-table < #10730 added strong right- and left-table index types for the two table comparator. For refactoring "merge" (see Part 1) we need to allow right-right and left-left indices in the two table comparator
Support self-comparison in two-table = see #13371
Add strong index types to two-table # See #13116 for more information, may require upstream changes in cuco

Part 3: Expand support for nested types in cuDF-python

Update: Part 3 will be made into a separate story issue once this issue is closed

Name Issue Notes
drop_duplicates() on nested types #6784 doesn't drop duplicates correctly
groupby.apply on nested types #8039 throws exception from index with List type
binop comparison TBD exception handling in #11609

Metadata

Metadata

Assignees

Labels

PerformancePerformance related issueSparkFunctionality that helps Spark RAPIDSfeature requestNew feature or requestlibcudfAffects libcudf (C++/CUDA) code.

Type

No type

Projects

Status

Story Issue

Relationships

None yet

Development

No branches or pull requests

Issue actions