[FEA] Implement full support for nested types

After the introduction of new row operators for nested types (#10186), it's time to build on that success with a new issue to focus on the outstanding items. The row operators for equality `=` (#10289) and hashing `#` (#10641) included support for arbitrarily-nested `List` and `Struct` types. List inequality `<` (#11129) added support for arbitrarily-nested `List<List>` and `Struct<List>`. Please note that the Spark-RAPIDS plugin has a companion issue (https://github.com/NVIDIA/spark-rapids/issues/8550) to this story issue.

### Part 1: Transition algorithms to new row operators

|Algorithm|Status|Operation|File|Notes|
|---|---|---|---|---|
|scan_rank| ✅ ️ ️|`=`️|[rank_scan.cu](https://github.com/rapidsai/cudf/blob/branch-22.12/cpp/src/reductions/scan/rank_scan.cu)|#10289|
|distinct| ✅ ️ |`#`|[distinct.cu](https://github.com/rapidsai/cudf/blob/branch-22.12/cpp/src/stream_compaction/distinct.cu)|#10641|
| distinct_count | ✅ #12776 | `#`, `=` | |
| unique_count | ✅ #12776 | `#`, `=` | |
|hash-groupby| ✅ ️|`#`, `=`|[groupby.cu](https://github.com/rapidsai/cudf/blob/branch-22.12/cpp/src/groupby/hash/groupby.cu)|#10770|
|sort|✅ ️|`<`|[sort_impl.cuh](https://github.com/rapidsai/cudf/blob/branch-22.12/cpp/src/sort/sort_impl.cuh)|#11129, using legacy `relational_compare` in `simple_comparator` as part of two-path solution in `sorted_order`|
|list contains|✅|`=`| [contains.cu](https://github.com/rapidsai/cudf/blob/branch-22.10/cpp/src/lists/contains.cu)| #10548, using legacy `equality_compare` as part of two-path solution in `search_list_non_nested_types_fn`, see also #11330 for the two-path solution for equality|
|struct binops|✅|`=`, `<` | |#11153 | 
|murmur hash |✅|`#`| | plus spark_murmur_hash |
|merge| ✅ #14250 |`<`|[merge.cuh](https://github.com/rapidsai/cudf/blob/branch-22.12/cpp/include/cudf/detail/merge.cuh)| merge creates a `row_lexicographic_tagged_comparator` that uses legacy `element_relational_comparator` as part of internal operations. Also see discussion in #13514 and #8050 |
|sort-groupby nunique| ✅ #11792 |`=`| [group_nunique.cu](https://github.com/rapidsai/cudf/blob/branch-22.12/cpp/src/groupby/sort/group_nunique.cu)| uses legacy `element_equality_comparator` in `is_unique_iterator_fn`|
|sort-groupby rank_scan| ✅ #11792 |`=`| [group_rank_scan.cu](https://github.com/rapidsai/cudf/blob/branch-22.12/cpp/src/groupby/sort/group_rank_scan.cu)| uses legacy `row_equality_comparator` in `permuted_comparator` |
|sort-groupby | ✅ #11792 | `=` | [sort_helper.cu](https://github.com/rapidsai/cudf/blob/branch-22.12/cpp/src/groupby/sort/sort_helper.cu) | uses legacy `row_equality_comparator` in `permuted_row_equality_comparator`, linked issue #8039|
|hash map in parquet | ✅ #12918 | `=` |  [chunk_dict.cu](https://github.com/rapidsai/cudf/blob/branch-22.12/cpp/src/io/parquet/chunk_dict.cu)| uses legacy `equality_compare` in find/insert. #8476 convert to cuco and #10635 optimize |
|unused includes| ✅ | | [round_robin.cu](https://github.com/rapidsai/cudf/blob/branch-22.12/cpp/src/partitioning/round_robin.cu) [search_ordered.cu](https://github.com/rapidsai/cudf/blob/branch-22.10/cpp/src/search/search_ordered.cu) [group_single_pass...](https://github.com/rapidsai/cudf/blob/branch-22.10/cpp/src/groupby/sort/group_single_pass_reduction_util.cuh) | #11857 |
|partition | ✅ #12761 | `#` | [partitioning.cu](https://github.com/rapidsai/cudf/blob/branch-22.12/cpp/src/partitioning/partitioning.cu)| uses legacy `row_hasher` in `hash_partition_table`|
|struct min/max | ✅ #13069 | `<` | [struct_minmax_util.cuh](https://github.com/rapidsai/cudf/blob/branch-22.10/cpp/src/reductions/struct_minmax_util.cuh) | uses legacy `row_lexicographic_comparator` in `row_arg_minmax_fn` #8974, see also #8964. see draft work in #10811 and #13069 | 
|is_sorted | ✅ #12752 | `<` | [is_sorted.cu](https://github.com/rapidsai/cudf/blob/branch-22.10/cpp/src/sort/is_sorted.cu) | uses legacy `row_lexicographic_comparator` in `is_sorted`|
|rank | ✅ #12481 | `=` | [rank.cu](https://github.com/rapidsai/cudf/blob/branch-22.10/cpp/src/sort/rank.cu) | uses legacy `row_equality_comparator` in `unique_comparator`|
|unique |✅| `=` | [unique.cu](https://github.com/rapidsai/cudf/blob/branch-22.10/cpp/src/stream_compaction/unique.cu) | uses legacy `row_equality_comparator` in `unique`|
|stream compaction | ✅ #12776 | `#` | [stream_co...](https://github.com/rapidsai/cudf/blob/branch-22.10/cpp/src/stream_compaction/stream_compaction_common.hpp) | uses legacy `row_hasher` renamed as `row_hash`|
|one-hot encode |✅| `=` | [one_hot_encode.cu](https://github.com/rapidsai/cudf/blob/branch-22.10/cpp/src/transform/one_hot_encode.cu) | uses legacy `element_equality_comparator` in `one_hot_encode_functor`|
|join | ✅ #12787 | `#`, `=` | [join_common_utils.hpp](https://github.com/rapidsai/cudf/blob/branch-22.10/cpp/src/join/join_common_utils.hpp) | uses legacy `row_hasher` and `row_equality_comparator`|
|mixed join |  ✅ #13028 | `#`, `=`  |  | uses struct flattening |
|test util | ✅ #12777 | `=` | [column_utilities.cu](https://github.com/rapidsai/cudf/blob/branch-22.10/cpp/tests/utilities/column_utilities.cu) | uses legacy `row_equality_comparator` in `corresponding_rows_unequal`|
| contains_table | ✅ #13119 | `#`, `=` | | See #13032 and previous work in #11330 |
| list contains | ✅ #13810  | `=` |  [search_list](https://github.com/rapidsai/cudf/blame/aed7174eae6c6eb38fbf186938df44f88787cf29/cpp/src/lists/contains.cu#L141) | uses legacy `cudf::equality_compare` also see #13672 |
| list min/max | ✅ #13676 | `<` | `list_minmax_util.cuh` |  See #13667, We need to introduce a utility similar to `struct_minmax_util.cuh` to handle top-level lists in aggregation values |

### Part 2: Improving performance and adding functionality 

|Name|Status|Notes|
|---|---|---|
|Make experimental comparators default |  | see #12593, depends on Part 1|
|Improve two-path `<`| ⏸️ (paused) | See #11667.   #11129 added a two-path implementation for `sorted_order` based on templating experimental `<`, with one instance for simple types and one for nested. |
|Improve two-path `=`| ✅ #12676  |See #12676 #11016, #11667, #11330 added a two-path implementation for `lists::contains`, with one path for simple types using legacy `=` and one path for nested using experimental `=`. |
| Enable `<` operator on `Struct<List>` types | ✅ #13005 | see #11672, we expected that these types would work as part of #11129, but we encountered  failures and disabled it | 
| Support `<` operator for `List<Struct>` types  | ✅ #12953  | Proposal and initial design in #11222, impacts #10408, to be solved by #12953 |
| Support self-comparison in two-table `<` | | #10730 added strong right- and left-table index types for the two table comparator. For refactoring "merge" (see Part 1) we need to allow right-right and left-left indices in the two table comparator | 
| Support self-comparison in two-table `=` | | see #13371 |
| Add strong index types to two-table `#` | | See #13116 for more information, may require upstream changes in cuco |



### Part 3: Expand support for nested types in cuDF-python 
**Update: Part 3 will be made into a separate story issue once this issue is closed**
|Name|Issue|Notes|
|---|---|---|
| `drop_duplicates()` on nested types| #6784 | doesn't drop duplicates correctly |
| `groupby.apply` on nested types | #8039 | throws exception from index with List type |
| binop comparison | TBD  | exception handling in #11609 |



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Implement full support for nested types #11844

Part 1: Transition algorithms to new row operators

Part 2: Improving performance and adding functionality

Part 3: Expand support for nested types in cuDF-python

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Algorithm	Status	Operation	File	Notes
scan_rank	✅ ️ ️	`=`️	rank_scan.cu	#10289
distinct	✅ ️	`#`	distinct.cu	#10641
distinct_count	✅ #12776	`#`, `=`
unique_count	✅ #12776	`#`, `=`
hash-groupby	✅ ️	`#`, `=`	groupby.cu	#10770
sort	✅ ️	`<`	sort_impl.cuh	#11129, using legacy `relational_compare` in `simple_comparator` as part of two-path solution in `sorted_order`
list contains	✅	`=`	contains.cu	#10548, using legacy `equality_compare` as part of two-path solution in `search_list_non_nested_types_fn`, see also #11330 for the two-path solution for equality
struct binops	✅	`=`, `<`		#11153
murmur hash	✅	`#`		plus spark_murmur_hash
merge	✅ #14250	`<`	merge.cuh	merge creates a `row_lexicographic_tagged_comparator` that uses legacy `element_relational_comparator` as part of internal operations. Also see discussion in #13514 and #8050
sort-groupby nunique	✅ #11792	`=`	group_nunique.cu	uses legacy `element_equality_comparator` in `is_unique_iterator_fn`
sort-groupby rank_scan	✅ #11792	`=`	group_rank_scan.cu	uses legacy `row_equality_comparator` in `permuted_comparator`
sort-groupby	✅ #11792	`=`	sort_helper.cu	uses legacy `row_equality_comparator` in `permuted_row_equality_comparator`, linked issue #8039
hash map in parquet	✅ #12918	`=`	chunk_dict.cu	uses legacy `equality_compare` in find/insert. #8476 convert to cuco and #10635 optimize
unused includes	✅		round_robin.cu search_ordered.cu group_single_pass...	#11857
partition	✅ #12761	`#`	partitioning.cu	uses legacy `row_hasher` in `hash_partition_table`
struct min/max	✅ #13069	`<`	struct_minmax_util.cuh	uses legacy `row_lexicographic_comparator` in `row_arg_minmax_fn` #8974, see also #8964. see draft work in #10811 and #13069
is_sorted	✅ #12752	`<`	is_sorted.cu	uses legacy `row_lexicographic_comparator` in `is_sorted`
rank	✅ #12481	`=`	rank.cu	uses legacy `row_equality_comparator` in `unique_comparator`
unique	✅	`=`	unique.cu	uses legacy `row_equality_comparator` in `unique`
stream compaction	✅ #12776	`#`	stream_co...	uses legacy `row_hasher` renamed as `row_hash`
one-hot encode	✅	`=`	one_hot_encode.cu	uses legacy `element_equality_comparator` in `one_hot_encode_functor`
join	✅ #12787	`#`, `=`	join_common_utils.hpp	uses legacy `row_hasher` and `row_equality_comparator`
mixed join	✅ #13028	`#`, `=`		uses struct flattening
test util	✅ #12777	`=`	column_utilities.cu	uses legacy `row_equality_comparator` in `corresponding_rows_unequal`
contains_table	✅ #13119	`#`, `=`		See #13032 and previous work in #11330
list contains	✅ #13810	`=`	search_list	uses legacy `cudf::equality_compare` also see #13672
list min/max	✅ #13676	`<`	`list_minmax_util.cuh`	See #13667, We need to introduce a utility similar to `struct_minmax_util.cuh` to handle top-level lists in aggregation values

Name	Status	Notes
Make experimental comparators default		see #12593, depends on Part 1
Improve two-path `<`	⏸️ (paused)	See #11667. #11129 added a two-path implementation for `sorted_order` based on templating experimental `<`, with one instance for simple types and one for nested.
Improve two-path `=`	✅ #12676	See #12676 #11016, #11667, #11330 added a two-path implementation for `lists::contains`, with one path for simple types using legacy `=` and one path for nested using experimental `=`.
Enable `<` operator on `Struct<List>` types	✅ #13005	see #11672, we expected that these types would work as part of #11129, but we encountered failures and disabled it
Support `<` operator for `List<Struct>` types	✅ #12953	Proposal and initial design in #11222, impacts #10408, to be solved by #12953
Support self-comparison in two-table `<`		#10730 added strong right- and left-table index types for the two table comparator. For refactoring "merge" (see Part 1) we need to allow right-right and left-left indices in the two table comparator
Support self-comparison in two-table `=`		see #13371
Add strong index types to two-table `#`		See #13116 for more information, may require upstream changes in cuco

Name	Issue	Notes
`drop_duplicates()` on nested types	#6784	doesn't drop duplicates correctly
`groupby.apply` on nested types	#8039	throws exception from index with List type
binop comparison	TBD	exception handling in #11609

[FEA] Implement full support for nested types #11844

Description

Part 1: Transition algorithms to new row operators

Part 2: Improving performance and adding functionality

Part 3: Expand support for nested types in cuDF-python

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions