Skip to content

Update groupby::hash to use new row operators for keys#10770

Merged
rapids-bot[bot] merged 33 commits intorapidsai:branch-22.06from
PointKernel:groupby-new-row-operator
May 25, 2022
Merged

Update groupby::hash to use new row operators for keys#10770
rapids-bot[bot] merged 33 commits intorapidsai:branch-22.06from
PointKernel:groupby-new-row-operator

Conversation

@PointKernel
Copy link
Copy Markdown
Member

@PointKernel PointKernel commented May 2, 2022

Related to #8039 and #10181

Contributes to #10186

This PR updates groupby::hash to use new row operators. It gets rid of the current "flattened nested column" logic and allows groupby::hash to handle LIST and STRUCT keys. The work also involves small cleanups like getting rid of unnecessary template parameters and removing unused arguments.

It becomes a breaking PR since the updated groupby::hash will treat inner nulls as equal when top-level nulls are excluded
while the current behavior treats inner nulls as unequal.

@PointKernel PointKernel added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels May 2, 2022
@PointKernel PointKernel self-assigned this May 2, 2022
@codecov
Copy link
Copy Markdown

codecov bot commented May 9, 2022

Codecov Report

Merging #10770 (efd497e) into branch-22.06 (e0d94f3) will increase coverage by 0.03%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.06   #10770      +/-   ##
================================================
+ Coverage         86.28%   86.32%   +0.03%     
================================================
  Files               144      144              
  Lines             22654    22668      +14     
================================================
+ Hits              19548    19569      +21     
+ Misses             3106     3099       -7     
Impacted Files Coverage Δ
python/cudf/cudf/utils/ioutils.py 79.47% <0.00%> (-0.13%) ⬇️
python/cudf/cudf/io/json.py 97.56% <0.00%> (ø)
python/dask_cudf/dask_cudf/tests/test_groupby.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/groupby.py 97.42% <0.00%> (+0.02%) ⬆️
python/cudf/cudf/core/dataframe.py 93.78% <0.00%> (+0.04%) ⬆️
python/cudf/cudf/core/column/string.py 88.78% <0.00%> (+0.12%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.79% <0.00%> (+0.22%) ⬆️
python/dask_cudf/dask_cudf/core.py 73.62% <0.00%> (+0.26%) ⬆️
python/cudf/cudf/core/column/numerical.py 96.17% <0.00%> (+0.29%) ⬆️
python/cudf/cudf/core/tools/datetimes.py 84.49% <0.00%> (+0.30%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fe9aaeb...efd497e. Read the comment docs.

@github-actions github-actions bot added the CMake CMake build issue label May 10, 2022
@PointKernel PointKernel marked this pull request as ready for review May 11, 2022 01:08
@PointKernel PointKernel requested a review from a team as a code owner May 11, 2022 01:08
@PointKernel
Copy link
Copy Markdown
Member Author

@mythrocks requesting your review as well since you authored the flattened nested column work.

@PointKernel PointKernel added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels May 11, 2022
@PointKernel PointKernel requested a review from devavret May 24, 2022 19:21
@PointKernel PointKernel added breaking Breaking change and removed non-breaking Non-breaking change labels May 24, 2022
@PointKernel
Copy link
Copy Markdown
Member Author

Setting the current work as breaking since the behavior is changed when nulls are excluded. See #10770 (comment)

@PointKernel PointKernel changed the title Update groupby::hash to use new row operators Update groupby::hash to use new row operators for keys May 24, 2022
@PointKernel
Copy link
Copy Markdown
Member Author

rerun tests

@jrhemstad
Copy link
Copy Markdown
Contributor

It becomes a breaking PR due to two reasons:

  • The updated groupby::hash matches pandas' behavior, i.e. when nulls are excluded, the output will drop top-level nulls for struct/list but include struct/list containing null elements.
  • Nulls are always treated as equal.

Wait, what? Why do we need to make these changes? I don't see these reflected in the top-level groupby API either.

@PointKernel
Copy link
Copy Markdown
Member Author

PR description updated to provide clearer breaking information.

@PointKernel
Copy link
Copy Markdown
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit dbd2b08 into rapidsai:branch-22.06 May 25, 2022
rapids-bot bot pushed a commit that referenced this pull request Aug 3, 2022
Closes #10952 

After #10770 was merged there are no more uses of `unflatten_nested_columns`. This pr removes `unflatten_nested_columns` and adjusts the tests accordingly.

Authors:
  - Srikar Vanavasam (https://github.com/SrikarVanavasam)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #11421
@PointKernel PointKernel deleted the groupby-new-row-operator branch November 16, 2022 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team breaking Breaking change CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants