Skip to content

Improve contains_column by invoking contains_table#14238

Merged
rapids-bot[bot] merged 3 commits intorapidsai:branch-23.12from
PointKernel:cuco-contains-column
Oct 4, 2023
Merged

Improve contains_column by invoking contains_table#14238
rapids-bot[bot] merged 3 commits intorapidsai:branch-23.12from
PointKernel:cuco-contains-column

Conversation

@PointKernel
Copy link
Copy Markdown
Member

@PointKernel PointKernel commented Sep 29, 2023

Description

Part of ##12261

This PR simplifies the contains_column implementation by invoking contains_table and gets rid of the use of the cudf unordered_multiset. It also removes the unordered_multiset header file from libcudf.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 29, 2023
@PointKernel PointKernel self-assigned this Sep 29, 2023
@PointKernel PointKernel added tech debt improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 29, 2023
@PointKernel
Copy link
Copy Markdown
Member Author

PointKernel commented Sep 29, 2023

Performance comparison between before and after this PR shows that the new implementation can bring up to 2.2x speedups over the old for large data and can be 15% slower than the current code in the worst scenario:

Benchmark                                                             Time             CPU      Time Old      Time New       CPU Old       CPU New
--------------------------------------------------------------------------------------------------------------------------------------------------
Search/ColumnContains_AllValid/1024/manual_time                    -0.0201         -0.0147             0             0             0             0
Search/ColumnContains_AllValid/4096/manual_time                    +0.1479         +0.1027             0             0             0             0
Search/ColumnContains_AllValid/32768/manual_time                   +0.3774         +0.2879             0             0             0             0
Search/ColumnContains_AllValid/262144/manual_time                  +0.1825         +0.1631             0             0             0             0
Search/ColumnContains_AllValid/2097152/manual_time                 -0.4995         -0.4977             4             2             4             2
Search/ColumnContains_AllValid/16777216/manual_time                -0.4629         -0.4627            42            23            42            23
Search/ColumnContains_AllValid/67108864/manual_time                -0.5701         -0.5700            82            35            82            35
Search/ColumnContains_Nulls/1024/manual_time                       +0.0138         +0.0081             0             0             0             0
Search/ColumnContains_Nulls/4096/manual_time                       +0.0807         +0.0570             0             0             0             0
Search/ColumnContains_Nulls/32768/manual_time                      +0.4188         +0.3202             0             0             0             0
Search/ColumnContains_Nulls/262144/manual_time                     +0.4772         +0.4089             0             0             0             0
Search/ColumnContains_Nulls/2097152/manual_time                    -0.4655         -0.4635             4             2             4             2
Search/ColumnContains_Nulls/16777216/manual_time                   -0.4141         -0.4139            38            22            38            22
Search/ColumnContains_Nulls/67108864/manual_time                   -0.5455         -0.5453            77            35            77            35
OVERALL_GEOMEAN                                                    -0.1737         -0.1887             0             0             0             0

@PointKernel PointKernel added the Performance Performance related issue label Sep 29, 2023
@PointKernel PointKernel marked this pull request as ready for review September 30, 2023 00:21
@PointKernel PointKernel requested a review from a team as a code owner September 30, 2023 00:21
Copy link
Copy Markdown
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've wanted this change for a long time! Thanks @PointKernel. The benchmarks look good for large sizes, despite the minor slowdown for medium sizes.

@PointKernel
Copy link
Copy Markdown
Member Author

/merge

@rapids-bot rapids-bot bot merged commit b120f7e into rapidsai:branch-23.12 Oct 4, 2023
@PointKernel PointKernel deleted the cuco-contains-column branch October 4, 2023 21:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants