[FEA] Support Multi-Output JIT Transforms by lamarrr · Pull Request #21704 · rapidsai/cudf

lamarrr · 2026-03-07T19:43:15Z

Description

This Pull-Request adds support for multi-output JIT Transforms (0-N).
It also improves the efficiency of string output creation by supporting pre-allocating string offsets, which helps to reduce memory usage and efficiency in the case where the size of the string columns can be cheaply pre-computed or known ahead of time.

Summary

Closes #20155

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-03-07T19:43:19Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

… and improve column handling - Updated `transform_input` to remove `mutable_column_view` and streamline input handling. - Introduced `transform_output` struct to specify output types and nullability policies. - Modified `transform_extended2` to return a `table` instead of a vector of columns, enhancing output management. - Refactored `column_accessor` to utilize `detail::column_device_view_base` for better abstraction. - Removed deprecated reflection functions for input accessors and adjusted related helper functions. - Enhanced `transform_udf` to work with the new column view structure, ensuring compatibility with the updated input/output specifications. - Updated kernel launch and argument handling to accommodate changes in input and output types. - Improved nullability checks and handling in the transform execution flow. - Added support for pre-allocated string offsets in output columns to optimize memory usage.

…orm-support

lamarrr · 2026-03-12T01:56:14Z

/ok to test

…olumn_device_view_core

lamarrr · 2026-03-12T02:39:52Z

/ok to test

…orm-support

lamarrr · 2026-03-12T07:55:03Z

/ok to test

lamarrr · 2026-03-12T09:59:24Z

/ok to test

- Updated column_accessor to use __restrict__ qualifiers for improved performance. - Renamed vector_column_device_view to mut_vector_device_view for clarity. - Removed the transform_udf.cuh file and integrated its functionality directly into the transform kernel. - Modified the transform_kernel to handle stencil nullability and user data more efficiently. - Enhanced the launch function to accommodate stencil null checks. - Updated tests to include multi-output and offset string operations. - Cleaned up code and comments for better readability and maintainability.

…orm-support

cpp/include/cudf/column/column_device_view.cuh

cpp/include/cudf/column/column_device_view_base.cuh

cpp/src/jit/column_accessor.cuh

cpp/src/jit/column_device_view_wrappers.cuh

cpp/src/transform/jit/kernel.cu

cpp/src/transform/transform.cu

…amarrr/cudf into multi-output-transform-support

…orm-support

mhaseeb123

Some non-blocking comments

cpp/src/transform/transform.cu

cpp/src/jit/sync.cuh

cpp/src/transform/transform.cu

mhaseeb123 · 2026-04-01T20:38:27Z

cpp/src/transform/jit/kernel.cu

 {
-  // inputs to JITIFY kernels have to be either sized-integral types or pointers. Structs or
-  // references can't be passed directly/correctly as they will be crossing an ABI boundary
+  // TODO: ensure block size is a multiple of warp size for correct warp-synchronous behavior


Non-blocking but should we just handle this here by adding a __launch_bounds(block_size)__ and a static_assert(block_size % cudf::detail::warp_size == 0, "Transform kernel block size must be a multiple of the warp size")?

the block size can only be retrived using the runtime values, static_assert would not work.
The correct approach would be to handle it in the transform code but JITIFY doesn't let us get the value of the block and grid dimensions it got from the cuda occupancy configuration.
The block size gotten from cudaOccupancyMaxPotentialBlockSize is always a multiple of 32 from my tests.

cpp/src/transform/transform.cu

…amarrr/cudf into multi-output-transform-support

…orm-support

Co-authored-by: Muhammad Haseeb <14217455+mhaseeb123@users.noreply.github.com>

…ut checks

lamarrr · 2026-04-02T03:33:25Z

RTX A6000

Benchmark	Config	GPU Time Before	GPU Time After	Delta %	Result
transform_polynomials_float32	num_rows=100000, order=1, null_probability=0.01	41.470 us	16.807 us	-59.47%	FASTER
transform_polynomials_float32	num_rows=1000000, order=1, null_probability=0.01	53.630 us	30.166 us	-43.75%	FASTER
transform_polynomials_float32	num_rows=10000000, order=1, null_probability=0.01	167.302 us	165.701 us	-0.96%	~SAME
transform_polynomials_float32	num_rows=100000000, order=1, null_probability=0.01	1.346 ms	1.523 ms	+13.15%	~SAME
transform_polynomials_float32	num_rows=100000, order=2, null_probability=0.01	42.838 us	17.668 us	-58.76%	FASTER
transform_polynomials_float32	num_rows=1000000, order=2, null_probability=0.01	53.125 us	30.211 us	-43.13%	FASTER
transform_polynomials_float32	num_rows=10000000, order=2, null_probability=0.01	167.168 us	166.130 us	-0.62%	~SAME
transform_polynomials_float32	num_rows=100000000, order=2, null_probability=0.01	1.345 ms	1.531 ms	+13.83%	~SAME
transform_polynomials_float32	num_rows=100000, order=4, null_probability=0.01	44.109 us	19.323 us	-56.19%	FASTER
transform_polynomials_float32	num_rows=1000000, order=4, null_probability=0.01	54.438 us	31.296 us	-42.51%	FASTER
transform_polynomials_float32	num_rows=10000000, order=4, null_probability=0.01	173.097 us	167.155 us	-3.43%	FASTER
transform_polynomials_float32	num_rows=100000000, order=4, null_probability=0.01	1.406 ms	1.543 ms	+9.74%	~SAME
transform_polynomials_float32	num_rows=100000, order=8, null_probability=0.01	44.480 us	22.623 us	-49.14%	FASTER
transform_polynomials_float32	num_rows=1000000, order=8, null_probability=0.01	58.464 us	37.166 us	-36.43%	FASTER
transform_polynomials_float32	num_rows=10000000, order=8, null_probability=0.01	213.449 us	205.792 us	-3.59%	FASTER
transform_polynomials_float32	num_rows=100000000, order=8, null_probability=0.01	1.802 ms	1.932 ms	+7.21%	~SAME
transform_polynomials_float32	num_rows=100000, order=16, null_probability=0.01	48.877 us	29.999 us	-38.62%	FASTER
transform_polynomials_float32	num_rows=1000000, order=16, null_probability=0.01	74.472 us	51.678 us	-30.61%	FASTER
transform_polynomials_float32	num_rows=10000000, order=16, null_probability=0.01	313.152 us	301.362 us	-3.76%	FASTER
transform_polynomials_float32	num_rows=100000000, order=16, null_probability=0.01	2.846 ms	2.950 ms	+3.65%	~SAME
transform_polynomials_float32	num_rows=100000, order=32, null_probability=0.01	57.339 us	42.460 us	-25.95%	FASTER
transform_polynomials_float32	num_rows=1000000, order=32, null_probability=0.01	97.980 us	84.466 us	-13.79%	FASTER
transform_polynomials_float32	num_rows=10000000, order=32, null_probability=0.01	524.325 us	505.890 us	-3.52%	FASTER
transform_polynomials_float32	num_rows=100000000, order=32, null_probability=0.01	4.974 ms	5.067 ms	+1.87%	~SAME
transform_polynomials_float64	num_rows=100000, order=1, null_probability=0.01	43.076 us	18.201 us	-57.75%	FASTER
transform_polynomials_float64	num_rows=1000000, order=1, null_probability=0.01	64.691 us	42.334 us	-34.56%	FASTER
transform_polynomials_float64	num_rows=10000000, order=1, null_probability=0.01	284.249 us	285.990 us	+0.61%	~SAME
transform_polynomials_float64	num_rows=100000000, order=1, null_probability=0.01	2.595 ms	2.765 ms	+6.55%	~SAME
transform_polynomials_float64	num_rows=100000, order=2, null_probability=0.01	43.429 us	19.149 us	-55.91%	FASTER
transform_polynomials_float64	num_rows=1000000, order=2, null_probability=0.01	66.332 us	42.437 us	-36.02%	FASTER
transform_polynomials_float64	num_rows=10000000, order=2, null_probability=0.01	286.099 us	285.210 us	-0.31%	~SAME
transform_polynomials_float64	num_rows=100000000, order=2, null_probability=0.01	2.590 ms	2.785 ms	+7.53%	~SAME
transform_polynomials_float64	num_rows=100000, order=4, null_probability=0.01	44.368 us	20.740 us	-53.25%	FASTER
transform_polynomials_float64	num_rows=1000000, order=4, null_probability=0.01	66.123 us	42.389 us	-35.89%	FASTER
transform_polynomials_float64	num_rows=10000000, order=4, null_probability=0.01	285.817 us	287.712 us	+0.66%	~SAME
transform_polynomials_float64	num_rows=100000000, order=4, null_probability=0.01	2.592 ms	2.809 ms	+8.37%	~SAME
transform_polynomials_float64	num_rows=100000, order=8, null_probability=0.01	47.611 us	24.925 us	-47.65%	FASTER
transform_polynomials_float64	num_rows=1000000, order=8, null_probability=0.01	74.593 us	51.691 us	-30.70%	FASTER
transform_polynomials_float64	num_rows=10000000, order=8, null_probability=0.01	346.669 us	339.849 us	-1.97%	~SAME
transform_polynomials_float64	num_rows=100000000, order=8, null_probability=0.01	3.238 ms	3.402 ms	+5.06%	~SAME
transform_polynomials_float64	num_rows=100000, order=16, null_probability=0.01	53.675 us	38.465 us	-28.34%	FASTER
transform_polynomials_float64	num_rows=1000000, order=16, null_probability=0.01	106.014 us	89.011 us	-16.04%	FASTER
transform_polynomials_float64	num_rows=10000000, order=16, null_probability=0.01	631.290 us	639.224 us	+1.26%	~SAME
transform_polynomials_float64	num_rows=100000000, order=16, null_probability=0.01	6.093 ms	6.377 ms	+4.66%	~SAME
transform_polynomials_float64	num_rows=100000, order=32, null_probability=0.01	66.632 us	52.364 us	-21.41%	FASTER
transform_polynomials_float64	num_rows=1000000, order=32, null_probability=0.01	170.483 us	154.110 us	-9.60%	FASTER
transform_polynomials_float64	num_rows=10000000, order=32, null_probability=0.01	1.211 ms	1.202 ms	-0.74%	~SAME
transform_polynomials_float64	num_rows=100000000, order=32, null_probability=0.01	11.798 ms	12.021 ms	+1.89%	~SAME
transform_int32_imbalanced_unique	tree_levels=1, num_rows=100000	15.577 us	15.631 us	+0.35%	~SAME
transform_int32_imbalanced_unique	tree_levels=5, num_rows=100000	18.663 us	20.764 us	+11.26%	~SAME
transform_int32_imbalanced_unique	tree_levels=10, num_rows=100000	23.510 us	28.261 us	+20.21%	~SAME
transform_int32_imbalanced_unique	tree_levels=1, num_rows=1000000	31.713 us	32.160 us	+1.41%	~SAME
transform_int32_imbalanced_unique	tree_levels=5, num_rows=1000000	56.758 us	58.755 us	+3.52%	~SAME
transform_int32_imbalanced_unique	tree_levels=10, num_rows=1000000	87.866 us	92.079 us	+4.79%	~SAME
transform_int32_imbalanced_unique	tree_levels=1, num_rows=10000000	195.201 us	192.497 us	-1.39%	~SAME
transform_int32_imbalanced_unique	tree_levels=5, num_rows=10000000	422.163 us	425.243 us	+0.73%	~SAME
transform_int32_imbalanced_unique	tree_levels=10, num_rows=10000000	714.254 us	716.396 us	+0.30%	~SAME
transform_int32_imbalanced_unique	tree_levels=1, num_rows=100000000	1.812 ms	1.806 ms	-0.33%	~SAME
transform_int32_imbalanced_unique	tree_levels=5, num_rows=100000000	4.133 ms	4.121 ms	-0.29%	~SAME
transform_int32_imbalanced_unique	tree_levels=10, num_rows=100000000	7.190 ms	7.183 ms	-0.10%	~SAME
transform_int32_imbalanced_reuse	tree_levels=1, num_rows=100000	14.495 us	13.689 us	-5.56%	FASTER
transform_int32_imbalanced_reuse	tree_levels=5, num_rows=100000	14.273 us	13.742 us	-3.72%	FASTER
transform_int32_imbalanced_reuse	tree_levels=10, num_rows=100000	14.370 us	13.680 us	-4.80%	FASTER
transform_int32_imbalanced_reuse	tree_levels=1, num_rows=1000000	26.092 us	25.471 us	-2.38%	FASTER
transform_int32_imbalanced_reuse	tree_levels=5, num_rows=1000000	25.678 us	25.258 us	-1.64%	~SAME
transform_int32_imbalanced_reuse	tree_levels=10, num_rows=1000000	25.766 us	24.912 us	-3.31%	FASTER
transform_int32_imbalanced_reuse	tree_levels=1, num_rows=10000000	133.830 us	133.677 us	-0.11%	~SAME
transform_int32_imbalanced_reuse	tree_levels=5, num_rows=10000000	134.389 us	133.481 us	-0.68%	~SAME
transform_int32_imbalanced_reuse	tree_levels=10, num_rows=10000000	133.972 us	133.765 us	-0.15%	~SAME
transform_int32_imbalanced_reuse	tree_levels=1, num_rows=100000000	1.222 ms	1.211 ms	-0.90%	~SAME
transform_int32_imbalanced_reuse	tree_levels=5, num_rows=100000000	1.215 ms	1.212 ms	-0.25%	~SAME
transform_int32_imbalanced_reuse	tree_levels=10, num_rows=100000000	1.221 ms	1.219 ms	-0.16%	~SAME
transform_double_imbalanced_unique	tree_levels=1, num_rows=100000	17.056 us	17.265 us	+1.23%	~SAME
transform_double_imbalanced_unique	tree_levels=5, num_rows=100000	23.137 us	25.668 us	+10.94%	~SAME
transform_double_imbalanced_unique	tree_levels=10, num_rows=100000	31.570 us	36.949 us	+17.04%	~SAME
transform_double_imbalanced_unique	tree_levels=1, num_rows=1000000	49.098 us	49.796 us	+1.42%	~SAME
transform_double_imbalanced_unique	tree_levels=5, num_rows=1000000	96.348 us	99.185 us	+2.94%	~SAME
transform_double_imbalanced_unique	tree_levels=10, num_rows=1000000	158.684 us	161.322 us	+1.66%	~SAME
transform_double_imbalanced_unique	tree_levels=1, num_rows=10000000	376.476 us	374.832 us	-0.44%	~SAME
transform_double_imbalanced_unique	tree_levels=5, num_rows=10000000	827.134 us	829.747 us	+0.32%	~SAME
transform_double_imbalanced_unique	tree_levels=10, num_rows=10000000	1.406 ms	1.414 ms	+0.57%	~SAME
transform_double_imbalanced_unique	tree_levels=1, num_rows=100000000	3.766 ms	3.686 ms	-2.12%	FASTER
transform_double_imbalanced_unique	tree_levels=5, num_rows=100000000	8.379 ms	8.370 ms	-0.11%	~SAME
transform_double_imbalanced_unique	tree_levels=10, num_rows=100000000	14.315 ms	14.827 ms	+3.58%	~SAME

ttnghia · 2026-04-02T19:56:17Z

java/src/test/java/ai/rapids/cudf/ColumnVectorTest.java

-  static String ptx = "***(" +
-      "      .func _Z1fPii(" +
-      "        .param .b64 _Z1fPii_param_0," +
-      "        .param .b32 _Z1fPii_param_1" +
-      "  )" +
-      "  {" +
-      "        .reg .b32       %r<4>;" +
-      "        .reg .b64       %rd<3>;" +
-      "    ld.param.u64    %rd1, [_Z1fPii_param_0];" +
-      "    ld.param.u32    %r1, [_Z1fPii_param_1];" +
-      "    cvta.to.global.u64      %rd2, %rd1;" +
-      "    mul.lo.s32      %r2, %r1, %r1;" +
-      "    sub.s32         %r3, %r2, %r1;" +
-      "    st.global.u32   [%rd2], %r3;" +
-      "    ret;" +
-      "  }" +
-      ")***";


How about this?

what?
the previous PTX is restrictive. It assumes the memory space of the output pointers which prevented us from passing addresses to shared memory or stack to the UDF

[FEA] Support Multi-Output JIT Transforms

6c9be74

github-actions bot assigned lamarrr Mar 7, 2026

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 7, 2026

lamarrr added 3 commits March 8, 2026 16:14

Merge remote-tracking branch 'upstream/main' into multi-output-transf…

aa2fd4d

…orm-support

initial diff

f708ff9

lamarrr added feature request New feature or request breaking Breaking change labels Mar 12, 2026

Add [[nodiscard]] attribute to head() and data() methods in mutable_c…

69ff5e1

…olumn_device_view_core

lamarrr added 3 commits March 12, 2026 07:41

Refactoring + add support for zero inputs

2ab3bc5

added test todos

176d356

Merge remote-tracking branch 'upstream/main' into multi-output-transf…

8efd349

…orm-support

Refactor variable names for clarity in transform functions

8ff447d

lamarrr added 3 commits March 13, 2026 05:02

Remove unnecessary blank line in transform kernel definition

cfc1e23

Merge remote-tracking branch 'upstream/main' into multi-output-transf…

c82923c

…orm-support

lamarrr marked this pull request as ready for review March 13, 2026 05:07

lamarrr requested a review from a team as a code owner March 13, 2026 05:07

lamarrr requested review from karthikeyann and mhaseeb123 March 13, 2026 05:07

remove TODO

b279d49

lamarrr requested review from shrshi and wence- March 13, 2026 06:08

lamarrr mentioned this pull request Mar 13, 2026

[FEA] Improve null-mask computation performance for transforms #21531

Closed

3 tasks

bdice reviewed Mar 13, 2026

View reviewed changes

lamarrr added 9 commits March 25, 2026 15:49

Merge branch 'main' into multi-output-transform-support

9be6531

Merge branch 'main' into multi-output-transform-support

40d46cf

Merge branch 'main' into multi-output-transform-support

df6bd8d

refactor null mask handling to use set_null_mask for consistency

060d369

properly handle cuda version for "-minimal" flag

4ba506a

Merge branch 'multi-output-transform-support' of https://github.com/l…

c21c228

…amarrr/cudf into multi-output-transform-support

Merge branch 'main' into multi-output-transform-support

476025e

Merge remote-tracking branch 'upstream/main' into multi-output-transf…

7e8c903

…orm-support

fix invocation error

3aa3b20

lamarrr requested review from bdice, davidwendt and mhaseeb123 April 1, 2026 17:40

lamarrr added 2 commits April 1, 2026 20:08

remove unused kernel header from helpers.hpp

49d81e6

Update PTX code in ColumnVectorTest with generated output from Numba

70c558c

lamarrr requested a review from a team as a code owner April 1, 2026 20:08

github-actions bot added the Java Affects Java cuDF API. label Apr 1, 2026

Merge branch 'main' into multi-output-transform-support

c0b947b

mhaseeb123 approved these changes Apr 1, 2026

View reviewed changes

mhaseeb123 reviewed Apr 1, 2026

View reviewed changes

cpp/src/transform/transform.cu Outdated Show resolved Hide resolved

lamarrr and others added 7 commits April 1, 2026 22:37

FIX JAVA string literal quote

742b88a

Merge branch 'multi-output-transform-support' of https://github.com/l…

d27a094

…amarrr/cudf into multi-output-transform-support

Merge remote-tracking branch 'upstream/main' into multi-output-transf…

6b4fe05

…orm-support

Update cpp/src/transform/transform.cu

cc503e2

Co-authored-by: Muhammad Haseeb <14217455+mhaseeb123@users.noreply.github.com>

Update cpp/src/transform/transform.cu

24c450a

Co-authored-by: Muhammad Haseeb <14217455+mhaseeb123@users.noreply.github.com>

Refactor inplace_null_mask_and to move bitmask calculations after inp…

29cfcc7

…ut checks

reduce binary size + fix java string

22b189b

Merge branch 'main' into multi-output-transform-support

c9e651c

ttnghia reviewed Apr 2, 2026

View reviewed changes

Merge branch 'main' into multi-output-transform-support

4932483

Conversation

lamarrr commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary

Checklist

Uh oh!

copy-pr-bot bot commented Mar 7, 2026

Uh oh!

lamarrr commented Mar 12, 2026

Uh oh!

lamarrr commented Mar 12, 2026

Uh oh!

lamarrr commented Mar 12, 2026

Uh oh!

lamarrr commented Mar 12, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mhaseeb123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mhaseeb123 Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

lamarrr Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lamarrr commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

RTX A6000

Uh oh!

ttnghia Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

lamarrr Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lamarrr commented Mar 7, 2026 •

edited

Loading

lamarrr commented Apr 2, 2026 •

edited

Loading

lamarrr Apr 4, 2026 •

edited

Loading