Skip to content

[FEA] Support Multi-Output JIT Transforms#21704

Open
lamarrr wants to merge 38 commits intorapidsai:mainfrom
lamarrr:multi-output-transform-support
Open

[FEA] Support Multi-Output JIT Transforms#21704
lamarrr wants to merge 38 commits intorapidsai:mainfrom
lamarrr:multi-output-transform-support

Conversation

@lamarrr
Copy link
Copy Markdown
Contributor

@lamarrr lamarrr commented Mar 7, 2026

Description

This Pull-Request adds support for multi-output JIT Transforms (0-N).
It also improves the efficiency of string output creation by supporting pre-allocating string offsets, which helps to reduce memory usage and efficiency in the case where the size of the string columns can be cheaply pre-computed or known ahead of time.

Summary

  • Support output of multiple columns and allow for batching. i.e. null-counting, stencil creation, and other pre/post-transform computations.
  • Remove intermediate-nullmask boolean creation in null-aware transforms
  • Make the Transform kernel streaming-oriented (allowing for non-coherent memory load/stores via --restrict)
  • Ban memory spaces in the PTX UDF, this has prevented us from passing stack variable pointers to UDFs as it attempts to load from an incorrect memory space. Numba UDFs do not produce memory-space specific UDFs when compiled without array arguments.
  • Made column_device_view types bytewise-compatible to allow aliasing in the kernel (via wrappers)
  • Merged all transform kernels into a single kernel
  • Refactored and simplified the Transforms code
  • Added more assertions and support checks to the Transforms code
  • Added and updated tests for these changes
  • Validated performance impact of these changes

Closes #20155

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 7, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Mar 7, 2026
lamarrr added 3 commits March 8, 2026 16:14
… and improve column handling

- Updated `transform_input` to remove `mutable_column_view` and streamline input handling.
- Introduced `transform_output` struct to specify output types and nullability policies.
- Modified `transform_extended2` to return a `table` instead of a vector of columns, enhancing output management.
- Refactored `column_accessor` to utilize `detail::column_device_view_base` for better abstraction.
- Removed deprecated reflection functions for input accessors and adjusted related helper functions.
- Enhanced `transform_udf` to work with the new column view structure, ensuring compatibility with the updated input/output specifications.
- Updated kernel launch and argument handling to accommodate changes in input and output types.
- Improved nullability checks and handling in the transform execution flow.
- Added support for pre-allocated string offsets in output columns to optimize memory usage.
@lamarrr
Copy link
Copy Markdown
Contributor Author

lamarrr commented Mar 12, 2026

/ok to test

@lamarrr lamarrr added feature request New feature or request breaking Breaking change labels Mar 12, 2026
@lamarrr
Copy link
Copy Markdown
Contributor Author

lamarrr commented Mar 12, 2026

/ok to test

@lamarrr
Copy link
Copy Markdown
Contributor Author

lamarrr commented Mar 12, 2026

/ok to test

@lamarrr
Copy link
Copy Markdown
Contributor Author

lamarrr commented Mar 12, 2026

/ok to test

lamarrr added 3 commits March 13, 2026 05:02
- Updated column_accessor to use __restrict__ qualifiers for improved performance.
- Renamed vector_column_device_view to mut_vector_device_view for clarity.
- Removed the transform_udf.cuh file and integrated its functionality directly into the transform kernel.
- Modified the transform_kernel to handle stencil nullability and user data more efficiently.
- Enhanced the launch function to accommodate stencil null checks.
- Updated tests to include multi-output and offset string operations.
- Cleaned up code and comments for better readability and maintainability.
@lamarrr lamarrr marked this pull request as ready for review March 13, 2026 05:07
@lamarrr lamarrr requested a review from a team as a code owner March 13, 2026 05:07
@lamarrr lamarrr requested a review from a team as a code owner April 1, 2026 20:08
@github-actions github-actions bot added the Java Affects Java cuDF API. label Apr 1, 2026
Copy link
Copy Markdown
Member

@mhaseeb123 mhaseeb123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some non-blocking comments

{
// inputs to JITIFY kernels have to be either sized-integral types or pointers. Structs or
// references can't be passed directly/correctly as they will be crossing an ABI boundary
// TODO: ensure block size is a multiple of warp size for correct warp-synchronous behavior
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking but should we just handle this here by adding a __launch_bounds(block_size)__ and a static_assert(block_size % cudf::detail::warp_size == 0, "Transform kernel block size must be a multiple of the warp size")?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the block size can only be retrived using the runtime values, static_assert would not work.
The correct approach would be to handle it in the transform code but JITIFY doesn't let us get the value of the block and grid dimensions it got from the cuda occupancy configuration.
The block size gotten from cudaOccupancyMaxPotentialBlockSize is always a multiple of 32 from my tests.

@lamarrr
Copy link
Copy Markdown
Contributor Author

lamarrr commented Apr 2, 2026

RTX A6000

Benchmark Config GPU Time Before GPU Time After Delta % Result
transform_polynomials_float32 num_rows=100000, order=1, null_probability=0.01 41.470 us 16.807 us -59.47% FASTER
transform_polynomials_float32 num_rows=1000000, order=1, null_probability=0.01 53.630 us 30.166 us -43.75% FASTER
transform_polynomials_float32 num_rows=10000000, order=1, null_probability=0.01 167.302 us 165.701 us -0.96% ~SAME
transform_polynomials_float32 num_rows=100000000, order=1, null_probability=0.01 1.346 ms 1.523 ms +13.15% ~SAME
transform_polynomials_float32 num_rows=100000, order=2, null_probability=0.01 42.838 us 17.668 us -58.76% FASTER
transform_polynomials_float32 num_rows=1000000, order=2, null_probability=0.01 53.125 us 30.211 us -43.13% FASTER
transform_polynomials_float32 num_rows=10000000, order=2, null_probability=0.01 167.168 us 166.130 us -0.62% ~SAME
transform_polynomials_float32 num_rows=100000000, order=2, null_probability=0.01 1.345 ms 1.531 ms +13.83% ~SAME
transform_polynomials_float32 num_rows=100000, order=4, null_probability=0.01 44.109 us 19.323 us -56.19% FASTER
transform_polynomials_float32 num_rows=1000000, order=4, null_probability=0.01 54.438 us 31.296 us -42.51% FASTER
transform_polynomials_float32 num_rows=10000000, order=4, null_probability=0.01 173.097 us 167.155 us -3.43% FASTER
transform_polynomials_float32 num_rows=100000000, order=4, null_probability=0.01 1.406 ms 1.543 ms +9.74% ~SAME
transform_polynomials_float32 num_rows=100000, order=8, null_probability=0.01 44.480 us 22.623 us -49.14% FASTER
transform_polynomials_float32 num_rows=1000000, order=8, null_probability=0.01 58.464 us 37.166 us -36.43% FASTER
transform_polynomials_float32 num_rows=10000000, order=8, null_probability=0.01 213.449 us 205.792 us -3.59% FASTER
transform_polynomials_float32 num_rows=100000000, order=8, null_probability=0.01 1.802 ms 1.932 ms +7.21% ~SAME
transform_polynomials_float32 num_rows=100000, order=16, null_probability=0.01 48.877 us 29.999 us -38.62% FASTER
transform_polynomials_float32 num_rows=1000000, order=16, null_probability=0.01 74.472 us 51.678 us -30.61% FASTER
transform_polynomials_float32 num_rows=10000000, order=16, null_probability=0.01 313.152 us 301.362 us -3.76% FASTER
transform_polynomials_float32 num_rows=100000000, order=16, null_probability=0.01 2.846 ms 2.950 ms +3.65% ~SAME
transform_polynomials_float32 num_rows=100000, order=32, null_probability=0.01 57.339 us 42.460 us -25.95% FASTER
transform_polynomials_float32 num_rows=1000000, order=32, null_probability=0.01 97.980 us 84.466 us -13.79% FASTER
transform_polynomials_float32 num_rows=10000000, order=32, null_probability=0.01 524.325 us 505.890 us -3.52% FASTER
transform_polynomials_float32 num_rows=100000000, order=32, null_probability=0.01 4.974 ms 5.067 ms +1.87% ~SAME
transform_polynomials_float64 num_rows=100000, order=1, null_probability=0.01 43.076 us 18.201 us -57.75% FASTER
transform_polynomials_float64 num_rows=1000000, order=1, null_probability=0.01 64.691 us 42.334 us -34.56% FASTER
transform_polynomials_float64 num_rows=10000000, order=1, null_probability=0.01 284.249 us 285.990 us +0.61% ~SAME
transform_polynomials_float64 num_rows=100000000, order=1, null_probability=0.01 2.595 ms 2.765 ms +6.55% ~SAME
transform_polynomials_float64 num_rows=100000, order=2, null_probability=0.01 43.429 us 19.149 us -55.91% FASTER
transform_polynomials_float64 num_rows=1000000, order=2, null_probability=0.01 66.332 us 42.437 us -36.02% FASTER
transform_polynomials_float64 num_rows=10000000, order=2, null_probability=0.01 286.099 us 285.210 us -0.31% ~SAME
transform_polynomials_float64 num_rows=100000000, order=2, null_probability=0.01 2.590 ms 2.785 ms +7.53% ~SAME
transform_polynomials_float64 num_rows=100000, order=4, null_probability=0.01 44.368 us 20.740 us -53.25% FASTER
transform_polynomials_float64 num_rows=1000000, order=4, null_probability=0.01 66.123 us 42.389 us -35.89% FASTER
transform_polynomials_float64 num_rows=10000000, order=4, null_probability=0.01 285.817 us 287.712 us +0.66% ~SAME
transform_polynomials_float64 num_rows=100000000, order=4, null_probability=0.01 2.592 ms 2.809 ms +8.37% ~SAME
transform_polynomials_float64 num_rows=100000, order=8, null_probability=0.01 47.611 us 24.925 us -47.65% FASTER
transform_polynomials_float64 num_rows=1000000, order=8, null_probability=0.01 74.593 us 51.691 us -30.70% FASTER
transform_polynomials_float64 num_rows=10000000, order=8, null_probability=0.01 346.669 us 339.849 us -1.97% ~SAME
transform_polynomials_float64 num_rows=100000000, order=8, null_probability=0.01 3.238 ms 3.402 ms +5.06% ~SAME
transform_polynomials_float64 num_rows=100000, order=16, null_probability=0.01 53.675 us 38.465 us -28.34% FASTER
transform_polynomials_float64 num_rows=1000000, order=16, null_probability=0.01 106.014 us 89.011 us -16.04% FASTER
transform_polynomials_float64 num_rows=10000000, order=16, null_probability=0.01 631.290 us 639.224 us +1.26% ~SAME
transform_polynomials_float64 num_rows=100000000, order=16, null_probability=0.01 6.093 ms 6.377 ms +4.66% ~SAME
transform_polynomials_float64 num_rows=100000, order=32, null_probability=0.01 66.632 us 52.364 us -21.41% FASTER
transform_polynomials_float64 num_rows=1000000, order=32, null_probability=0.01 170.483 us 154.110 us -9.60% FASTER
transform_polynomials_float64 num_rows=10000000, order=32, null_probability=0.01 1.211 ms 1.202 ms -0.74% ~SAME
transform_polynomials_float64 num_rows=100000000, order=32, null_probability=0.01 11.798 ms 12.021 ms +1.89% ~SAME
transform_int32_imbalanced_unique tree_levels=1, num_rows=100000 15.577 us 15.631 us +0.35% ~SAME
transform_int32_imbalanced_unique tree_levels=5, num_rows=100000 18.663 us 20.764 us +11.26% ~SAME
transform_int32_imbalanced_unique tree_levels=10, num_rows=100000 23.510 us 28.261 us +20.21% ~SAME
transform_int32_imbalanced_unique tree_levels=1, num_rows=1000000 31.713 us 32.160 us +1.41% ~SAME
transform_int32_imbalanced_unique tree_levels=5, num_rows=1000000 56.758 us 58.755 us +3.52% ~SAME
transform_int32_imbalanced_unique tree_levels=10, num_rows=1000000 87.866 us 92.079 us +4.79% ~SAME
transform_int32_imbalanced_unique tree_levels=1, num_rows=10000000 195.201 us 192.497 us -1.39% ~SAME
transform_int32_imbalanced_unique tree_levels=5, num_rows=10000000 422.163 us 425.243 us +0.73% ~SAME
transform_int32_imbalanced_unique tree_levels=10, num_rows=10000000 714.254 us 716.396 us +0.30% ~SAME
transform_int32_imbalanced_unique tree_levels=1, num_rows=100000000 1.812 ms 1.806 ms -0.33% ~SAME
transform_int32_imbalanced_unique tree_levels=5, num_rows=100000000 4.133 ms 4.121 ms -0.29% ~SAME
transform_int32_imbalanced_unique tree_levels=10, num_rows=100000000 7.190 ms 7.183 ms -0.10% ~SAME
transform_int32_imbalanced_reuse tree_levels=1, num_rows=100000 14.495 us 13.689 us -5.56% FASTER
transform_int32_imbalanced_reuse tree_levels=5, num_rows=100000 14.273 us 13.742 us -3.72% FASTER
transform_int32_imbalanced_reuse tree_levels=10, num_rows=100000 14.370 us 13.680 us -4.80% FASTER
transform_int32_imbalanced_reuse tree_levels=1, num_rows=1000000 26.092 us 25.471 us -2.38% FASTER
transform_int32_imbalanced_reuse tree_levels=5, num_rows=1000000 25.678 us 25.258 us -1.64% ~SAME
transform_int32_imbalanced_reuse tree_levels=10, num_rows=1000000 25.766 us 24.912 us -3.31% FASTER
transform_int32_imbalanced_reuse tree_levels=1, num_rows=10000000 133.830 us 133.677 us -0.11% ~SAME
transform_int32_imbalanced_reuse tree_levels=5, num_rows=10000000 134.389 us 133.481 us -0.68% ~SAME
transform_int32_imbalanced_reuse tree_levels=10, num_rows=10000000 133.972 us 133.765 us -0.15% ~SAME
transform_int32_imbalanced_reuse tree_levels=1, num_rows=100000000 1.222 ms 1.211 ms -0.90% ~SAME
transform_int32_imbalanced_reuse tree_levels=5, num_rows=100000000 1.215 ms 1.212 ms -0.25% ~SAME
transform_int32_imbalanced_reuse tree_levels=10, num_rows=100000000 1.221 ms 1.219 ms -0.16% ~SAME
transform_double_imbalanced_unique tree_levels=1, num_rows=100000 17.056 us 17.265 us +1.23% ~SAME
transform_double_imbalanced_unique tree_levels=5, num_rows=100000 23.137 us 25.668 us +10.94% ~SAME
transform_double_imbalanced_unique tree_levels=10, num_rows=100000 31.570 us 36.949 us +17.04% ~SAME
transform_double_imbalanced_unique tree_levels=1, num_rows=1000000 49.098 us 49.796 us +1.42% ~SAME
transform_double_imbalanced_unique tree_levels=5, num_rows=1000000 96.348 us 99.185 us +2.94% ~SAME
transform_double_imbalanced_unique tree_levels=10, num_rows=1000000 158.684 us 161.322 us +1.66% ~SAME
transform_double_imbalanced_unique tree_levels=1, num_rows=10000000 376.476 us 374.832 us -0.44% ~SAME
transform_double_imbalanced_unique tree_levels=5, num_rows=10000000 827.134 us 829.747 us +0.32% ~SAME
transform_double_imbalanced_unique tree_levels=10, num_rows=10000000 1.406 ms 1.414 ms +0.57% ~SAME
transform_double_imbalanced_unique tree_levels=1, num_rows=100000000 3.766 ms 3.686 ms -2.12% FASTER
transform_double_imbalanced_unique tree_levels=5, num_rows=100000000 8.379 ms 8.370 ms -0.11% ~SAME
transform_double_imbalanced_unique tree_levels=10, num_rows=100000000 14.315 ms 14.827 ms +3.58% ~SAME

Comment on lines -54 to -70
static String ptx = "***(" +
" .func _Z1fPii(" +
" .param .b64 _Z1fPii_param_0," +
" .param .b32 _Z1fPii_param_1" +
" )" +
" {" +
" .reg .b32 %r<4>;" +
" .reg .b64 %rd<3>;" +
" ld.param.u64 %rd1, [_Z1fPii_param_0];" +
" ld.param.u32 %r1, [_Z1fPii_param_1];" +
" cvta.to.global.u64 %rd2, %rd1;" +
" mul.lo.s32 %r2, %r1, %r1;" +
" sub.s32 %r3, %r2, %r1;" +
" st.global.u32 [%rd2], %r3;" +
" ret;" +
" }" +
")***";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this?

Copy link
Copy Markdown
Contributor Author

@lamarrr lamarrr Apr 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what?
the previous PTX is restrictive. It assumes the memory space of the output pointers which prevented us from passing addresses to shared memory or stack to the UDF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaking change feature request New feature or request Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Multi-output jit-transform

5 participants