[FEA] Support Multi-Output JIT Transforms#21704
[FEA] Support Multi-Output JIT Transforms#21704lamarrr wants to merge 38 commits intorapidsai:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
… and improve column handling - Updated `transform_input` to remove `mutable_column_view` and streamline input handling. - Introduced `transform_output` struct to specify output types and nullability policies. - Modified `transform_extended2` to return a `table` instead of a vector of columns, enhancing output management. - Refactored `column_accessor` to utilize `detail::column_device_view_base` for better abstraction. - Removed deprecated reflection functions for input accessors and adjusted related helper functions. - Enhanced `transform_udf` to work with the new column view structure, ensuring compatibility with the updated input/output specifications. - Updated kernel launch and argument handling to accommodate changes in input and output types. - Improved nullability checks and handling in the transform execution flow. - Added support for pre-allocated string offsets in output columns to optimize memory usage.
|
/ok to test |
…olumn_device_view_core
|
/ok to test |
|
/ok to test |
|
/ok to test |
- Updated column_accessor to use __restrict__ qualifiers for improved performance. - Renamed vector_column_device_view to mut_vector_device_view for clarity. - Removed the transform_udf.cuh file and integrated its functionality directly into the transform kernel. - Modified the transform_kernel to handle stencil nullability and user data more efficiently. - Enhanced the launch function to accommodate stencil null checks. - Updated tests to include multi-output and offset string operations. - Cleaned up code and comments for better readability and maintainability.
…amarrr/cudf into multi-output-transform-support
mhaseeb123
left a comment
There was a problem hiding this comment.
Some non-blocking comments
| { | ||
| // inputs to JITIFY kernels have to be either sized-integral types or pointers. Structs or | ||
| // references can't be passed directly/correctly as they will be crossing an ABI boundary | ||
| // TODO: ensure block size is a multiple of warp size for correct warp-synchronous behavior |
There was a problem hiding this comment.
Non-blocking but should we just handle this here by adding a __launch_bounds(block_size)__ and a static_assert(block_size % cudf::detail::warp_size == 0, "Transform kernel block size must be a multiple of the warp size")?
There was a problem hiding this comment.
the block size can only be retrived using the runtime values, static_assert would not work.
The correct approach would be to handle it in the transform code but JITIFY doesn't let us get the value of the block and grid dimensions it got from the cuda occupancy configuration.
The block size gotten from cudaOccupancyMaxPotentialBlockSize is always a multiple of 32 from my tests.
…amarrr/cudf into multi-output-transform-support
Co-authored-by: Muhammad Haseeb <14217455+mhaseeb123@users.noreply.github.com>
Co-authored-by: Muhammad Haseeb <14217455+mhaseeb123@users.noreply.github.com>
RTX A6000
|
| static String ptx = "***(" + | ||
| " .func _Z1fPii(" + | ||
| " .param .b64 _Z1fPii_param_0," + | ||
| " .param .b32 _Z1fPii_param_1" + | ||
| " )" + | ||
| " {" + | ||
| " .reg .b32 %r<4>;" + | ||
| " .reg .b64 %rd<3>;" + | ||
| " ld.param.u64 %rd1, [_Z1fPii_param_0];" + | ||
| " ld.param.u32 %r1, [_Z1fPii_param_1];" + | ||
| " cvta.to.global.u64 %rd2, %rd1;" + | ||
| " mul.lo.s32 %r2, %r1, %r1;" + | ||
| " sub.s32 %r3, %r2, %r1;" + | ||
| " st.global.u32 [%rd2], %r3;" + | ||
| " ret;" + | ||
| " }" + | ||
| ")***"; |
There was a problem hiding this comment.
what?
the previous PTX is restrictive. It assumes the memory space of the output pointers which prevented us from passing addresses to shared memory or stack to the UDF
Description
This Pull-Request adds support for multi-output JIT Transforms (0-N).
It also improves the efficiency of string output creation by supporting pre-allocating string offsets, which helps to reduce memory usage and efficiency in the case where the size of the string columns can be cheaply pre-computed or known ahead of time.
Summary
--restrict)column_device_viewtypes bytewise-compatible to allow aliasing in the kernel (via wrappers)transformkernels into a single kernelCloses #20155
Checklist