Skip to content

Commit cc88902

Browse files
authored
Cythonize _module.py (#1520)
* Cythonize _module.py (Phase 2a: cdef classes) Convert Kernel, ObjectCode, and KernelOccupancy to cdef classes with proper .pxd declarations. This phase establishes the Cython structure while maintaining Python driver module usage. Changes: - Rename _module.py to _module.pyx - Create _module.pxd with cdef class declarations - Convert Kernel, ObjectCode, KernelOccupancy to cdef class - Remove _backend dict in favor of direct driver calls - Add _init_py() Python-accessible factory for ObjectCode - Update _program.py and _linker.py to use _init_py() - Fix test to handle cdef class property descriptors Phase 2b will convert driver calls to cydriver with nogil blocks. Phase 2c will add RAII handles to resource_handles. * Phase 2a refinements: hide private attrs, add public properties - Use strong types in .pxd (ObjectCode, KernelOccupancy) - Remove cdef public - attributes now private to C level - Add Kernel.handle property for external access - Add ObjectCode.symbol_mapping property (symmetric with input) - Update _launcher.pyx, _linker.py, tests to use public APIs * Convert module-level functions and Kernel._get_arguments_info to cdef - Module globals: _inited, _py_major_ver, _py_minor_ver, _driver_ver, _kernel_ctypes, _paraminfo_supported -> cdef typed - Module functions: _lazy_init, _get_py_major_ver, _get_py_minor_ver, _get_driver_ver, _get_kernel_ctypes, _is_paraminfo_supported, _make_dummy_library_handle -> cdef inline with exception specs - Module constant: _supported_code_type -> cdef tuple - Kernel._get_arguments_info -> cdef tuple Note: KernelAttributes remains a regular Python class due to segfaults when converted to cdef class (likely due to weakref interaction with cdef class properties). * Convert KernelAttributes to cdef class Follow the _MemPoolAttributes pattern: - cdef class with inline cdef attributes (_kernel_weakref, _cache) - _init as @classmethod (not @staticmethod cdef) - _get_cached_attribute and _resolve_device_id use except? -1 - Explicit cast when dereferencing weakref * Add LibraryHandle and KernelHandle to resource_handles infrastructure Extends the RAII handle system to support CUlibrary and CUkernel driver objects used in _module.pyx. This provides automatic lifetime management and proper cleanup for library and kernel handles. Changes: - Add LibraryHandle/KernelHandle types with factory functions - Update Kernel, ObjectCode, KernelOccupancy to use typed handles - Move KernelAttributes cdef block to .pxd for strong typing - Update _launcher.pyx to access kernel handle directly via cdef * Convert _module.pyx driver calls to cydriver with nogil Replaces Python-level driver API calls with low-level cydriver calls wrapped in nogil blocks for improved performance. This allows the GIL to be released during CUDA driver operations. Changes: - cuDriverGetVersion, cuKernelGetAttribute, cuKernelGetParamInfo - cuOccupancy* functions (with appropriate GIL handling for callbacks) - cuKernelGetLibrary - Update KernelAttributes._get_cached_attribute to use cydriver types * Fix SEGV in Kernel.from_handle with non-int types Remove type annotation from handle parameter to prevent Cython's automatic float-to-int coercion, which caused a segmentation fault. The manual isinstance check properly validates all non-int types. * Refactor ObjectCode._init and add kernel lifetime test - Change ObjectCode._init from cdef to @classmethod def, matching the pattern used by Buffer, Stream, Graph, and other classes - Remove _init_py wrapper (no longer needed) - Update callers in _program.py and _linker.py - Add test_kernel_keeps_library_alive to verify that a Kernel keeps its underlying library alive after ObjectCode goes out of scope * Simplify resource handle patterns and clean up tests - Remove Kernel._module (ObjectCode reference no longer needed since KernelHandle keeps library alive via LibraryHandle dependency) - Simplify Kernel._from_obj signature (remove unused ObjectCode param) - KernelAttributes: store KernelHandle instead of weakref to Kernel - Rename get_kernel_from_library to create_kernel_handle for consistency - Remove fragile annotation introspection from test_saxpy_arguments * Simplify _MemPoolAttributes to use direct MemoryPoolHandle Replace weakref pattern with direct MemoryPoolHandle storage in _MemPoolAttributes. The handle's shared_ptr keeps the underlying pool alive, so attributes remain accessible after the MR is deleted. Note: _MemPool retains __weakref__ because the IPC subsystem uses WeakValueDictionary to track memory resources across processes. * Fix access violation in occupancy queries with uninitialized hStream Zero-initialize CUlaunchConfig struct to prevent garbage values in hStream field when no stream is provided. The driver dereferences hStream even when querying occupancy, causing access violations on some platforms (observed on Windows with RTX Pro 6000).
1 parent f4d3207 commit cc88902

File tree

14 files changed

+616
-233
lines changed

14 files changed

+616
-233
lines changed

cuda_core/cuda/core/_cpp/resource_handles.cpp

Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,11 @@ decltype(&cuMemFreeHost) p_cuMemFreeHost = nullptr;
5151

5252
decltype(&cuMemPoolImportPointer) p_cuMemPoolImportPointer = nullptr;
5353

54+
decltype(&cuLibraryLoadFromFile) p_cuLibraryLoadFromFile = nullptr;
55+
decltype(&cuLibraryLoadData) p_cuLibraryLoadData = nullptr;
56+
decltype(&cuLibraryUnload) p_cuLibraryUnload = nullptr;
57+
decltype(&cuLibraryGetKernel) p_cuLibraryGetKernel = nullptr;
58+
5459
// ============================================================================
5560
// GIL management helpers
5661
// ============================================================================
@@ -682,4 +687,81 @@ DevicePtrHandle deviceptr_import_ipc(const MemoryPoolHandle& h_pool, const void*
682687
}
683688
}
684689

690+
// ============================================================================
691+
// Library Handles
692+
// ============================================================================
693+
694+
namespace {
695+
struct LibraryBox {
696+
CUlibrary resource;
697+
};
698+
} // namespace
699+
700+
LibraryHandle create_library_handle_from_file(const char* path) {
701+
GILReleaseGuard gil;
702+
CUlibrary library;
703+
if (CUDA_SUCCESS != (err = p_cuLibraryLoadFromFile(&library, path, nullptr, nullptr, 0, nullptr, nullptr, 0))) {
704+
return {};
705+
}
706+
707+
auto box = std::shared_ptr<const LibraryBox>(
708+
new LibraryBox{library},
709+
[](const LibraryBox* b) {
710+
GILReleaseGuard gil;
711+
p_cuLibraryUnload(b->resource);
712+
delete b;
713+
}
714+
);
715+
return LibraryHandle(box, &box->resource);
716+
}
717+
718+
LibraryHandle create_library_handle_from_data(const void* data) {
719+
GILReleaseGuard gil;
720+
CUlibrary library;
721+
if (CUDA_SUCCESS != (err = p_cuLibraryLoadData(&library, data, nullptr, nullptr, 0, nullptr, nullptr, 0))) {
722+
return {};
723+
}
724+
725+
auto box = std::shared_ptr<const LibraryBox>(
726+
new LibraryBox{library},
727+
[](const LibraryBox* b) {
728+
GILReleaseGuard gil;
729+
p_cuLibraryUnload(b->resource);
730+
delete b;
731+
}
732+
);
733+
return LibraryHandle(box, &box->resource);
734+
}
735+
736+
LibraryHandle create_library_handle_ref(CUlibrary library) {
737+
auto box = std::make_shared<const LibraryBox>(LibraryBox{library});
738+
return LibraryHandle(box, &box->resource);
739+
}
740+
741+
// ============================================================================
742+
// Kernel Handles
743+
// ============================================================================
744+
745+
namespace {
746+
struct KernelBox {
747+
CUkernel resource;
748+
LibraryHandle h_library; // Keeps library alive
749+
};
750+
} // namespace
751+
752+
KernelHandle create_kernel_handle(const LibraryHandle& h_library, const char* name) {
753+
GILReleaseGuard gil;
754+
CUkernel kernel;
755+
if (CUDA_SUCCESS != (err = p_cuLibraryGetKernel(&kernel, *h_library, name))) {
756+
return {};
757+
}
758+
759+
return create_kernel_handle_ref(kernel, h_library);
760+
}
761+
762+
KernelHandle create_kernel_handle_ref(CUkernel kernel, const LibraryHandle& h_library) {
763+
auto box = std::make_shared<const KernelBox>(KernelBox{kernel, h_library});
764+
return KernelHandle(box, &box->resource);
765+
}
766+
685767
} // namespace cuda_core

cuda_core/cuda/core/_cpp/resource_handles.hpp

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,12 @@ extern decltype(&cuMemFreeHost) p_cuMemFreeHost;
6161

6262
extern decltype(&cuMemPoolImportPointer) p_cuMemPoolImportPointer;
6363

64+
// Library
65+
extern decltype(&cuLibraryLoadFromFile) p_cuLibraryLoadFromFile;
66+
extern decltype(&cuLibraryLoadData) p_cuLibraryLoadData;
67+
extern decltype(&cuLibraryUnload) p_cuLibraryUnload;
68+
extern decltype(&cuLibraryGetKernel) p_cuLibraryGetKernel;
69+
6470
// ============================================================================
6571
// Handle type aliases - expose only the raw CUDA resource
6672
// ============================================================================
@@ -69,6 +75,8 @@ using ContextHandle = std::shared_ptr<const CUcontext>;
6975
using StreamHandle = std::shared_ptr<const CUstream>;
7076
using EventHandle = std::shared_ptr<const CUevent>;
7177
using MemoryPoolHandle = std::shared_ptr<const CUmemoryPool>;
78+
using LibraryHandle = std::shared_ptr<const CUlibrary>;
79+
using KernelHandle = std::shared_ptr<const CUkernel>;
7280

7381
// ============================================================================
7482
// Context handle functions
@@ -218,6 +226,40 @@ StreamHandle deallocation_stream(const DevicePtrHandle& h) noexcept;
218226
// Set the deallocation stream for a device pointer handle.
219227
void set_deallocation_stream(const DevicePtrHandle& h, const StreamHandle& h_stream) noexcept;
220228

229+
// ============================================================================
230+
// Library handle functions
231+
// ============================================================================
232+
233+
// Create an owning library handle by loading from a file path.
234+
// When the last reference is released, cuLibraryUnload is called automatically.
235+
// Returns empty handle on error (caller must check).
236+
LibraryHandle create_library_handle_from_file(const char* path);
237+
238+
// Create an owning library handle by loading from memory data.
239+
// The driver makes an internal copy of the data; caller can free it after return.
240+
// When the last reference is released, cuLibraryUnload is called automatically.
241+
// Returns empty handle on error (caller must check).
242+
LibraryHandle create_library_handle_from_data(const void* data);
243+
244+
// Create a non-owning library handle (references existing library).
245+
// Use for borrowed libraries (e.g., from foreign code).
246+
// The library will NOT be unloaded when the handle is released.
247+
LibraryHandle create_library_handle_ref(CUlibrary library);
248+
249+
// ============================================================================
250+
// Kernel handle functions
251+
// ============================================================================
252+
253+
// Get a kernel from a library by name.
254+
// The kernel structurally depends on the provided library handle.
255+
// Kernels have no explicit destroy - their lifetime is tied to the library.
256+
// Returns empty handle on error (caller must check).
257+
KernelHandle create_kernel_handle(const LibraryHandle& h_library, const char* name);
258+
259+
// Create a non-owning kernel handle with library dependency.
260+
// Use for borrowed kernels. The library handle keeps the library alive.
261+
KernelHandle create_kernel_handle_ref(CUkernel kernel, const LibraryHandle& h_library);
262+
221263
// ============================================================================
222264
// Overloaded helper functions to extract raw resources from handles
223265
// ============================================================================
@@ -243,6 +285,14 @@ inline CUdeviceptr as_cu(const DevicePtrHandle& h) noexcept {
243285
return h ? *h : 0;
244286
}
245287

288+
inline CUlibrary as_cu(const LibraryHandle& h) noexcept {
289+
return h ? *h : nullptr;
290+
}
291+
292+
inline CUkernel as_cu(const KernelHandle& h) noexcept {
293+
return h ? *h : nullptr;
294+
}
295+
246296
// as_intptr() - extract handle as intptr_t for Python interop
247297
// Using signed intptr_t per C standard convention and issue #1342
248298
inline std::intptr_t as_intptr(const ContextHandle& h) noexcept {
@@ -265,6 +315,14 @@ inline std::intptr_t as_intptr(const DevicePtrHandle& h) noexcept {
265315
return static_cast<std::intptr_t>(as_cu(h));
266316
}
267317

318+
inline std::intptr_t as_intptr(const LibraryHandle& h) noexcept {
319+
return reinterpret_cast<std::intptr_t>(as_cu(h));
320+
}
321+
322+
inline std::intptr_t as_intptr(const KernelHandle& h) noexcept {
323+
return reinterpret_cast<std::intptr_t>(as_cu(h));
324+
}
325+
268326
// as_py() - convert handle to Python driver wrapper object (returns new reference)
269327
namespace detail {
270328
// n.b. class lookup is not cached to avoid deadlock hazard, see DESIGN.md
@@ -300,4 +358,12 @@ inline PyObject* as_py(const DevicePtrHandle& h) noexcept {
300358
return detail::make_py("CUdeviceptr", as_intptr(h));
301359
}
302360

361+
inline PyObject* as_py(const LibraryHandle& h) noexcept {
362+
return detail::make_py("CUlibrary", as_intptr(h));
363+
}
364+
365+
inline PyObject* as_py(const KernelHandle& h) noexcept {
366+
return detail::make_py("CUkernel", as_intptr(h));
367+
}
368+
303369
} // namespace cuda_core

cuda_core/cuda/core/_launch_config.pyx

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22
#
33
# SPDX-License-Identifier: Apache-2.0
44

5+
from libc.string cimport memset
6+
57
from cuda.core._utils.cuda_utils cimport (
68
HANDLE_RETURN,
79
)
@@ -152,9 +154,9 @@ cdef class LaunchConfig:
152154

153155
cdef cydriver.CUlaunchConfig _to_native_launch_config(self):
154156
_lazy_init()
155-
# TODO: memset to zero?
156157
cdef cydriver.CUlaunchConfig drv_cfg
157158
cdef cydriver.CUlaunchAttribute attr
159+
memset(&drv_cfg, 0, sizeof(drv_cfg))
158160
self._attrs.resize(0)
159161

160162
# Handle grid dimensions and cluster configuration

cuda_core/cuda/core/_launcher.pyx

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ from cuda.bindings cimport cydriver
88

99
from cuda.core._launch_config cimport LaunchConfig
1010
from cuda.core._kernel_arg_handler cimport ParamHolder
11+
from cuda.core._module cimport Kernel
1112
from cuda.core._resource_handles cimport as_cu
1213
from cuda.core._stream cimport Stream_accept, Stream
1314
from cuda.core._utils.cuda_utils cimport (
@@ -77,11 +78,11 @@ def launch(stream: Stream | GraphBuilder | IsStreamT, config: LaunchConfig, kern
7778
cdef ParamHolder ker_args = ParamHolder(kernel_args)
7879
cdef void** args_ptr = <void**><uintptr_t>(ker_args.ptr)
7980

80-
# TODO: cythonize Module/Kernel/...
8181
# Note: We now use CUkernel handles exclusively (CUDA 12+), but they can be cast to
8282
# CUfunction for use with cuLaunchKernel, as both handle types are interchangeable
8383
# for kernel launch purposes.
84-
cdef cydriver.CUfunction func_handle = <cydriver.CUfunction>(<uintptr_t>(kernel._handle))
84+
cdef Kernel ker = <Kernel>kernel
85+
cdef cydriver.CUfunction func_handle = <cydriver.CUfunction>as_cu(ker._h_kernel)
8586

8687
# Note: CUkernel can still be launched via cuLaunchKernel (not just cuLaunchKernelEx).
8788
# We check both binding & driver versions here mainly to see if the "Ex" API is

cuda_core/cuda/core/_linker.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -444,29 +444,29 @@ def __init__(self, *object_codes: ObjectCode, options: LinkerOptions = None):
444444
self._add_code_object(code)
445445

446446
def _add_code_object(self, object_code: ObjectCode):
447-
data = object_code._module
447+
data = object_code.code
448448
with _exception_manager(self):
449449
name_str = f"{object_code.name}"
450450
if _nvjitlink and isinstance(data, bytes):
451451
_nvjitlink.add_data(
452452
self._mnff.handle,
453-
self._input_type_from_code_type(object_code._code_type),
453+
self._input_type_from_code_type(object_code.code_type),
454454
data,
455455
len(data),
456456
name_str,
457457
)
458458
elif _nvjitlink and isinstance(data, str):
459459
_nvjitlink.add_file(
460460
self._mnff.handle,
461-
self._input_type_from_code_type(object_code._code_type),
461+
self._input_type_from_code_type(object_code.code_type),
462462
data,
463463
)
464464
elif (not _nvjitlink) and isinstance(data, bytes):
465465
name_bytes = name_str.encode()
466466
handle_return(
467467
_driver.cuLinkAddData(
468468
self._mnff.handle,
469-
self._input_type_from_code_type(object_code._code_type),
469+
self._input_type_from_code_type(object_code.code_type),
470470
data,
471471
len(data),
472472
name_bytes,
@@ -481,7 +481,7 @@ def _add_code_object(self, object_code: ObjectCode):
481481
handle_return(
482482
_driver.cuLinkAddFile(
483483
self._mnff.handle,
484-
self._input_type_from_code_type(object_code._code_type),
484+
self._input_type_from_code_type(object_code.code_type),
485485
data.encode(),
486486
0,
487487
None,

cuda_core/cuda/core/_memory/_memory_pool.pxd

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,16 @@ cdef class _MemPool(MemoryResource):
1919
object __weakref__
2020

2121

22+
cdef class _MemPoolAttributes:
23+
cdef:
24+
MemoryPoolHandle _h_pool
25+
26+
@staticmethod
27+
cdef _MemPoolAttributes _init(MemoryPoolHandle h_pool)
28+
29+
cdef int _getattribute(self, cydriver.CUmemPool_attribute attr_enum, void* value) except? -1
30+
31+
2232
cdef class _MemPoolOptions:
2333

2434
cdef:

cuda_core/cuda/core/_memory/_memory_pool.pyx

Lines changed: 7 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,6 @@ from cuda.core._utils.cuda_utils cimport (
2929
)
3030

3131
import platform # no-cython-lint
32-
import weakref
3332

3433
from cuda.core._utils.cuda_utils import driver
3534

@@ -45,16 +44,15 @@ cdef class _MemPoolOptions:
4544

4645

4746
cdef class _MemPoolAttributes:
48-
cdef:
49-
object _mr_weakref
47+
"""Provides access to memory pool attributes."""
5048

5149
def __init__(self, *args, **kwargs):
5250
raise RuntimeError("_MemPoolAttributes cannot be instantiated directly. Please use MemoryResource APIs.")
5351

54-
@classmethod
55-
def _init(cls, mr):
56-
cdef _MemPoolAttributes self = _MemPoolAttributes.__new__(cls)
57-
self._mr_weakref = mr
52+
@staticmethod
53+
cdef _MemPoolAttributes _init(MemoryPoolHandle h_pool):
54+
cdef _MemPoolAttributes self = _MemPoolAttributes.__new__(_MemPoolAttributes)
55+
self._h_pool = h_pool
5856
return self
5957

6058
def __repr__(self):
@@ -64,12 +62,8 @@ cdef class _MemPoolAttributes:
6462
)
6563

6664
cdef int _getattribute(self, cydriver.CUmemPool_attribute attr_enum, void* value) except?-1:
67-
cdef _MemPool mr = <_MemPool>(self._mr_weakref())
68-
if mr is None:
69-
raise RuntimeError("_MemPool is expired")
70-
cdef cydriver.CUmemoryPool pool_handle = as_cu(mr._h_pool)
7165
with nogil:
72-
HANDLE_RETURN(cydriver.cuMemPoolGetAttribute(pool_handle, attr_enum, value))
66+
HANDLE_RETURN(cydriver.cuMemPoolGetAttribute(as_cu(self._h_pool), attr_enum, value))
7367
return 0
7468

7569
@property
@@ -197,8 +191,7 @@ cdef class _MemPool(MemoryResource):
197191
def attributes(self) -> _MemPoolAttributes:
198192
"""Memory pool attributes."""
199193
if self._attributes is None:
200-
ref = weakref.ref(self)
201-
self._attributes = _MemPoolAttributes._init(ref)
194+
self._attributes = _MemPoolAttributes._init(self._h_pool)
202195
return self._attributes
203196

204197
@property

0 commit comments

Comments
 (0)