Multithreading - threads block one another on return.

**Describe the bug**
When running workloads on multiple GPUs, where each workload is on a seperate thread which contains its own context/queue/device, we see threads block until all are complete. IE a workload may finish very quickly on one GPU and still have lots to do on another. The thread which completes all its work is stuck until others complete. There is no data shared across threads.

Looking into it a bit further the issue seems to be related to event cleanup. We have narrowed down the blocked threads to being stuck deleting events from kernel enqueues.

**To Reproduce**
python script below, assumes platforms[0] is Nvidia and has multiple GPUs. I believe the same is reproduced with a CPU + GPU device.

this launches two threads with disparate amount of work between the two. During execution t2 gets to point of returning, but is not joined until t1 completes.

```python
import threading
import pyopencl
import numpy

def gpu_workload(gpu_idx, num_indexes):
    platforms = pyopencl.get_platforms()
    devices = platforms[0].get_devices()
    device = devices[gpu_idx]
    context = pyopencl.Context([device])
    queue = pyopencl.CommandQueue(context)

    data = numpy.random.rand(num_indexes, num_indexes).astype(numpy.float32)
    dest = numpy.ones((num_indexes * num_indexes), dtype=numpy.float32)

    # create buffers
    with pyopencl.CommandQueue(context) as queue:
        data_buf = pyopencl.Buffer(
            queue.context,
            pyopencl.mem_flags.READ_ONLY | pyopencl.mem_flags.COPY_HOST_PTR,
            hostbuf=data,
        )

        dest_buf = pyopencl.Buffer(
            queue.context, pyopencl.mem_flags.WRITE_ONLY, dest.nbytes
        )

        # create kernel with disabled optimizations
        prg = pyopencl.Program(
            queue.context,
            """
        __kernel void foo(ushort n, __global float *a, __global float *c)
        {
          int gid = get_global_id(0);
          c[gid] = 0.0f;
          int rowC = gid/n;
          int colC = gid%n;
          __global float *pA = &a[rowC*n];
          __global float *pB = &a[colC];
          for(int k=0; k<n; k++)
          {
             pB = &a[colC+k*n];
             for(int j=0; j<1; j++)
             {
             c[gid] += (*(pA++))*(*pB);
             }
          }

          c[gid] -= c[gid];
        }
        """,
        ).build(options=["-cl-opt-disable"])
        kernel = prg.foo


        print(f"gpu {gpu_idx} work beginning")
        kernel_event = kernel(
            queue,
            dest.shape,
            None,
            numpy.uint16(num_indexes),
            data_buf,
            dest_buf,
        )
        # create output of all ones -> program should set it all to zeros
        output = numpy.ones_like(dest)
        copy_event = pyopencl.enqueue_copy(queue, output, dest_buf, is_blocking=True)
        print(f"gpu {gpu_idx} work done.")

    print(f"gpu {gpu_idx} returning")
    return


def main():
    t1 = threading.Thread(target=gpu_workload, args=(0, 4096 * 2,))
    t2 = threading.Thread(target=gpu_workload, args=(1, 4096,))
    t1.start()
    t2.start()
    t2.join()
    print("shorter gpu join done")
    t1.join()
    print("short gpu join done")

if __name__ == "__main__":
    main()
``` 

**Expected behavior**
Threads using separate context/queue should not block another AFAIK.

**Environment (please complete the following information):**
 - OS: Linux SLES 15 SP6
 - ICD Loader and version: [e.g. ocl-icd 2.3.1]: not sure
 - ICD and version: nvidia
 - CPU/GPU: Nvidia RTX A5000, 550/570 drivers tested
  - Python version: 3.9/3.10/3.11
  - PyOpenCL version: 2025.1

**Additional context**
In our full application we run what is similar to this in an asyncio.to_thread as a part of a greater RESTful API application. When the thread blocking occurs it seems to be holding GIL as we cannot interact with any part of our application, uses 100% of CPU during this time as well. From nsys it appears to be constantly calling clReleaseEvent.

We are fairly new to using PyOpenCL, so it could be that we are missing some property/documentation on thread-safety/blocking across contexts. Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multithreading - threads block one another on return. #824

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Multithreading - threads block one another on return. #824

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions