Skip to content

Multithreading - threads block one another on return. #824

@jricker2

Description

@jricker2

Describe the bug
When running workloads on multiple GPUs, where each workload is on a seperate thread which contains its own context/queue/device, we see threads block until all are complete. IE a workload may finish very quickly on one GPU and still have lots to do on another. The thread which completes all its work is stuck until others complete. There is no data shared across threads.

Looking into it a bit further the issue seems to be related to event cleanup. We have narrowed down the blocked threads to being stuck deleting events from kernel enqueues.

To Reproduce
python script below, assumes platforms[0] is Nvidia and has multiple GPUs. I believe the same is reproduced with a CPU + GPU device.

this launches two threads with disparate amount of work between the two. During execution t2 gets to point of returning, but is not joined until t1 completes.

import threading
import pyopencl
import numpy

def gpu_workload(gpu_idx, num_indexes):
    platforms = pyopencl.get_platforms()
    devices = platforms[0].get_devices()
    device = devices[gpu_idx]
    context = pyopencl.Context([device])
    queue = pyopencl.CommandQueue(context)

    data = numpy.random.rand(num_indexes, num_indexes).astype(numpy.float32)
    dest = numpy.ones((num_indexes * num_indexes), dtype=numpy.float32)

    # create buffers
    with pyopencl.CommandQueue(context) as queue:
        data_buf = pyopencl.Buffer(
            queue.context,
            pyopencl.mem_flags.READ_ONLY | pyopencl.mem_flags.COPY_HOST_PTR,
            hostbuf=data,
        )

        dest_buf = pyopencl.Buffer(
            queue.context, pyopencl.mem_flags.WRITE_ONLY, dest.nbytes
        )

        # create kernel with disabled optimizations
        prg = pyopencl.Program(
            queue.context,
            """
        __kernel void foo(ushort n, __global float *a, __global float *c)
        {
          int gid = get_global_id(0);
          c[gid] = 0.0f;
          int rowC = gid/n;
          int colC = gid%n;
          __global float *pA = &a[rowC*n];
          __global float *pB = &a[colC];
          for(int k=0; k<n; k++)
          {
             pB = &a[colC+k*n];
             for(int j=0; j<1; j++)
             {
             c[gid] += (*(pA++))*(*pB);
             }
          }

          c[gid] -= c[gid];
        }
        """,
        ).build(options=["-cl-opt-disable"])
        kernel = prg.foo


        print(f"gpu {gpu_idx} work beginning")
        kernel_event = kernel(
            queue,
            dest.shape,
            None,
            numpy.uint16(num_indexes),
            data_buf,
            dest_buf,
        )
        # create output of all ones -> program should set it all to zeros
        output = numpy.ones_like(dest)
        copy_event = pyopencl.enqueue_copy(queue, output, dest_buf, is_blocking=True)
        print(f"gpu {gpu_idx} work done.")

    print(f"gpu {gpu_idx} returning")
    return


def main():
    t1 = threading.Thread(target=gpu_workload, args=(0, 4096 * 2,))
    t2 = threading.Thread(target=gpu_workload, args=(1, 4096,))
    t1.start()
    t2.start()
    t2.join()
    print("shorter gpu join done")
    t1.join()
    print("short gpu join done")

if __name__ == "__main__":
    main()

Expected behavior
Threads using separate context/queue should not block another AFAIK.

Environment (please complete the following information):

  • OS: Linux SLES 15 SP6
  • ICD Loader and version: [e.g. ocl-icd 2.3.1]: not sure
  • ICD and version: nvidia
  • CPU/GPU: Nvidia RTX A5000, 550/570 drivers tested
  • Python version: 3.9/3.10/3.11
  • PyOpenCL version: 2025.1

Additional context
In our full application we run what is similar to this in an asyncio.to_thread as a part of a greater RESTful API application. When the thread blocking occurs it seems to be holding GIL as we cannot interact with any part of our application, uses 100% of CPU during this time as well. From nsys it appears to be constantly calling clReleaseEvent.

We are fairly new to using PyOpenCL, so it could be that we are missing some property/documentation on thread-safety/blocking across contexts. Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions