Skip to content

Optimization: Accelerate Torchvision CPU Inference via Multi-threading and Parallelization #532

@gtfrans2re

Description

@gtfrans2re

Currently, when evaluating models on local machines or laptops without CUDA-enabled GPUs in the Streamlit GUI Web App, the inference process is predominantly CPU-bound. In many environments, torchvision operations (such as data transformations and specific model layers) default to single-threaded execution or do not fully leverage available multi-core architectures. This leads to significant bottlenecks during model evaluation and testing, especially on datasets with high-resolution images.

The Problem:
Users without NVIDIA GPUs, like me, experience long wait times for inference tasks that could theoretically be parallelized across available CPU cores, unless they use high-performance computing clusters to accelerate inference.

Proposed Solution:
I propose introducing or enhancing parallelization backends to accelerate CPU-based inference. This would involve integrating or optimizing one of the following high-performance computing (HPC) frameworks within the CPU-specific kernels of the library:

  1. OpenMP: For shared-memory parallelization. This is the most portable and standard way to parallelize loops across CPU cores in C++/PyTorch extensions.
  2. Vectorization (AVX/SIMD): Ensuring that lower-level operations are optimized for modern CPU instruction sets.
  3. TBB (Intel Threading Building Blocks): An alternative to OpenMP that often provides better load balancing for task-based parallelism.

Technical Implementation Plan:

  • Profiling: Identify the specific bottlenecks in the torchvision C++ kernels (e.g., Resize, Crop, or specific Op kernels) using tools like valgrind or perf.
  • Parallel Implementation: Implement #pragma omp parallel for (for OpenMP) or equivalent TBB constructs in the identified hot-path loops.
  • Benchmark Suite: Create a comparison script to measure the speedup factor ($S = \frac{T_{serial}}{T_{parallel}}$) on standard architectures (x86/ARM).

Benefits:

  • Performance: Drastic reduction in inference time for researchers and students working on standard laptops.
  • Accessibility: Lowers the barrier to entry for users who do not have access to high-end GPU clusters.

Assignment Request

I have experience in parallel programming and am currently working with GPUs of Compute Canada clusters at Narval on an autonomous robotics-based master's thesis at UQAM, which allows me to help with this issue. I would like to take responsibility for this optimization.

Could you please assign this issue to me? I am ready to submit a PR once I have finalized the initial profiling and proof-of-concept.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions