Optimization: Accelerate Torchvision CPU Inference via Multi-threading and Parallelization

Currently, when evaluating models on local machines or laptops without CUDA-enabled GPUs in the Streamlit GUI Web App, the inference process is predominantly CPU-bound. In many environments, torchvision operations (such as data transformations and specific model layers) default to single-threaded execution or do not fully leverage available multi-core architectures. This leads to significant bottlenecks during model evaluation and testing, especially on datasets with high-resolution images.

**The Problem:**
Users without NVIDIA GPUs, like me, experience long wait times for inference tasks that could theoretically be parallelized across available CPU cores, unless they use high-performance computing clusters to accelerate inference.

**Proposed Solution:**
I propose introducing or enhancing parallelization backends to accelerate CPU-based inference. This would involve integrating or optimizing one of the following high-performance computing (HPC) frameworks within the CPU-specific kernels of the library:

1. **OpenMP**: For shared-memory parallelization. This is the most portable and standard way to parallelize loops across CPU cores in C++/PyTorch extensions.
2. **Vectorization (AVX/SIMD)**: Ensuring that lower-level operations are optimized for modern CPU instruction sets.
3. **TBB (Intel Threading Building Blocks)**: An alternative to OpenMP that often provides better load balancing for task-based parallelism.

**Technical Implementation Plan:**

- Profiling: Identify the specific bottlenecks in the `torchvision` C++ kernels (e.g., Resize, Crop, or specific Op kernels) using tools like `valgrind` or `perf`.
- Parallel Implementation: Implement `#pragma omp parallel for` (for OpenMP) or equivalent TBB constructs in the identified hot-path loops.
- Benchmark Suite: Create a comparison script to measure the speedup factor ($S = \frac{T_{serial}}{T_{parallel}}$) on standard architectures (x86/ARM).

**Benefits:**

- Performance: Drastic reduction in inference time for researchers and students working on standard laptops.
- Accessibility: Lowers the barrier to entry for users who do not have access to high-end GPU clusters.

---

## **Assignment Request**
I have experience in parallel programming and am currently working with GPUs of Compute Canada clusters at Narval on an autonomous robotics-based master's thesis at UQAM, which allows me to help with this issue. I would like to take responsibility for this optimization.

Could you please assign this issue to me? I am ready to submit a PR once I have finalized the initial profiling and proof-of-concept.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization: Accelerate Torchvision CPU Inference via Multi-threading and Parallelization #532

Assignment Request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimization: Accelerate Torchvision CPU Inference via Multi-threading and Parallelization #532

Description

Assignment Request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions