Currently, when evaluating models on local machines or laptops without CUDA-enabled GPUs in the Streamlit GUI Web App, the inference process is predominantly CPU-bound. In many environments, torchvision operations (such as data transformations and specific model layers) default to single-threaded execution or do not fully leverage available multi-core architectures. This leads to significant bottlenecks during model evaluation and testing, especially on datasets with high-resolution images.
The Problem:
Users without NVIDIA GPUs, like me, experience long wait times for inference tasks that could theoretically be parallelized across available CPU cores, unless they use high-performance computing clusters to accelerate inference.
Proposed Solution:
I propose introducing or enhancing parallelization backends to accelerate CPU-based inference. This would involve integrating or optimizing one of the following high-performance computing (HPC) frameworks within the CPU-specific kernels of the library:
- OpenMP: For shared-memory parallelization. This is the most portable and standard way to parallelize loops across CPU cores in C++/PyTorch extensions.
- Vectorization (AVX/SIMD): Ensuring that lower-level operations are optimized for modern CPU instruction sets.
- TBB (Intel Threading Building Blocks): An alternative to OpenMP that often provides better load balancing for task-based parallelism.
Technical Implementation Plan:
- Profiling: Identify the specific bottlenecks in the
torchvision C++ kernels (e.g., Resize, Crop, or specific Op kernels) using tools like valgrind or perf.
- Parallel Implementation: Implement
#pragma omp parallel for (for OpenMP) or equivalent TBB constructs in the identified hot-path loops.
- Benchmark Suite: Create a comparison script to measure the speedup factor ($S = \frac{T_{serial}}{T_{parallel}}$) on standard architectures (x86/ARM).
Benefits:
- Performance: Drastic reduction in inference time for researchers and students working on standard laptops.
- Accessibility: Lowers the barrier to entry for users who do not have access to high-end GPU clusters.
Assignment Request
I have experience in parallel programming and am currently working with GPUs of Compute Canada clusters at Narval on an autonomous robotics-based master's thesis at UQAM, which allows me to help with this issue. I would like to take responsibility for this optimization.
Could you please assign this issue to me? I am ready to submit a PR once I have finalized the initial profiling and proof-of-concept.
Currently, when evaluating models on local machines or laptops without CUDA-enabled GPUs in the Streamlit GUI Web App, the inference process is predominantly CPU-bound. In many environments, torchvision operations (such as data transformations and specific model layers) default to single-threaded execution or do not fully leverage available multi-core architectures. This leads to significant bottlenecks during model evaluation and testing, especially on datasets with high-resolution images.
The Problem:
Users without NVIDIA GPUs, like me, experience long wait times for inference tasks that could theoretically be parallelized across available CPU cores, unless they use high-performance computing clusters to accelerate inference.
Proposed Solution:
I propose introducing or enhancing parallelization backends to accelerate CPU-based inference. This would involve integrating or optimizing one of the following high-performance computing (HPC) frameworks within the CPU-specific kernels of the library:
Technical Implementation Plan:
torchvisionC++ kernels (e.g., Resize, Crop, or specific Op kernels) using tools likevalgrindorperf.#pragma omp parallel for(for OpenMP) or equivalent TBB constructs in the identified hot-path loops.Benefits:
Assignment Request
I have experience in parallel programming and am currently working with GPUs of Compute Canada clusters at Narval on an autonomous robotics-based master's thesis at UQAM, which allows me to help with this issue. I would like to take responsibility for this optimization.
Could you please assign this issue to me? I am ready to submit a PR once I have finalized the initial profiling and proof-of-concept.