[RFC] Pipeline Diagnostic Mode

(Sharing broadly what we are working on. Feel free to comment. 😉)

# [RFC] Pipeline Diagnostic Mode

# Problem

When tackling the performance issue at application-level (such as AI training pipeline), it is difficult to pin-point where the bottleneck is, and what is the cause. SPDL [collects runtime performance statistics](https://facebookresearch.github.io/spdl/main/getting_started/logging.html), which can help identify the performance bottleneck, but there are factors that this system cannot cover. Most often asked questions are related to the GIL, such as "Is there a contention caused by the GIL?" and "Which function holds the GIL?" Since the GIL affects everything running in Python, each stage has performance implications for the other stages. Therefore, it is not possible to tell what contention (not restricted to the GIL) each stage function has.

The best way to answer such a question is to run a specific function in an isolated environment and check its scalability. For example, the [Data Format and Performance](https://facebookresearch.github.io/spdl/main/case_studies/data_format.html) section of SPDL documentation shows what the performance trend of a pure-Python function (which holds the GIL) looks like.**†** This trend is obtained from running the specific function in a SPDL pipeline.

It is tedious to write such a one-off script every time checking a function's scalability, yet this is very important information.

**†** The performance trend of a function that releases the GIL can be found in [the Overview section](https://facebookresearch.github.io/spdl/main/overview.html).

# Approach

Introduce the diagnostic mode in Pipeline construction, which runs stage functions separately and provides insight on its performance and scalability.

1. When enabled, the `spdl.pipeline.PipelineBuilder.build` method does not build a `Pipeline`. Instead it builds a system for profiling the stage functions one by one.  
2. In this profiling, the stage function is executed with cached inputs to measure its performance.  
3. When profiling a stage function, some parameter sweep is performed. (such as concurrency to check the scalability)  
4. When done, the statistics are exported.

The following figure illustrates the idea.

<img width="1152" height="864" alt="Image" src="https://github.com/user-attachments/assets/45806267-b808-443e-a238-069a0c2d9f4a" />

# Expected Outcome

The following is the performance statistics the analysis can collect.

1. The speed (QPS) of function execution, while changing the concurrency.  
2. \[Stretch\] The memory consumption. (if the execution can be isolated in process)  
3. \[Stretch\] The CPU consumption. (if the execution can be isolated in process)

By analyzing 1, we can expect to get insight of the following

1. Whether a stage function has resource contention, including, but not limited to the GIL.  
2. The ideal concurrency for the stage which yields the best performance.

# Requirements

- Requires no code change from the user.  
  - Enabled with an environment variable, such as `SPDL_PIPELINE_DIAGNOSTIC_MODE=1`  
- When the profiling is done, the system exits. `sys.exit(1)`.  
- The result is easy to browse.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Pipeline Diagnostic Mode #903

[RFC] Pipeline Diagnostic Mode

Problem

Approach

Expected Outcome

Requirements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Pipeline Diagnostic Mode #903

Description

[RFC] Pipeline Diagnostic Mode

Problem

Approach

Expected Outcome

Requirements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions