(Sharing broadly what we are working on. Feel free to comment. 😉)
[RFC] Pipeline Diagnostic Mode
Problem
When tackling the performance issue at application-level (such as AI training pipeline), it is difficult to pin-point where the bottleneck is, and what is the cause. SPDL collects runtime performance statistics, which can help identify the performance bottleneck, but there are factors that this system cannot cover. Most often asked questions are related to the GIL, such as "Is there a contention caused by the GIL?" and "Which function holds the GIL?" Since the GIL affects everything running in Python, each stage has performance implications for the other stages. Therefore, it is not possible to tell what contention (not restricted to the GIL) each stage function has.
The best way to answer such a question is to run a specific function in an isolated environment and check its scalability. For example, the Data Format and Performance section of SPDL documentation shows what the performance trend of a pure-Python function (which holds the GIL) looks like.† This trend is obtained from running the specific function in a SPDL pipeline.
It is tedious to write such a one-off script every time checking a function's scalability, yet this is very important information.
† The performance trend of a function that releases the GIL can be found in the Overview section.
Approach
Introduce the diagnostic mode in Pipeline construction, which runs stage functions separately and provides insight on its performance and scalability.
- When enabled, the
spdl.pipeline.PipelineBuilder.build method does not build a Pipeline. Instead it builds a system for profiling the stage functions one by one.
- In this profiling, the stage function is executed with cached inputs to measure its performance.
- When profiling a stage function, some parameter sweep is performed. (such as concurrency to check the scalability)
- When done, the statistics are exported.
The following figure illustrates the idea.
Expected Outcome
The following is the performance statistics the analysis can collect.
- The speed (QPS) of function execution, while changing the concurrency.
- [Stretch] The memory consumption. (if the execution can be isolated in process)
- [Stretch] The CPU consumption. (if the execution can be isolated in process)
By analyzing 1, we can expect to get insight of the following
- Whether a stage function has resource contention, including, but not limited to the GIL.
- The ideal concurrency for the stage which yields the best performance.
Requirements
- Requires no code change from the user.
- Enabled with an environment variable, such as
SPDL_PIPELINE_DIAGNOSTIC_MODE=1
- When the profiling is done, the system exits.
sys.exit(1).
- The result is easy to browse.
(Sharing broadly what we are working on. Feel free to comment. 😉)
[RFC] Pipeline Diagnostic Mode
Problem
When tackling the performance issue at application-level (such as AI training pipeline), it is difficult to pin-point where the bottleneck is, and what is the cause. SPDL collects runtime performance statistics, which can help identify the performance bottleneck, but there are factors that this system cannot cover. Most often asked questions are related to the GIL, such as "Is there a contention caused by the GIL?" and "Which function holds the GIL?" Since the GIL affects everything running in Python, each stage has performance implications for the other stages. Therefore, it is not possible to tell what contention (not restricted to the GIL) each stage function has.
The best way to answer such a question is to run a specific function in an isolated environment and check its scalability. For example, the Data Format and Performance section of SPDL documentation shows what the performance trend of a pure-Python function (which holds the GIL) looks like.† This trend is obtained from running the specific function in a SPDL pipeline.
It is tedious to write such a one-off script every time checking a function's scalability, yet this is very important information.
† The performance trend of a function that releases the GIL can be found in the Overview section.
Approach
Introduce the diagnostic mode in Pipeline construction, which runs stage functions separately and provides insight on its performance and scalability.
spdl.pipeline.PipelineBuilder.buildmethod does not build aPipeline. Instead it builds a system for profiling the stage functions one by one.The following figure illustrates the idea.
Expected Outcome
The following is the performance statistics the analysis can collect.
By analyzing 1, we can expect to get insight of the following
Requirements
SPDL_PIPELINE_DIAGNOSTIC_MODE=1sys.exit(1).