Skip to content

[RFC] Pipeline Diagnostic Mode #903

@mthrok

Description

@mthrok

(Sharing broadly what we are working on. Feel free to comment. 😉)

[RFC] Pipeline Diagnostic Mode

Problem

When tackling the performance issue at application-level (such as AI training pipeline), it is difficult to pin-point where the bottleneck is, and what is the cause. SPDL collects runtime performance statistics, which can help identify the performance bottleneck, but there are factors that this system cannot cover. Most often asked questions are related to the GIL, such as "Is there a contention caused by the GIL?" and "Which function holds the GIL?" Since the GIL affects everything running in Python, each stage has performance implications for the other stages. Therefore, it is not possible to tell what contention (not restricted to the GIL) each stage function has.

The best way to answer such a question is to run a specific function in an isolated environment and check its scalability. For example, the Data Format and Performance section of SPDL documentation shows what the performance trend of a pure-Python function (which holds the GIL) looks like. This trend is obtained from running the specific function in a SPDL pipeline.

It is tedious to write such a one-off script every time checking a function's scalability, yet this is very important information.

The performance trend of a function that releases the GIL can be found in the Overview section.

Approach

Introduce the diagnostic mode in Pipeline construction, which runs stage functions separately and provides insight on its performance and scalability.

  1. When enabled, the spdl.pipeline.PipelineBuilder.build method does not build a Pipeline. Instead it builds a system for profiling the stage functions one by one.
  2. In this profiling, the stage function is executed with cached inputs to measure its performance.
  3. When profiling a stage function, some parameter sweep is performed. (such as concurrency to check the scalability)
  4. When done, the statistics are exported.

The following figure illustrates the idea.

Image

Expected Outcome

The following is the performance statistics the analysis can collect.

  1. The speed (QPS) of function execution, while changing the concurrency.
  2. [Stretch] The memory consumption. (if the execution can be isolated in process)
  3. [Stretch] The CPU consumption. (if the execution can be isolated in process)

By analyzing 1, we can expect to get insight of the following

  1. Whether a stage function has resource contention, including, but not limited to the GIL.
  2. The ideal concurrency for the stage which yields the best performance.

Requirements

  • Requires no code change from the user.
    • Enabled with an environment variable, such as SPDL_PIPELINE_DIAGNOSTIC_MODE=1
  • When the profiling is done, the system exits. sys.exit(1).
  • The result is easy to browse.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions