Skip to content

Rlin1027/dualpipe-enhanced

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DualPipe Enhanced

Analytical tools and benchmarking framework for DualPipe and DualPipeV bidirectional pipeline parallelism algorithms from the DeepSeek-V3 Technical Report.

Pure Python • Zero Dependencies • No GPU Required

This enhanced version provides three modules for analyzing, simulating, and benchmarking bidirectional pipeline parallelism strategies—all in pure Python with no external dependencies. Understand DualPipe scheduling, compare strategies, estimate memory, and find optimal configurations without running distributed training.

What is DualPipe?

DualPipe is an innovative bidirectional pipeline parallelism algorithm that achieves full overlap of forward and backward computation-communication phases while reducing pipeline bubbles. Unlike traditional pipeline parallelism (GPipe, 1F1B), DualPipe divides the model vertically into two halves and allows data to flow bidirectionally.

For detailed information, refer to the DeepSeek-V3 Technical Report and profile data.

Visual Guide

DualPipe Scheduling:

dualpipe

Example DualPipe scheduling for 8 PP ranks and 20 micro-batches in two directions. Two cells enclosed by a shared black border have mutually overlapped computation and communication.

DualPipeV Scheduling:

dualpipev

Example DualPipeV scheduling for 4 PP ranks (8 PP stages) and 10 micro-batches. DualPipeV is a concise V-shape schedule derived from DualPipe using a "cut-in-half" procedure (thanks to Sea AI Lab for the blog post).

Enhanced Modules

Module 1: Pipeline Analysis (dualpipe_enhanced/analysis.py) - 49 tests

Computational analysis of pipeline scheduling without a distributed runtime.

Key Classes:

  • ScheduleAnalyzer: Generate operation timelines for DualPipe/DualPipeV, compute bubble ratios, find optimal chunk counts
  • BubbleCalculator: Calculate and compare bubble ratios across all strategies (DualPipe, DualPipeV, 1F1B, GPipe)
  • CommunicationAnalyzer: Count P2P operations, estimate communication volume, analyze bandwidth requirements

Module 2: Schedule Simulator (dualpipe_enhanced/simulator.py) - 43 tests

Pure Python simulation of full pipeline execution across ranks and microbatches.

Key Classes:

  • PipelineEvent: Dataclass representing individual pipeline events (forward, backward, weight, communication)
  • ScheduleSimulator: Simulate complete schedules for DualPipe, DualPipeV, 1F1B, and GPipe strategies
  • ScheduleComparator: Compare strategies by makespan, bubble ratio, memory factor with ASCII reports
  • ConfigOptimizer: Find optimal chunk counts, run sensitivity analysis, determine communication thresholds

Module 3: Benchmarking & Performance Estimation (dualpipe_enhanced/benchmark.py) - 40 tests

Estimate memory usage, throughput, and scalability characteristics.

Key Classes:

  • MemoryEstimator: Estimate activation, parameter, and total memory per rank for each strategy
  • ThroughputEstimator: Estimate step time, model FLOPs utilization (MFU), samples/tokens per second
  • ScalabilityAnalyzer: Perform weak and strong scaling analysis, find scaling limits
  • ConfigurationSearch: Search for optimal (num_ranks, num_chunks) configurations under memory constraints

Installation

Clone the repository (no pip install needed):

git clone https://github.com/deepseek-ai/DualPipe.git dualpipe-enhanced
cd dualpipe-enhanced

Verify the enhanced modules are present:

ls dualpipe_enhanced/
# Expected: __init__.py, analysis.py, simulator.py, benchmark.py

Quick Start

Example 1: Compare Bubble Ratios

from dualpipe_enhanced.analysis import BubbleCalculator

# Compare bubble ratios for 4 pipeline stages, 16 microbatches
calc = BubbleCalculator(num_ranks=4)
num_chunks = 16

dualpipe_result = calc.dualpipe_bubble(num_chunks)
dualpipev_result = calc.dualpipev_bubble(num_chunks)
gpipe_result = calc.gpipe_bubble(num_chunks)

print(f"DualPipe   bubble_ratio={dualpipe_result['bubble_ratio']:.3f}")
print(f"DualPipeV  bubble_ratio={dualpipev_result['bubble_ratio']:.3f}")
print(f"GPipe      bubble_ratio={gpipe_result['bubble_ratio']:.3f}")

Output:

DualPipe   bubble_ratio=0.108
DualPipeV  bubble_ratio=0.125
GPipe      bubble_ratio=0.375

Example 2: Simulate Full Schedule

from dualpipe_enhanced.simulator import ScheduleComparator

# Simulate all strategies for 4 ranks, 8 chunks
comparator = ScheduleComparator(num_ranks=4, num_chunks=8)
results = comparator.compare_strategies()

# Print ASCII report
print(comparator.generate_report())

Output:

Strategy Performance Comparison
================================
Makespan (lower is better):
  1. dualpipev   : 17.0
  2. dualpipe    : 18.0
  3. 1f1b        : 20.0
  4. gpipe       : 30.0

Bubble Ratio (lower is better):
  1. dualpipev   : 0.059
  2. dualpipe    : 0.067
  3. 1f1b        : 0.150
  4. gpipe       : 0.400

Example 3: Estimate Memory

from dualpipe_enhanced.benchmark import MemoryEstimator

# Estimate memory for 4 ranks, 16 chunks
# Model: 1B parameters, hidden_dim=1024, seq_len=2048, batch_size=64
mem = MemoryEstimator(
    num_ranks=4,
    num_chunks=16,
    model_params_per_rank=250_000_000,
    hidden_dim=1024,
    seq_len=2048,
    batch_size=64
)

for strategy in ["dualpipe", "1f1b", "gpipe"]:
    activation = mem.estimate_activation_memory(strategy)
    print(f"{strategy}: {activation['peak_mb']:.1f} MB activation")

Example 4: Find Optimal Configuration

from dualpipe_enhanced.benchmark import ConfigurationSearch

# Search for optimal (num_ranks, num_chunks) under 80GB GPU memory
searcher = ConfigurationSearch(
    model_params=10_000_000,
    hidden_dim=1024,
    seq_len=512,
    batch_size=64,
    gpu_memory_gb=80.0
)

configs = searcher.search_optimal_config(min_ranks=2, max_ranks=8)

for i, cfg in enumerate(configs[:3], 1):
    print(f"{i}. Ranks={cfg['num_ranks']}, Chunks={cfg['num_chunks']}, "
          f"Throughput={cfg['throughput']:.2f} samples/s")

Using Original DualPipe

The original DualPipe implementation is available in the dualpipe/ directory:

python examples/example_dualpipe.py
python examples/example_dualpipev.py

Note: For real-world applications, you will need to implement a custom overlapped_forward_backward method tailored to your specific module.

Testing

Run all 132 tests:

python3 -m pytest tests/ -v

Expected output:

tests/test_analysis.py ......................................... [ 37%]
tests/test_simulator.py .......................................... [ 75%]
tests/test_benchmark.py ........................................... [100%]

======================== 132 passed in 0.09s ========================

Test Coverage:

  • test_analysis.py: 49 tests for ScheduleAnalyzer, BubbleCalculator, CommunicationAnalyzer
  • test_simulator.py: 43 tests for PipelineEvent, ScheduleSimulator, ScheduleComparator, ConfigOptimizer
  • test_benchmark.py: 40 tests for MemoryEstimator, ThroughputEstimator, ScalabilityAnalyzer, ConfigurationSearch

Architecture

dualpipe_enhanced/              # Pure Python analytical modules
├── __init__.py
├── analysis.py                 # Pipeline analysis (49 tests)
├── simulator.py                # Schedule simulation (43 tests)
└── benchmark.py                # Performance estimation (40 tests)

dualpipe/                       # Original DualPipe implementation
├── dualpipe.py
├── dualpipev.py
├── comm.py
└── utils.py

examples/                       # Usage examples for original DualPipe
├── example_dualpipe.py
└── example_dualpipev.py

tests/                          # Test suite (132 tests)
├── test_analysis.py
├── test_simulator.py
└── test_benchmark.py

Zero Dependencies

All core analytical functionality is pure Python—no PyTorch, no CUDA, no distributed runtime needed. Uses only Python's standard library:

  • dataclasses (for PipelineEvent)
  • typing (for type hints)
  • math (for calculations)

Pipeline Bubbles and Memory Comparison

Theoretical comparison of pipeline strategies (same number of PP stages):

Method Bubble Parameter Per Device Activation Per Device #Devices
1F1B (PP-1)(𝐹+𝐵) PP PP
ZB1P (PP-1)(𝐹+𝐵-2𝑊) PP PP
DualPipe (PP/2-1)(𝐹&𝐵+𝐵-3𝑊) PP+1 PP
DualPipeV (PP/2-1)(𝐹&𝐵+𝐵-3𝑊) PP+1 PP/2

PP denotes the number of pp stages (even). 𝐹 denotes the execution time of a forward chunk, 𝐵 denotes the execution time of a full backward chunk, 𝑊 denotes the execution time of a "backward for weights" chunk, and 𝐹&𝐵 denotes the execution time of two mutually overlapped forward and backward chunks.

Use Cases

  1. Algorithm Research: Understand DualPipe's scheduling properties without distributed training
  2. Configuration Planning: Find optimal rank and chunk counts before launching training
  3. Strategy Comparison: Compare DualPipe vs. 1F1B vs. GPipe for your hardware setup
  4. Memory Analysis: Estimate memory requirements under different configurations
  5. Scaling Studies: Analyze weak/strong scaling characteristics across cluster sizes

References

Requirements

For Enhanced Modules:

  • Python 3.7+
  • No external dependencies (pure Python)

For Original DualPipe:

  • PyTorch 2.0 and above

Developers

Original DualPipe: Created and developed by Jiashi Li, Chengqi Deng, and Wenfeng Liang (DeepSeek-AI)

Enhanced Modules: Analytical framework for pipeline parallelism research and optimization

Citation

@misc{deepseekai2025deepseekv3technicalreport,
      title={DeepSeek-V3 Technical Report},
      author={DeepSeek-AI},
      year={2025},
      eprint={2412.19437},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.19437},
}

License

MIT License. See LICENSE file for details.

About

Enhanced DualPipe (DeepSeek V3/R1) with pipeline analysis, schedule simulation & benchmarking | 132 tests | Pure Python

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages