Analytical tools and benchmarking framework for DualPipe and DualPipeV bidirectional pipeline parallelism algorithms from the DeepSeek-V3 Technical Report.
Pure Python • Zero Dependencies • No GPU Required
This enhanced version provides three modules for analyzing, simulating, and benchmarking bidirectional pipeline parallelism strategies—all in pure Python with no external dependencies. Understand DualPipe scheduling, compare strategies, estimate memory, and find optimal configurations without running distributed training.
DualPipe is an innovative bidirectional pipeline parallelism algorithm that achieves full overlap of forward and backward computation-communication phases while reducing pipeline bubbles. Unlike traditional pipeline parallelism (GPipe, 1F1B), DualPipe divides the model vertically into two halves and allows data to flow bidirectionally.
For detailed information, refer to the DeepSeek-V3 Technical Report and profile data.
DualPipe Scheduling:
Example DualPipe scheduling for 8 PP ranks and 20 micro-batches in two directions. Two cells enclosed by a shared black border have mutually overlapped computation and communication.
DualPipeV Scheduling:
Example DualPipeV scheduling for 4 PP ranks (8 PP stages) and 10 micro-batches. DualPipeV is a concise V-shape schedule derived from DualPipe using a "cut-in-half" procedure (thanks to Sea AI Lab for the blog post).
Computational analysis of pipeline scheduling without a distributed runtime.
Key Classes:
ScheduleAnalyzer: Generate operation timelines for DualPipe/DualPipeV, compute bubble ratios, find optimal chunk countsBubbleCalculator: Calculate and compare bubble ratios across all strategies (DualPipe, DualPipeV, 1F1B, GPipe)CommunicationAnalyzer: Count P2P operations, estimate communication volume, analyze bandwidth requirements
Pure Python simulation of full pipeline execution across ranks and microbatches.
Key Classes:
PipelineEvent: Dataclass representing individual pipeline events (forward, backward, weight, communication)ScheduleSimulator: Simulate complete schedules for DualPipe, DualPipeV, 1F1B, and GPipe strategiesScheduleComparator: Compare strategies by makespan, bubble ratio, memory factor with ASCII reportsConfigOptimizer: Find optimal chunk counts, run sensitivity analysis, determine communication thresholds
Estimate memory usage, throughput, and scalability characteristics.
Key Classes:
MemoryEstimator: Estimate activation, parameter, and total memory per rank for each strategyThroughputEstimator: Estimate step time, model FLOPs utilization (MFU), samples/tokens per secondScalabilityAnalyzer: Perform weak and strong scaling analysis, find scaling limitsConfigurationSearch: Search for optimal (num_ranks, num_chunks) configurations under memory constraints
Clone the repository (no pip install needed):
git clone https://github.com/deepseek-ai/DualPipe.git dualpipe-enhanced
cd dualpipe-enhancedVerify the enhanced modules are present:
ls dualpipe_enhanced/
# Expected: __init__.py, analysis.py, simulator.py, benchmark.pyfrom dualpipe_enhanced.analysis import BubbleCalculator
# Compare bubble ratios for 4 pipeline stages, 16 microbatches
calc = BubbleCalculator(num_ranks=4)
num_chunks = 16
dualpipe_result = calc.dualpipe_bubble(num_chunks)
dualpipev_result = calc.dualpipev_bubble(num_chunks)
gpipe_result = calc.gpipe_bubble(num_chunks)
print(f"DualPipe bubble_ratio={dualpipe_result['bubble_ratio']:.3f}")
print(f"DualPipeV bubble_ratio={dualpipev_result['bubble_ratio']:.3f}")
print(f"GPipe bubble_ratio={gpipe_result['bubble_ratio']:.3f}")Output:
DualPipe bubble_ratio=0.108
DualPipeV bubble_ratio=0.125
GPipe bubble_ratio=0.375
from dualpipe_enhanced.simulator import ScheduleComparator
# Simulate all strategies for 4 ranks, 8 chunks
comparator = ScheduleComparator(num_ranks=4, num_chunks=8)
results = comparator.compare_strategies()
# Print ASCII report
print(comparator.generate_report())Output:
Strategy Performance Comparison
================================
Makespan (lower is better):
1. dualpipev : 17.0
2. dualpipe : 18.0
3. 1f1b : 20.0
4. gpipe : 30.0
Bubble Ratio (lower is better):
1. dualpipev : 0.059
2. dualpipe : 0.067
3. 1f1b : 0.150
4. gpipe : 0.400
from dualpipe_enhanced.benchmark import MemoryEstimator
# Estimate memory for 4 ranks, 16 chunks
# Model: 1B parameters, hidden_dim=1024, seq_len=2048, batch_size=64
mem = MemoryEstimator(
num_ranks=4,
num_chunks=16,
model_params_per_rank=250_000_000,
hidden_dim=1024,
seq_len=2048,
batch_size=64
)
for strategy in ["dualpipe", "1f1b", "gpipe"]:
activation = mem.estimate_activation_memory(strategy)
print(f"{strategy}: {activation['peak_mb']:.1f} MB activation")from dualpipe_enhanced.benchmark import ConfigurationSearch
# Search for optimal (num_ranks, num_chunks) under 80GB GPU memory
searcher = ConfigurationSearch(
model_params=10_000_000,
hidden_dim=1024,
seq_len=512,
batch_size=64,
gpu_memory_gb=80.0
)
configs = searcher.search_optimal_config(min_ranks=2, max_ranks=8)
for i, cfg in enumerate(configs[:3], 1):
print(f"{i}. Ranks={cfg['num_ranks']}, Chunks={cfg['num_chunks']}, "
f"Throughput={cfg['throughput']:.2f} samples/s")The original DualPipe implementation is available in the dualpipe/ directory:
python examples/example_dualpipe.py
python examples/example_dualpipev.pyNote: For real-world applications, you will need to implement a custom overlapped_forward_backward method tailored to your specific module.
Run all 132 tests:
python3 -m pytest tests/ -vExpected output:
tests/test_analysis.py ......................................... [ 37%]
tests/test_simulator.py .......................................... [ 75%]
tests/test_benchmark.py ........................................... [100%]
======================== 132 passed in 0.09s ========================
Test Coverage:
- test_analysis.py: 49 tests for ScheduleAnalyzer, BubbleCalculator, CommunicationAnalyzer
- test_simulator.py: 43 tests for PipelineEvent, ScheduleSimulator, ScheduleComparator, ConfigOptimizer
- test_benchmark.py: 40 tests for MemoryEstimator, ThroughputEstimator, ScalabilityAnalyzer, ConfigurationSearch
dualpipe_enhanced/ # Pure Python analytical modules
├── __init__.py
├── analysis.py # Pipeline analysis (49 tests)
├── simulator.py # Schedule simulation (43 tests)
└── benchmark.py # Performance estimation (40 tests)
dualpipe/ # Original DualPipe implementation
├── dualpipe.py
├── dualpipev.py
├── comm.py
└── utils.py
examples/ # Usage examples for original DualPipe
├── example_dualpipe.py
└── example_dualpipev.py
tests/ # Test suite (132 tests)
├── test_analysis.py
├── test_simulator.py
└── test_benchmark.py
All core analytical functionality is pure Python—no PyTorch, no CUDA, no distributed runtime needed. Uses only Python's standard library:
dataclasses(for PipelineEvent)typing(for type hints)math(for calculations)
Theoretical comparison of pipeline strategies (same number of PP stages):
| Method | Bubble | Parameter Per Device | Activation Per Device | #Devices |
|---|---|---|---|---|
| 1F1B | (PP-1)(𝐹+𝐵) | 1× | PP | PP |
| ZB1P | (PP-1)(𝐹+𝐵-2𝑊) | 1× | PP | PP |
| DualPipe | (PP/2-1)(𝐹&𝐵+𝐵-3𝑊) | 2× | PP+1 | PP |
| DualPipeV | (PP/2-1)(𝐹&𝐵+𝐵-3𝑊) | 2× | PP+1 | PP/2 |
PP denotes the number of pp stages (even). 𝐹 denotes the execution time of a forward chunk, 𝐵 denotes the execution time of a full backward chunk, 𝑊 denotes the execution time of a "backward for weights" chunk, and 𝐹&𝐵 denotes the execution time of two mutually overlapped forward and backward chunks.
- Algorithm Research: Understand DualPipe's scheduling properties without distributed training
- Configuration Planning: Find optimal rank and chunk counts before launching training
- Strategy Comparison: Compare DualPipe vs. 1F1B vs. GPipe for your hardware setup
- Memory Analysis: Estimate memory requirements under different configurations
- Scaling Studies: Analyze weak/strong scaling characteristics across cluster sizes
- DeepSeek-V3 Technical Report: https://arxiv.org/abs/2412.19437
- Original DualPipe: github.com/deepseek-ai/DualPipe
- Related Work:
- GPipe (Huang et al., 2019)
- 1F1B (Narayanan et al., 2021)
- Sea AI Lab "Cut-in-half" procedure for DualPipeV
For Enhanced Modules:
- Python 3.7+
- No external dependencies (pure Python)
For Original DualPipe:
- PyTorch 2.0 and above
Original DualPipe: Created and developed by Jiashi Li, Chengqi Deng, and Wenfeng Liang (DeepSeek-AI)
Enhanced Modules: Analytical framework for pipeline parallelism research and optimization
@misc{deepseekai2025deepseekv3technicalreport,
title={DeepSeek-V3 Technical Report},
author={DeepSeek-AI},
year={2025},
eprint={2412.19437},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.19437},
}MIT License. See LICENSE file for details.

