PipeDream Generalized Pipeline Parallelism for DNN Training.
4 minute read ∼ Filed in : A paper noteQuestion
-
Why is it faster than Data parallelism with (ASP and BSP), Model parallelism, Hybrid parallelism, and pipeline parallelism?
It integrates all the benefits of the above systems.
E,g. It mainly uses Pipeline Parallelism, adds stage replication (similar to data parallelism), and updates weight immediately while using weight stashing to provide practical training.
Introduction
Background & Motivations
DNNs are getting more extensive and computationally expensive to train, thus requiring parallel execution across multiple accelerators.
Gap
Intra-batch parallelization:
- Data parallelism with BSP/ASP. Where the worker receives the gradients.
- Limitations are mainly on communication.
- Data-parallelization training suffers from high communication costs at large scale. E,g. when workers == 32, the communication cost takes 90% of the total training time.
- Model parallelism:
- Limitations: under-utilizations of GPUs and inconvenience of manual partitioning.
Inter-batch parallelism:
- Pipeline parallelism. E,g. GPipe.
Goal
Propose a pipeline parallelism system to enable faster DNN training by combining intra-batch parallelism with inter-batch parallelization.
It can outperform intra-batch parallelism because:
- Low communication overhead compared with data parallelism.
-
Overlaps the computations and communications of different inputs in a pipelined fashion to achieve high hardware efficiency.
- High statistical efficiency (number of iterations needed to reach a particular target accuracy) comparable to data parallelism using the same number of workers.
Challenges
DNN training is bi-directional, thus incurs some challenges for using pipelining.
- Inject all mini-batches in an epoch
- => cannot reach desired target accuracy since gradients are averages over all training samples.
- Inject m mini-batches and update the weight of every m mini-batches. (GPipe)
- => reduce the hardware efficiency (much new time unit for each hardware. )
Worker Partitioning
Obj:
- High hardware efficiency.
Problem definition:
- Each stage has equal time usage and less ideal time => better hardware efficiency.
- Less communication => better hardware efficiency.
- Each operator has a different stage; thus, uniformly splitting the model may not be optimal.
Solution:
- Replicate the slow stage (data parallelism) to reduce and increase the throughput.
- Propose a partitioning algorithm to ensure each stage completes at roughly the same rate while minimizing communication across workers in a topology-aware way. It computes
- partition of the layer into each stage,
- replication factor for each stage,
- an optimal number of mini-batches to keep the training pipeline bust.
- The partition algorithm replies on a profiler to measure the time of F/B pass, the size of the layer outputs, and the size of the parameters.
Worker scheduling
Obj:
- High resource utilization.
Problems:
- Determine whether it should perform forward/backward/ tasks.
- Determine how the mini-batches are routed with replicated stages.
Solution:
- 1F1B scheduling. Worker alternate between F/B pass
- Gradients are used to update model weight immediately. (GPipe needs to wait for gradient aggregation, thus slowing resource utilization. )
Effective Learning
Obj:
- Effective learning.
Problem:
- Forward on one minibatch using w0, and backward may update on w1 (another mini-batch already updates w0.), which could reduce the effectiveness.
Solution:
- Use weight stashing to avoid the mismatch between single stage/GPU weights.
- Weight stashing cannot guarantee consistency across stages.
- Vertical Sync:
- Eliminates the potential inconsistency across stages.
Multiple version of weights is averaged together periodically.
Evaluation
Macro-benchmarks
- Speedups in Time-to-target-accuracy.
- Compared with data parallelism, PipeDream is much faster due to less communication cost.
- Compared with the data parallelism of ASP, PipeDream is much more accurate due to weight stashing.
- Compared with model parallelism (w/o pipelining or replication). PipeDream is much faster when adding pipelining and replication.
- Compared with hybrid parallelism. PipeDream is much faster due to overlapping communication and computation.
- Reduce the overhead of communications without increasing memory usage.
- Compared with GPipe.
Microbenchmarks
-
Optimizer
-
Memory Footprint.
-
Communication overhead.
-
Effect of pipeline depth