Memory Efficient Pipeline-Parallel DNN Training

Posted on November 8, 2022   1 minute read ∼ Filed in  : 

Introduction

Goal

This paper tries to improve the throughput and reduce the memory footprint in pipeline-parallel training.

Gap

Training large models using limited memory is necessary, and two approaches to solve it have some limitations.

  • Model parallel training: Low throughput due to low resource utilization (partition by layer) & high communication cost (partition by tensor) in good scaling.
  • Pipelined training: Naive pipelining harm the performance due to weight version inconsistency. & multiple versioned weight increases memory usage.

Challenge

  1. It’s challenging to train efficiently using a low memory footprint and providing high throughput.
  2. The performance of the pipeline-parallel system is dependent on how DNN is partitioned over workers. And this is challenging because:
  • Memory capacity constraints: parameters & intermediate activations.
  • Heterogeneous Network Interconnects:
  • Splitting an operator graph becomes computationally expensive.

Design

The fundamental techniques to ensure low memory usage and high throughput.

Double-buffered Weight Update

For multiple micro-batch, the system uses the same weight in their F/B pass, thus getting multiple gradients.

Once the number of micro-batch reaches a threshold, the weights are accumulated, averaged, and applied to form a new weight version.

But, the new weights cannot be used in backward pass to update immediately because the old version of weight may be used by some micro-batch in forward already.

Thus, the system can maintain at most two versions.

Flush mechanisms

Periodic pipeline flushes to ensure consistent weight versions across weight updates.

Auto-partition planner

The partition planner tries to determine the best partition policy for optimal performance.

It decides the model partitioning, batch size, and whether to use memory-saving optimizations like activation recomputations.

Evaluation

Compare the quality of convergence, throughput, memory usage, and plan decisions.





END OF POST




Tags Cloud


Categories Cloud




It's the niceties that make the difference fate gives us the hand, and we play the cards.