Memory Efficient Pipeline-Parallel DNN Training

Posted on November 8, 2022 1 minute read ∼ Filed in : A paper note

Introduction
- Goal
- Gap
- Challenge
Design
Evaluation

Introduction

Goal

This paper tries to improve the throughput and reduce the memory footprint in pipeline-parallel training.

Gap

Training large models using limited memory is necessary, and two approaches to solve it have some limitations.

Model parallel training: Low throughput due to low resource utilization (partition by layer) & high communication cost (partition by tensor) in good scaling.
Pipelined training: Naive pipelining harm the performance due to weight version inconsistency. & multiple versioned weight increases memory usage.

Challenge

It’s challenging to train efficiently using a low memory footprint and providing high throughput.
The performance of the pipeline-parallel system is dependent on how DNN is partitioned over workers. And this is challenging because:

Memory capacity constraints: parameters & intermediate activations.
Heterogeneous Network Interconnects:
Splitting an operator graph becomes computationally expensive.

Design

The fundamental techniques to ensure low memory usage and high throughput.

Double-buffered Weight Update

For multiple micro-batch, the system uses the same weight in their F/B pass, thus getting multiple gradients.

Once the number of micro-batch reaches a threshold, the weights are accumulated, averaged, and applied to form a new weight version.

But, the new weights cannot be used in backward pass to update immediately because the old version of weight may be used by some micro-batch in forward already.

Thus, the system can maintain at most two versions.

Flush mechanisms

Periodic pipeline flushes to ensure consistent weight versions across weight updates.

Auto-partition planner

The partition planner tries to determine the best partition policy for optimal performance.

It decides the model partitioning, batch size, and whether to use memory-saving optimizations like activation recomputations.

Evaluation

Compare the quality of convergence, throughput, memory usage, and plan decisions.

Memory Efficient Pipeline-Parallel DNN Training

Introduction

Goal

Gap

Challenge

Design

Double-buffered Weight Update

Flush mechanisms

Auto-partition planner

Evaluation

Tags Cloud

Categories Cloud

It's the niceties that make the difference fate gives us the hand, and we play the cards.

Introduction

Goal

Gap

Challenge

Design

Double-buffered Weight Update

Flush mechanisms

Auto-partition planner

Evaluation

END OF POST

Tags Cloud

Categories Cloud

It's the niceties that make the difference fate gives us the hand, and we play the cards.