Cerebro A Data System for Optimized Deep Learning Model Selection

Posted on June 24, 2022 2 minute read ∼ Filed in : A paper note

Introduction

Motivation

It’s important to automatically scale the model selection on a cluster to increase throughput without raising resource costs. The paper proposes a system to meet the above requirements. Specifically, the system design has the following consideration.

Scalability: model selection scalability has a positive correlation to task scalability or data scalability.
High throughput: how many configurations are evaluated per unit time.
Resource efficiency:
- Per-epoch efficiency: time to complete an epoch.
- Convergence efficiency: time to complete training.
- Memory/storage efficiency: memory/disk usage.
- Communication efficiency: network bandwidth usage.
Reproducibility

However, the existing model selection system doesn’t achieve the above aspects well.

They have not increased overall scalability by sharding the data ( data scalability )
Bring data scalability into the system without wasting memory, and network bandwidth in the model selection computing process is not easy. Decouple computing with storage => high network bandwidth or potential cache.

Contribution

The paper builds a model selection system with a well-designed combination of task-parallelism and data-parallelism. it can

Increase model-selection throughput. (3-10X)
Save memory/storage usage. (8x)
Save network communications (100X)

System Design

Parallelism summary.

Parallelism is a popular way to raise throughput.

Task Parallelism requires copying data into each worker, and there is no communication between them => high storage cost.
BSP (Bulk Synchronous Parallel): master average the weights / gradients per epoch => poor converge.
Sync ps / Asyn ps => high communication overhead.
All reduce => high communication overhead.

System

The system is basically composed of two parts.

scheduling: Try to achieve the best scheduling policy, graph D
failure recovery: Cerebro detects failures via the periodic heart-beat check between the scheduler and workers.

Evaluation

Workloads: Two neural architectures and hyperparameters => 16 cfgs.

Compare with Horovod,

GPU cluster with 8 workers and 1 master.

Throughput & Efficiency

The paper shows the system is fast and resource-efficient.

Scalability

linear speedups due to MOP’s marginal communication costs.

In contrast, Horovod exhibits substantially sub-linear speedups due to its much higher communication costs with multiple workers

Effect of batch size

When batch size increases, the runtime of Cerebro is reduced because larger batch sizes fully use the hardware computing capacity.

A larger batch size also reduces the runtime of Horovod because the communication is less.

Google Vizier, Ray Tune, Dask-Hyperband, SparkDL, and Spark-Hyperopt.

Cerebro A Data System for Optimized Deep Learning Model Selection

Introduction

Motivation

Contribution

System Design

Parallelism summary.

System

Evaluation

Throughput & Efficiency

Scalability

Effect of batch size

Tags Cloud

Categories Cloud

It's the niceties that make the difference fate gives us the hand, and we play the cards.

Introduction

Motivation

Contribution

System Design

Parallelism summary.

System

Evaluation

Throughput & Efficiency

Scalability

Effect of batch size

Related system

END OF POST

Tags Cloud

Categories Cloud

It's the niceties that make the difference fate gives us the hand, and we play the cards.