Cerebro A Data System for Optimized Deep Learning Model Selection

Posted on June 24, 2022   2 minute read ∼ Filed in  : 

Introduction

Motivation

It’s important to automatically scale the model selection on a cluster to increase throughput without raising resource costs. The paper proposes a system to meet the above requirements. Specifically, the system design has the following consideration.

  1. Scalability: model selection scalability has a positive correlation to task scalability or data scalability.
  2. High throughput: how many configurations are evaluated per unit time.
  3. Resource efficiency:
    • Per-epoch efficiency: time to complete an epoch.
    • Convergence efficiency: time to complete training.
    • Memory/storage efficiency: memory/disk usage.
    • Communication efficiency: network bandwidth usage.
  4. Reproducibility

However, the existing model selection system doesn’t achieve the above aspects well.

  1. They have not increased overall scalability by sharding the data ( data scalability )
  2. Bring data scalability into the system without wasting memory, and network bandwidth in the model selection computing process is not easy. Decouple computing with storage => high network bandwidth or potential cache.

Contribution

The paper builds a model selection system with a well-designed combination of task-parallelism and data-parallelism. it can

  1. Increase model-selection throughput. (3-10X)
  2. Save memory/storage usage. (8x)
  3. Save network communications (100X)

System Design

Parallelism summary.

Parallelism is a popular way to raise throughput.

  1. Task Parallelism requires copying data into each worker, and there is no communication between them => high storage cost.

  2. BSP (Bulk Synchronous Parallel): master average the weights / gradients per epoch => poor converge.
  3. Sync ps / Asyn ps => high communication overhead.
  4. All reduce => high communication overhead.

System

The system is basically composed of two parts.

  1. scheduling: Try to achieve the best scheduling policy, graph D
  2. failure recovery: Cerebro detects failures via the periodic heart-beat check between the scheduler and workers.

image-20220627212504654

Evaluation

Workloads: Two neural architectures and hyperparameters => 16 cfgs.

Compare with Horovod,

GPU cluster with 8 workers and 1 master.

Throughput & Efficiency

The paper shows the system is fast and resource-efficient.

image-20220627213706340

Scalability

linear speedups due to MOP’s marginal communication costs.

In contrast, Horovod exhibits substantially sub-linear speedups due to its much higher communication costs with multiple workers

image-20220627214421155

Effect of batch size

When batch size increases, the runtime of Cerebro is reduced because larger batch sizes fully use the hardware computing capacity.

A larger batch size also reduces the runtime of Horovod because the communication is less.

image-20220627220559013

Related system

Google Vizier, Ray Tune, Dask-Hyperband, SparkDL, and Spark-Hyperopt.





END OF POST




Tags Cloud


Categories Cloud




It's the niceties that make the difference fate gives us the hand, and we play the cards.