Cerebro A Data System for Optimized Deep Learning Model Selection
2 minute read ∼ Filed in : A paper noteIntroduction
Motivation
It’s important to automatically scale the model selection on a cluster to increase throughput without raising resource costs. The paper proposes a system to meet the above requirements. Specifically, the system design has the following consideration.
- Scalability: model selection scalability has a positive correlation to task scalability or data scalability.
- High throughput: how many configurations are evaluated per unit time.
- Resource efficiency:
- Per-epoch efficiency: time to complete an epoch.
- Convergence efficiency: time to complete training.
- Memory/storage efficiency: memory/disk usage.
- Communication efficiency: network bandwidth usage.
- Reproducibility
However, the existing model selection system doesn’t achieve the above aspects well.
- They have not increased overall scalability by sharding the data ( data scalability )
- Bring data scalability into the system without wasting memory, and network bandwidth in the model selection computing process is not easy. Decouple computing with storage => high network bandwidth or potential cache.
Contribution
The paper builds a model selection system with a well-designed combination of task-parallelism and data-parallelism. it can
- Increase model-selection throughput. (3-10X)
- Save memory/storage usage. (8x)
- Save network communications (100X)
System Design
Parallelism summary.
Parallelism is a popular way to raise throughput.
-
Task Parallelism requires copying data into each worker, and there is no communication between them => high storage cost.
- BSP (Bulk Synchronous Parallel): master average the weights / gradients per epoch => poor converge.
- Sync ps / Asyn ps => high communication overhead.
- All reduce => high communication overhead.
System
The system is basically composed of two parts.
- scheduling: Try to achieve the best scheduling policy, graph D
- failure recovery: Cerebro detects failures via the periodic heart-beat check between the scheduler and workers.
Evaluation
Workloads: Two neural architectures and hyperparameters => 16 cfgs.
Compare with Horovod,
GPU cluster with 8 workers and 1 master.
Throughput & Efficiency
The paper shows the system is fast and resource-efficient.
Scalability
linear speedups due to MOP’s marginal communication costs.
In contrast, Horovod exhibits substantially sub-linear speedups due to its much higher communication costs with multiple workers
Effect of batch size
When batch size increases, the runtime of Cerebro is reduced because larger batch sizes fully use the hardware computing capacity.
A larger batch size also reduces the runtime of Horovod because the communication is less.
Related system
Google Vizier, Ray Tune, Dask-Hyperband, SparkDL, and Spark-Hyperopt.