HYDROZOA DYNAMIC HYBRID-PARALLEL DNN TRAINING ON SERVERLESS CONTAINERS
3 minute read ∼ Filed in : A paper noteIntroduction
Motivation
Training DNN in parallel can reduce the training time, but it is inconvenient due to cumbersome cluster management and overprovisioning of resources.
Some good features of serverless architecture could solve those problems but it’s still inefficient due to communication costs and CPU-only limitations. One alternative way is to use serverless containers since they can use GPU and can directly communicate with each other. but it still has overprovisioning issues. As a result, using serverless in distributed training requires some effort in system design. Platform examples:
- MLaaS: Azure ML, AWS SageMaker.
- Serverless container: Azure Container Instances (ACI): execute containers in a serverless fashion
Existing distributed training systems like MXNet and Pytorch didn’t design for the serverless platform, they cannot handle workers on serverless compute dynamically joining and leaving during training. While other serverless-based training systems are less efficient.
Contribution
The paper designed a distributed training system on the serverless platform with higher throughput-per-dollar, it combines the following features in a well-designed pattern.
- Hybrid parallel training: data-parallel and model parallel => improve the training speed.
- Serverless container: use container instead of functions => directly communicate & GPU & fine-grained scaling & pay on use & manage free
- Dynamic Worker Scaling: dynamically add or leave workers during training.
In conclusion, it combines serverless containers with hybrid-parallel training and supports dynamic worker scaling to achieve high throughput and low cost.
System design
Overall, the whole system has three components,
- planner:
- determines the model partitioning and hybrid-parallel strategy selection
- monitor memory usage and send it to the coordinator.
- worker: each worker has 3 threads, computation, gradient communications, prefetching training data
- coordinator:
- start workers’ tasks with cfgs.
- gather worker information.
- adjust resources when worker number changes.
Planner & Parallelism
The system combines both data parallelism and model parallelism, so it must determine how many shards to partition the model and what batch sizes or micro-batch sizes to use.
The planner will run a partitioning algorithm to partition the model into many shards in a way that minimizes the total training time of a minibatch of inputs.
- It first profiles the model by recording the model’s per-layer computing time, output size, and speed.
- formate the problem
- for different batch sizes, it calculates the best partition number using dynamic programming.
- Finally, it found a global best combination of partition size and batch size.
Evaluation
It evaluates the Throughput vs hardware/serverless pattern to confirm the benefits of using a container-based serverless approach.
Then it measures the throughput vs parallelism strategy to confirm the correctness of the planner.
Throughput: Number of inputs processed per second. T
Throughput-per-dollar: Number of inputs the system can process per dollar spent. T/C. where C is the per-second cost of running the system.
Throughput
Single worker GPU + directly communicate > CPU + directly communicate > CPU + in-directly communicate.
HL: scale poorly since in-direct communication is overhead. (S3)
It also compares the system with the cloud platform inner library.
The system has a lower cost. The training occurs on GPU, only a fraction of that memory is used. Hydrozoa uses profiling data to allocate no more resources than needed.
Planner
Hydrozoa’s partitioning algorithm is able to produce partitions that scale throughput effectively
Auto scaling
When the worker increase, the coordinator can adjust the resources.
In experiments, it starts with one worker and increases to 16 workers.
The result shows dynamic worker scaling in Hydrozoa can bring significant benefits in training efficiency and cost.