Retiarii A Deep Learning Exploratory-Training Framework
7 minute read ∼ Filed in : A paper noteRetiarii: A Deep Learning Exploratory-Training Framework
Current Problems
There are three ways to create a new model candidate.
The exploratory-training process is not supported well in current software.
-
Search space: In the current system, like PyTorch/tf, we have to code up all variations of models in one jumbo model, and there is a control flow to pick one during model construction. The control flow in this Jumbo model makes it hard to optimize memory usage, operator fusion, etc.
-
Search Strategy: Exploration strategy is responsible for
- deciding which models to instantiate and train. in which priority, and when to terminate. eg, Random Search, Grid Search, Heuristic-based search, Bayesian-based. and reinforcement learning.
- Manage execution of training, eg. stop training the bad-performance model, and add more resources to model with good performance. Share weights of overlapped layers.
The implementation of a Search Strategy often tightly couples an exploration strategy with a specific model space, which includes 2 problems:
- poor reusable: it’s hard to reuse a search strategy designed for one search space by another search space.
- hard to scale: It’s hard to cross-model training or distributed training with Multiple GPU.
Contribution
The paper’s system clearly decouples model space from exploration strategy and enables system optimizations to speed up the exploration process.
-
New programming interface ( Mutator abstraction ) to
- specify DNN model space for exploration.
- specify exploration strategy to decide
- order to instantiate and train the model.
- prioritize model training.
- when to terminate training.
The Retiarii use Mutator abstraction for the above specification ( Search space and Strategy ).
- The Search Space =
A set of base models and mutators.
- The Search Strategy = which base model and a mutator to use +
when
to apply mutators to the base model.
Each mutator is fine-grained, and captures a logical unit of modification. Reusable and compostable
-
Offers just-in-time engine to instantiate model, manage the training of the instantiated model, gather information of exploration strategy to consume, execute the decision.
-
Offers
cross-model optimizations
to improve the overall exploratory training process by using correlation information.
Evaluation result
- reduce the exploration time of popular NAS algorithm by 2.57X
- Improve scalability of NAS using weight sharing with a speed-up of 8.58 X
Mutator as the Core Abstraction
-
Rather than encoding modification in the complex jumbo model, the system import the existing model from Tf/Pytorch as a base model and apply a mutator to generate a new model.
There are three Mutator in the system
- Input Mutator: mutate inputs of matched operators
- Operator mutator: replace matched operator with other operators
- Insert mutator: Insert new operators or sub-graphs
The search space includes all base models and new models.
-
The system can record the relation between models generated by different mutators based on the same base model. eg, for two instantiations of the same base model, the nodes not modified by mutator are considered identical.
The relation can be used to optimize the multi-model training.
-
The mutator can be applied to any subgraph of the model. And create a new model instance.
Retiarii Just-In-Time Engine
The engine will instantiate the model on the Floy and manage the training of the model dynamically.
Inputs
: based models, mutators, policy describing exploration strategy.
Execution
: Pick one base model and a mutator to generate a new model using strategy.
Strategy can be
- context-free strategy: random choice
- history-based strategy. etc
- customization choice.
The engine records the mutation history, thus it knows which nodes are not modified and stay identical. So the engine can perform corse-model optimization like
- common sub-expression elimination,
- cross model operator batching
- NAS optimization
The optimized Data flow graph is then converted to a standard model format for the existing DL framework to perform single-model optimization before training.
The strategy also responsible for
- Launch training on the new model,
- monitor the training and collect results,
- adjust training resource allocation
- Terminate training of less promising models.
Cross-Model Optimization
Optimization Opportunities
Common sub-expression elimination
Compute the identical operations only once.
It can be applied to non-trainable operations such as data loading and preprocessing. since it’s determinstic operation. while in training, weight is changing.
Evaluation
Operator Batching
Common operators with different inputs and weights can potentially be batched together and computed in a single operator kernel.
- Two graphs that share multiple layers with the same weights can be merged. As shown below.
- Two operations with different weights can also be batched with special kernels like grouped convolution, and batch_matmul. That can parallel compute on slices of an input tensor.
group convolution:
Weight sharing
Instead of training the graph’s weight from scratch, shared weights are inherited from other graphs to continue the training in this graph. And only the different nodes will have different weights.
- It can let user developers to annotate operator weights they want to share.
- it will identify the weight-sharing-enabled operators in common subgraphs.
The system incure a new type of parallelism when constructing executable graphs.
The System builds a super-graph automatically, And we don’t need to store the check-point on disk and then reload.
Evaluation
Executable Graph Construction
To exploit the above optimization, the system needs to construct graphs from raw models.
The construction involves:
- Model merging,
- device placement of operators.
- Training parallelism
Device placement
For DFGs sharing the same dataset and preprocessing, these common operators can be merged by common sub-expression elimination.
The system will test each model for a few iterations and then sort them based on iteration time. The system will then pack as many as models possible.
Evaluation
Mixed parallelism for weight sharing.
The system uses both data parallelism and model parallelism to train the network.
Evaluation
Evaluation
Main founding
-
The separation of model space and exploration strategy makes it easy for Retiarii to try different combinations. Retiarii currently supports 27 popular Neural Architecture Search (NAS) solutions. Most of them can be implemented by the three mutator classes provided by Retiarii.
-
A number of micro-benchmarks show how Retiarii’s cross model optimizations greatly improve training efficiency.
-
Retiarii improves the model exploration speed of three NAS solutions by up to 2.58°ø, compared with traditional approaches.
-
Retiarii improves the scalability of weight sharing-based NAS solutions and brings up to 8.58°ø speed-up using the proposed mixed parallelism, compared with data parallelism.
Micro benchmarks
Shard data loading and preprocessing
We compare Retiarii with a baseline that runs each model independently without common sub-expression elimination.
Operator batching
Insert an adapter layer to a pre-trained Mobile Net, and multiple mobile Net shares the same weights, only the adaptor is different.
Overall, Retiarii’s operator batching improves the aggregate throughput by 3.08°ø when batching 192 models, compared with the baseline that can only train at most 12 models together. Retiarii can batch more models than the baseline because it only has one copy of (fixed) weights from MobileNet. Only the memory for adapters is increased when batching more models
Weight sharing
Compare three cases
- weight is saved and loaded through files
- weight is saved and loaded through the object in memory
- the system’s super graph with cross-modal optimization.
Speeding up NAS
Using MnasNet, NASNet, AmoebaNet.
- Retiarii is substantially faster than the two baselines due to the cross-model optimizations
Scale weight shared training