Retiarii A Deep Learning Exploratory-Training Framework

Posted on April 18, 2022   7 minute read ∼ Filed in  : 

Retiarii: A Deep Learning Exploratory-Training Framework

Current Problems

There are three ways to create a new model candidate.

image-20220419161111915

The exploratory-training process is not supported well in current software.

  1. Search space: In the current system, like PyTorch/tf, we have to code up all variations of models in one jumbo model, and there is a control flow to pick one during model construction. The control flow in this Jumbo model makes it hard to optimize memory usage, operator fusion, etc.

    image-20220419165437608

  2. Search Strategy: Exploration strategy is responsible for

    • deciding which models to instantiate and train. in which priority, and when to terminate. eg, Random Search, Grid Search, Heuristic-based search, Bayesian-based. and reinforcement learning.
    • Manage execution of training, eg. stop training the bad-performance model, and add more resources to model with good performance. Share weights of overlapped layers.

    The implementation of a Search Strategy often tightly couples an exploration strategy with a specific model space, which includes 2 problems:

    • poor reusable: it’s hard to reuse a search strategy designed for one search space by another search space.
    • hard to scale: It’s hard to cross-model training or distributed training with Multiple GPU.

Contribution

The paper’s system clearly decouples model space from exploration strategy and enables system optimizations to speed up the exploration process.

  1. New programming interface ( Mutator abstraction ) to

    • specify DNN model space for exploration.
    • specify exploration strategy to decide
      • order to instantiate and train the model.
      • prioritize model training.
      • when to terminate training.

    The Retiarii use Mutator abstraction for the above specification ( Search space and Strategy ).

    • The Search Space = A set of base models and mutators.
    • The Search Strategy = which base model and a mutator to use + when to apply mutators to the base model.

    Each mutator is fine-grained, and captures a logical unit of modification. Reusable and compostable

  2. Offers just-in-time engine to instantiate model, manage the training of the instantiated model, gather information of exploration strategy to consume, execute the decision.

  3. Offers cross-model optimizations to improve the overall exploratory training process by using correlation information.

Evaluation result

  1. reduce the exploration time of popular NAS algorithm by 2.57X
  2. Improve scalability of NAS using weight sharing with a speed-up of 8.58 X

Mutator as the Core Abstraction

  1. Rather than encoding modification in the complex jumbo model, the system import the existing model from Tf/Pytorch as a base model and apply a mutator to generate a new model.

    There are three Mutator in the system

    • Input Mutator: mutate inputs of matched operators
    • Operator mutator: replace matched operator with other operators
    • Insert mutator: Insert new operators or sub-graphs

    The search space includes all base models and new models.

  2. The system can record the relation between models generated by different mutators based on the same base model. eg, for two instantiations of the same base model, the nodes not modified by mutator are considered identical.

    The relation can be used to optimize the multi-model training.

  3. The mutator can be applied to any subgraph of the model. And create a new model instance.

    image-20220419175540302

Retiarii Just-In-Time Engine

The engine will instantiate the model on the Floy and manage the training of the model dynamically.

Inputs: based models, mutators, policy describing exploration strategy.

Execution: Pick one base model and a mutator to generate a new model using strategy.

Strategy can be

  • context-free strategy: random choice
  • history-based strategy. etc
  • customization choice.

The engine records the mutation history, thus it knows which nodes are not modified and stay identical. So the engine can perform corse-model optimization like

  • common sub-expression elimination,
  • cross model operator batching
  • NAS optimization

The optimized Data flow graph is then converted to a standard model format for the existing DL framework to perform single-model optimization before training.

The strategy also responsible for

  • Launch training on the new model,
  • monitor the training and collect results,
  • adjust training resource allocation
  • Terminate training of less promising models.

image-20220419182139203

Cross-Model Optimization

Optimization Opportunities

Common sub-expression elimination

Compute the identical operations only once.

It can be applied to non-trainable operations such as data loading and preprocessing. since it’s determinstic operation. while in training, weight is changing.

Evaluation

image-20220419213808406

Operator Batching

Common operators with different inputs and weights can potentially be batched together and computed in a single operator kernel.

  1. Two graphs that share multiple layers with the same weights can be merged. As shown below.
  2. Two operations with different weights can also be batched with special kernels like grouped convolution, and batch_matmul. That can parallel compute on slices of an input tensor.

image-20220419230838890

image-20220419184057269

group convolution:

image-20220419232624659

Weight sharing

Instead of training the graph’s weight from scratch, shared weights are inherited from other graphs to continue the training in this graph. And only the different nodes will have different weights.

  1. It can let user developers to annotate operator weights they want to share.
  2. it will identify the weight-sharing-enabled operators in common subgraphs.

The system incure a new type of parallelism when constructing executable graphs.

image-20220419225513233

The System builds a super-graph automatically, And we don’t need to store the check-point on disk and then reload.

Evaluation

image-20220419225626822

Executable Graph Construction

To exploit the above optimization, the system needs to construct graphs from raw models.

The construction involves:

  • Model merging,
  • device placement of operators.
  • Training parallelism

Device placement

image-20220419190028675

For DFGs sharing the same dataset and preprocessing, these common operators can be merged by common sub-expression elimination.

The system will test each model for a few iterations and then sort them based on iteration time. The system will then pack as many as models possible.

Evaluation

image-20220419214307634

Mixed parallelism for weight sharing.

The system uses both data parallelism and model parallelism to train the network.

image-20220419191852896

Evaluation

image-20220419222635225

Evaluation

Main founding

  1. The separation of model space and exploration strategy makes it easy for Retiarii to try different combinations. Retiarii currently supports 27 popular Neural Architecture Search (NAS) solutions. Most of them can be implemented by the three mutator classes provided by Retiarii.

  2. A number of micro-benchmarks show how Retiarii’s cross model optimizations greatly improve training efficiency.

  3. Retiarii improves the model exploration speed of three NAS solutions by up to 2.58°ø, compared with traditional approaches.

  4. Retiarii improves the scalability of weight sharing-based NAS solutions and brings up to 8.58°ø speed-up using the proposed mixed parallelism, compared with data parallelism.

Micro benchmarks

Shard data loading and preprocessing

We compare Retiarii with a baseline that runs each model independently without common sub-expression elimination.

Operator batching

Insert an adapter layer to a pre-trained Mobile Net, and multiple mobile Net shares the same weights, only the adaptor is different.

image-20220419184057269

image-20220419214446724

Overall, Retiarii’s operator batching improves the aggregate throughput by 3.08°ø when batching 192 models, compared with the baseline that can only train at most 12 models together. Retiarii can batch more models than the baseline because it only has one copy of (fixed) weights from MobileNet. Only the memory for adapters is increased when batching more models

Weight sharing

Compare three cases

  1. weight is saved and loaded through files
  2. weight is saved and loaded through the object in memory
  3. the system’s super graph with cross-modal optimization.

Speeding up NAS

Using MnasNet, NASNet, AmoebaNet.

image-20220419225009849

image-20220419222323121

  1. Retiarii is substantially faster than the two baselines due to the cross-model optimizations

Scale weight shared training





END OF POST




Tags Cloud


Categories Cloud




It's the niceties that make the difference fate gives us the hand, and we play the cards.