Neural Architecture Search A Survey
4 minute read ∼ Filed in : A paper note[JMLR-2019] Neural Architecture Search: A Survey
Problems:
- review some basic knowledge about Reinforcement learning, and Bayesian optimization.
Introduction
The paper categorizes methods for NAS according to three dimensions: Search space, Search strategy and Performance estimation strategy.
search space
Incorporating prior knowledge about architecture and tasks can reduce the search space, but introduce human bias.
Search strategy
:
- Goal1: Find well-performing architectures as quickly as possible.
- Goal2: Avoid convergence to a region of suboptimal architectures.
Performance Estimation Strategy
- Goal1: Find architectures that achieve high predictive performance on the validation dataset.
- Goal2: Reduce as much evaluation cost as possible
Search Space
Chain-structured Neural Networks
layer l0’s output is l1’s input.
search space is parameterized by
- number of layers n
- type or operations very layer execution, eg., pooling, convolution, etc
- Hyperparameters of each layer
- number of fully-connected networks.
multi-branch networks
input of layer i can be formally described as a function G() combining previous layer outputs
Block/cell-based networks
Two kinds of cells: Normal cells that preserve dimensionality of input and reduction cell reduce spatial dimension.
Final architecture is built by stacking cells in predified manner
.
Advantages:
- Search space is reduced because cell has fewer layers.
- Architecture built from cells can more easily be transferred or adapted to other dataset.
- Creating architecture by repeating blocks is a more useful design.
Block/cell-based New design
How to choose the macro-architecture: how many cells shall be used and how should they be connected to build the actual model? Hard-coded macro architecture:
Each cell receives the outputs of the two preceding cells as input.
- Manually designed architectures, eg., DenseNet
In general the cell based searching includes 3 steps:
- Define a set of primitive operations
- connect primitive operations and form the cell
- Hard-coded macro-architecture.
Search Strategy
Search strategies: random search, Bayesian optimization, evolutionary methods, reinforcement learning, and gradient-based methods.
Bayesian Optimization
Some papers derive kernel functions for architecture search spaces in order to use classic GP-based BO methods
Optimize both neural architectures and their hyperparameters jointly.
##
reinforcement learning method
Agent’s action: Generation of a neural architecture with action space: identical to the search space.
Agent’s reward: Estimate of the performance of the trained architecture on unseen data.
Different RL approaches differ in how they represent the agent’s policy and how they optimize it.
- Recurrent neural network (RNN) policy to sequentially sample a string that in turn encodes the neural architecture.
- Q-learning to train a policy which sequentially chooses a layer’s type and corresponding hyperparameters.
Sequential decision processes
state is the current (partially trained) architecture
reward is an estimate of the architecture’s performance
action corresponds to an application of function-preserving mutations followed by a training phase of the network.
Evolutionary method
Using gradient-based methods for optimizing weights and solely use evolutionary algorithms
for optimizing the neural architecture itself.
in every evolution step, at least one model from the population is sampled
and serves as a parent
to generate offsprings
by applying mutations to it. In the context of NAS, mutations are local operations, such as adding or removing a layer, altering the hyperparameters of a layer, adding skip connections, as well as altering training hyperparameters. After training the offsprings, their fitness (e.g., performance on a validation set) is evaluated and they are added to the population
.
Challenge
- how to sample parents
- update populations
- generate offsprings
Compare
RL and evolution perform equally well in terms of final test accuracy, with evolution having better anytime performance and finding smaller models.
Random search test error = 3.9% on CIFAR-10 and a top-1 validation error of 21.0% on ImageNet.
Evolution-based method: 3.75% and 20.3% respectively.
Performance Estimation Strategy
To guide the search process, these search strategies need to estimate the performance of a given architecture A they consider.
Lower fidelity estimates
Learning Curve extrapolation
Consider architectural hyperparameters for predicting which partial learning curves are most promising.
Challenge
The main challenge for predicting the performances of neural architectures is good predictions in a relatively large search space need to be made based on relatively few evaluations.
One-Shot Models
Treats all architectures as different sub-graphs of a supergraph (the one-shot model) and shares weights between architectures that have edges of this supergraph in common.
Only the weights of a single one-shot model need to be trained (in one of various ways), and architectures (which are just subgraphs of the one-shot model) can then be evaluated without any separate training by inheriting trained weights from the one-shot model.
Challenge
How the one-shot model is trained.
Future Directions