Neural Architecture Search A Survey

Posted on January 23, 2022 4 minute read ∼ Filed in : A paper note

[JMLR-2019] Neural Architecture Search: A Survey

Problems:

review some basic knowledge about Reinforcement learning, and Bayesian optimization.

Introduction

The paper categorizes methods for NAS according to three dimensions: Search space, Search strategy and Performance estimation strategy.

search space

Incorporating prior knowledge about architecture and tasks can reduce the search space, but introduce human bias.

Search strategy:

Goal1: Find well-performing architectures as quickly as possible.
Goal2: Avoid convergence to a region of suboptimal architectures.

Performance Estimation Strategy

Goal1: Find architectures that achieve high predictive performance on the validation dataset.
Goal2: Reduce as much evaluation cost as possible

Search Space

Chain-structured Neural Networks

layer l0’s output is l1’s input.

search space is parameterized by

number of layers n
type or operations very layer execution, eg., pooling, convolution, etc
Hyperparameters of each layer
number of fully-connected networks.

multi-branch networks

input of layer i can be formally described as a function G() combining previous layer outputs

Block/cell-based networks

Two kinds of cells: Normal cells that preserve dimensionality of input and reduction cell reduce spatial dimension.

Final architecture is built by stacking cells in predified manner.

Advantages:

Search space is reduced because cell has fewer layers.
Architecture built from cells can more easily be transferred or adapted to other dataset.
Creating architecture by repeating blocks is a more useful design.

Block/cell-based New design

How to choose the macro-architecture: how many cells shall be used and how should they be connected to build the actual model? Hard-coded macro architecture:

Each cell receives the outputs of the two preceding cells as input.
Manually designed architectures, eg., DenseNet

In general the cell based searching includes 3 steps:

Define a set of primitive operations
connect primitive operations and form the cell
Hard-coded macro-architecture.

Search Strategy

Search strategies: random search, Bayesian optimization, evolutionary methods, reinforcement learning, and gradient-based methods.

Bayesian Optimization

Some papers derive kernel functions for architecture search spaces in order to use classic GP-based BO methods

Optimize both neural architectures and their hyperparameters jointly.

reinforcement learning method

Agent’s action: Generation of a neural architecture with action space: identical to the search space.

Agent’s reward: Estimate of the performance of the trained architecture on unseen data.

Different RL approaches differ in how they represent the agent’s policy and how they optimize it.

Recurrent neural network (RNN) policy to sequentially sample a string that in turn encodes the neural architecture.
Q-learning to train a policy which sequentially chooses a layer’s type and corresponding hyperparameters.

Sequential decision processes

state is the current (partially trained) architecture

reward is an estimate of the architecture’s performance

action corresponds to an application of function-preserving mutations followed by a training phase of the network.

Evolutionary method

Using gradient-based methods for optimizing weights and solely use evolutionary algorithms for optimizing the neural architecture itself.

in every evolution step, at least one model from the population is sampled and serves as a parent to generate offsprings by applying mutations to it. In the context of NAS, mutations are local operations, such as adding or removing a layer, altering the hyperparameters of a layer, adding skip connections, as well as altering training hyperparameters. After training the offsprings, their fitness (e.g., performance on a validation set) is evaluated and they are added to the population.

Challenge

how to sample parents
update populations
generate offsprings

Compare

RL and evolution perform equally well in terms of final test accuracy, with evolution having better anytime performance and finding smaller models.

Random search test error = 3.9% on CIFAR-10 and a top-1 validation error of 21.0% on ImageNet.

Evolution-based method: 3.75% and 20.3% respectively.

Performance Estimation Strategy

To guide the search process, these search strategies need to estimate the performance of a given architecture A they consider.

Lower fidelity estimates

Learning Curve extrapolation

Consider architectural hyperparameters for predicting which partial learning curves are most promising.

Challenge

The main challenge for predicting the performances of neural architectures is good predictions in a relatively large search space need to be made based on relatively few evaluations.

One-Shot Models

Treats all architectures as different sub-graphs of a supergraph (the one-shot model) and shares weights between architectures that have edges of this supergraph in common.

Only the weights of a single one-shot model need to be trained (in one of various ways), and architectures (which are just subgraphs of the one-shot model) can then be evaluated without any separate training by inheriting trained weights from the one-shot model.

Challenge

How the one-shot model is trained.

Neural Architecture Search A Survey

Introduction

Search Space

Chain-structured Neural Networks

multi-branch networks

Block/cell-based networks

Block/cell-based New design

Search Strategy

Bayesian Optimization

reinforcement learning method

Sequential decision processes

Evolutionary method

Compare

Performance Estimation Strategy

Lower fidelity estimates

Learning Curve extrapolation

One-Shot Models

Future Directions

Tags Cloud

Categories Cloud

It's the niceties that make the difference fate gives us the hand, and we play the cards.

Introduction

Search Space

Chain-structured Neural Networks

multi-branch networks

Block/cell-based networks

Block/cell-based New design

Search Strategy

Bayesian Optimization

reinforcement learning method

Sequential decision processes

Evolutionary method

Compare

Performance Estimation Strategy

Lower fidelity estimates

Learning Curve extrapolation

One-Shot Models

Future Directions

END OF POST

Tags Cloud

Categories Cloud

It's the niceties that make the difference fate gives us the hand, and we play the cards.