NASI Label and Data-agnostic Neural Architecture Search at Initialization

Posted on December 20, 2021   5 minute read ∼ Filed in  : 

1. Abstract & Introduction

Problems

Neural Architecture Search efficiency is limited by need for model training for numerous candidate archi- tectures during the search process.

One-shot NAS algorithms share model parameters among candidate architectures, thereby reducing the cost of model training substantially, but still require the training of the one-shot architecture.

Many training-free algorithms are also not sufficient.

  1. Park et al. (2020) have approximated the converged performance of candidate architecture by the performance of their corresponding NNGP, but it is costly.

  2. Abdelfattah et al. (2021) have investigated several training-free proxies to rank candidate architectures in the search space.

Idea

can we realize NAS is at initialization such that model training can be completely avoided during the search process?

Solutions

This paper presents a novel NAS algorithm called NAS at Initialization (NASI) that can completely avoid model training to boost search efficiency.

It exploits the capability of a Neural Tangent Kernel in being able to characterize the converged performance of candidate architectures at initialization. Hence avoid model training to boost the search efficiency.

Compared with other training-free algorithms, the algorithm in paper is

  1. Providing theoretically grounded performance estimation by NTK (compared with (Mellor et al., 2020; Abdelfattah et al., 2021; Chen et al., 2021)),
  2. Guaranteeing the transferability of its selected architectures with its provable label and data-agnostic search under mild conditions (compared with (Mellor et al., 2020; Park et al., 2020; Abdelfattah et al., 2021; Chen et al., 2021)))
  3. Achieving SOTA performance in a large search space over various benchmark datasets (compared with (Mellor et al., 2020; Park et al., 2020; Abdelfattah et al., 2021)).

Result

Compared with other NAS algorithms, NASI incurs the smallest search cost while preserving the competitive performance of its selected architectures.

2. Related works and BackGround

The NTK stays asymptotically constant during the course of training as the width of DNNs goes to infinit

image-20211225172959309

image-20211225181308294

The weights don’t change much at all for larger hidden widths.

Taylor expand the network function with respect to the weights around its initialization.

3. Neural Architecutre Search at Initialization

Reformulating NAS via NTK

To completely avoid this training cost, it exploit the capability of NTK for characterizing the converged performance of DNNs at initialization.

For a L-layer DNN, we denote the output dimension of its hidden layers and the last layer as n1 =···=nL−1 = k and nL =n, respectively.

image-20211226145119529

according to Proposition 1.

image-20211226150008239

NAS can be realizable at initialization. Specifically, given a fixed and sufficiently large training budget, to select the best-performing architecture, we can simply minimize the upper bound of Lt (6) over all the candidate architectures in the search space for t → ∞

image-20211226145156394

However, this constant NTK is computationally costly to evaluate. The trace norm of NTK at initialization can be efficiently approximated. So the paper mainly consider using it.

image-20211225225420848

Approximating the Trace Norm of NTK

image-20211225225545221

  1. Frobenius norm of the Jacobian matrix in (9) is costly to evaluate. So, we approximate this term using its lower bound.

  2. Calculate each sample in function (9) using parallelization over mini-batches.

image-20211225230032957

  1. further approximate the summation over m/b mini-batches in (11) by one single uniformly randomly sampled mini-batch Xj

image-20211225230133764

Optimization and Search Algorithm

with the approximation in (12), our reformulated NAS problem (8) can be transformed into

image-20211225230753501

The optimization of (13) in the discrete search space is intractable. So, we apply some optimization tricks to simplify it.

Next, instead of optimizing (13), we introduce a distribution pα(A) (param- eterized by α) over the candidate architectures in this search space.

image-20211225231625672

Then, we apply Gumbel-Softmax to relax the optimization of (14) to be continuous and differentiable using the reparameterization trick.

Specifically, for a given α, to sample an architecture A, we simply have to sample g from p(g) = Gumbel(0, 1) and then determine A using α and g.

image-20211225232009358

we approximate (15) based on its first-order Taylor expansion at initialization such that it can be optimized efficiently through a gradient-based algorithm.

image-20211225232645184

Unfortunately, the expectation in (17) makes the evaluation of ∆∗ intractable. Monte Carlo sampling is thus applied to estimate ∆∗ efficiently.

image-20211225233128186

image-20211225233242861

This simple and efficient solution in (18) can already allow us to select architectures with competitive performances.

image-20211225233707249

4. Lable- and Data-agnostic Search of NASI

conclusion: NASI is guaranteed to be label- and data-agnostic under mild conditions, which implies the transferability of the final architectures selected by NASI over different datasets.

Our reformulated NAS problem (8) explicitly reveals that it can be optimized without the need of the labels from a dataset. Because (12) can be derived using random labels

5. Experiments

Search Efficiency and Effectiveness

Three search spaces of NAS-Bench-1Shot1 on CIFAR-10

Search in NAS-Bench-1Shot1.

A lower penalty coefficient μ and a larger constraint ν (i.e., μ=1 and ν=1000) are adopted to encourage the selection of high-complexity architectures in the optimization of (13).

image-20211226133539003

Search in the DARTS Search Space

We then compare NASI with other NAS algorithms in a more complex search space than NAS-Bench-1Shot1, i.e., the DARTS (Liu et al., 2019) search space .

NASI adopts a higher penalty coefficient μ and a smaller constraint ν (i.e., μ=2, ν=500)

NASI selects the architecture with a search budget of T =100 and batch size of b=64 in this DARTS search space.

image-20211226134146567

These results show that NASI is also efficient and effective in large search spaces.

Evaluation of Selected Architectures

CIFAR-100 and ImageNet

We adopt the same search setting as those in Sec. 5.1 on the DARTS search space

Evaluation on CIFAR-10

Compared with popular training-based NAS algorithms, NASI achieves a substantial improvement in search efficiency and maintains a competitive generalization performance.

image-20211226135441955

Evaluation on ImageNet.

The results on ImageNet further confirm the transferability of the architectures selected by NASI to larger-scale datasets.

image-20211226135601379





END OF POST




Tags Cloud


Categories Cloud




It's the niceties that make the difference fate gives us the hand, and we play the cards.