NASI Label and Data-agnostic Neural Architecture Search at Initialization
5 minute read ∼ Filed in : A paper note1. Abstract & Introduction
Problems
Neural Architecture Search efficiency is limited by need for model training for numerous candidate archi- tectures during the search process.
One-shot NAS algorithms share model parameters among candidate architectures, thereby reducing the cost of model training substantially, but still require the training of the one-shot architecture.
Many training-free algorithms are also not sufficient.
-
Park et al. (2020) have approximated the converged performance of candidate architecture by the performance of their corresponding NNGP, but it is costly.
-
Abdelfattah et al. (2021) have investigated several training-free proxies to rank candidate architectures in the search space.
Idea
can we realize NAS is at initialization such that model training can be completely avoided during the search process?
Solutions
This paper presents a novel NAS algorithm called NAS at Initialization (NASI) that can completely avoid model training to boost search efficiency.
It exploits the capability of a Neural Tangent Kernel in being able to characterize the converged performance of candidate architectures at initialization. Hence avoid model training to boost the search efficiency.
Compared with other training-free algorithms, the algorithm in paper is
- Providing theoretically grounded performance estimation by NTK (compared with (Mellor et al., 2020; Abdelfattah et al., 2021; Chen et al., 2021)),
- Guaranteeing the transferability of its selected architectures with its provable label and data-agnostic search under mild conditions (compared with (Mellor et al., 2020; Park et al., 2020; Abdelfattah et al., 2021; Chen et al., 2021)))
- Achieving SOTA performance in a large search space over various benchmark datasets (compared with (Mellor et al., 2020; Park et al., 2020; Abdelfattah et al., 2021)).
Result
Compared with other NAS algorithms, NASI incurs the smallest search cost while preserving the competitive performance of its selected architectures.
2. Related works and BackGround
The NTK stays asymptotically constant during the course of training as the width of DNNs goes to infinit
The weights don’t change much at all for larger hidden widths.
Taylor expand the network function with respect to the weights around its initialization.
3. Neural Architecutre Search at Initialization
Reformulating NAS via NTK
To completely avoid this training cost, it exploit the capability of NTK for characterizing the converged performance of DNNs at initialization.
For a L-layer DNN, we denote the output dimension of its hidden layers and the last layer as n1 =···=nL−1 = k and nL =n, respectively.
according to Proposition 1.
NAS can be realizable at initialization. Specifically, given a fixed and sufficiently large training budget, to select the best-performing architecture, we can simply minimize the upper bound of Lt (6) over all the candidate architectures in the search space for t → ∞
However, this constant NTK is computationally costly to evaluate. The trace norm of NTK at initialization can be efficiently approximated. So the paper mainly consider using it.
Approximating the Trace Norm of NTK
-
Frobenius norm of the Jacobian matrix in (9) is costly to evaluate. So, we approximate this term using its lower bound.
-
Calculate each sample in function (9) using parallelization over mini-batches.
- further approximate the summation over m/b mini-batches in (11) by one single uniformly randomly sampled mini-batch Xj
Optimization and Search Algorithm
with the approximation in (12), our reformulated NAS problem (8) can be transformed into
The optimization of (13) in the discrete search space is intractable. So, we apply some optimization tricks to simplify it.
Next, instead of optimizing (13), we introduce a distribution pα(A) (param- eterized by α) over the candidate architectures in this search space.
Then, we apply Gumbel-Softmax to relax the optimization of (14) to be continuous and differentiable using the reparameterization trick.
Specifically, for a given α, to sample an architecture A, we simply have to sample g from p(g) = Gumbel(0, 1) and then determine A using α and g.
we approximate (15) based on its first-order Taylor expansion at initialization such that it can be optimized efficiently through a gradient-based algorithm.
Unfortunately, the expectation in (17) makes the evaluation of ∆∗ intractable. Monte Carlo sampling is thus applied to estimate ∆∗ efficiently.
This simple and efficient solution in (18) can already allow us to select architectures with competitive performances.
4. Lable- and Data-agnostic Search of NASI
conclusion: NASI is guaranteed to be label- and data-agnostic under mild conditions, which implies the transferability of the final architectures selected by NASI over different datasets.
Label-Agnostic Search
Our reformulated NAS problem (8) explicitly reveals that it can be optimized without the need of the labels from a dataset. Because (12) can be derived using random labels
Data-Agnostic Search
5. Experiments
Search Efficiency and Effectiveness
Three search spaces of NAS-Bench-1Shot1 on CIFAR-10
Search in NAS-Bench-1Shot1.
A lower penalty coefficient μ and a larger constraint ν (i.e., μ=1 and ν=1000) are adopted to encourage the selection of high-complexity architectures in the optimization of (13).
Search in the DARTS Search Space
We then compare NASI with other NAS algorithms in a more complex search space than NAS-Bench-1Shot1, i.e., the DARTS (Liu et al., 2019) search space .
NASI adopts a higher penalty coefficient μ and a smaller constraint ν (i.e., μ=2, ν=500)
NASI selects the architecture with a search budget of T =100 and batch size of b=64 in this DARTS search space.
These results show that NASI is also efficient and effective in large search spaces.
Evaluation of Selected Architectures
CIFAR-100 and ImageNet
We adopt the same search setting as those in Sec. 5.1 on the DARTS search space
Evaluation on CIFAR-10
Compared with popular training-based NAS algorithms, NASI achieves a substantial improvement in search efficiency and maintains a competitive generalization performance.
Evaluation on ImageNet.
The results on ImageNet further confirm the transferability of the architectures selected by NASI to larger-scale datasets.