Auto-Pytorch, Multi-Fidelity MetaLearning for Efficient and Robust AutoDL
4 minute read ∼ Filed in : A paper noteAutoNet [12],
Introduction
Motivation
AutoML system should have both hyperparameter tuning, and NAS.
- HyperParameter Tunning below cannot scale.
- BlackBox Bayesian optimization
- Evolutionary
- reinforcement learning
- There is no multi-fidelity benchmark on learning curves for the joint optimization of architectures and hyperparameters
Contribution
-
Propose a system with automatically-designed portfolios of architectures & hyperparameters. ensembling.
- Auto-PyTorch Tabular performs multi-fidelity optimization on a joint search space of architectural parameters and training hyperparameters for neural nets.
- It targeted tabular data
The system combines state-of-the-art approaches from multi-fidelity optimization, ensemble learning, and meta-learning for a data-driven selection of initial configurations for warm starting Bayesian optimization.
-
Use multi-fidelity optimization: tasks on cheaper fidelities (training only for a few epochs)
-
Introduce a new benchmark LCBench for studying multi-fidelity optimization
-
Experiment shows it is better than several other common AutoML frameworks: AutoKeras, AutoGluon, auto-learn, and hyperopic-learn.
Some notes of Related Work
- The design space for the current NAS is over-engineered, leading to very simple optimization tasks where even random searches can perform well.
- MetaLearning can be used for warm-starting.
- BOHB combines Bayesian optimization (BO) with Hyperband (HB) and has been shown to outperform BO and HB on many tasks. It speedups of up to 55x over Random Search
Auto-Pytorch
The system implements and tunes the full DL pipeline, including data preprocessing, neural architecture, network training, and regularization.
Search Space
A large number of hyperparameters
- preprocessing options (e.g. encoding, imputation)
- architectural hyperparameters (e.g. network type, number of layers)
- training hyperparameters (e.g. learning rate, weight decay- p in regularization ).
The system can study on either
- Small-space: Funnel-shaped variant, which contains only 2 search targets.
- Requires a predefined number of layers.
- A maximum number of units.
- Full-space: allow to achieve SOA performance.
Multi-fidelity Search
It uses BOHB to find well-performing configurations. The key choice is to set up the budget type like runtime, the number of training epochs, and dataset subsamples.
The paper use number of training epochs as the budget type, since it has good generality and interpretability.
Parallelly
Since BOHB uses kernel density estimator (KDE) as a probabilistic model.
So it is efficient scaling for parallel optimization.
Evaluation
The system support hold-out protocol and cross-validation to determine the accuracy of an architecture.
Warm-Start
BOHB starts from scratch for the new task and it’s not optimized.
The paper learns the warm-starting from PoSH-Auto-Sklearn, and it starts BOHB’s first iteration with a set of complementary configurations that cover a set of meta-training datasets well; Afterwards, it transitions to BOHB’s conventional sampling.
Enabling
After finding the best model, the system uses ensembling to combine them.
LC-BENCH
The paper also conducts some experiments to investigate how to design multi-fidelity optimization for AutoDL from many perspectives.
- How do the configurations relate to the datasets?
- Are there configurations that perform well on several datasets?
- Is it possible to cover most datasets based on a few complimentary configurations?
- How to choose budgets.
Experiment
2000 configurations and evaluating each of them across 35 datasets and three budgets. Each evaluation is performed with three different seeds on Intel Xeon Gold 6242 CPUs with one core per evaluation, totaling 1 500 CPU hours.
Datasets
The used datasets are very diverse in the number of features (5 - 1637), data points (690- 581012), and classes (2-355) and cover binary and multi-class classification, as well as strongly imbalanced tasks
Budgets
A number of epochs, and evaluate each cfg for 12, 25, and 50 epochs.
Foundings:
-
Transferring configurations to other datasets are very promising if the configuration is well selected
-
As expected, the adjacent budget pairs (12; 25) and (25; 50) exhibit a larger correlation than the more distant budget pair (12; 50)
-
The paper use fANOVA and Local Hyperparameter Importance (LPI) to quantify the importance of the hyperparameter importance. And it finds the number of layers (num layers) is the most important hyperparameter, even more important than learning rate or weight decay.
The maximum number of neurons (max units) is less important
Evaluation
The system evaluates the system on tabular data from those perspectives isolated. And shows the system can also perform well on object recognition in NAS-Bench-201.
- Configuration space
- Multi-fidelity optimization
- Ensembling
- Warm-start with meta-learning.
Warm-start
Ran the system on 100 meta datasets from OpenML. The system use search 300 BOHB iterations.
BOHB Search
The architecture searched by BOHB is better than the one searched by BO
Ensembling
Building ensembles from different DNNs and fidelities improve the performance in the long run, sometimes substantially
Parallel
speedups of 3x
Compare with others
We compare Auto-PyTorch Tabular to several state-of-the-art AutoML frameworks, i.e. Auto-Keras, Auto-Sklearn, and hyperopic-sklearn, We also include the early version Auto-Net2.0
The core component is also useful for other tasks.