Auto-Pytorch, Multi-Fidelity MetaLearning for Efficient and Robust AutoDL

Posted on May 12, 2022 4 minute read ∼ Filed in : A paper note

AutoNet [12],

Introduction

Motivation

AutoML system should have both hyperparameter tuning, and NAS.

HyperParameter Tunning below cannot scale.
- BlackBox Bayesian optimization
- Evolutionary
- reinforcement learning
There is no multi-fidelity benchmark on learning curves for the joint optimization of architectures and hyperparameters

Contribution

Propose a system with automatically-designed portfolios of architectures & hyperparameters. ensembling.
- Auto-PyTorch Tabular performs multi-fidelity optimization on a joint search space of architectural parameters and training hyperparameters for neural nets.
- It targeted tabular data
The system combines state-of-the-art approaches from multi-fidelity optimization, ensemble learning, and meta-learning for a data-driven selection of initial configurations for warm starting Bayesian optimization.
Use multi-fidelity optimization: tasks on cheaper fidelities (training only for a few epochs)
Introduce a new benchmark LCBench for studying multi-fidelity optimization
Experiment shows it is better than several other common AutoML frameworks: AutoKeras, AutoGluon, auto-learn, and hyperopic-learn.

The design space for the current NAS is over-engineered, leading to very simple optimization tasks where even random searches can perform well.
MetaLearning can be used for warm-starting.
BOHB combines Bayesian optimization (BO) with Hyperband (HB) and has been shown to outperform BO and HB on many tasks. It speedups of up to 55x over Random Search

Auto-Pytorch

The system implements and tunes the full DL pipeline, including data preprocessing, neural architecture, network training, and regularization.

Search Space

A large number of hyperparameters

preprocessing options (e.g. encoding, imputation)
architectural hyperparameters (e.g. network type, number of layers)
training hyperparameters (e.g. learning rate, weight decay- p in regularization ).

The system can study on either

Small-space: Funnel-shaped variant, which contains only 2 search targets.
- Requires a predefined number of layers.
- A maximum number of units.
Full-space: allow to achieve SOA performance.

Multi-fidelity Search

It uses BOHB to find well-performing configurations. The key choice is to set up the budget type like runtime, the number of training epochs, and dataset subsamples.

The paper use number of training epochs as the budget type, since it has good generality and interpretability.

Parallelly

Since BOHB uses kernel density estimator (KDE) as a probabilistic model.

So it is efficient scaling for parallel optimization.

Evaluation

The system support hold-out protocol and cross-validation to determine the accuracy of an architecture.

Warm-Start

BOHB starts from scratch for the new task and it’s not optimized.

The paper learns the warm-starting from PoSH-Auto-Sklearn, and it starts BOHB’s first iteration with a set of complementary configurations that cover a set of meta-training datasets well; Afterwards, it transitions to BOHB’s conventional sampling.

Enabling

After finding the best model, the system uses ensembling to combine them.

LC-BENCH

The paper also conducts some experiments to investigate how to design multi-fidelity optimization for AutoDL from many perspectives.

How do the configurations relate to the datasets?
- Are there configurations that perform well on several datasets?
- Is it possible to cover most datasets based on a few complimentary configurations?
How to choose budgets.

Experiment

2000 configurations and evaluating each of them across 35 datasets and three budgets. Each evaluation is performed with three different seeds on Intel Xeon Gold 6242 CPUs with one core per evaluation, totaling 1 500 CPU hours.

Datasets

The used datasets are very diverse in the number of features (5 - 1637), data points (690- 581012), and classes (2-355) and cover binary and multi-class classification, as well as strongly imbalanced tasks

Budgets

A number of epochs, and evaluate each cfg for 12, 25, and 50 epochs.

Foundings:

Transferring configurations to other datasets are very promising if the configuration is well selected
As expected, the adjacent budget pairs (12; 25) and (25; 50) exhibit a larger correlation than the more distant budget pair (12; 50)
The paper use fANOVA and Local Hyperparameter Importance (LPI) to quantify the importance of the hyperparameter importance. And it finds the number of layers (num layers) is the most important hyperparameter, even more important than learning rate or weight decay.

The maximum number of neurons (max units) is less important

Evaluation

The system evaluates the system on tabular data from those perspectives isolated. And shows the system can also perform well on object recognition in NAS-Bench-201.

Configuration space
Multi-fidelity optimization
Ensembling
Warm-start with meta-learning.

Warm-start

Ran the system on 100 meta datasets from OpenML. The system use search 300 BOHB iterations.

BOHB Search

The architecture searched by BOHB is better than the one searched by BO

Ensembling

Building ensembles from different DNNs and fidelities improve the performance in the long run, sometimes substantially

Parallel

speedups of 3x

Compare with others

We compare Auto-PyTorch Tabular to several state-of-the-art AutoML frameworks, i.e. Auto-Keras, Auto-Sklearn, and hyperopic-sklearn, We also include the early version Auto-Net2.0

The core component is also useful for other tasks.