Auto-Pytorch, Multi-Fidelity MetaLearning for Efficient and Robust AutoDL

Posted on May 12, 2022   4 minute read ∼ Filed in  : 

image-20220513144529017

AutoNet [12],

Introduction

Motivation

AutoML system should have both hyperparameter tuning, and NAS.

  1. HyperParameter Tunning below cannot scale.
    • BlackBox Bayesian optimization
    • Evolutionary
    • reinforcement learning
  2. There is no multi-fidelity benchmark on learning curves for the joint optimization of architectures and hyperparameters

Contribution

  1. Propose a system with automatically-designed portfolios of architectures & hyperparameters. ensembling.

    • Auto-PyTorch Tabular performs multi-fidelity optimization on a joint search space of architectural parameters and training hyperparameters for neural nets.
    • It targeted tabular data

    The system combines state-of-the-art approaches from multi-fidelity optimization, ensemble learning, and meta-learning for a data-driven selection of initial configurations for warm starting Bayesian optimization.

  2. Use multi-fidelity optimization: tasks on cheaper fidelities (training only for a few epochs)

  3. Introduce a new benchmark LCBench for studying multi-fidelity optimization

  4. Experiment shows it is better than several other common AutoML frameworks: AutoKeras, AutoGluon, auto-learn, and hyperopic-learn.

Some notes of Related Work

  1. The design space for the current NAS is over-engineered, leading to very simple optimization tasks where even random searches can perform well.
  2. MetaLearning can be used for warm-starting.
  3. BOHB combines Bayesian optimization (BO) with Hyperband (HB) and has been shown to outperform BO and HB on many tasks. It speedups of up to 55x over Random Search

Auto-Pytorch

The system implements and tunes the full DL pipeline, including data preprocessing, neural architecture, network training, and regularization.

image-20220516231614895

Search Space

A large number of hyperparameters

  1. preprocessing options (e.g. encoding, imputation)
  2. architectural hyperparameters (e.g. network type, number of layers)
  3. training hyperparameters (e.g. learning rate, weight decay- p in regularization ).

The system can study on either

  1. Small-space: Funnel-shaped variant, which contains only 2 search targets.
    • Requires a predefined number of layers.
    • A maximum number of units.
  2. Full-space: allow to achieve SOA performance.

image-20220513144502272

It uses BOHB to find well-performing configurations. The key choice is to set up the budget type like runtime, the number of training epochs, and dataset subsamples.

The paper use number of training epochs as the budget type, since it has good generality and interpretability.

Parallelly

Since BOHB uses kernel density estimator (KDE) as a probabilistic model.

So it is efficient scaling for parallel optimization.

Evaluation

The system support hold-out protocol and cross-validation to determine the accuracy of an architecture.

Warm-Start

BOHB starts from scratch for the new task and it’s not optimized.

The paper learns the warm-starting from PoSH-Auto-Sklearn, and it starts BOHB’s first iteration with a set of complementary configurations that cover a set of meta-training datasets well; Afterwards, it transitions to BOHB’s conventional sampling.

Enabling

After finding the best model, the system uses ensembling to combine them.

LC-BENCH

The paper also conducts some experiments to investigate how to design multi-fidelity optimization for AutoDL from many perspectives.

  1. How do the configurations relate to the datasets?
    • Are there configurations that perform well on several datasets?
    • Is it possible to cover most datasets based on a few complimentary configurations?
  2. How to choose budgets.

Experiment

2000 configurations and evaluating each of them across 35 datasets and three budgets. Each evaluation is performed with three different seeds on Intel Xeon Gold 6242 CPUs with one core per evaluation, totaling 1 500 CPU hours.

Datasets

The used datasets are very diverse in the number of features (5 - 1637), data points (690- 581012), and classes (2-355) and cover binary and multi-class classification, as well as strongly imbalanced tasks

Budgets

A number of epochs, and evaluate each cfg for 12, 25, and 50 epochs.

image-20220516214903808

Foundings:

  1. Transferring configurations to other datasets are very promising if the configuration is well selected

  2. As expected, the adjacent budget pairs (12; 25) and (25; 50) exhibit a larger correlation than the more distant budget pair (12; 50)

  3. The paper use fANOVA and Local Hyperparameter Importance (LPI) to quantify the importance of the hyperparameter importance. And it finds the number of layers (num layers) is the most important hyperparameter, even more important than learning rate or weight decay.

    The maximum number of neurons (max units) is less important

Evaluation

The system evaluates the system on tabular data from those perspectives isolated. And shows the system can also perform well on object recognition in NAS-Bench-201.

  1. Configuration space
  2. Multi-fidelity optimization
  3. Ensembling
  4. Warm-start with meta-learning.

Warm-start

Ran the system on 100 meta datasets from OpenML. The system use search 300 BOHB iterations.

image-20220516221454719

The architecture searched by BOHB is better than the one searched by BO

image-20220516221715857

Ensembling

Building ensembles from different DNNs and fidelities improve the performance in the long run, sometimes substantially

image-20220516221554527

Parallel

speedups of 3x

image-20220516221829931

Compare with others

We compare Auto-PyTorch Tabular to several state-of-the-art AutoML frameworks, i.e. Auto-Keras, Auto-Sklearn, and hyperopic-sklearn, We also include the early version Auto-Net2.0

image-20220516222146839

The core component is also useful for other tasks.

image-20220516222246937





END OF POST




Tags Cloud


Categories Cloud




It's the niceties that make the difference fate gives us the hand, and we play the cards.