Towards Demystifying Serverless Machine Learning Training

Posted on June 10, 2022   5 minute read ∼ Filed in  : 

Introduction

Motivation

Serverless platforms: AWS Lambda, Azure Functions, Google Cloud Functions. Applications adopted in serverless: Event processing, API composition, API aggregation, data flow control, etc

Serverless computing has three benefits.

  1. Unlimited elasticity: user can specify the number of such functions that are executed concurrently
  2. Pay per use pricing model: The user specifies an executable function and is only charged for the duration of the function execution.
  3. Lower start-up and set-up overhead.

Most Serverless infrastructure has the following limitations:

  1. Only supports stateless function calls with limited computations resource and duration. (a function call in AWS Lambda can use up to 3GB of memory and must finish within 15 minutes).
  2. Do not allow direct communications between stateless functions.

In summary, FaaS’s resource is on-demand and auto-scaled. While in IaaS, users tend to overprovision to handle peak workloads by reserving more computation resources than needed.

But the advantage of serverless is still inconclusive when compared with server-full infrastructures in ML applications, eg.

  1. When a serverless infrastructure (FaaS) outperforms a server-full infrastructure ( IaaS ) for distributed ML training?

Contribution

The paper implements a platform, LambdaML, to present a comparative study of distributed ML over FaaS and IaaS. And it also develops an analytic model to capture cost/performance tradeoffs that must be considered when opting for a serverless infrastructure.

In summary

  1. Systematically explore algorithm choice and system design for FaaS and IaaS ML training, and depict the tradeoff over an extensive range of ML models, training workloads, and infrastructure choices.
  2. Develop an analytical model that characterizes the tradeoff between FaaS and IaaS-based training
    • Fari comparison: FaaS and IaaS implement the same algorithm (SGD); run the most suitable algorithms with the most suitable hyper-parameters for FaaS and IaaS.
    • End-to-end benchmark: end-to-end training performance

Foundings:

  1. FaaS can be faster than IaaS if communication is very efficient.
  2. When FaaS is much faster, it is not much cheaper. When FaaS is faster, It usually incurs a comparable cost in dollars.

System Design

Basic design dimensions

image-20220610200419126

The paper builds an ML system on top of Amazon Lambda. And four dimensions need to be considered when developing the system, namely:

  1. ML optimization algorithm
    • gradient averaging
    • model averaging
  2. The communication channel
    • Shared-nothing architecture with message passing between executors.
    • shared-disk file system or in-memory kv store, in AWS, there are four alternatives—S3, ElastiCache for Redis, ElastiCache for Memcached, and DynamoDB
  3. Communication patterns can be Gather, AllReduce, or ScatterReduce.
  4. Synchronization protocol can be bulk synchronous parallel (BSP) and asynchronous parallel (ASP)
    • BSP: can be divided into the merge phase and update phase.
    • ASP: Each executor update one shared-global model state without caring about the speeds of other executors.

FaaS Details

Execution time cannot be longer than 15 minutes. So each executor runs 15mins and then restarts by triggering themselves.

image-20220610201516112

Evaluation

Settings

Datasets: CIFAR10, RCV1, HIGG

Models: LR, SVM, MN, RN, KM, EM.

Optimization Algorithms: GA-SGD, MA-SGD, ADMM.

Evaluate algorithms

image-20220610202859642

GA-SGD can converge steadily and achieve a lower loss, and MA is unstable.

Evaluate communication channels.

Compare design choices including Memcached, S3, Redis, and DynamoDB.

image-20220610203137731

S3: no startup time,

Memcached: multiple threading architectures are suitable to support large models’ communication.

DynamoDB: reduce communication time by 20%, but there is startup time. It also only allows messages smaller than 400KB.

Hybrid solutions: A ps container can save 2X communication compared with S3/Memcached.

HybridPS is currently bounded not only by the maximal network bandwidth but also by serialization/ deserialization and model update. Since serialization and deserialization are the bottlenecks, increasing bandwidth by adding more ps cannot speed up the training.

And the hybrid approach is slower than a pure FaaS approach with a large model. (due to seria/desearia.)

Communication pattern

image-20220610205746340

When communication is heavy, a single reducer (in ALL reduce) is the bottleneck, and ScatterReduce is better.

When the communication is small, ScatterReduce is a little bit slow since it will partition all data.

Sync protocols.

Study Synchronous and Asynchronous.

image-20220610205758632

Async: Store a global model at S3, and each FaaS will rewrite it.

Synchronous converges steadily, whereas Asynchronous suffers from unstable convergence.

Compare FAAS & IAAS

Start FaaS and IaaS once the data is submitted into S3, then train the model until it converges.

Compare LambdaML with Hybrid PS, Distributed PyTorch, and Angel.

Datasets and Models:

image-20220610211727153

Loss

image-20220610211928347

Total Run time

runtime = start-up time + communiaction time + data-loading time.

FaaS function can quickly startup. At the same time, the IaaS requires starting EC2 clusters, mounting shared volumes, dispatching scripts, etc.

LambdaML is the fastest, although it has the highest communication time.

image-20220610213010078

Analytical

The paper also provides a cost model to model the execution time.

image-20220610214843261

FaaS incurs a more negligible start-up overhead, while IaaS incurs a smaller communication overhead because of its flexible mechanism and higher bandwidth.

Conclusion:

When the R ( number of epochs to converge, algorithm speed ) is small, or f ( scaling factor, minor f means better scalability ) is small, the FaaS is better than IaaS.

In other words, when the algorithm is fast, and communication is less, FaaS is better than IaaS.





END OF POST




Tags Cloud


Categories Cloud




It's the niceties that make the difference fate gives us the hand, and we play the cards.