Dorylus Affordable, Scalable and Accurate GNN Training with Distributed CPU Servers and Serverless Threads

Posted on November 5, 2022   5 minute read ∼ Filed in  : 

Question

  1. Why is using serverless cheaper than using a cluster of CPUs?

    => Users not only pay for computing but also pay for unneeded resources. (storage)

    Note that Lambdas are a perfect fit for GNNs’ tensor computations. While one could also employ regular CPU instances for computing, using such instances would incur a much higher monetary cost to provide the same level of burst parallelism (e.g., 2.2× in our experiments) since users not only pay for the compute but also other unneeded resources (e.g., storage).

  2. The system updates the weight and gathers the graph in a bounded asynchrony manner, thus maintaining multiple versions for weight. When syncing the weights?

    => Periodically broadcast.

Introduction

Background & Motivation

GNN family (e,g. GCN) has gained lots of success. It’s essential to train GNN in an affordable, scalable, and accurate manner.

Gap

Training GNN requires lots of GPUs, but:

  • Using GPU in the cloud is costly.
  • GPU has limited memory, hindering scalability.

CPU-based training and graph sampling are used to solve those two issues (high cost and poor scalability.) But:

  • CPU has poor parallelism computation and thus has poor efficiency.
  • Graph sampling incurs time overheads and reduces the accuracy of trained GNN.

Challenge

Using serverless can scale with low cost, but it’s challenging to adopt it to DNN training:

  • How to make computation fit into lambda’s limited compute resources:

    • A lambda thread is too weak to execute a tensor kernel on large data. (large data => high FLOPs => longer time)
    • Breaking data into tiny mini-batches incurs high-data transfer overhead.
  • How to minimize the negative impact of Lambda’s network latency.

    • One-thrid of time on communication.

      E.g., When the number of Lambdas it launches reaches 100, the per-Lambda bandwidth drops to 200Mbps.

Goal

This paper proposes affordable, scalable, and accurate GNN training.

  • Affordable: low-cost.
  • Scalable: billion-edge graphs.
  • Accurate: higher than the sampling-based method.

Details:

  • Use serverless computing and CPU servers.
    • It overcomes the above challenges by dividing the training pipeline into tasks and executing them with appropriate resources.
    • Graph operations => CPU
    • Tensor computation => Lambdas.
  • The bounded pipeline asynchronous communication (BPAC) model reduces communication overhead.
    • Different tasks overlap each other.
    • Allow asynchrony in parameter update and data gathering process. And also bounds the degree of asynchrony.

System Design and Contribution

GNN forward can be divided into four computation stages: Gather, ApplyVertex, Scatter, and ApplyEdge.

The graph is partitioned into the edge-cut algorithm.

Tasks and pipelining

Goal: This is about how to decompose tasks into fine-grained tasks such that

  • The computation can be fit in Lambda.
  • Tasks can be Overlap
  1. Fine-grained tasks => fix task into lambda.
  • Graph computation (adjacency matrics) => on graph sever.
  • Tensor data computation => on Lambda to benefit massive parallelism.
  • Weight-update => on PSs.
  1. Pipelining enables overlapping such that the communication cost can be hidden.
  • Vertices are partitioned into groups called intervals.
  • Each interval computes GA using one thread in GSs. Once the GA is done, the results are pushed into Lambda for AV.
  • Overlap the graph-parallel of one interval and tensor-parallel computations of another interval.

Bounded Asynchrony

Goal: Async training may use more epochs to reach one acc, but each epoch uses less time. Exp shows the total efficiency is higher.

Two kinds of asynchrony

  • Asynchrony weight Updates (using weight stashing technique)
    • Each Lambda sends a local gradient to PS and then fetches a new weight from PS. PS update directly without waiting for other workers’ updates. It requires multiple versions for each vertical group/interval.
    • Fully replicate the latest weight overall PS servers to enable load balancing.
    • Partition versions to multiple PS to reduce the memory usage of each PS.
    • PS periodically broadcasts its latest weight to do the weight aggregation.
  • Asynchrony Gather
    • Vertex intervals progress independently using stale vertex values of their neighbors without waiting for their update.
    • A fast-moving vertex interval is allowed at most S epochs away from the slowest-moving interval. Otherwise, the fast-moving vertex will stop GA and wait for the slow-moving vertex to update

The paper also discusses their converge guarantee.

  • As for the proof of converge of Weight Update, it mainly cites the previous paper.
  • As for the convergence of asynchronous Gather with bounded staleness S, it proposes a new algorithm and assumes N (iteration) => infinite.

Lambda Management

Each Graph server maintains a lambda controller.

Optimizations:

  • Task Fusion:
    • Merge ApplyVertex and gradient calculating of the laster layer together.
    • Save the communication between Lambda and GS.
  • rematerialization
    • Re-calculate AHW.
  • Lambda-internal streaming:
    • Overlaps computation with communications.
    • Retrieve the first half of the data to compute, and fetch another part simultaneously.

Autotuning Lambda nums

  • Autotuner auto-adjusts the number of Lambdas by monitoring the CPU’s task queue.

image-20221107213644718

Evaluation

Measure Metrics:

  • Value == Performance-per-dollar, V = 1/(T+C), T == Training time, C == Cost

Exp:

  • Measure the instance types to determine the cloud resources yielding optimal value for each backend.
    • Compare 2 CPU instances over two datasets, and indicating C5n is always better than another CPU in an example,
    • Compare one dataset over 2 GPU instances, indicating P3 gives the best value.
  • Measure synchronous and asynchronous.
    • Async training requires more epochs to converge to the same accuracy as sync training, but each epoch uses less time. Overall, async can improve efficiency.
    • Experimentally decide staleness values S (When the quicker will wait for slower. )
  • Measure value, performance, and scalability on effects of Lambdas and compare with CPU and GPU-only implementations.
    • GPU-only, CPU-only,
  • Compared with the existing system
    • Compare accuracy with sampling-based.
    • Compare speed to reach a test accuracy.
    • Performance-per-dollar.




END OF POST




Tags Cloud


Categories Cloud




It's the niceties that make the difference fate gives us the hand, and we play the cards.