PaSca a Graph Neural Architecture Search System under the Scalable Paradigm
2 minute read ∼ Filed in : A paper noteIntroduction
Motivation
-
GNN cannot scale well to data size and message passing steps. The exponential growth of neighborhood size leads to exponential IO overhead (a major challenge in large-scale GNN.)
-
Some work tries to train GNN in a distributed way, but the aggregation procedure bottlenecked the speed.
-
There is no general design space for GNN. And the exploration of search space is extensive.
Contribution
The paper proposes the first paradigm and system
-
Introduce scalable graph neural architecture paradigm with some abstractions
- graph_aggregator: captures the structural information via graph aggregation operations.
- message_aggregator combines different levels of structural information.
- message_updater generates the prediction based on the multi-scale features.
With those abstractions, the system can define general design space, and decouple sampling and training.
-
Propose a general design space consisting of 6 design dimensions, including 150k possible designs of scalable GNN.
And the space has adaptive aggregation and a complementary post-processing stage
-
Propose a search system to search a GNN.
- Suggestion engine (multi-objective search algorithm)
- Evaluation engine in a distributed manner
Abstraction
The paper divides the GNN training process into 3 stages. And each stage has many optional operations, which define the overall search space.
Many existing GNN models can be generalized from the defined search space.
Engines
Experiments
Setting
Datasets:
- citation networks (Citeseer, Cora, and PubMed)
- two social networks (Flickr and Reddit),
- co-authorship graphs (Amazon and Coauthor)
- co-purchasing network (ogbn-products)
- one short-form video recommendation graph (Industry)
Baselines: compare with GCN, GAT, JK-Net, Res-GCN, APPNP, AP-GCN, SGC, SIGN, S2GC and GBP
Searched Representatives
We apply the multi-objective optimization targeting at classification error and inference time on Cora.
Training scalability
choose PaSca-APPNP as a representative and compare it with GraphSAGE
Train both of them with
- batch size is 8192 for Reddit and 16384 for ogbn-product
- in stand-alone and distributed scenarios and then measure their corresponding speedups.
- speedup is calculated by runtime per epoch ( one worker in the stand-alone scenario and two workers in the distributed scenario )
- WIthout cost, expectation is linear increase. ( since it’s async dist train )
GraphSage requires aggregating the ndoes during training, and it meets I/O bottleneck.
Performance-Efficiency Analysis
PaSca-V3 achieves the best performance with 4°ø training time compared with GBP and PaSca-V1. Note that, though PaSca-V1 requires the same training time as GBP, its inference time is less than GBP
So we can choose PaSca-V1 to V3, along with GBP, according to different requirements of predictive performance, training efficiency, and inference time.
Model Scability
It includes adaptive message_aggregator and the adaptive message_aggregator can identify the different message-passing demands of nodes and explicitly weight each graph message.