A Stochastic Optimization Strategy for Parallel Sparse FastTucker Decomposition on GPU Platform
2 minute read ∼ Filed in : A paper notePaper Details
Background & Motivation
Data can be represented as High-order, High-Dimension, and Sparse Tensor (HOHDST); it’s impossible to analyze directly.
Tensor decomposition is proposed to solve this issue, which uses multiple low-rank feature matrices or tensors to represent the original tensor and preserves the multi-order structural information of the original tensor.
Tucker decomposition is a mainstream tensor decomposition algorithm, and implementations of Tucker, such as HOSVD, HOOI, T-HOSVD, and ST-HOSVD, incur huge memory overhead and computational complexity.
Gap
Previous work (FastTucker) tries to solve the issue (reducing memory overhead and computational complexity). But it suffers from poor GPU utilization efficiency and has many redundant calculations.
Challenge
How to increase GPU utilization efficiency and avoid redundant calculations.
Goal
Propose a decomposition strategy to reduce the memory and computation complexity in processing sparse tensors while fully using GPU storage resources.
Specifically,
- It’s based on the FastTucker algorithm and further reduces redundant calculations.
Contributions:
- From an algorithm perspective:
- Extract and pre-compute the reusable and shared-intermediate variables to redundant calculations.
- Analysis of the time complexity in both situations, namely, with/without using reusable and shared-intermediate variables
- From a backend perspective:
- Use the B-CSF tensor storage format to uniformly distribute the large tensors into multiple sub-tensors across multiple computational workers.
- Parallelly decompose the tensor using multiple workers and threads, where
- Each worker is a Warp, or called thread group (32 threads) in GPU. And each work is responsible for the computation of a sub-tensor.
- Each thread in a worker computes a scalar in a given vector.
- Optimizations
- Use the Memory Coalescing technique to achieve higher global memory access efficiency.
- Use the Wrap Shuffle technique to facilitate inter-thread communication.
- Use the on-chip L1 cache to facilitate the read/write of the reusable intermediate and commonly used variables,
- Use the Register to store some value during computation.
Experiment
Compare the efficiency of using
- Tensor storage of COO and CSF
- Extraction of reusable intermediate variables and shared invariant.