PipeSwitch Fast Pipelined Context Switching for Deep Learning Applications

Posted on January 12, 2022 12 minute read ∼ Filed in : A paper note

Problems not solves:

How to only load model structure (small), not the model parameters ?
Inference job can preempt the training job, if the model is the same, can it be used in online learning?

Abstract & Introduction

Background

GPU clusters are often over-provisioned based on the peak load of inference and the cluster has limited sharing between applications and task types. eg.

Inference tasks cannot be served with training clusters under flash crowds,
Training tasks cannot utilize inference clusters when the inference load is low.
Even for inference itself, production systems are typically provisioned to each application on per-GPU to limit the interference between applications

challenge 1: GPU has high overhead when switching between tasks. eg. If a GPU switches to a DNN model (e.g., ResNet) that has not been preloaded onto the GPU, it can take multiple seconds before serving the first inference request.

challenge 2: NVIDIA Multiple Process Sharing (MPS) and Salus allow multiple processes to use the same GPU, they require all processes’ data (e.g., DNN models) to be preloaded into the GPU memory. But GPU memory is limited.

Problems

How to quickly switch the contents on GPU memory such that each one of the applications (that can be multiplexed ) is able to use the entire GPU compute and memory resources during its time slice?

Contribution

They present PipeSwitch, a system that enables unused cycles of an inference application to be filled by training or other inference applications.

Overall, the system leverages pipelined model transmission, unified memory management, and active-standby worker switching to achieves millisecond-scale context switching latencies and high throughput.

They proposed PipeSwitch, a system that enables GPU- efficient fine-grained time-sharing for multiple DL applications, and achieves millisecond-scale context switching latencies and high throughput.
They introduce pipelined context switching, which exploits the characteristics of DL applications, and leverages pipelined model transmission, unified memory management, and active-standby worker switching to minimize switching overhead and enforce process-level isolation.
They implement a system prototype and integrate it with PyTorch. Experiments on a variety of DL models and GPU cards show that PipeSwitch only incurs a task startup over- head of 3.6–6.6 ms and a total overhead of 5.4–34.6 ms (10–50× better than NVIDIA MPS), and achieves near 100% GPU utilization.

System Overview

Controller: contains two threads, TCP thread and scheduler thread ( together with memory daemon).

Memory daemon: it manages the GPU memory and the DNN models.

before starting a task, user register the model in scheduler.
It allocates the GPU memory to the active worker,
It transfers the model from the host memory (scheduler) to the GPU memory.

Active worker: A process that executes a task on one GPU. it contains two threads.

Termination thread: receive termination signal from controller and notifies main thread.
Main thread: manages DNN models and performs computation for inference or training
Worker only loads model structure (small), not the model parameters

StandBy worker: idle process, is initializing a new task or cleaning its environment for previous task.

Execution steps:

Controller queues a set of tasks received from clients.
Uses a scheduling policy to decide which task to execute next.
To start a new task, the controller either waits for the current task to finish (e.g., if it is inference) or preempts it by notifying the active worker to stop (e.g., if it is training). At the same time, controller notifies an idle standby worker to initialize its environment for the new task.
Active worker completes a task
Controller notifies memory daemon and standby worker to load task to GPU to execute with pipelined model transmission (if the model are the same, may be no need to transfer, online learning?).
Memory daemon allocates memory to standby worker, and trasmits the model from memory to GPU
Standby worker become active worker to execute the new task
Active worker become standby worker and clean envs for previous task

PipeSwitch Design

Profiling task switching overhead

On scenario that typical scenario that a server stops a training task running on the GPU, and then starts an inference task.

model: ResNet152

Measure: time to start and execute it on GPU.

All the components take considerable time compared to the inference time, so all those components should be optimized.

Profiling model transmission

The PCIe bandwidth is the physical limit on how fast an arbitrary task can be loaded to the GPU. In another words, transmitting a task from CPU to GPU is bounded by the PCIe bandwidth.

In DNN, a task does not need to wait for the entire model to be transmitted to the GPU before beginning the computation. Instead, the task can start the computation of a layer as soon as the layer is loaded in the GPU and the input of the layer is ready (i.e., the previous layers have finished their computation), regardless of its following layers.

Optimal model-aware grouping

Why grouping layers for transmission? (minimize these two sources of overhead.)

Transmission overhead of large amount of data is dominated by data size
Transmission overhead of layer-by-layer is dominated by too many PCIe calls.

How to choose group size? two insights:

First group cannot be to large (F3.a).
Other than first group, we can safely pack multiple layers in a group based on progress of computation without affecting pipeline efficiency (F3.b).

The algorithm runs offline to find the strategy, and the resulting strategy is used online by PipeSwitch for context switching.

Unified Memory Management

By default:

NVIDIA provides cudaMalloc to allocate memory on GPU.
NVIDIA also provides CUDA unified memory to automatically handle memory movement between the host memory and the GPU memory for applications

Two characteristics of DL applications:

The amount of memory allocated to the DNN model is fixed, and does not change during task execution. (structure is fixed, so parameter num is fixed = > memory is fixed. )
The intermediate results (output of each layer) change do not cause memory fragmentation.
1. In inference, after the next layer is computed, they are no longer needed and can be safely freed.
2. In training, they cannot be immediately freed, because they are also used by the backward pass to update the weights. (but the intermediate results are consumed first-in-last-out style.)

New design:

=> Minimize memory allocation overhead:

The memory daemon uses cudaMalloc to obtain the GPU memory when the system starts, and then dynamically allocates the memory to the workers at runtime.

Eliminates the overhead for each worker to use cudaMalloc to get memory.
Memory daemon only pass memory pointers to workers.
Memory daemon ensure only one worker owns GPU memory to guarantee memory isolation between workers.

=> Minimize memory copies overhead:

The memory daemon stores the models, and it can directly transfer model to GPU for task startup.

=> Minimize IPC overhead:

After model is transmitted to GPU, memory daemon needs to notify workers and export GPU memory handler to workers, which requires IPC APIs. (cudaIpcOpenMemHandle for NVIDIA GPUs).

cudaIpcOpenMemHandle incur high overhead, memory daemon uses the GPU IPC once to initialize the worker, and then uses cheap CPU IPCs to notify the worker which pipeline group has been transmitted.

=> Pin memory:

The OS would swap a memory page to disk if the page is inactive for a certain amount of time. We pin the pages of the memory daemon to the host memory, to eliminate this overhead

Active-Standby worker Switching

Two process: Each process need to clean GPU envs and warm-up GPU.

One process: Although new task can reuse warm CUDA context, but current task has to clean its status and it does not provide process-level isolation between tasks.

Active-standby:

Each worker initializes its own GPU environment (i.e., CUDA context) when it is first created. eliminates the GPU environment initialization overhead when a new task is assigned to a worker
When a task is stopped, a major job is to clear asynchronous CUDA functions queued on GPU. The paper insert synchronization points into training tasks. So the number of queued function are limited and can be quickly cleared.
When a task is stopped, another job is to free memory, in this paper, they only deletes the pointers pointing to the tensor data rather than freeing the actual data. Therefore, it is safe for the new task to transmit its model to the GPU memory at the same time. In other words, we can parallelize the task cleaning of the current task and the pipelined model transmission of the new task, to hide the task cleaning overhead.

When task arrives, there sould be an idle worker waiting for the task, every standby worker needs to maintain its own CUDA context, which consumes a few hundred MB GPU memory.

Two standby workers are sufficient to ensure at least one idle worker, which eliminates the waiting time and has moderate GPU memory consumption.

Implementation

3600 lines of code in C++ and Python,

Integrated it with PyTorch including adding functions for allocating GPU memory, sharing the GPU memory to workers through CUDA IPC API and getting the shared GPU memory.

Evaluation

Setup

conducts on AWS, PCIe 3.0 ×8, and 32 GB memory. The software environment includes PyTorch-1.3.0, torchvision- 0.4.2, scipy-1.3.2, and CUDA-10.1.

Workloads

Models used: ResNet152, Incep- tion_v3, and Bert_base.

The default batch size for training is 32, and that for inference is 8.

Training tasks periodically checkpoint their models to the host memory, and restart from the latest checkpoint after preemption, and checkpointing frequency of training tasks is set according to the scheduling cycle to minimize checkpointing overhead

Metrics

Measure throughput and latency.

End-to-End Experiments

Experiment: A client sends an inference task to a GPU server, and the GPU server preempts the training task to execute the inference task and sends a reply back to the client.

Measure end to end latency and compare with

Ready model: no training task, lowest latency for inference task.
Stop-and-start:
1. preempt the train task and start inference task .
2. this is slowest, main source of the overhead is CUDA context initialization and first-time library loading operations in PyTorch.
NVIDIA MPS:
1. multi-process support from NVIDIA which allows the inference process to share the GPU with the training process, and training task occupies the entire GPU memory and does not stop when inference tasks come. CUDA unified memory is used for memory swapping.
2. One source of the overhead is the contentions both on the computation and memory of the GPU, as the training task do not stop when an inference task comes. Another source is GPU memory swapping.
PipeSwitch:
1. Perform the best and is close to lower bound. 10ms overhead for most apps.

Measure end to end throughput and latency with different scheduling cycle.

We only use ResNet152 for both training and inference on eight p3.2xlarge instances, and switch between these two tasks after each scheduling cycle.

Pipelined Model Transmission

Experiment: we keep all other components of PipeSwitch the same, and compare the following mechanisms.

Grouped transmission improves no optimization by combining the layers of the model into one big tensor and transmitting it in one group.

Per-layer pipeline overlaps transmission and computation at the granularity of layer. But because it has PCIe overhead and synchronization overhead for every layer, for the models with many layers but relatively light computation such as ResNet152 and Inception, it can perform worse than grouped transmission and sometimes even no pipeline.

This experiments proves that pruning speeds up the algorithm 1.

Unified Memory Management

Experiment: we keep all other components of PipeSwitch the same, and compare the following five mechanisms.

No unified memory management: Each worker uses cudaMalloc to allocate GPU memory, and transmits the model to GPU by its own.
No IPC optimization: The memory daemon handles GPU memory allocation and model transmission, but creates and sends GPU memory handlers to workers. To compare, PipeSwitch simply sends an 64-bit integer offset for the shared GPU memory to workers.
No pin memory: It has all optimizations on unified memory management except that the pages of the memory daemon are not pinned to the main memory.
CUDA unified memory: Each worker allocates GPU memory with cudaMallocManaged, and CUDA automatically transmits the model to GPU when needed.
PipeSwitch: It is the unified memory management mechanism used by PipeSwitch.

All the optimizations on memory management are effective.

Active-Standby Worker Switching

Experiment: Keep all other components of PipeSwitch the same, and compare the following mechanisms.

Two processes:
1. The process of the old task cleans the GPU environment, and then another process is created and initialized for the new task.
2. The new process needs to create a new CUDA environment, which dominates the total time
One process.
1. The process cleans the GPU environment for the old task, and reuses the environment for the new task.
2. One process reuses the CUDA environment, but still pays the overhead to clean the environment.
PipeSwitch:
1. It is the active-standby workers switching mechanism used by PipeSwitch.
2. parallelize old task cleaning and new task initialization

PipeSwitch Fast Pipelined Context Switching for Deep Learning Applications

Problems not solves:

Abstract & Introduction

Background

Problems

Contribution

System Overview

PipeSwitch Design

Profiling task switching overhead

Profiling model transmission

Unified Memory Management

Active-Standby worker Switching

Implementation

Evaluation

End-to-End Experiments

Pipelined Model Transmission

Unified Memory Management

Active-Standby Worker Switching

Tags Cloud

Categories Cloud

It's the niceties that make the difference fate gives us the hand, and we play the cards.

Problems not solves:

Abstract & Introduction

Background

Problems

Contribution

System Overview

PipeSwitch Design

Profiling task switching overhead

Profiling model transmission

Unified Memory Management

Active-Standby worker Switching

Implementation

Evaluation

End-to-End Experiments

Pipelined Model Transmission

Unified Memory Management

Active-Standby Worker Switching

END OF POST

Tags Cloud

Categories Cloud

It's the niceties that make the difference fate gives us the hand, and we play the cards.