Trisk Task-Centric Data Stream Reconfiguration

Posted on April 18, 2023   2 minute read ∼ Filed in  : 

Introduction

Background & Motivation

Distributed Stream Jobs:

  • execution plan: DAG, operators -> physical paralle tasks.
    • Vertices: tasks
    • Edget: data flow between tasks.
  • Workloads: input data of an operator.

Control policy:

  • Monitors the stream job and decide whether or not to update it.
  • Identify the performance bottlenect, and use reconfigurations to optimize it.

Reconfigurations: low-latecy on stream jobs requires system can reconfigure part of dataflow computation dynamically during execution without affecting the correctness of processing logic.

  • Versatility: reconfigurations mainly include operations along three dimension:

    • resource: amount of resources allocated to a task, CPU cores and memory.

      There is need to reallocate resource for stream jobs to achive better resoruce utilization. Thus needs Reconfigurations.

    • workloads: partition the workloads to a task of an operator.

      User can adjust the workloading over tasks, scalling out/in by adding/removing tasks. Each task has a state, thus needs Reconfigurations

    • execution logic: UDF updated, thus needs Reconfigurations

  • Efficiency: Must finish reconfigurations quickly, e,g. minimize the sync time between parallel tasks. sync is to make all parallel tasks to be paused at the same logical time to avoid data loss or data duplication during reconfiguration execution

  • Usability: easy-to-use API

image-20230419114210653

Goal

Control plane solution supports reconfigurations of stream jobs.

Details

Reconfiguration Execution

To reduce the latency and maintain the consistency of , it uses partial pause-and-resume scheme and is able to leverage the mechanisms in native stream systems for tasks synchronization.

Upon receiving reconfigurations request,

  • it uses a coordinator to generate a new DAG execution plan, detect all tasks by comparing the DAG.
  • pause tha affected tasks, and sync the them (with native system’s mechanism, like Flink’s async checkpoint machanism).
  • update task with new configs.

IMPLEMENTATION

Runtime + stream system

Runtime: abstractiona nd reconfiguration APIs for user.

Stream system instrouctions:

  • efficient execution of reconfigurations requred from runtime.
  • leverage sync mechanism in native stream system, and work collaborate with each tasks’ ConfigManager.

On Flink:

  • map abstraction to configurations in Flinks JobGraph and ExecutionGraph




END OF POST




Tags Cloud


Categories Cloud




It's the niceties that make the difference fate gives us the hand, and we play the cards.