Trisk Task-Centric Data Stream Reconfiguration
2 minute read ∼ Filed in : A paper noteIntroduction
Background & Motivation
Distributed Stream Jobs:
- execution plan: DAG, operators -> physical paralle tasks.
- Vertices: tasks
- Edget: data flow between tasks.
- Workloads: input data of an operator.
Control policy:
- Monitors the stream job and decide whether or not to update it.
- Identify the performance bottlenect, and use reconfigurations to optimize it.
Reconfigurations: low-latecy on stream jobs requires system can reconfigure part of dataflow computation dynamically during execution without affecting the correctness of processing logic.
-
Versatility: reconfigurations mainly include operations along three dimension:
-
resource: amount of resources allocated to a task, CPU cores and memory.
There is need to reallocate resource for stream jobs to achive better resoruce utilization. Thus needs Reconfigurations.
-
workloads: partition the workloads to a task of an operator.
User can adjust the workloading over tasks, scalling out/in by adding/removing tasks. Each task has a state, thus needs Reconfigurations
-
execution logic: UDF updated, thus needs Reconfigurations
-
-
Efficiency: Must finish reconfigurations quickly, e,g. minimize the sync time between parallel tasks. sync is to make all parallel tasks to be paused at the same logical time to avoid data loss or data duplication during reconfiguration execution
-
Usability: easy-to-use API
Goal
Control plane solution supports reconfigurations of stream jobs.
Details
Reconfiguration Execution
To reduce the latency and maintain the consistency of , it uses partial pause-and-resume scheme and is able to leverage the mechanisms in native stream systems for tasks synchronization.
Upon receiving reconfigurations request,
- it uses a coordinator to generate a new DAG execution plan, detect all tasks by comparing the DAG.
- pause tha affected tasks, and sync the them (with native system’s mechanism, like Flink’s async checkpoint machanism).
- update task with new configs.
IMPLEMENTATION
Runtime + stream system
Runtime: abstractiona nd reconfiguration APIs for user.
Stream system instrouctions:
- efficient execution of reconfigurations requred from runtime.
- leverage sync mechanism in native stream system, and work collaborate with each tasks’ ConfigManager.
On Flink:
- map abstraction to configurations in Flinks
JobGraph
andExecutionGraph